Data Engineering with Hydra - The Basics

It is already well-known that Postgres itself offers a solid foundation for the efficient and speedy execution of analytical processes, and we have chosen Hydra to enhance these features further. It uses advanced techniques like columnar storage, vectorized execution, and query parallelization to handle queries efficiently. With these features, Hydra significantly improves the performance of analytical workloads in Postgres.

Columnar storage:

Columnar storage in Postgres refers to a technique where data is organized and stored column by column instead of row by row. Columnar tables support updates, deletes, and vacuuming. Hydra leverages columnar storage, and the data organization in Hydra is often referred to as “Stripes” and “Chunks”.

Stripes: A stripe typically contains a certain number of rows from each column The number of rows in a stripe can vary depending on the implementation and configuration of the columnar storage system.

Chunks: The data is further divided into smaller units called “chunks”. A chunk contains a fixed number of contiguous values from a single column. The purpose of dividing data into chunks is to enable efficient compression and data retrieval.

This design choice has several advantages for analytical workloads:

Multifaceted: Columnar storage is often considered while working with huge volumes of data. HTAP(Hybrid Transactional/Analytical Processing) enables the storage of metadata and does real-time data analytics. Columnar storage is efficient for loading new data quickly.
Data Compression: Columnar storage is well-suited for data compression. Similar values within a column can be efficiently encoded, reducing storage requirements and faster data access.
Aggregation Efficiency: Analytical queries often involve aggregating data, and columnar storage excels in performing these types of operations, resulting in faster query execution.

Our columnar approach:

Download the relevant Linux columnar package. Initialize the Postgres data cluster, add all necessary dependencies, and configure it as necessary.

git clone git@github.com:rbernierZulu/columnar-tables-extension.git
sudo dpkg -i postgresql-15-citus-columnar_11.1-1-UBUNTU2004_amd64.deb
sudo apt-get update

Vectorized Execution:

A technique where operations are performed on entire columns at once, rather than processing rows individually. This offers several benefits:

Increased Processing Speed: By operating on entire columns in a vectorized manner, Hydra takes advantage of modern processor architectures optimized for parallelism and SIMD (single instruction, multiple data) operations. This leads to substantial performance improvements.
Reduced CPU Overhead: Vectorized execution reduces the need for repetitive CPU instructions, resulting in lower CPU overhead and better utilization of computational resources.

Query Parallelization:

Hydra efficiently processes queries in parallel, distributing the workload across multiple processors to speed up the analytical processing.

Enhanced Scalability: Parallel query processing allows Hydra to scale horizontally by adding more processing nodes. As query volume and complexity increase, the system can handle load efficiently without compromising on performance.

Support Heap Tables:

Heap tables are the default table types in Postgres. If we want a traditional row-based table, In Hydra we need to use ‘USING heap’ at the end of a CREATE TABLE statement. Then they would store data row-wise, making them well-suited for transactional workloads. Hydra’s integration with heap tables brings several advantages:

Transactional Workloads: In scenarios where real-time data processing is required, such as capturing and processing continuous data streams, Heap tables provide the necessary efficiency for handling frequent inserts, updates, and deletes.
Optimal Data Ingestion: Heap tables excel in rapidly ingesting data due to their efficient row-wise storage. They are thus well-suited for real-time data-driven applications involving a constant stream of incoming data or bulk data uploads.

Unveiling the data mystery: Seamless loading of postgresql data into a Hydra data warehouse

In previous sections, we delved into the intricacies of data optimization, columnar storage, vectorized execution, and the capabilities that Hydra brings to the table. Now let’s take a step further and explore the practical steps involved in seamlessly loading postgresql data into the Hydra data warehouse.

Buckle up as we embark on a journey to harness the power of Hydra for unmatched analytical performance.

Setting up:

We start off with an operational Postgresql database, which serves as the source. Next, we’ll ensure that the Hydra data warehouse is likewise operational – setting it up only takes a few minutes and simple actions.

To add a team, a warehouse, a region, and a capacity, we only need to log in to Hydra and set their values. The storage and computational resources can be chosen as per our needs.

This is the actual data present in the database before taking a dump. This is the data we are going to restore to Hydra.

LOAD DATA:

First, we must have the PostgreSQL source database. Here we are using pg_dump and pg_restore to transfer the load onto Hydra.

Pg_dump:

By using pg_dump we can capture data from specific tables into the database by using -t tablename1 -t tablename2. Similarly, while importing data onto Hydra, it will create columnar tables by default.

pg_dump --no acl -Fc --no-owner -h localhost -U postgres -d db_1 > db_1.dump

Once the dump is created for the existing data, the very next step is to create connectivity between the PostgreSQL database and Hydra. For that, we add details about our Hydra setup to the pg_service.conf service file.

The database dump that we created earlier using pg_dump is now imported into Hydra using pg_restore by passing the hostname, username, password, and database name of the Hydra database (which we can find on the dashboard) as arguments.

    Pgpassword= \
     pg_restore \
    -h hydra-hostname.us-east-1.hydras.io \
    -p port \
    -U username \
    -d database name \
     Mydb.dump

Here is the data that has been restored to the Hydra Data warehouse.

We will explore more in future sessions.

Data Engineering with Hydra – The Basics