Others

Mastering pgbench for Database Performance Tuning – Part Ⅱ

Introduction

In the realm of database performance tuning, the pursuit of optimal throughput and responsiveness is a never-ending journey. In our previous blog , we discussed about pgbench , setting up and first run of pgbench , tips, tricks and common pitfalls of benchmarking. Building upon that foundation, we now venture into the realm of advanced optimization strategies, where we delve deeper into the intricacies of pgbench with more options in it. Additionally, we will explore ways to interpret pgbench results effectively, transforming raw metrics into actionable insights that drive tangible performance enhancements.

If you haven’t read our previous blog , read here.

Exploring pgbench with more options

In our previous blog , we explored the basic options of pgbench like -c for number of clients,-j for number of threads, -t for number of transactions, -T for time limit . But, there are lot more options that the pgbench utility offers. Let’s explore a few more of these options and see how they can be used.

No vacuum -n

It is used to avoid the vacuuming before running the test . Usually , this option is used in a custom test run using a script which does not include the default tables.

#Running test run without -n
[postgres@localhost bin]$ ./pgbench db
pgbench (16.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>

#Running test run with -n
[postgres@localhost bin]$ ./pgbench -n db
pgbench (16.3)
transaction type: <builtin: TPC-B (sort of)>

Analyzing the difference , without option -n it is running vacuum. You can see avoiding vacuum with option -n

Connect -C

This option will establish a new connection for each transaction instead of once per client sessions. It is used to measure the client overhead.

#Using -C
[postgres@localhost bin]$ ./pgbench -C db
pgbench (16.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
number of failed transactions: 0 (0.000%)
latency average = 5.255 ms
average connection time = 2.524 ms
tps = 190.284094 (including reconnection times)

Observing the tps , it is including the time taken for reconnection . If you don’t use this option , you will see tps without initial connection time.

Rate -R & Latency limit -L

These two options are mostly used together. Rate -R used to specify the rate/speed of transactions and Latency limit -L used to count the transaction over the specified limit. These two are calculated in milliseconds.

[postgres@localhost bin]$ ./pgbench -R 10 -L 10 db
pgbench (16.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
number of failed transactions: 0 (0.000%)
number of transactions skipped: 0 (0.000%)
number of transactions above the 10.0 ms latency limit: 0/10 (0.000%)
latency average = 3.153 ms
latency stddev = 0.480 ms
rate limit schedule lag: avg 0.311 (max 0.518) ms
initial connection time = 2.973 ms
tps = 6.476235 (without initial connection time)

Here, we can see the number of transactions above the specified latency limit and number of transactions skipped. In this case, there are no transactions that crossed the given inputs.

Report per command -r

This option gives us the stats of each command after the benchmark finishes.

[postgres@localhost bin]$ ./pgbench -r db
pgbench (16.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
number of failed transactions: 0 (0.000%)
latency average = 1.020 ms
initial connection time = 2.315 ms
tps = 980.199961 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set aid random(1, 100000 * :scale)
         0.000           0  \set bid random(1, 1 * :scale)
         0.000           0  \set tid random(1, 10 * :scale)
         0.000           0  \set delta random(-5000, 5000)
         0.029           0  BEGIN;
         0.169           0  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.053           0  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.056           0  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.043           0  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.054           0  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
	 0.611           0  END;

It gives us the latency and failures of the default script used by pgbench.

Summary

To summarize, effective benchmarking is crucial for the optimal performance and scalability of databases.By leveraging benchmarking tools such as pgbench, database administrators and developers can simulate real-world workloads, measure performance metrics, and identify areas for optimization. However, successful benchmarking requires careful planning, execution, and analysis. In this blog , we have explored few more options in leveraging the pgbench utility. If you haven’t checked our previous blog on pgbench , here is the blog.

Thank you and stay tuned for more…

Mastering pgbench for Database Performance Tuning

Introduction

In today’s world of databases, performance is one of the key aspects that need to be taken care-of irrespective of the size of the database. If you’re in charge of managing a PostgreSQL database, ensuring peak performance is undoubtedly a top priority. Enter pgbench, PostgreSQL’s own benchmarking powerhouse, designed to help you simulate workloads and test different database strategies to optimize efficiency. Whether you’re a seasoned DBA or just starting out, understanding how to leverage pgbench can be a game-changer for your database operations. This comprehensive guide will cover everything you need to know to get started with pgbench, interpret its results, and implement its insights for optimal database performance.

pgbench

pgbench is a benchmarking tool included with PostgreSQL that simulates a standard workload. A simple program for running benchmark tests on PostgreSQL.. By default, pgbench tests are based on the TPC-B benchmark, which includes executing 5 SELECT, INSERT, and UPDATE commands per transaction. It is available as a part of the PostgreSQL installation. Using pgbench , we can simulate various workloads on PostgreSQL databases by generating multiple client connections and also by predefined scripts or set of parameters.

Setting up and running my first pgbench test

pgbench should be invoked with the -i (initialize) option to create and populate these tables. While initializing pgbench -i  database_name, it creates four tables pgbench_accounts, pgbench_branches, pgbench_history, pgbench_tellers dropping any existing tables of these names.

However it will check for the previous tables of these names and create again whenever -i (initialize) is given. In order a scale up , with a scale factor -s of 20, it will insert data into the tables to represent a workload 20 times larger than the default scale. Each tuple inserted represents a row of data in the database tables as shown below.

There are various parameters in pgbench to perform benchmark tests. The most common options are -c for number of clients,-j for number of threads, -t for number of transactions, -T for time limit and -f for specify a custom script file.

Running pgbench

Simulating the workload with 50 client connections and 100 transactions per client to database mydb

Understanding the Output

Let’s analyze the output, there are some defaults that will be considered such as scaling factor =1 , number of threads =1 if you didn’t give them in input.

Number of clients , Number of transactions per client is given as input to pgbench and number of transactions actually processed is the product of those two. There are zero transactions failed.

Latency average is the time taken to respond for a given request. It is measured in milliseconds . here it is 92.234ms

Initial connection time is the time taken for the client to establish connection with the server. Here it took 416.143ms

Most important is tps i.e transaction per seconds , it refers to the number of transactions processed by the system in one second time interval. It varies based on the complexity of transactions(or query) , system resources, concurrency levels, and database configuration. It is better to have high tps.

Tips and Tricks

  • Understand the workloads – It is important to understand the workloads on application and simulate accordingly using customized SQL scripts and adjusting scale factor.
  • Controlling concurrency level – We can control the concurrency level  in the pgbench test using the options -c and -j accordingly.
  • Monitor System Metrics – Use system monitoring tools such as top to monitor CPU usage, memory utilization, disk I/O, and network activity during pgbench tests. 
  • Tune PostgreSQL configurations – Adjust postgresql.conf parameters , such as shared_buffers, max_connections, work_mem based on results.
  • Run multiple tests – To get the accurate assessment on performance , run the multiple pgbench tests.
  • Documentation – Documenting the methodologies & results of your testing activities will help in further analysis.

Common pitfalls to avoid in database benchmarking

  • Doing the benchmark testing without understanding the workload.
  • Running the test on the insufficient CPU, memory, storage, and network resources to support the expected workload.
  • Incorrect benchmarking tools and methodologies can lead to incorrect results.
  • Not testing on realistic conditions or failing to replicate the environment conditions leads to improper results.
  • Considering the result of single test rather than the average of multiple tests
  • No proper documentation

Summary

To summarize, effective benchmarking is crucial for the optimal performance and scalability of databases.By leveraging benchmarking tools such as pgbench, database administrators and developers can simulate real-world workloads, measure performance metrics, and identify areas for optimization. However, successful benchmarking requires careful planning, execution, and analysis.In this blog, we have explored some features and results of pgbench including tips , tricks and also common pitfalls need to be avoided in database benchmarking.

Thank you and stay tuned for more…

Troubleshooting PostgreSQL Replication Setup Errors with repmgr

Introduction

Hey All

Managing a PostgreSQL replication setup involves various challenges, especially when encountering errors during configuration. This blog post explores common errors encountered while setting up replication using repmgr, a popular replication manager for PostgreSQL. We’ll delve into the causes of these errors and provide solutions to overcome them, ensuring a seamless replication setup. We had discussed setting up High Availability in Postgres using repmgr in this post. If you did not get a chance to go through it, now might be a good time for it.

These are the some of the errors/issues I faced in my initial work on repmgr as well as with ec2 instances. They might look very basic, but they reinforce the fact that minute details are important.

Error#1

This is the first brake of my smooth journey with ec2 instances & repmgr.

[postgres@ip-172-31-14-99 ~]$ /usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register
WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
INFO: connecting to primary database...
ERROR: connection to database failed
DETAIL:
connection to server at "52.64.241.210", port 5432 failed: timeout expired

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=52.64.241.210 fallback_application_name=repmgr options=-csearch_path=

Solution

We need to edit the inbound rules of the security group and add each and every PostgreSQL EC2 instances to allow traffic to flow between those instances.
Steps to do : Instances –> Security–>Security group–>Edit Inbound rules

Add the PostgreSQL to the security group as mentioned below

Result

[postgres@ip-172-31-14-99 bin]$ /usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register
WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

#Verifying Primary registeration 
[postgres@ip-172-31-14-99 bin]$ /usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=52.64.241.210 user=repmgr dbname=repmgr connect_timeout=2

Error#2

This is the basic error , we DBA’s deal with this on daily basis. “No Entries in pg_hba.conf” . I encountered this error while performing the Standby clone dry run operatiion.

[postgres@ip-172-31-7-43 ~]$ /usr/pgsql-14/bin/repmgr -h 52.64.241.210 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run

DETAIL:
connection to server at "52.64.241.210", port 5432 failed: FATAL:  no pg_hba.conf entry for replication connection from host "13.210.66.27", user "repmgr", no encryption

ERROR: connection to database failed
DETAIL:
connection to server at "52.64.241.210", port 5432 failed: FATAL:  no pg_hba.conf entry for replication connection from host "13.210.66.27", user "repmgr", no encryption

Solution

It clearly indicates that we need to make an entry in pg_hba.conf of Primary server.

cd $PGDATA
vi pg_hba.conf
host    replication     repmgr          {Standby1 IP}           md5
host    repmgr          repmgr          {Standby1 IP}           md5

# We have to add details of all of the Standby nodes and the Witness node too, if applicable.

Note : As we will clone the data from the Primary to both of the Standby nodes , there is no need to make these changes in the Standby nodes. But on the Witness mode , we need to enter the IP’s in pg_hba.conf file and change the parameters in postgresql.conf file . (For more details, refer to this post)

Result

[postgres@ip-172-31-7-43 ~]$ /usr/pgsql-14/bin/repmgr -h 52.64.241.210 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run

NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
  pg_basebackup -l "repmgr base backup"  -D /var/lib/pgsql/14/data -h 52.64.241.210 -p 5432 -U repmgr -X stream
INFO: all prerequisites for "standby clone" are met

Error#3

I had encountered this error when I was setting up the Witness node. It is possible that even after you have done an entry in pg_hba.conf file, you get the following error:

[postgres@ip-172-31-8-219 ~]$ /usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf witness register -h 52.64.241.210
WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
INFO: connecting to witness node "node4" (ID: 4)
ERROR: unable to connect to witness node "node4" (ID: 4)
DETAIL:
connection to server at "ec2-13-211-191-92.ap-southeast-2.compute.amazonaws.com" (172.31.8.219), port 5432 failed: FATAL:  role "repmgr" is not permitted to log in

HINT: the witness node must be running before it can be registered

Solution

Login to psql and check the attributes of the user

[postgres@ip-172-31-8-219 ~]$ psql
psql (14.11)
Type "help" for help.
postgres=# \du+

                                          List of roles
 Role name |                         Attributes                         | Member of | Description
-----------+------------------------------------------------------------+-----------+-------------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}        |
 repmgr    | Superuser, Cannot login                                    | {}        |

# It shows Role "repmgr" cannot login
# Simply alter the role 
postgres=# ALTER ROLE repmgr LOGIN;
ALTER ROLE
postgres=# \du+
                                          List of roles
 Role name |                         Attributes                         | Member of | Description
-----------+------------------------------------------------------------+-----------+-------------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}        |
 repmgr    | Superuser                                                  | {}        |
 

Result

After altering the role , retry registering the witness node.

[postgres@ip-172-31-8-219 ~]$ /usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf witness register -h 52.64.241.210
WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
INFO: connecting to witness node "node4" (ID: 4)
INFO: connecting to primary node
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
INFO: witness registration complete
NOTICE: witness node "node4" (ID: 4) successfully registered

Conclusion

In PostgreSQL replication setups, encountering errors is not uncommon, but with the right knowledge and troubleshooting steps, they can be resolved effectively. In this post, we have discussed some of the common errors encountered during the configuration of replication using repmgr. By understanding the root cause of these errors and implementing the provided solutions, administrators can ensure a smooth and reliable PostgreSQL replication setup, facilitating data redundancy and high availability for their applications. These might already be known to many of the experienced DBA’s, but can serve as a useful guide to novice PostgreSQL enthusiasts, and DBAs with less experience, and save precious time. If you have encountered any other with issues while building an PostgreSQL HA systems using repmgr, please feel free to comment below, and reach out to us. As always, we’re happy to help.

Thank you and Stay tuned…

Building PostgreSQL HA Systems with repmgr

Introduction

Welcome to our latest blog post! Today, we’ll delve into the intricacies of setting up PostgreSQL High availability (HA) production cluster for a prominent client in the banking sector. Our task? Designing and implementing a robust HA solution for their PODS, with PostgreSQL as the database backbone. Join us as we navigate through the process of establishing a four-node repmgr cluster, ensuring continuous availability and data integrity for our esteemed client.

Why repmgr

repmgr is a popular tool used for managing PostgreSQL replication and failover in high availability (HA) environments. It simplifies the setup and management of PostgreSQL replication by providing easy-to-use commands and tools for configuration, monitoring, and failover. One of its key features is its ability to automatically detect primary node failures and promote a standby node to become the new primary.

Prerequisites

  • Provision four EC2 instances.
  • Install PostgreSQL 14 on each node (why not the latest – Application compatibility)
  • Install repmgr on each node

Setup

Splitting the four nodes into  1  primary node, 2 standby nodes & 1 witness node. 

Before setting up the repmgr , we need to add the IPv4 address of four nodes  to a single security group in EC2 instances.

In Primary node:

Create the user and database accordingly and also do entries of remaining nodes in pg_hba.conf of primary server.

#Changes of parameters in Postgresql.conf

max_wal_senders = 10
max_replication_slots = 10
wal_level ='logical' #to enable logical decoding in future
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'

#Restart the services after changing the parameters in postgresql.conf file.

Then, we need to create a repmgr configuration file with the following parameters:

cluster='clientname_prod'
node_id=1
node_name=node1
conninfo='host=node1 user=myuser dbname=mydb connect_timeout=2'
data_directory='/var/lib/pgsql/14/data/'
failover=automatic
promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

Now, we need to register the Primary using the command

/usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register

#To check whether the primary server is registered or not,
/usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf  cluster show

Once the primary is  registered, we can setup the standby node

In Standby node

Two steps:

One is standby clone and the other is standby register.
Before that, let us create the repmgr configuration file in the standby server.

node_id=2
node_name=node2
conninfo='host=(Standby IP address) user=myuser dbname=mydb connect_timeout=2'
data_directory='/var/lib/pgsql/14/data'
failover=automatic
promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

Now, let us do a dry run before cloning using the command,

/usr/pgsql-14/bin/repmgr -h (Primary IP address) -U myuser -d mydb -f /var/lib/pgsql/repmgr.conf standby clone --dry-run

If it is successful, let us do the standby clone using the command

/usr/pgsql-14/bin/repmgr -h (Primary IP address) -U myuser -d mydb -f /var/lib/pgsql/repmgr.conf standby clone

Once the cloning is done ,Register the standby now using the command

/usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby register

Make sure that you stop the PostgreSQL server while doing the Standby clone and restart it before registering the standby.

Repeat the same process with the second standby too.

In Witness node

In the witness node, the parameter changes would need to be made in postgresql.conf file, and entries in pg_hba.conf need to be added similar to what we did in Primary; and also create the repmgr configuration file accordingly.

To register the witness node, use the command

/usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf witness register -h (Primary IP)

Now, we have all of the four nodes ready and registered as needed. You could use the cluster show command to check the nodes and verify that everything is in order.

Automate the failover

To automate the failover, there is a process called repmgrd daemon. It is a background process that runs continuously to manage PostgreSQL replication and failover using repmgr. It is a crucial component of the repmgr toolset, providing automated monitoring, management, and failover capabilities for PostgreSQL replication clusters.

To activate repmgrd, use the command

/usr/pgsql-14/bin/repmgrd -f /var/lib/pgsql/repmgr.conf

#To check the events, use the command:
/usr/pgsql-14/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster event

Conclusion

In conclusion, the successful implementation of the four-node repmgr cluster enabled our client to achieve HA for their Microservices architecture, bolstering data resilience and ensuring uninterrupted service delivery in a mission-critical banking environment. With repmgr’s robust features and our expertise, we provided a reliable and scalable solution tailored to the client’s specific requirements. To this setup , we also added PgBouncer and keepalived. If you want to know more about the architecture and how it works with a practical approach, please stay tuned for this series..

Unlocking Initial Sync for Logical Replication in AWS RDS for PostgreSQL

PostgreSQL’s logical replication is handled through publications and subscriptions. Publications define what data to replicate, while subscriptions specify which server receives the updates.

Initial synchronization ensures the subscriber has all the data. Common methods include:

Snapshot: Copies all data at once. Simple but slow for large datasets.
Base Backup and Continuous Archiving: Makes a base copy and then tracks changes. More suitable for large datasets.

However, when dealing with massive databases in RDS PostgreSQL, such as those in the terabyte range, using pg_basebackup for base backup might not work due to security restrictions on file systems. pg_basebackup requires deep file system access, which is limited in RDS for security reasons. In such cases, replication needs to be set up using the pg_dump utility instead.

Before proceeding, ensure the following configuration parameters are set and restart the cluster:

max_replication_slots: 10
max_wal_senders: 10
track_commit_timestamp: on
wal_level: logical

Next, create a publication and add tables to it in the source database:

CREATE PUBLICATION lsr_sync_01;
ALTER PUBLICATION lsr_sync_01 ADD TABLE sales_data;
ALTER PUBLICATION lsr_sync_01 ADD TABLE customer_info;
ALTER PUBLICATION lsr_sync_01 ADD TABLE transactions;

Create a replication slot:

To create a replication connection to the database and establish a replication slot, follow these steps:

Open a separate terminal session.

Keep the session active, preferably using a screen session.

Run the following command:

[postgres@primary ~]$ psql -h primary -U repmgr "dbname=osdb_lsr replication=database"
psql (10.23, server 14.10)
WARNING: psql major version 10, server major version 14.
         Some psql features might not work.
Type "help" for help.

osdb_lsr=# CREATE_REPLICATION_SLOT lsr_sync_01 LOGICAL pgoutput;
  slot_name  | consistent_point |    snapshot_name    | output_plugin
-------------+------------------+---------------------+---------------
 lsr_sync_01 | 0/C000110        | 00000003-00000002-1 | pgoutput
(1 row)

Now let's proceed with backing up the database using pg_dump.

pg_dump -h primary -U repmgr -t sales_data -t customer_info -t transactions -Fd -f osdb_lsr_intial_sync -j 2 --no-publications --no-subscriptions --snapshot=00000003-00000002-1 --compress=0 -v osdb_lsr

Restore the backup on the target for LSR:

pg_restore -h standby -U repmgr -Fd -j 2 -v -d osdb_lsr osdb_lsr_intial_sync

Create a subscription:

Create the subscription on the standby server, referring to the primary database, but keep it in a disabled state.

CREATE SUBSCRIPTION lsr_sync_sub_01 CONNECTION 'host=primary port=5432 dbname=osdb_lsr user=repmgr password=*****' PUBLICATION lsr_sync_01
WITH (
  copy_data = false,
  create_slot = false,
  enabled = false,
  connect = true,
  slot_name = 'lsr_sync_01'
);

Advance the replication origin:

Retrieve the external ID from the pg_subscription table and use it to advance the replay position to the consistent point captured during replication slot creation.

standby=# SELECT 'pg_'||oid::text AS "external_id" FROM pg_subscription WHERE subname = 'lsr_sync_sub_01';
 external_id
-------------
 pg_32119
(1 row)

Take the position of the LSN from the replication slot creation and make it advance to start the replication

standby=# SELECT pg_replication_origin_advance('pg_32119', '0/C000110') ;
 pg_replication_origin_advance
-------------------------------
 
(1 row)

Enable the subscription:

ALTER SUBSCRIPTION lsr_sync_sub_01 ENABLE;

Check the replication status:

SELECT * FROM pg_stat_replication;

This series of steps helps to establish logical replication in PostgreSQL, ensuring data synchronization between databases efficiently.

Open-source Data Engineering with PostgreSQL

Blog-2: Installation and Setup on Ubuntu

INTRODUCTION:

Welcome back to the series on Open-source Data Engineering with PostgreSQL. In this post, we shall delve into the installation and configuration of Apache Spark and Apache Drill on an Ubuntu environment. Our aim is to guide you through each step, building upon the foundational concepts we’ve previously covered. By following this comprehensive walkthrough, you’ll have all the necessary prerequisites for a streamlined installation process, setting the stage for an optimized data engineering environment that facilitates efficient data loading and exploration. Let’s dive in and set the stage for a seamlessly integrated experience.

Prerequisites:

Before we embark on the installation journey, make sure your Ubuntu operating  system meets the following prerequisites:

  • Sufficient Disk storage
  • Apache Spark and Apache Drill require Java version Java 8 or Java 11 to run
    • Here is the command to install Java if it is not already installed
      • sudo apt-get install openjdk-8-jdk
  • Apache spark interacts with various data sources, including PostgreSQL. To connect to a Postgres database, we need a PostgreSQL JDBC Driver; we can download it from hhttps://jdbc.postgresql.org/download/
  • Place the PostgreSQL JDBC Driver JAR file in this location /opt/spark/jars so that is accessible to Apache Spark.

Installation and setup

Apache Spark:

  • Download the latest Apache Spark utility (version 3.5.0 as of this writing) from this link https://spark.apache.org/downloads.html, we can download any other available version as per the need.
  • First, we will update using the command
    sudo apt-get update
    sudo apt-get upgrade
  • Use this command to extract the tarball and before that change the directory where we want to extract spark
    cd /path/to/your/desired/directory
    tar -xzf spark-3.5.0-bin-hadoop3.tar.gz
  • Update your .bashrc file with the following environment variables
    export HADOOP_HOME=/path/to/your/hadoop
    export SPARK_HOME=/path/to/your/spark
  • Navigate to the Spark directory and run the following command, we can see the Spark working directory
    cd /path/to/your/spark
    spark-shell

Apache Drill:

  • Download the latest version of Apache Drill (version 1.19.0 as of this writing) using
    wget https://www.apache.org/dyn/closer.cgi/drill/drill-1.19.0/apache-drill-1.19.0.tar.gz
  • Firstly we will run 
    sudo apt-get update
    sudo apt-get upgrade
  • Use the following command to extract the tarball at the desired directory 
    cd /path/to/your/desired/directory
    tar -xvf apache-drill-1.19.0.tar.gz
  • Navigate to the Drill’s bin directory and run any of the following commands to start the Drill shell or the sqlline shell:
    ./drill-embedded
    ./sqlline
    ./sqlline -u jdbc:drill:zk=local

What’s next:

This installation guide has equipped you with the essential steps to set up Apache Spark and Apache Drill on your Ubuntu environment. By following these instructions, you’ve laid a solid foundation for efficient data engineering, enabling seamless data loading and exploration.

As you embark on your journey with Spark and Drill, remember that this is just the beginning. In upcoming blogs, we will delve deeper into advanced configurations, optimization strategies, and real-world use cases. Stay tuned for more insights and practical tips that will elevate your data engineering experience. The upcoming blogs will provide valuable insights for harnessing the full potential of Apache Spark and Apache Drill in your data projects. Happy exploring!

Open-source Data Engineering with PostgreSQL

Overview – A Curtain raiser

Introduction:

In the ever-evolving landscape of Data management, organizations are constantly seeking efficient ways to handle, transform, and query massive datasets. Data Archiving has become an important component of Data Engineering in the ever-evolving landscape of Data Management which constitutes efficient methods to handle, transform, and query massive datasets, including the archived ones. This blog series aims to focus on a robust solution using Open-source tools such as Apache Spark and Apache Drill. Throughout this series, we will delve into the intricacies of transforming table data into the Parquet format (columnar storage optimized for analytics) and using Apache Drill to query this data seamlessly. The important pillars of this blog series would be around:

Open-source tools

In the architecture described for archiving and data engineering, two key open-source tools play pivotal roles: Apache Spark and Apache Drill. Together, they form a robust foundation for enabling efficient data loading and exploration within a scalable and flexible environment.

Archiving

As organizations accumulate massive volumes of data, archiving becomes imperative for both cost optimization and performance enhancement. Parquet, a highly efficient columnar storage format, has gained prominence due to its ability to compress and store data in a way that facilitates fast analytics. In this series, we delve into the motivations behind archiving data and how Parquet addresses these challenges.

Legacy Datasets

A key emphasis of this series is on how data archiving plays a pivotal role in liberating production environments from the burden of legacy datasets. By effectively storing and archiving historical data in a structured and efficient format, organizations not only ensure smoother operational workflows but also unlock the potential for enhanced analytics.

Choosing Parquet format

Parquet is chosen for its superiority in analytical data storage due to its columnar storage architecture. This format excels in optimizing query performance through efficient compression techniques like predicate pushdown, run-length encoding, and dictionary encoding, reducing storage requirements and speeding up query processing. Its flexibility in schema evolution allows seamless modifications, ensuring compatibility across different versions for smooth system upgrades.

Apache Spark:

  • Apache Spark is a powerful open-source data processing engine known for its distributed computing capabilities.
  • In this context, Apache Spark serves as the backbone for the data-loading process. Its scalability and ability to handle diverse data sources make it well-suited for efficiently extracting, transforming, and loading (ETL) data into the desired format, which, in this case, involves archiving data into the Parquet format.

Apache Drill:

  • Apache Drill is a schema-free SQL query engine designed for exploring and querying large-scale datasets.
  •  In the described setup, it is instrumental in querying and extracting insights from the Parquet-archived data.
  • The schema-free nature of Apache Drill aligns well with the flexibility of the Parquet format, allowing for seamless querying and analysis without strict schema constraints.

As we conclude this introductory chapter of our blog series on transforming and archiving legacy datasets with Apache Spark, Parquet files, and Apache Drill, we’ve only just scratched the surface.

In the upcoming articles, we’ll walk you through practical implementation strategies, share real-world case studies.

Stay tuned for more!

Timestamp-based Change Data Capture

Introduction

Hey all!! Hope you are following the series of topics around Data Integration with PostgreSQL, The World’s Most Advanced Open Source Relational Database. In our previous blog, we explored the Change Data Capture (CDC) and its methods. If you missed it, we recommend reading the blog.

Timestamp-based CDC (Change Data Capture) is one approach to capturing and tracking changes in a database by associating a timestamp with each change. It is a versatile and widely used approach in tracking the changes in databases and supports various use cases like data replication and synchronization. In this blog, we will explore the prerequisites to execute the timestamp-based CDC and the setup with the result set.

Setup the Environment

Before diving into CDC, let’s ensure we have a timestamp column in the table (source) and the necessary components installed and configured.

PostgreSQL Database

  • Install PostgreSQL on the server or your local machine.
  • Create a database and the required tables that will be part of the CDC.

Pentaho Data Integration (ETL tool)

  • Install PDI on the server or your local machine – you may refer to this blog for a step-by-step procedure
  • Connect with the databases.
  • Create the data pipeline for CDC.

Designing the CDC Pipeline

In this CDC pipeline, we will create a new transformation and building a pipeline to extract the data from one table(t1) in one PostgreSQL database and load it in another table(t2) in another PostgreSQL database in incremental basis.

Define the date range

  • Go to Design → Input → Get system info
  • Here, we are getting the start and end range of the transformation.
  • Hop the result to the next step as input.

Export data from Source

  • Go to Design → Input → Table input
  • Here, we will extract the data in the table(t1) using a timestamp column by giving SQL statements.
  • Hop the result to the next step as input.

Import data into Target

  • Go to Design → Output → Insert/update
  • Here, we would insert the data(output from the previous step) into the target table(t2) as per the given conditions.

Input and Output

  • Assuming that both tables(t1 & t2) in different databases with columns (id, name & last_update) have no data.
  • Insert the data in the table(t1) using the insert command.
  • Check the data in table(t2) before running the transformation.
  • Run the transformation. If the transformation is successful, you will get the ✅ at the top of the following steps.
  • Now, check the target table(t2).
  • Compare the data in both the tables(t1&t2).

Note: As discussed in the previous blog, CDC using timestamp will not work for the DELETE option.

Conclusion

In conclusion, the adoption of timestamp-based Change Data Capture (CDC) in PostgreSQL marks a pivotal stride in the realm of data integration. By leveraging this sophisticated approach, you can build systems to capture and propagate real-time changes efficiently. The journey through the nuances of timestamp-based CDC in PostgreSQL+Pentaho DI showcases not just a technical evolution but a strategic leap towards a more responsive and interconnected data ecosystem. As we witness the seamless synchronization of data across diverse platforms and applications, the potential for data-driven decision-making becomes boundless. In the ever-evolving landscape of data integration, embracing timestamp-based CDC in PostgreSQL is not just a best practice—it’s a transformative key to unlocking the true power of real-time data.

Stay tuned for more in Data Integration with PostgreSQL blog post series!!

Unleashing the Power of Change Data Capture

Introduction

Here I am again! Talking about the series of topics around Data Integration with PostgreSQL, The World’s Most Advanced Open Source Relational Database. If you haven’t looked at the previous blog in the series, I’d highly recommend reading the same, it is available here.

In the ever-evolving landscape of data management, organisations are constantly seeking ways to stay ahead of the modifications, additions and deletions in real-time. CDC is a transformative technology that allows organisations to achieve data integrity and consistency across all systems and deployment environments. Additionally, it allows organisations to use the right tool for the right job by moving data from legacy databases to purpose-built data platforms, such as document or search databases and data warehouses.
Change Data Capture (CDC) has emerged as a game-changer, providing a dynamic solution to keep pace with the rapidly changing world of data. This blog explores three W’s of CDC and methods of CDC.

Three W’s of Change Data Capture

What is Change Data Capture?

Change Data Capture (CDC) is a technique used to capture the changes made to data in a source database. So that the changes can be implemented in the target database also. Replication is one of the important aspects where the data is constantly changing, CDC reduces the load by sending only incremental data.

Why Change Data Capture is important?

CDC enables real-time synchronisation of data between source and target systems. This is critical for applications where access to the most current information is essential, such as in financial transactions or healthcare etc

Unlike traditional methods that involve replicating entire datasets, CDC focuses only on the changes. This reduces the processing overhead, minimises network traffic, and optimises resource utilisation.

When to use Change Data Capture?

CDC is particularly used when dealing with datasets that undergo frequent changes. In scenarios where data is constantly updated, capturing only the changes helps maintain system efficiency. 

In environments where data warehouses and analytical systems need to stay current, implementing CDC ensures that updates are propagated swiftly and without the need for extensive data transfers.

Methods of Change Data Capture

Change Data Capture (CDC) comes in various types, each having its own specific use cases and requirements. The primary methods of CDC ,we want to discuss :

Timestamp based CDC

In timestamp based CDC ,we use the timestamp column with the name “last_updated”, “ last_modified” or “date_modified” (as per your convenience) in the source table to identify changes.

Pros: 

  • Simple to implement, especially when timestamps are readily available.
  • No additional load on the source database compared to trigger-based CDC.

Cons:

  • Timestamp columns need to be updated every time.
  • Delete operations will not be identified as there is no timestamp column for deleted rows.

Log based CDC

In log based CDC, changes are captured directly from the transaction logs of the source database. This method is efficient because it reads the logs generated by the database management system (DBMS) to identify changes.

Pros:

  • More reliable and efficient.
  • There is no need to change the schemas of the database system or add additional columns.

Cons:

  • More complex to implement.
  • Dependency on database-specific features.

Trigger based CDC

In trigger based CDC, changes in the source are identified using the database triggers. Triggers are special stored procedures that are automatically executed when a specified event (such as an update, insert, or delete) occurs on a particular table.

Pros:

  • Simple to implement.
  • Directly tied to database events, making it easy to understand.

Cons:

  • Can introduce additional overhead on the source database.
  • May not be as real-time as log-based approaches.
  • Complexity increases with the number of triggers.

Conclusion

As we conclude the exploration of Change Data Capture ,uses and methods of CDC. Change Data Capture is not just a technological tool, it is a strategic asset for organisations navigating the complexities of a data-driven world. By embracing this approach, businesses can not only adapt to change but thrive in an environment where accurate, timely, and reliable data is the cornerstone of success.

If you are interested in exploring CDC in your PostgreSQL database using Pentaho Data Integration (PDI) , please reach out to us, let’s connect and collaborate and extract more from your PostgreSQL database.

Thank you and Stay tuned!!

Independence Day – 15th August

OpenSource DB wishing you “Happy 77th Independence Day”

On this day, let’s celebrate and unite, embracing the diversity that fortifies our strength!