PostgreSQL 15

The 101s of Mastering Database Inheritance in PostgreSQL

PostgreSQL is renowned for its powerful features and flexibility, and one of its most intriguing capabilities is support for table inheritance. Table inheritance in PostgreSQL allows you to model hierarchical data structures directly within your database schema. This feature can simplify schema design, reduce redundancy, and enhance query efficiency. In this blog post, we’ll explore how PostgreSQL implements inheritance, its benefits, and practical examples to help you master this powerful feature.

What is Table Inheritance in PostgreSQL?

Table inheritance in PostgreSQL allows you to create a hierarchical relationship between tables. This means a table known as a child can inherit columns and constraints from another table known as a parent . This design pattern helps model real-world relationships more naturally and supports object-oriented programming concepts within your relational database.

Key Concepts:

  • Parent Table: The base table from which other tables inherit. It typically contains common attributes and constraints.
  • Child Table: A table that inherits from a parent table. It can have additional attributes or constraints specific to its type.

All check constraints and not-null constraints on a parent table are automatically inherited by its children, unless explicitly specified otherwise with NO INHERIT clauses. Other types of constraints (unique, primary key, and foreign key constraints) are not inherited.

Using Inheritance in PostgreSQL

Here are the steps to create and use inheritance in PostgreSQL

Create the Parent table

Creating the parent table with the common attributes that will be inherited by child tables.

postgres=# CREATE TABLE vehicle (
    id SERIAL PRIMARY KEY,
    brand VARCHAR(50),
    model VARCHAR(50),
    year INT
);
CREATE TABLE

Here, we have created the table vehicle as the parent table with attributes that are common to all vehicles.

Create Child tables

Creating the child tables that inherit from the parent table. You use the INHERITS keyword to specify inheritance.

postgres=# CREATE TABLE car (
    num_doors INT
) INHERITS (vehicle);

CREATE TABLE bike (
    has_sidecar BOOLEAN
) INHERITS (vehicle);
CREATE TABLE
CREATE TABLE

Here, car and bike are child tables that inherit the columns from the vehicle table. Each child table also has its own additional attributes.

Insert Data into child tables

Insert data into car and bike tables

postgres=# INSERT INTO car (brand, model, year, num_doors)
VALUES ('Toyota', 'Corolla', 2023, 4);

INSERT INTO bike (brand, model, year, has_sidecar)
VALUES ('Harley-Davidson', 'Sportster', 2024, FALSE);
INSERT 0 1
INSERT 0 1

Check Data in Parent table

Checking the data in Parent table

postgres=# SELECT * FROM vehicle;
-[ RECORD 1 ]----------
id    | 1
brand | Toyota
model | Corolla
year  | 2023
-[ RECORD 2 ]----------
id    | 2
brand | Harley-Davidson
model | Sportster
year  | 2024

Here , we can see the data of both the child tables in the parent table as well.

Add and check the constraints

We will first add the constraint and check the constraints

postgres=# ALTER TABLE car ADD CONSTRAINT check_num_doors CHECK (num_doors > 0);
ALTER TABLE
postgres=# SELECT
    conname AS constraint_name,
    contype AS constraint_type,
    condeferrable,
    condeferred,
    c.relname AS table_name
FROM                    
    pg_constraint AS con
JOIN                    
    pg_class AS c ON con.conrelid = c.oid
WHERE                                    
    c.relname IN ('vehicle', 'car', 'bike');
-[ RECORD 1 ]---+----------------
constraint_name | vehicle_pkey
constraint_type | p
condeferrable   | f
condeferred     | f
table_name      | vehicle
-[ RECORD 2 ]---+----------------
constraint_name | check_num_doors
constraint_type | c
condeferrable   | f
condeferred     | f
table_name      | car

Use Cases

  • E-commerce Platforms: Inherit attributes of a generic product table into specific product types like clothing, electronics, and furniture. This approach streamlines managing different product categories while retaining common attributes.
  • Content Management Systems: Use inheritance to manage various content types (e.g., article, video, blog_post) that share common attributes like title, author, and publish_date.
  • Vehicle Tracking Systems: Model different types of vehicles (e.g., car, bike, truck) with shared attributes such as license_plate, registration_date, and owner.

Conclusion

PostgreSQL’s table inheritance is a robust feature that enhances schema design by allowing you to model hierarchical relationships and reduce redundancy. By understanding and leveraging inheritance, you can create more organized and flexible database schemas. However, it’s essential to balance its benefits with potential complexities and performance considerations. With thoughtful design and regular performance monitoring, you can effectively use inheritance to manage and query data in PostgreSQL.

Thank you n Stay tuned…

Essential PostgreSQL Maintenance Activities for Optimal Performance

Maintaining the performance and health of a PostgreSQL database is essential to ensure smooth operations and reliable data management. Regular maintenance activities help optimize performance, prevent issues, and keep your database running efficiently. In this blog post, we’ll explore the essential PostgreSQL maintenance activities that one should incorporate into their routine to achieve optimal performance.

Why Database Maintenance Matters

Database maintenance is essential for ensuring the optimal performance, reliability, and security of your database systems. Regular maintenance activities, such as vacuuming, analyzing, and backing up, help optimize query performance, reclaim disk space, and prevent data corruption. By proactively managing resources and monitoring system health, maintenance reduces the risk of downtime and performance issues, supports scalability, and ensures compliance with security standards. Investing in routine maintenance not only enhances database efficiency but also protects against potential data loss and security vulnerabilities, ultimately safeguarding your business operations and data integrity.

The following list highlights some of the prominent activities

Vacuum

PostgreSQL’s VACUUM command has to process each table on a regular basis for several reasons:

  1. To recover or reuse disk space occupied by updated or deleted rows.
  2. To update data statistics used by the PostgreSQL query planner.
  3. To update the visibility map, which speeds up index-only scans.
  4. To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.

Source: https://www.postgresql.org/docs/16/routine-vacuuming.html

Different types of Vacuum

  • VACUUM : Basic operation to reclaim storage and update statistics.
    VACUUM table_name;
  • VACUUM FULL: More intensive operation that rewrites the entire table, reclaiming more space but requiring more resources and locking the table.
    VACUUM FULL table_name;

Automatic VACUUM

  • Autovacuum: PostgreSQL includes an autovacuum process that automatically runs vacuum and analyze operations.

Analyze

ANALYZE collects statistics about the contents of tables in the database, which helps query planner to make better decisions to be more efficient.

ANALYZE table_name;

autovacuum process also performs automatic analyze.

Reindex

Recreating indexes to optimize their performance and reduce bloat or fragmentation. It also helps in resolving the issues like index corruption to ensure data integrity.

REINDEX INDEX index_name;

Backup and Recovery

Having a backup and recovery strategy for databases is very important. Regular backups protect against data loss due to hardware failures, corruption, or human errors and also ensure you can recover data in the event of a system failure or disaster.

There are a plethora of utilities and tools available in the Postgres’ ecosystem to manage database backups. Here, is our detail blog on Postgres backups.

Monitoring and Analysis

Effective database maintenance requires comprehensive monitoring of various critical aspects. Performance metrics such as query performance, index usage, and response time are essential to ensure efficient operation. Resource utilization should be tracked, including CPU and memory usage, disk I/O, and available disk space, to prevent bottlenecks and ensure optimal performance. System health monitoring involves checking log files for errors, verifying backup status, and managing disk space. Security is paramount, and can be well maintained with regular reviews of user access, authentication logs, and patch management. Configuration settings need periodic review and adjustment, while maintenance tasks like vacuuming and reindexing are crucial for performance. Additionally, monitoring replication lag, failover mechanisms, and application activity ensures data consistency and high availability. Finally, capacity planning by analyzing growth trends helps in anticipating future needs and avoiding performance issues.

Tools/Extensions

There are lot of tools/extensions in the market for monitoring and analysing. Here are some of them,

  • pg_stat_statements is used to track the query performance and identify slow running queries.
  • pgBadger is used to analyze PostgreSQL log files and generate the detailed performance report.
  • pgAdmin provides a graphical monitoring of performance , health metrics and also performs maintenance tasks like vacuuming and reindexing.

Update

Regular maintenance and updates are crucial for ensuring the security and stability of your PostgreSQL database. Applying updates and patches addresses security vulnerabilities, enhances stability, and resolves bugs to improve overall functionality.

Start by regularly checking for updates for PostgreSQL and its extensions to stay current with the latest releases. Before applying updates to your production environment, thoroughly test them in a staging environment to ensure compatibility and minimize disruptions. Finally, follow PostgreSQL’s prescribed upgrade process to apply updates and patches methodically, maintaining a robust and secure database system.

Configuration review

Reviewing and adjusting the database configuration settings to match the workload of your environment is important. However , reviewing and adjusting manually the parameters like shared_buffers, work_mem, maintenance_work_mem and effective_cache_size etc. is more complex especially when you have a differential workload on your environment.

DBtune

Leveraging machine learning to understand specific workload requirements, DBtune is an AI-enabled tuning engine that automates and enhances the parameter tuning process using cutting-edge technologies. This advanced methodology allows for precise tuning and continuous improvement of database performance by analysing complex patterns and making data-driven adjustments.

Know more about DBtune ,
Ref1 : https://opensource-db.com/the-dna-of-dba-mastering-parameter-tuning/
Ref2 : https://opensource-db.com/postgresql-performance-tuning-from-manual-to-automatic-dbtune/

Feel free to reach out to us for a demo.

Best Practices/tips for PostgreSQL Maintenance

  1. Establish a Routine: Create a regular maintenance schedule for vacuuming, backups, and configuration reviews.
  2. Monitor Continuously: Use monitoring tools to track performance, detect issues, and make informed adjustments.
  3. Using latest tools: Using the tools like DBtune will increase the performance as well as reduce the resource cost.
  4. Document Procedures: Maintain detailed records of maintenance activities, configurations, and issues.
  5. Test Changes: Test configuration changes and updates in a staging environment before applying them to production.
  6. Bring Expertise : Engaging with seasoned professionals like us to add deep technical knowledge and experience can significantly enhance your database operations.

Conclusion

Effective maintenance is key to ensuring the performance, reliability, and longevity of your PostgreSQL database. By incorporating essential maintenance activities such as vacuuming, analyzing, backing up, and monitoring, you can optimize your database and prevent potential issues.

Adopting a proactive approach to maintenance and leveraging the right tools like DBtune and best practices will help you manage your PostgreSQL database effectively, ensuring it meets the demands of your applications and users.

Happy maintaining!

Index Bloat Management in PostgreSQL

Introduction:

Index bloat occurs when an index accumulates unnecessary or unused space, which can degrade database performance. This bloat typically results from operations such as deletions, which leave behind “dead tuples” that are no longer needed but continue to consume space within the index. Understanding and managing index bloat is crucial for maintaining efficient database operations and ensuring optimal performance. In this guide, we’ll walk through the process of creating an index, checking for bloat, simulating bloat, and addressing it through reindexing.

Why it is important to monitor and reduce Index bloat

Monitoring and reducing index bloat is essential for optimizing database performance and efficiency. Bloated indexes consume excessive disk space, increase maintenance overhead, and can significantly slow down query performance by forcing the database engine to sift through more data than necessary. By addressing index bloat, you not only free up valuable storage resources but also improve system responsiveness and reduce the time required for backups and recovery operations. Efficient index management minimizes resource consumption, prevents fragmentation, and ensures that your database operates smoothly and effectively, ultimately leading to a better user experience and lower operational costs.

Step1: Creating an Index

We begin by creating an index on the test table:

postgres=# CREATE INDEX idx_test_id ON test(id);
CREATE INDEX

Step 2: Checking Initial Index Bloat

We use pgstattuple to check for any initial bloat:

postgres=# SELECT * FROM pgstattuple('idx_test_id');
-[ RECORD 1 ]------+------
table_len | 16384
tuple_count | 40
tuple_len | 640
tuple_percent | 3.91
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 7348
free_percent | 44.85

The output shows the current state of the index, including metrics such as the total index length (table_len), the number of tuples (tuple_count), and the amount of free space (free_space). At this point, there is no bloat, as indicated by a dead_tuple_count of 0.

Step3: Simulating Index Bloat

To simulate index bloat, we delete some rows from the test table:

postgres=# DELETE FROM test WHERE id BETWEEN 25 AND 30;
DELETE 6

Deleting rows can create dead tuples in the index, which are no longer needed but still occupy space.

Step4: Checking Index Bloat

We can verify this by checking the index bloat use this Query:

SELECT
  current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
  ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::float/otta END)::numeric,1) AS tbloat,
  CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
  iname, /*ituples::bigint, ipages::bigint, iotta,*/
  ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::float/iotta END)::numeric,1) AS ibloat,
  CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
FROM (
  SELECT
    schemaname, tablename, cc.reltuples, cc.relpages, bs,
    CEIL((cc.reltuples*((datahdr+ma-
      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta,
    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols
  FROM (
    SELECT
      ma,bs,schemaname,tablename,
      (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
      (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
    FROM (
      SELECT
        schemaname, tablename, hdr, ma, bs,
        SUM((1-null_frac)*avg_width) AS datawidth,
        MAX(null_frac) AS maxfracsum,
        hdr+(
          SELECT 1+count(*)/8
          FROM pg_stats s2
          WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
        ) AS nullhdr
      FROM pg_stats s, (
        SELECT
          (SELECT current_setting('block_size')::numeric) AS bs,
          CASE WHEN substring(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
          CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
        FROM (SELECT version() AS v) AS foo
      ) AS constants
      GROUP BY 1,2,3,4,5
    ) AS foo
  ) AS rs
  JOIN pg_class cc ON cc.relname = rs.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
  LEFT JOIN pg_index i ON indrelid = cc.oid
  LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
) AS sml
ORDER BY wastedbytes DESC;

(Source: https://wiki.postgresql.org/wiki/Show_database_bloat)

This query is designed to analyze the “bloat” in a PostgreSQL database, both at the table and index level.

Step5:Reindexing the Index

To ensure the index is optimized, we perform a reindex operation:

postgres=# REINDEX INDEX idx_test_id;
REINDEX

Reindexing rebuilds the index from scratch, potentially reducing bloat and optimizing performance.

Step6:Verifying Bloat Removal

We verify the state of the index after reindexing:

postgres=# SELECT * FROM pgstattuple('idx_test_id');
-[ RECORD 1 ]------+------
table_len | 16384
tuple_count | 34
tuple_len | 544
tuple_percent | 3.32
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 7468
free_percent | 45.58

The results show a slight reduction in the tuple_count and a corresponding increase in free_space, confirming that the index has been compacted and optimized.

Conclusion:

In summary, index bloat can negatively impact database performance by occupying unnecessary space. By creating an index, simulating bloat, and then performing a reindex operation, we have demonstrated how we could effectively manage and mitigate the effects of bloat. Regularly monitoring and maintaining index health through techniques such as reindexing helps preserve database performance and efficiency. The pgstattuple extension in PostgreSQL is a powerful tool for analyzing and managing index bloat. While we didn’t observe significant bloat after deleting rows, reindexing still provided benefits by optimizing the index structure. Regularly checking and managing bloat is crucial for maintaining efficient database performance

Unlocking PostgreSQL’s potential with pgstattuple

Introduction

pgstattuple is an open-source utility in PostgreSQL, provided as an extension. It helps manage and clear table bloat, which in turn improves table performance. It’s ability to provide detailed insights into the extent of bloat, makes pgstattuple a crucial tool for optimizing database efficiency and ensuring that your tables operate smoothly.

PostgreSQL is renowned for its robustness, flexibility, and advanced features. One such feature, pgstattuple, provides invaluable insights into the health and efficiency of your database tables and indexes. In this blog, we’ll delve into what pgstattuple is, how it works, and how it can be a game-changer for database maintenance.

What is pgstattuple?

pgstattuple is an extension for PostgreSQL that helps you analyze the physical storage of database tables. It provides detailed statistics on the space usage within your database, including how much space is wasted due to bloat. This information is crucial for maintaining optimal performance and ensuring efficient use of disk space.

Why is pgstattuple Important?

Over time, PostgreSQL databases can accumulate bloat due to various operations like inserts, updates, and deletes. This bloat results in unused space within tables and indexes, which can degrade performance and lead to inefficient storage use. pgstattuple helps you identify and quantify this bloat, allowing you to take corrective actions such as vacuuming or reindexing.

Step 1

Creating the Extension and Table

postgres=# create Extension pgstattuple;
CREATE EXTENSION
postgres=# create table test (id int, name varchar(90));
CREATE TABLE

 Install the pgstattuple extension, which is necessary for analyzing bloat in tables. A table named test is created with two columns: id and name.

Step 2

Insert Data into the Table

postgres=# INSERT INTO test (id, name)
SELECT generate_series(1, 50) AS id, 'some_text' AS name;
INSERT 0 50

We insert 50 rows into the test table. Each row has an id from 1 to 50, and the name is set to 'some_text'.

Step 3

 Initial Bloat Check

Check Table Bloat Using pgstattuple:

postgres=# select * from pgstattuple ('test');
-[ RECORD 1 ]------+------
table_len          | 8192
tuple_count        | 50
tuple_len          | 1900
tuple_percent      | 23.19
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 5964
free_percent       | 72.8
  • The initial check shows:
    • table_len: Total table size in bytes (8192 bytes).
    • tuple_count: Number of live tuples (50).
    • tuple_len: Total size of live tuples (1900 bytes).
    • dead_tuple_count: Number of dead tuples (0).
    • free_space: Free space available in the table (5964 bytes).
  • At this stage, there are no dead tuples, meaning there is no bloat.

Step4

Creating Bloat

Delete Some Rows:

postgres=# DELETE FROM test WHERE id BETWEEN 41 AND 50;
DELETE 10

We deleted 10 rows from the table where the id is between 41 and 50.

Check Bloat

postgres=# select * from pgstattuple ('test');
-[ RECORD 1 ]------+------
table_len          | 8192
tuple_count        | 40
tuple_len          | 1520
tuple_percent      | 18.55
dead_tuple_count   | 10
dead_tuple_len     | 380
dead_tuple_percent | 4.64
free_space         | 5964
free_percent       | 72.8
  • After the deletion:
    • dead_tuple_count: The number of dead tuples is now 10, indicating bloat.
    • dead_tuple_len: Size of dead tuples (380 bytes).
    • free_space: Free space remains the same (5964 bytes), but bloat is now present.

Step 5

Removing Bloat

  • Vacuum the Table:
postgres=# vacuum test;
VACUUM

Running VACUUM helps to reclaim space from dead tuples, thus reducing bloat.

Check Bloat After Vacuuming:

postgres=# select * from pgstattuple ('test');
-[ RECORD 1 ]------+------
table_len          | 8192
tuple_count        | 40
tuple_len          | 1520
tuple_percent      | 18.55
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 6404
free_percent       | 78.17

After vacuuming:

  • dead_tuple_count: The dead tuples are now 0, meaning bloat has been cleared.
  • free_space: The free space has increased to 6404 bytes

Note: It is recommended to schedule the VACUUM activity during off-peak hours.

Conclusion:

pgstattuple is an indispensable tool for PostgreSQL database administrators aiming to maintain optimal performance and efficient storage. By offering detailed insights into table and space utilization, it helps identify areas where space is being wasted and performance may be degraded. Regular use of pgstattuple can reveal critical information about dead tuples, free space, and overall storage efficiency.
Incorporating pgstattuple into your database maintenance routine allows for proactive management, helping to mitigate issues related to bloat before they impact performance. By analyzing the results, you can make informed decisions about vacuuming, reindexing, and adjusting autovacuum settings, ultimately leading to a healthier and more efficient PostgreSQL environment. In this blog, we used pgstattuple to monitor and manage table bloat effectively in order to ensure optimal database performance.

Migration from Supabase to RDS : User Guide

In today’s rapidly evolving tech landscape, organizations often find themselves needing to adapt their infrastructure to better meet their needs. Whether driven by requirements for greater control, performance, scalability, or compliance, migrating from a managed service like Supabase to an on-premises or managed database service such as AWS RDS is a significant undertaking. Even though Supabase is a powerful, open-source Backend-as-a-Service platform built on PostgreSQL, and provides a range of features including real-time capabilities, authentication, and auto-scaling, there might be several reasons to migrate away from Supabase. In this blog, let’s explore the step by step approach for executing the PostgreSQL commands for a smooth migration process.

Migrating from one environment to another involves various strategies, each suited to different constraints and requirements. The approach you choose will depend on factors such as system limitations, acceptable downtime, and specific migration needs. Considering the less downtime, we are going for the logical replication strategy to migrate.

Step1 : Export Users

Before setting up the replication, let’s export Users from Supabase to RDS using the pg_dumpall

#Using pg_dumpall to get the users.
pg_dumpall --globals-only \
  -U postgres -h <your_supabase_hostname> \
  2>stderr_globals.txt > supabase_globals.sql

#To restore them in RDS
psql -f supabase_globals.sql -h <your_RDS_hostname> -U postgres -d postgres -W

Since Supabase is a managed service, it includes its own default users, which will be present in the exported supabase_globals.sql file. These default users may not be compatible with AWS RDS, which is also a managed service. You can either remove these users from the SQL file before restoring it, or delete them from RDS after the restoration process is complete.

Step2 : Export Schema

After the users are imported in RDS, we have to export the schema as well

# Using pg_dump to export the schema
pg_dump --schema-only \
  -U postgres -h <your_supabase_hostname> \
--schema=<schema_name> --schema=<schema_name> \
  2>stderr_schema.txt > postgres_schema_only.sql

# To restore them in RDS
psql -f postgres_schema_only.sql -h <your_RDS_hostname> -U postgres -d postgres -W

Export the necessary schemas and restore them similar to what was done for the Users. You may ignore some default schemas, if they wouldn’t be needed.

Step3 : Setup for logical replication

On Source

  • Create publication for the tables in required schemas.
#Create publication for tables in schema_1 and schema_2. Add the schema names as per your environment.
CREATE PUBLICATION <publication_name> FOR TABLES IN SCHEMA schema_1,schema_2;
  • Checking the replication slot status after creating subscription
#To check the status of replication_slot
select * from pg_replication_slots;

Make sure that replication is active.

On Target

  • Create subscription for the publication
CREATE SUBSCRIPTION <subscription_name>
CONNECTION 'dbname=postgres host=<your_supabase_hostname> user=postgres password=<postgres_password>'
PUBLICATION <publication_name>
WITH (copy_data = true);

Replication_slot will be automatically created

  • Check the subscription status
select * from pg_subscription_rel;

You can get the information about the relationship between subscriptions and the tables they replicate.

Step4 : Validate the data migration

Ensure that all tables, indexes, and constraints have been imported correctly. Compare the data in your AWS RDS instance with the original data in Supabase to ensure accuracy. You can use DBeaver tool for validation.

Step5 : Migrate Sequences

There are a few limitations with the replication of sequence values with this approach. To overcome that , you can use the SELECT setval feature in PostgreSQL to set the value of a sequence. This is particularly useful when you want to ensure that a sequence is synchronised with the maximum value in a table column, which is often necessary after data has been imported or modified.
Note: You may wanna create a dynamic SQL (as following) to be generated beforehand and execute just before the cutover.

SELECT setval('<sequence_name>', (SELECT max(<column_name>) FROM <table_name>) , false);

However, we need to the stop the connections to the source while setting this sequence value to make sure there is no data push from application to source , source to target because the max sequence value may change if there is any data incoming . However, we need to have some downtime to map the application to the target (i.e., on-premise or a different cloud) database. And also , we need to create such setval statements for each and every sequence existing in the database.

Step6 : Cutover

After swapping the connection strings in your application from the old database to the new one and redirecting all traffic , we need to cutover the replication with the old database.

On New database (Target)

We need to disable the subscription and drop the subscription

ALTER SUBSCRIPTION <subscription_name> DISABLE;

DROP SUBSCRIPTION <subscription_name>;

The replication_slot will automatically dropped once the subscription is dropped.

On Old database(Source)

We need to drop the publication

DROP PUBLICATION <publication_name>;

Conclusion

Migrating from a managed service like Supabase to an on-premises or managed database service such as AWS RDS is a complex endeavor. It requires a strategic approach to ensure minimal disruption and data integrity. Begin by exporting users and schema from Supabase, noting that Supabase’s default users does not allow postgres to have superuser privileges .Thus, using the postgres user for migration tasks is advisable. Next, leverage logical replication to synchronize data with minimal downtime, and carefully validate data and sequences in the target environment.

Challenges include ensuring user permissions and system compatibility, as Supabase users cannot be granted replication roles and version differences may require specific handling. By meticulously planning, executing, and validating each step, organizations can smoothly transition their database infrastructure to AWS RDS, achieving better control, performance, and scalability.

Stay tuned and Happy Migrating …

PostgreSQL performance tuning from Manual to Automatic – DBtune

Introduction:

In the previous blog post – The DNA of DBA: Mastering Parameter Tuning, we explored how mastering PostgreSQL parameter tuning is essential for every successful DBA, and how leveraging DBtune’s AI-driven insights can automate optimization. Now, let’s delve deeper into the mechanics of database tuning and how DBtune elevates this practice to new heights.

In today’s fast-paced digital world, ensuring your databases perform at peak efficiency is more challenging than ever. Enterprises face pressures from slow database queries, high infrastructure costs, and frequent downtimes, which significantly impact business operations and user experience.

What is Database tuning?

Database tuning is the practice of keeping a database fit and responsive. As databases change, grow, and slow down based on workloads and machine specifications, tuning adapts a database to its current use-case, workload, and machine. This includes adjustments to queries, DBMS parameters, indexes, and OS parameters.

Why does tuning matter?

From a technical perspective, tuning has a:

  • Direct impact on system performance, such as TPS (Transactions Per Second) with throughput and average query time with latency.
  • Improved scalability.
  • Enhanced reliability.

From a business perspective, it:

  • Decreases infrastructure costs.
  • Increases user satisfaction.
  • Reduces downtime.
  • Boosts productivity.

This blog covers DBMS parameter tuning of PostgreSQL using DBtune, an AI-powered solution designed to transition manual parameter tuning to automatic optimization, enhancing PostgreSQL performance with minimal effort. Parameter tuning involves adjusting the knobs to best fit the workload, and these parameters are application-dependent.

Meet DBtune: Your Database tuning friend

DBtune is a Software as a Service (SaaS) platform that uses advanced AI techniques to analyze and optimize PostgreSQL configurations. By continuously learning from real-world usage patterns, DBtune ensures your database runs at its best, adapting to changing demands without manual intervention.

Key Features of DBtune:

  • Automated Optimization: DBtune removes the guesswork from tuning by automatically adjusting parameters to enhance performance.
  • Real-time Monitoring: Database performance metrics are presented on an intuitive dashboard.
  • Cost Efficiency: Reduces expenses by optimizing infrastructure usage.
  • Easy Integration: Supports various PostgreSQL environments, including self-hosted and managed IaaS services like Amazon EC2, GCP VMs, Azure VMs, etc.

Optimising PostgreSQL with DBtune step by step:

Step 1: Setting up Environment

  • Choose database engine and hosting, then download and run the DBtune client on your CLI.
  • Then you can add the PostgreSQL database connection string in the dbtune_config.yml file.
  • Then we are ready to start the DBtune client by running the python3 dbtune.py, then initiate the tuning session by clicking the start tuning button on the UI page.

Step 2: Initiating the tuning session

  • Once the DBtune client is running, head back to the DBtune dashboard to monitor the tuning session. We can apply some workloads like pgbench, benchbase, and other benchmark tests to monitor the database performance.
  • DBtune does 30 iterations through various configurations, continuously learning and adapting to find the optimal settings for specific workload. This dynamically adjusts the database’s changing needs.

Step 3: Observing the results

  • The performance of DBtune is evident in the results, As tuning session progresses, we can notice the significant changes in the performance metrics such as:
  • Increased Transactions Per Second(TPS): Experience up to 4x increase in TPS, this increase transactions database can handle, thanks to optimised configurations.
  • Reduced Query Latency: Achieve 2.2x reduction in query latency ensuring faster application response times for the users.
  • Optimised Resource Usage: Track the efficient utilisation of CPU, memory, and other resources, resulting in potential cost savings on infrastructure.

Conclusion:

DBtune offers revolutionary solution for optimising PostgreSQL performance through its AI-powered automation and user-friendly interface. By seamlessly integrating the parameter tuning techniques, it aims to improve the query speed and maximising resource utilisation resulting in enhanced database efficiency and reduced infrastructure costs.

Ready to experience these benefits and see how DBtune transforms your database environment into a high performing cost-effective asset.

Setting up pgBadger in the PostgreSQL

Introduction

PostgreSQL, as a widely adopted relational database management system, powers applications ranging from small-scale projects to enterprise-level solutions. Monitoring and analyzing PostgreSQL performance and operational logs are essential tasks for ensuring database reliability, optimizing performance, and troubleshooting issues effectively. pgBadger is an open-source PostgreSQL log analyzer that provides fast, detailed, and user-friendly log analysis reports. Written in Perl, pgBadger processes log output from a running PostgreSQL instance and transforms it into comprehensive HTML reports. These reports offer valuable insights into various aspects of PostgreSQL performance and operations, including error occurrences, checkpoint behavior, vacuum operations, and more.

Key Features of pgBadger

  1. Performance Analysis: pgBadger helps you analyze the performance of your PostgreSQL instance by highlighting slow queries and providing details on query execution times.
  2. Error Reporting: The tool efficiently logs and reports errors, making it easier to identify and troubleshoot issues within your PostgreSQL system.
  3. Checkpoint and Vacuum Behavior: pgBadger provides insights into checkpoint and vacuum operations, which are crucial for database maintenance and performance.
  4. Detailed Reports: The HTML reports generated by pgBadger are easy to read and navigate, with interactive charts and graphs that provide a visual representation of the data.
  5. Customization: The tool offers various options to customize the report output, including filtering logs by date range, log level, and more.

Installation Prerequisites

pgBadger is written in pure Perl, a highly versatile programming language known for its text processing capabilities. To use pgBadger, you need to have a modern Perl distribution installed on your system. Additionally, the generated HTML reports utilize a JavaScript library for rendering interactive charts and graphs, but no additional software is required for this as your web browser will handle the rendering.

Perl Installation

Perl needs to be installed on your system, you can install it using the following command:

 yum install -y perl perl-devel

This command installs both the Perl interpreter and the Perl development libraries, ensuring that you have everything needed to run pgBadger effectively.

PgBadger Installation

Step 1: Install pgBadger

To install pgBadger along with its dependencies, follow these steps:

Install pgBadger and Dependencies

You can install pgBadger and its necessary dependencies, such as perl-Text-CSV_XS and perl-UNIVERSAL-isa, using the following command:

sudo dnf install pgbadger

This command installs the latest version of pgBadger from the community repository.

To verify that pgBadger is installed correctly, you can check the version:

[postgres@localhost ~]$ pgbadger --version
pgBadger version 12.4

Step 2: PostgreSQL Configuration

To ensure pgBadger can effectively analyze your PostgreSQL logs, you need to modify the configuration in the postgresql.conf file. Here are the parameters you should set:

log_autovacuum_min_duration = 0 
log_checkpoints = on
log_connections = on
log_disconnections = on
log_duration = on
log_lock_waits = on
log_temp_files = 0 
log_statement = 'all'
log_min_duration_statement = 0 
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ' 

After making these changes, restart your PostgreSQL server to apply the new configuration:

sudo systemctl restart postgresql

Step 3: Generate pgBadger Report

After restarting PostgreSQL, you can generate the pgBadger report using the following command:

[postgres@localhost ~]$  pgbadger -f stderr  -o file_name /var/lib/pgsql/16/data/log/postgresql-*.log
[========================>] Parsed 29915 bytes of 29915 (100.00%), queries: 42, events: 13
LOG: Ok, generating txt report...

This command will process the log file and generate an HTML report that you can view in your web browser.

Best Practices for Using pgBadger

Following best practices ensures you get the most out of pgBadger and maintain an optimized PostgreSQL environment:

Regularly Rotating and Archiving Log Files:
Log Rotation: Configure PostgreSQL to rotate logs daily or when they reach a certain size to prevent large files from becoming unmanageable.
Archiving: Archive old log files to a separate directory or storage system to save space and maintain log history.
Choosing the Right Log Format:

Standard Error (stderr): The default and most commonly used format.
CSV Logs: Use CSV format for easier parsing and integration with other tools, if needed.
Keeping pgBadger Updated:

Latest Version: Regularly check for updates and upgrade to the latest version of pgBadger to benefit from new features and bug fixes.
Release Notes: Review release notes for any changes that might affect your setup.
Integrating pgBadger with Other Monitoring Tools:

Comprehensive Monitoring
Combine pgBadger with other tools like Nagios, Prometheus, or Grafana for a complete monitoring solution.
Alerting: Set up alerts for specific performance metrics to proactively address issues.

Conclusion:

pgbadger is a powerful tool for PostgreSQL DBA’s and developers to utilize PostgreSQL log file for monitoring and take required actions. It’s fast processing capabilities, comprehensive reports, and customization options make it invaluable for optimizing database performance, troubleshooting issues, and ensuring reliable database operations.

With pgBadger installed, you can now use it to analyze your PostgreSQL logs and generate detailed HTML reports. This tool will help you gain insights into database performance, errors, and other critical operational metrics, ultimately aiding in maintaining a healthy and efficient PostgreSQL environment. Stay tuned for more info

Understanding PostgreSQL User Management and Password Encryption

In PostgreSQL, user management plays a crucial role in controlling access to databases, ensuring data security, and facilitating efficient database administration. Understanding how PostgreSQL handles user accounts, privileges, and authentication mechanisms is essential for maintaining a secure and well-managed database environment. In this blog post, we will delve into how PostgreSQL handles user authentication, password storage, and the implementing of changing password encryption methods.

PostgreSQL User Management

PostgreSQL uses a role-based authentication system where users (or roles) can be granted specific privileges to access databases and perform operations. Let’s explore a practical example using PostgreSQL’s pg_shadow view, which provides insights into user account details, including password hashes.

postgres=# select usename,usesysid,passwd from pg_shadow;
 usename  | usesysid |                                                                passwd                                                                 
----------+----------+---------------------------------------------------------------------------------------------------------------------------------------
 postgres |       10 | SCRAM-SHA-256$4096:E7aNSt5s+P8/Y8wkhiUiig==$KzGClfCwduklVgojVIDm/zTnIZ9k4+r9vpY2ul6NagU=:/McQWOQi0AJ5aBRK5cmvSfQOquPZAl5mPOoEM4fCwgg=
 akhil    |    25884 | SCRAM-SHA-256$4096:N020XkU3CMC9tCvXrEwZMQ==$BWnkZUvu4pD9l+RKYsoEa+RdpzAk1tMQnOXZd2XdWvM=:t4qO9sJgxWfOEdyWL8tkGc3C6k91pxr4hkiX96klt98=
(2 rows)

In the above query, pg_shadow displays user details including password hashes encrypted using SCRAM-SHA-256

Changing User Passwords

Changing the password for role akhil

mydb=# ALTER ROLE akhil PASSWORD 'newpassword';

This command updates the password for the user akhil to newpassword. PostgreSQL automatically encrypts the password using the configured encryption method (scram-sha-256 by default).

Managing Password Encryption Methods

PostgreSQL allows administrators to configure password encryption methods globally. Here’s how we can alter the encryption method

mydb=# ALTER SYSTEM set password_encryption to md5;
ALTER SYSTEM
mydb=# SELECT pg_reload_conf();
 pg_reload_conf 
----------------
 t
(1 row)
mydb=# show password_encryption ;
 password_encryption 
---------------------
 md5
(1 row)

Changing the password encryption to md5 alters how PostgreSQL encrypts user passwords.

Supported values are scram-sha-256, md5 and password but it is recommended not to use the password option since the password would be sent in clear text and hence might be a security concern.

Impact of not altering the Password

Even though , we have altered the password encryption from SCRAM-SHA-256 to md5 ,but the password will be still in the old hash i.e, SCRAM-SHA-256 because it is encrypted before altering the password encryption method.

postgres=# show password_encryption;
 password_encryption 
---------------------
 md5
(1 row)

postgres=# select usename,usesysid,passwd from pg_shadow;
 usename  | usesysid |                                                                passwd                                                                 
----------+----------+---------------------------------------------------------------------------------------------------------------------------------------
 postgres |       10 | SCRAM-SHA-256$4096:E7aNSt5s+P8/Y8wkhiUiig==$KzGClfCwduklVgojVIDm/zTnIZ9k4+r9vpY2ul6NagU=:/McQWOQi0AJ5aBRK5cmvSfQOquPZAl5mPOoEM4fCwgg=
 akhil    |    25884 | SCRAM-SHA-256$4096:N020XkU3CMC9tCvXrEwZMQ==$BWnkZUvu4pD9l+RKYsoEa+RdpzAk1tMQnOXZd2XdWvM=:t4qO9sJgxWfOEdyWL8tkGc3C6k91pxr4hkiX96klt98=
(2 rows)

So, we need to alter the password after changing the password encryption method. PostgreSQL will now save the new password in md5

postgres=# ALTER ROLE akhil PASSWORD 'somepassword';
ALTER ROLE
postgres=# select usename,usesysid,passwd from pg_shadow;
 usename  | usesysid |                                                                passwd                                                                 
----------+----------+---------------------------------------------------------------------------------------------------------------------------------------
 postgres |       10 | SCRAM-SHA-256$4096:E7aNSt5s+P8/Y8wkhiUiig==$KzGClfCwduklVgojVIDm/zTnIZ9k4+r9vpY2ul6NagU=:/McQWOQi0AJ5aBRK5cmvSfQOquPZAl5mPOoEM4fCwgg=
 akhil    |    25884 | md575cb1fb77702b4fdd2501914c6d6d1a3
(2 rows)

Now ,we can see the password hash for user akhil is stored in md5 encryption method.

Use-Case

Assume you want to use a extension/tool which supports the md5 only and your password encryption method is SCRAM-SHA-256 . In this case ,changing only password encryption to md5 doesn’t allow us to access/use the extension/tool .We need to alter the password too after changing the encryption as above. Recently , one of our beloved client has faced the same issue.

Conclusion

PostgreSQL offers robust mechanisms for managing user authentication and password security. By understanding how PostgreSQL encrypts passwords and configuring encryption methods, administrators can ensure database security aligns with organizational policies and best practices.

Managing PostgreSQL users and passwords effectively not only enhances security but also ensures seamless operation and compliance with data protection standards. In this blog, we delved into how PostgreSQL handles user authentication, password storage, and the implementing of changing password encryption methods and also the impact and necessity of changing the password after altering the password encryption method.

Stay tuned for more info

PostgreSQL MVCC

Introduction:

PostgreSQL is The Most powerful open-source relational database system that uses a sophisticated concurrency control mechanism called Multi-Version Concurrency Control (MVCC). MVCC is fundamental for maintaining data consistency and enabling concurrent transactions without locking issues. This blog post will delve into the intricacies of PostgreSQL MVCC, provide practical examples, and outline best practices to follow.

What is MVCC?
MVCC, or Multi-Version Concurrency Control, is a database management technique that provides a consistent view of the database to each transaction by maintaining multiple versions of data. This allows concurrent transactions to read and write data without interfering with each other, thus improving performance and maintaining data integrity.

How MVCC Works in PostgreSQL
PostgreSQL implements MVCC by keeping multiple versions of a data row. Each row version is tagged with transaction IDs that indicate when the row was created and when it was deleted (if applicable). This mechanism allows PostgreSQL to provide a snapshot of the database at the start of each transaction, ensuring a consistent view.

Key Concepts
Transaction ID (TXID): A unique identifier assigned to each transaction.
Snapshot: A view of the database at a particular point in time.
Visibility Rules: Determine which version of a row is visible to a transaction based on TXIDs.

Step1: create table

creating the table phone with the columns id and balance:

mydb=# create table phone (id int, balance money);
CREATE TABLE

insert a row into the table:

mydb=# insert into phone values (1,100);
INSERT 0 1

insert a row into the table:

mydb=# insert into phone values (2,200);
INSERT 0 1

We then add two more rows and check the values:

mydb=# insert into phone values (3,300),(4,400);
INSERT 0 2

Verify the update:

mydb=# select * from phone ;
 id | balance 
----+---------
  1 | $100.00
  2 | $200.00
  3 | $300.00
  4 | $400.00
(4 rows)

Step 2: Hidden columns

The * does return all the columns but hidden:

mydb=# select ctid,xmin,Xmax, *from phone;
 ctid  | xmin | xmax | id | balance 
-------+------+------+----+---------
 (0,1) |  742 |    0 |  1 | $100.00
 (0,2) |  743 |    0 |  2 | $200.00
 (0,3) |  744 |    0 |  3 | $300.00
 (0,4) |  744 |    0 |  4 | $400.00
(4 rows)
  • ctid – a row identifier within the page.
  • xmin – an ID of the transaction created the row.
  • xmax – an ID of the transaction deleted the row

It’s necessary for its MVCC engine. Postgres can store several versions of a single row at a time and those hidden columns help to control the visibility of the versions for running transactions/queries

Step 3: Begin a Transaction

Start a transaction and run a few experiments:

Begin the transaction:

mydb=# begin;
BEGIN 

Get the current value of the second row:

select ctid, xmin, xmax, * from account where id = 2;
ctid  | xmin | xmax | id | balance
-------+------+------+----+---------
(0,2) |  743 |    0 |  2 | $200.00
(1 row)
      

Update  the balance:

mydb=*#  update phone set balance = 450 where id=2;
UPDATE 1
  

Read the row back:

mydb=*# select  ctid, xmin, xmax, * from phone where id=2;
 ctid  | xmin | xmax | id | balance 
-------+------+------+----+---------
 (0,5) |  745 |    0 |  2 | $450.00
(1 row)

  
mydb=*# select txid_current_if_assigned();
 txid_current_if_assigned 
--------------------------
                      745
(1 row)

See that the xmin is equal to the ID of the current started transaction:

Step 4: Connect a database

Open another session with Postgres and read the value of the record that is being modified:

connect to database

[postgres@localhost ~]$ psql -p 5432 -d mydb -U postgres
psql (16.3)

Read the row:

mydb=# select  ctid, xmin, xmax, * from phone where id=2;
 ctid  | xmin | xmax | id | balance 
-------+------+------+----+---------
 (0,2) |  743 |  745 |  2 | $200.00
(1 row)

This version has different ctid, xmin and xmax when compared to the version that was created by the first transaction.

Now, in the first session, commit the transaction:

mydb=*# commit;
COMMIT

In the second session, read the row one more time:

mydb=# select  ctid, xmin, xmax, * from phone where id=2;
 ctid  | xmin | xmax | id | balance 
-------+------+------+----+---------
 (0,5) |  745 |    0 |  2 | $450.00
(1 row)

That new version of the row was committed and is visible to all future queries/transactions.

Best Practices for Using MVCC in PostgreSQL

1. Use Appropriate Isolation Levels

PostgreSQL supports different isolation levels: Read Uncommitted, Read Committed, Repeatable Read, and Serializable. Choose the appropriate level based on your application’s consistency and performance requirements.

  • Read Committed: Suitable for most applications, providing a balance between performance and consistency.
  • Repeatable Read: Use when you need to ensure that reads within a transaction are consistent.
  • Serializable: Provides the highest level of isolation but can impact performance due to increased locking.

2. Keep Transactions Short

Long-running transactions can lead to increased bloat and impact database performance. Aim to keep transactions as short as possible to reduce the number of old row versions and improve MVCC efficiency.

3. Regularly Vacuum the Database

Vacuuming helps to reclaim storage occupied by dead row versions and maintain MVCC performance. Schedule regular VACUUM operations, especially on frequently updated tables.

4. Monitor and Manage Bloat

Bloat occurs when the database accumulates too many dead row versions. Use tools like pg_stat_user_tables and pgstattuple to monitor bloat and take corrective actions.

5. Optimize Queries and Indexes

Efficient queries and proper indexing can minimize the impact on MVCC performance. Use EXPLAIN to analyze query plans and create indexes to speed up data retrieval.

Conclusion

PostgreSQL’s MVCC is a robust mechanism that allows for efficient and concurrent data access while maintaining consistency. By understanding how MVCC works and following best practices, you can optimize your PostgreSQL database for better performance and reliability. Whether you’re handling basic transactions or managing complex concurrent operations, MVCC ensures your data remains consistent and accessible.


Streaming PostgreSQL Data to kafka Topics using Debezium

Introduction:

In today’s fast-paced digital landscape, real-time data processing is no longer a luxury but a necessity. Businesses need to react to changes instantly, whether for updating inventories, personalising user experiences, or detecting fraud. This need for immediacy has led to the adoption of Change Data Capture (CDC) techniques, which enable the continuous capture and delivery of database changes to downstream systems.

Why this setup?

Setting up a data pipeline from PostgreSQL to Kafka using Debezium CDC and docker offers several compelling advantages:

  • Real-time Data Capture: Debezium captures changes to data as they occur, allowing applications to react in near real-time.
  • Low impact on source systems: Debezium uses the transaction log to capture data changes, instead of running batch jobs, which can reduce the impact on source systems.
  • Fault tolerance: Debezium ensures continuous data availability and reliability. Even if an application stops unexpectedly, it can resume reading from the topic where it left off after restarting.
  • Snapshotting: Debezium captures a consistent snapshot of the database to ensure a reliable starting point.
  • Automatic topic creation: Debezium automatically generates kafka topics for each table.
  • Data serialisation: Debezium supports various data formats, including Avro, JSON Schema, and Protobuf.
  • Low delay: Debezium produces change events with a very low delay, in the millisecond range, while avoiding increased CPU usage.
  • Scalability: Kafka clusters are easy to build and scale.

Overview of the setup:

In this blog, we’ll stream data from PostgreSQL to kafka topics using debezium Change data capture (CDC) on docker. The setup involves using Docker compose to create and manage containers for PostgreSQL, Kafka, debezium, and Kafka UI, for enabling a seamless and real-time data pipeline.

Components: 

  • PostgreSQL: An open-source relational database management system (RDBMS) that stores the data. We’ll use the logical replication feature of PostgreSQL to capture changes.
  • Kafka: A distributed streaming platform that serves as the intermediary, capturing the data changes and making them available to downstream consumers.
  • Debezium: An open-source CDC tool that captures row-level changes in the database and sends them to Kafka topics. It leverages PostgreSQL’s logical decoding feature to stream changes in real-time.
  • Kafka UI: A web-based user interface to monitor and interact with Kafka clusters, topics, and messages.

Step by step Guide:

  • Docker Compose File: Explains the structure and purpose of each service in the docker-compose.yml file. Create a new folder, and within that, create the docker-compose.yml file with relevant content as described below.
  • PostgreSQL Dbserver: Describes the setup of the PostgreSQL container.
  • Kafka: Explains the configuration of the Kafka broker.
  • Connect: Details the Debezium Connect service setup.
  • Kafka-UI: Describes the Kafka UI for monitoring.

Docker compose file:

version: '3.9'
services:
  postgres:
    image: postgres:16
    container_name: dbserver
    hostname: dbserver
    ports:
      - 30432:5434
    volumes:
      - ./pgdata:/var/lib/postgresql/data
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=postgres
      - POSTGRES_DB=mydb

  kafka:
    image: apache/kafka
    container_name: kafka
    hostname: kafka
    ports:
      - 9092:9092
    environment:
      - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true
      - KAFKA_ADVERTISED_HOST_NAME=kafka
      - KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
      - KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1
    volumes:
      - ./cfg/server.properties:/mnt/shared/config/server.properties

  connect:
    image: debezium/connect:nightly
    container_name: connect
    hostname: connect
    ports:
      - 8083:8083
    depends_on:
      - kafka
    environment:
      - BOOTSTRAP_SERVERS=kafka:9092
      - SCHEMA_HISTORY_INTERNAL_KAFKA_BOOTSTRAP_SERVERS=kafka:9092
      - GROUP_ID=1
      - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true
      - CONFIG_STORAGE_TOPIC=connect_configs
      - OFFSET_STORAGE_TOPIC=connect_offsets
      - STATUS_STORAGE_TOPIC=connect_statuses

  kafka-ui:
    image: provectuslabs/kafka-ui
    container_name: kafkaui
    hostname: kafkaui
    depends_on:
      - kafka
    ports:
      - 8888:8080
    restart: always
    environment:
      - KAFKA_CLUSTERS_0_NAME=local
      - KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS=kafka:9092

Running the containers: Start the docker cluster by running the command: ‘docker compose up -d‘ as illustrated below:

Configuring PostgreSQL for CDC:

Steps to enable Logical replication in PostgreSQL by modifying ‘postgreql.conf‘ file.

wal_level = logical 
max_wal_senders = 10

Statements to add to ‘pg_hba.conf‘ file

host replication postgres 0.0.0.0/0 trust
host replication user 0.0.0.0/0 trust

Restart the PostgreSQL server/services after making the changes to ‘postgresql.conf’ and ‘pg_hba.conf’ files.

Setting up Postgres Connector for Debezium:

Define connector configuration in /kafka/config directory and add the connector file. This is the JSON configuration. Save this content in a file with the name ‘connect-postgres-source.properties’ within the /kafka/config directory.

{
"name": "postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "dbserver",
"database.port": "5434",
"database.user": "user",
"database.password": "postgres",
"database.dbname": "mydb",
"topic.prefix": "pgdata",
"plugin.name": "pgoutput"
}
}

Adding the Postgres Connector for Debezium:

  • Use curl or Postman to POST this configuration to the Kafka Connect REST API, from within the /kafka/config directory:
curl -d @"connect-postgres-source.properties" -H "Content-Type: application/json" -X POST http://localhost:8083/connectors

Monitoring and verifying Data flow:

Once the Postgres Connector for Debezium is successfully created, Debezium would automatically create a unique topic for each of the tables, and stream the data from the initial snapshot. You could now now verify the data flow from PostgreSQL to Kafka, and monitor it from the Kafka UI at http://localhost:8888

Following the initial snapshot, any change that gets done on the Postgres tables is automatically streamed into the Kafka topics, which can again be verified from the Kafka UI at http://localhost:8888

In the above screenshots, we can see the Kafka UI page and topics which contain the PostgreSQL data. We can view the actual data by drilling down on the topics.

Conclusion:

In this blog, we set up a real-time data pipeline from PostgreSQL to Kafka using Debezium CDC and Docker. This approach ensures minimal impact on source systems, fault tolerance, low latency, and scalability. Using Docker Compose, we easily deployed PostgreSQL, Kafka, Debezium, and Kafka UI for efficient data streaming and monitoring. This setup enables businesses to react instantly to data changes, enhancing operational efficiency.

Experience the power of real-time data streaming with PostgreSQL, Kafka, and Debezium. Happy streaming!