PostgreSQL 14

PostgreSQL Partitioning Made Easy: Features, Benefits, and Tips

Partitioning is a powerful database technique designed to improve large table’s performance, scalability, and manageability. By dividing a large table into smaller, more manageable pieces, partitioning allows for more efficient data access and maintenance. In PostgreSQL, one of the leading open-source database systems, partitioning plays a crucial role in optimizing the handling of large datasets.

This guide will walk you through the fundamentals of partitioning in PostgreSQL, highlighting its key advantages and exploring the different partitioning strategies available. By the end, you’ll have a solid understanding of how to choose the right partitioning approach to best meet your specific needs, whether you’re managing vast amounts of data or looking to optimize your database’s performance.

What Is Partitioning?

Partitioning allows you to split a database table into multiple segments based on conditions you define. Selecting appropriate criteria is crucial, as it should align with the characteristics of your data and its intended usage. Although partitioned tables are physically stored in separate pieces, they appear unified to applications, allowing seamless access and operations.

Why Use Partitioning?

Partitioning significantly improves database performance by optimizing data access speed while streamlining database management and maintenance processes.

1. Boost Query Performance

Partitioning optimizes data access and processing through more efficient resource usage:

Focused Data Access:

By applying filters in your queries, PostgreSQL limits the search to relevant partitions. This reduces disk operations and uses memory caching for frequently accessed data.

Enhanced Scalability in PostgreSQL 12+:

Earlier PostgreSQL versions suggested keeping partitions to about 100 for speed. Since version 12, improvements have enabled the efficient management of thousands of partitions.

2. Simplify Maintenance

Partitioning makes routine tasks like adding, modifying, or removing data easier and faster:

Segmented Data Management:

Consider a system that keeps five years of sales records, organized by month. When a new month starts, you can add a new partition and remove the outdated partitions from five years ago.

Faster Deletions:

Dropping or truncating entire partitions is much quicker than deleting rows one by one. These methods also reduce the load on maintenance processes like VACUUM.

Partitioning Options in PostgreSQL

PostgreSQL supports several partitioning strategies, allowing you to organize data according to your operational requirements.

1. Range-Based Partitioning

Description: Divides data into segments based on specified ranges.

Best For: Time-series data, such as logs or historical records.
Examples: Data grouped by date, event date, etc.

postgres=# CREATE TABLE sales (sale_id SERIAL,sale_date DATE NOT NULL,amount NUMERIC

) PARTITION BY RANGE (sale_date);
CREATE TABLE sales_jan2024 PARTITION OF sales FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE sales_feb2024 PARTITION OF sales FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

postgres=# INSERT INTO sales (sale_date, amount) VALUES

('2024-01-15', 150.00),

('2024-02-10', 200.00);

INSERT 0 2

postgres=# SELECT * FROM sales WHERE sale_date = '2024-01-15';

 sale_id | sale_date  | amount 

---------+------------+--------

       1 | 2024-01-15 | 150.00

(1 row)

2. List-Based Partitioning

Description: Organizes data into partitions based on specific, predefined values.

Best For: Grouping categorical data, like regions or departments.

Examples: Splitting records by geographic area or job title.

postgres=# CREATE TABLE customers (customer_id SERIAL,region TEXT NOT NULL,name TEXT

) PARTITION BY LIST (region);

CREATE TABLE customers_us PARTITION OF customers FOR VALUES IN ('US');

CREATE TABLE customers_europe PARTITION OF customers FOR VALUES IN ('Europe');

postgres=# INSERT INTO customers (region, name) VALUES

('US', 'John Doe'),

('Europe', 'Jane Smith');

INSERT 0 2

postgres=# SELECT * FROM customers WHERE region = 'Europe';

 customer_id | region |    name    

-------------+--------+------------

           2 | Europe | Jane Smith

(1 row)

3. Hash-Based Partitioning

Description: Distributes data evenly across partitions using a hash function.

Best For: Scenarios where balanced distribution is crucial to prevent performance bottlenecks.

postgres=# CREATE TABLE user_sessions (session_id SERIAL,user_id INT NOT NULL,session_data TEXT) PARTITION BY HASH (user_id);

CREATE TABLE user_sessions_part1 PARTITION OF user_sessions

FOR VALUES WITH (MODULUS 4, REMAINDER 0);

CREATE TABLE user_sessions_part2 PARTITION OF user_sessions

FOR VALUES WITH (MODULUS 4, REMAINDER 1);

postgres=# INSERT INTO user_sessions (user_id, session_data) VALUES

(1, 'Session 1 Data'),

(5, 'Session 5 Data');

INSERT 0 2

postgres=# SELECT * FROM user_sessions WHERE user_id = 1;

 session_id | user_id |  session_data  

------------+---------+----------------

          1 |       1 | Session 1 Data

(1 row)

3. Composite Partitioning

For more intricate setups, you can create subpartitions within primary partitions, a method called composite or hierarchical partitioning.

For instance, you could divide a sales table first by month and then further by product category. This structure improves query performance, especially for operations requiring specific subsets of data, like retrieving sales for a specific product in a particular month.

PostgreSQL composite partitions can be created up to N levels. Partition methods LIST-LIST, LIST-RANGE, LIST-HASH, RANGE-RANGE, RANGE-LIST, RANGE-HASH, HASH-HASH, HASH-LIST, and HASH-RANGE can be created in PostgreSQL declarative partitioning.

In the case of HASH-LIST, HASH-RANGE, and HASH-HASH composite partitions, users need to make sure all partitions are present at the subpartition level as HASH can direct values at any partition based on the hash value.

postgres=# CREATE TABLE product_sales (sale_id SERIAL,sale_date DATE NOT NULL,category TEXT,amount NUMERIC) PARTITION BY RANGE (sale_date);

CREATE TABLE product_sales_2024 PARTITION OF product_sales

FOR VALUES FROM ('2024-01-01') TO ('2025-01-01')

PARTITION BY LIST (category);

CREATE TABLE product_sales_2024_electronics PARTITION OF product_sales_2024

FOR VALUES IN ('Electronics');

CREATE TABLE product_sales_2024_clothing PARTITION OF product_sales_2024

FOR VALUES IN ('Clothing');

postgres=# INSERT INTO product_sales (sale_date, category, amount) VALUES

('2024-03-15', 'Electronics', 300.00),

('2024-03-16', 'Clothing', 150.00);

INSERT 0 2

postgres=# SELECT * FROM product_sales WHERE category = 'Electronics';

 sale_id | sale_date  |  category   | amount 

---------+------------+-------------+--------

       1 | 2024-03-15 | Electronics | 300.00

(1 row)

Limitations of PostgreSQL Partitions

PostgreSQL partitions have a few limitations:

Foreign key constraints involving partitioned tables are not supported. This includes foreign keys that reference a partitioned table and those where a partitioned table references another table, due to the lack of primary key support on partitioned tables.

Row-level triggers need to be set up individually for each partition rather than on the partitioned table as a whole.

Range partitions do not allow NULL values.
Solution/workaround for foreign key reference in the partition table
Instead of referencing the foreign key to the partition table, we can reference in a foreign table

Example: let’s say the event table is partitioned, you cannot reference on event table

ALTER TABLE ONLY public.event
    ADD CONSTRAINT "FK_1" FOREIGN KEY (type_id) REFERENCES public.event_type(id);

you can reference it to the event_type table like this

ALTER TABLE ONLY public.event_type
    ADD CONSTRAINT "FK_1" FOREIGN KEY (id) REFERENCES public.event(type_id);

TL;DL
a) What are the main types of partitioning in PostgreSQL?
The main types are range, list, and hash partitioning.
b) Can I combine multiple partitioning strategies?
Yes, you can use subpartitioning to combine strategies, such as range and list.
c) How does partitioning improve query performance?
Partitioning reduces the amount of data scanned by focusing on relevant partitions, speeding up query execution.
d) Are there any limitations to PostgreSQL partitioning?
Limitations include the need for careful planning and potential overhead in managing partitions.
e) What tools can help with managing partitions?
Tools like pg_partman can simplify partition creation, management, and maintenance.

Conclusion: 
Partitioning in PostgreSQL significantly enhances database performance, scalability, and manageability by dividing large tables into smaller, more efficient segments. It boosts query performance by focusing on data access and optimizing disk usage, while also simplifying maintenance tasks like adding, modifying, or removing data. PostgreSQL offers various partitioning strategies—range, list, hash, and composite partitioning—each catering to specific data types and operational needs. Organizations can streamline their database management and improve overall system performance by choosing the appropriate partitioning method.

Understanding Replication Conflicts in PostgreSQL

In any PostgreSQL production database environment, high availability is paramount. Streaming replication is a powerful feature that enables high availability by creating a standby node which is a replica of the primary database. It allows the standby node to serve read-only queries, offloading read traffic from the primary and improving overall system performance. It also provides disaster recovery capabilities, as the standby can act as a backup that’s always up-to-date with the primary database. With such a setup, PostgreSQL provides a robust solution for businesses to handle large-scale workloads and ensure their applications remain operational even in case of hardware or software failures.

Even though streaming replication is beneficial, it also brings you some challenges. One of the issues I’ve recently encountered in one of our client’s production environments is

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

We observed this issue when we were trying to run a query on one of the Standby nodes which is in streaming replication with Primary.

Common causes

This error occurs when a read-only query on the standby node tries to access row versions that are about to be or have been removed during the recovery process. Essentially, the query is trying to read data that no longer exists on the replica because it’s being cleaned up as part of the WAL recovery due to:

  • Long-running queries on the Standby server.
  • Replication Lag.
  • Vacuuming of old rows.

What is replication conflict?

It is a type of Conflict that occurs only on the Standby server, and not on the Primary server. This issue occurs when the recovery process is unable to apply WAL information from Primary to Standby if it would interfere with query processing on the standby. There are some of the conflicts:

  • Tablespace conflict occurs if an expected temp tablespace is dropped on Primary which is a part of queries being processed on Standby.
  • Lock conflict occurs when you query a table on standby while the same table is being modified on the primary.
  • Snapshot conflict occurs when a backend process tries to access rows on the standby server that have been vacuumed out on the primary.
  • Deadlock conflict occurs due to query cancellations occurring because of deadlocks on standby.

To monitor these replication conflicts, there is a pg_catalog view called pg_stat_database_conflicts . This view shows the statistics of the occurrence of query cancellation due to conflicts with recovery on the standby server as these conflicts don’t occur on the primary server. Here is the output of this view:

postgres=# select * from pg_stat_database_conflicts;
 datid |  datname  | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock | confl_active_logicalslot 
-------+-----------+------------------+------------+----------------+-----------------+----------------+--------------------------
     5 | postgres  |                0 |          0 |              0 |               0 |              0 |                        0
 16469 | test      |                0 |          0 |              0 |               0 |              0 |                        0
     1 | template1 |                0 |          0 |              0 |               0 |              0 |                        0
     4 | template0 |                0 |          0 |              0 |               0 |              0 |                        0
(4 rows)

Here you can see that there are no conflicts on my local standby. If there is any occurrence of this issue, it will display as 1 in any of the columns that begin with ‘confl_’ depending on the type of conflict.

How PostgreSQL handles these conflicts

PostgreSQL has a parameter max_standby_streaming_delay that determines what happens when WAL replay encounters a replication conflict. PostgreSQL suspends replay of the WAL information for at most  max_standby_streaming_delay milliseconds. If the conflicting query is still running after that time, PostgreSQL cancels it. Default value is 30 seconds . If it is set to, this setting would be disabled.

How to Avoid

To avoid replication conflicts, one simple option is to disable Hot Standby on the replica by setting hot_standby = off. This eliminates the chance of conflicts because the standby won’t process queries. However, this is only practical if the standby server is used exclusively for high availability and not for read queries.

To avoid lock conflicts, it’s important not to run queries that acquire an ACCESS EXCLUSIVE lock, such as DROP TABLE, TRUNCATE, or ALTER TABLE. A special case is VACUUM truncation, which can cause conflicts on the standby. While you can’t avoid the ACCESS EXCLUSIVE lock during VACUUM truncation, you can disable it for individual tables in the database.

For snapshot conflicts, enabling hot_standby_feedback on the standby can help prevent the primary from removing tuples that the standby still needs to see. However, this can lead to table bloat on the primary due to long-running queries, so use it with caution.

Summary

In PostgreSQL, streaming replication is a vital feature for achieving high availability, data redundancy, and disaster recovery. While it offers many advantages, it also introduces challenges, particularly around replication conflicts that can occur on the standby server. The “canceling statement due to conflict with recovery” error is a common issue that arises when long-running queries or replication lag interferes with the WAL recovery process.

By understanding the different types of replication conflicts such as lock conflicts & snapshot conflicts, you can better manage and mitigate these issues. Monitoring tools like pg_stat_database_conflicts can help identify the occurrence of these conflicts and guide you in resolving them.

To prevent replication conflicts, you can consider strategies like disabling, or avoiding certain types of queries that acquire exclusive locks, and enabling settings like hot_standby_feedback. However, each solution has its trade-offs, and it’s crucial to strike a balance between performance and the prevention of conflicts. Ultimately, carefully managing replication, query performance, and system configuration will ensure your PostgreSQL setup remains robust, reliable, and efficient, even in the face of replication challenges.

Mastering the pg_settings view in PostgreSQL

Introduction

PostgreSQL, a powerful and highly customizable open-source relational database system, offers a wide range of tools to manage and optimize its configuration. One of the key resources for interacting with PostgreSQL’s settings is the pg_settings system catalog view. This view acts as a central interface for administrators and developers to access and modify database configuration parameters. Unlike the SHOW and SET commands, which allow for basic configuration management, pg_settings provides more granular details, such as the minimum and maximum allowable values for parameters, the source of the current settings, and whether changes require a database restart. This blog will explore the uses of the pg_settings system catalog view for understanding, viewing, and modifying the PostgreSQL configurations effectively.

Uses of pg_settings

pg_settings is ideal when you need:

  • A comprehensive view of current PostgreSQL configuration.
  • Metadata to understand the implications of specific parameters.
  • To troubleshoot or debug configuration issues.
  • To test parameter changes in a live session.
  • Insights into default settings, minimum/maximum values, and change contexts.

Step1: Understanding pg_settings

The pg_settings table provides a structured view of PostgreSQL configuration parameters. Let’s examine an example query:

postgres=# select * from pg_settings;
-[ RECORD 1 ]---+----------------------------------------
name            | allow_in_place_tablespaces
setting         | off
unit            | 
category        | Developer Options
short_desc      | Allows tablespaces directly inside pg_tblspc, for testing.
extra_desc      | 
context         | superuser
vartype         | bool
source          | default
min_val         | 
max_val         | 
enumvals        | 
boot_val        | off
reset_val       | off
sourcefile      | 
sourceline      | 
pending_restart | f

Let’s breakdown the pg_settings’ output

  • name : The name of the configuration parameter.
  • setting : The current value of the parameter.
  • unit : The unit of the setting, if applicable.
  • category : The parameter’s category (e.g., Memory, Developer Options).
  • short_desc : A brief description of what the parameter does.
  • extra_desc : An extended description of the parameter.
  • context : Where and how the parameter can be modified (e.g., superuser, session).
  • vartype : The type of the value (e.g., integer, boolean).
  • source : The origin of the current value.
  • min_val and max_val : The allowed range for the setting, if applicable.
  • boot_val : The value used when PostgreSQL starts up.
  • reset_val : The value that will be used if the parameter is reset.
  • sourcefile and sourceline : The configuration file and line number defining the parameter.
  • pending_restart : Indicates if a restart is pending for the change to take effect.

A few columns that needs to be highlighted are:

CONTEXT: This parameter ascertains whether the parameter value change needs a RESTART or SIGHUP.
UNIT: You may see a difference between the SHOW output and the value seen on the pg_settings.unit – this is because of the measurement units.
SOURCE: As part of our standard database deployments, we follow the best practices of segregating the parameters into:
– parameters that needs restart
– parameters that needs reload
– specific extension related parameters
– custom parameters per instance
This approach helps us to to be more cautious while changing the parameters.

Step 2:Viewing a Specific Parameter

To view the details of a specific configuration parameter, use a SELECT query:

postgres=#  SELECT * FROM pg_settings WHERE name = 'log_statement';
-[ RECORD 1 ]---+------------------------------------
name            | log_statement
setting         | none
unit            | 
category        | Reporting and Logging / What to Log
short_desc      | Sets the type of statements logged.
extra_desc      | 
context         | superuser
vartype         | enum
source          | default
min_val         | 
max_val         | 
enumvals        | {none,ddl,mod,all}
boot_val        | none
reset_val       | none
sourcefile      | 
sourceline      | 
pending_restart | f

This query will return detailed information about the log_statement configuration, including its current value and the possible settings (none, ddl, mod, all).

Step 3:Modifying a Parameter Temporarily

To temporarily change a parameter, use the SET command:

postgres=#  SET log_statement TO mod;
SET

To verify that the change has taken place, run the same query again:

postgres=#  SELECT * FROM pg_settings WHERE name = 'log_statement';
-[ RECORD 1 ]---+------------------------------------
name            | log_statement
setting         | mod
unit            | 
category        | Reporting and Logging / What to Log
short_desc      | Sets the type of statements logged.
extra_desc      | 
context         | superuser
vartype         | enum
source          | session
min_val         | 
max_val         | 
enumvals        | {none,ddl,mod,all}
boot_val        | none
reset_val       | none
sourcefile      | 
sourceline      | 
pending_restart | f

The setting has been updated to mod, and the source column now reflects that the change was made for the current session.

Step 4:Direct Updates in pg_settings

While it might seem intuitive to directly update the pg_settings table, this approach is typically ineffective for permanent changes:

postgres=# UPDATE pg_settings SET setting = 'all' WHERE name = 'log_statement';
 set_config 
------------
 all
(1 row)

UPDATE 0

Verify the change using:

postgres=#  SELECT * FROM pg_settings WHERE name = 'log_statement';
-[ RECORD 1 ]---+------------------------------------
name            | log_statement
setting         | all
unit            | 
category        | Reporting and Logging / What to Log
short_desc      | Sets the type of statements logged.
extra_desc      | 
context         | superuser
vartype         | enum
source          | session
min_val         | 
max_val         | 
enumvals        | {none,ddl,mod,all}
boot_val        | none
reset_val       | none
sourcefile      | 
sourceline      | 
pending_restart | f

This confirms that the setting has been applied for the current session. However, as you will see in the next steps, this change will be lost once the server is restarted.

Step 5: Checking with SHOW Command

To check the current setting of log_statement after making the change and restarting the database server:

postgres=# show log_statement;
 log_statement 
---------------
 none
(1 row)

Notice: Even though you updated the pg_settings view, the setting is still showing as none when queried with SHOW. This is because changes in pg_settings are session-based and temporary, and the underlying configuration hasn’t been permanently updated.

In order to make permanent changes to any database configuration parameter(s), you could either update the postgresql.conf file directly with the appropriate value(s) set against the corresponding parameter(s), or use the Alter System Set … command.

Conclusion

The pg_settings view is a vital resource for PostgreSQL administrators and developers. It provides an in-depth view of the database’s configuration settings, enabling you to optimize performance, diagnose issues, and better understand the system’s behavior. By leveraging pg_settings, you can manage and fine-tune your PostgreSQL instance with enhanced precision and confidence. Furthermore, it empowers you to track and make informed changes to your configuration, ensuring the system operates at peak performance while meeting specific workload requirements. In essence, mastering pg_settings is key to maintaining a robust, efficient, and secure PostgreSQL environment.

Let us know how you handle postgresql.conf parameter changes by commenting below.

Secure and Seamless Distributed PostgreSQL Deployments: My experience at the Hyderabad Postgres Meetup

Last Friday, I had the privilege of attending the Hyderabad PostgreSQL User Group Meetup, a fantastic gathering of PostgreSQL enthusiasts and professionals. One of the highlights of the event was an enlightening session by Lahari Giddi, SDET at pgEdge, on Distributed PostgreSQL Deployments for Air-Gapped Systems.

Lahari’s talk was informative, delving into the critical role PostgreSQL plays in enabling secure and seamless deployments in isolated, high-security environments. The session shed light on innovative approaches and best practices for managing Distributed PostgreSQL deployments in air-gapped systems, a topic of immense relevance for organizations prioritizing data security.

The meetup was a vibrant platform for knowledge sharing, networking, and celebrating the versatility of PostgreSQL. It reaffirmed the power of community-driven learning and the ever-evolving potential of open-source databases.

The Concept of Air-Gapped Systems

Definition: Air gapping is a Security Measure that involves physically isolating a computer or Network and preventing it from making connection to other n/w devices.

The session began by explaining air-gapped systems, a term that immediately caught my attention. These systems are physically isolated from external networks, making them a go-to solution for environments requiring heightened security. For someone like me who works in tech but hadn’t deeply explored this concept, this talk clearly breaks-down physical isolation, data transfer controls, and robust security protocols was just awesome.

From defense systems to secure offline environments, it became evident why air-gapped solutions are essential in our interconnected yet vulnerable world.

Distributed PostgreSQL in Action

The heart of the session focused on Distributed PostgreSQL deployments, a powerful solution to handle large datasets securely, especially in air-gapped systems. Here are its capabilities:

  • Scalability to manage growing workloads
  • High Availability through replication strategies
  • Secure Integration with encryption and manual synchronization

These features resonated deeply, particularly as Lahari emphasized the importance of fault tolerance and offline functionality. It got me thinking: in scenarios like defense, where systems must function without network dependencies, distributed PostgreSQL offers an elegant yet robust solution.

Why pgEdge is a Perfect Fit for Air-Gapped Systems

One standout part of the session was how pgEdge caters specifically to air-gapped systems. Lahari detailed some impressive benefits:

  • End-to-end encryption for data security
  • Seamless offline failovers
  • Streamlined operations via CLI tools

Hearing about its implementation process and watching the demo made me understand the process clearly. The talk demystified it with a 3-step process – upgrading the repository, CLI, and platform components. The demonstration of commands and upgrades from downloading repositories to upgrading components like Spock made it look surprisingly approachable.

What’s Next

I’m going to continue digging into the Air-gapped system setup using the TAR myself and see if I can break it in any way (lol)

Stay tuned for my next blog!

Check out the YouTube video recording of the complete talk presentation here, for your reference : https://youtu.be/z0oTJK-RMTY

Efficient Data Management: Overcoming the Challenges of Large Tables with an Archival Strategy

In today’s data-driven world, businesses rely heavily on the power of databases to store and manage vast amounts of information. However, as data grows, it becomes more challenging to keep systems performant and efficient. One of the key strategies for managing large datasets in databases like PostgreSQL is an effective archival strategy.

Archiving involves the process of moving older, less frequently accessed data from active tables to an archival table which can later moved to more cost-effective, less performance-intensive storage solution if required. Without proper management of large tables, query performance can degrade, backups can become unwieldy, and maintenance tasks can become time-consuming. This is where having a well-defined archival strategy comes into play. Let’s discuss some of the challenges in managing large size tables and importance of having an archival strategy in this blog post.

Challenges of Managing Large Size Tables

Managing large tables in PostgreSQL presents several challenges. Here are some of them,

  1. Query Performance Degradation: As tables grow in size, the time taken to run queries, especially complex ones, increases. Indexes may become less efficient, and queries can lead to significant I/O operations.
  2. Longer Backup and Restore Times: Large tables result in large database backups. Restoring such backups can be a lengthy process, which may disrupt business operations.
  3. Increased Resource Usage: Storing vast amounts of historical data can become costly in terms of high CPU utilization , disk storage.
  4. Maintenance Overhead: Operations like vacuuming, reindexing, or analyzing large tables take longer and may lock the table, affecting other operations.

Importance of Having an Archival Strategy

An archival strategy is essential for several reasons. Here are some of them,

  1. Improved Performance: By archiving old data, we reduce the size of active tables, which leads to faster queries, better use of indexes, and improves overall database performance.
  2. Reduced Costs: Archiving allows you to move data from the active tables, where we can significantly cut operational expenses like computational and storage costs.
  3. Better Backup Management: Smaller active tables make backup and restore operations faster and less resource-intensive.
  4. Regulatory Compliance and Data Retention: In some industries, data retention is a legal requirement. A proper archival strategy ensures compliance with these regulations by safely storing older data in a manner that can be easily retrieved when necessary.

Archival Strategies

In managing large datasets, we have recently worked on two different archival strategies in two different scenarios.

Approach1 : Moving data through insert/delete

Scenerio : Imagine we have a main table called main_table, which contains data that is up to 3 years old. This table has a created_at column, which stores the timestamp of when each record was created. The goal is to keep only the most recent 3 months of data in the main table. Any data older than 3 months should be moved to an archival table called archive_table .
Note: main_table and archive_table must have the same structure/DDL syntax.

In this approach, we implement a procedure that manually moves the old data from main_table to archive_table based on the created_at timestamp. This approach inserts the old data into the archive and then delete it from the main table.

CREATE OR REPLACE PROCEDURE archive_old_data()
LANGUAGE plpgsql AS $$
DECLARE
    --Step1 declare the variables
    date TIMESTAMP := CURRENT_TIMESTAMP;
    rows_to_archive INTEGER;
    rows_archived INTEGER;
    rows_removed INTEGER;
BEGIN
    --Step2 Count the number of rows to archive
    SELECT COUNT(*) INTO rows_to_archive 
    FROM public.main_table 
    WHERE created_at < date - INTERVAL '3 months';
    
    --Step3 If no rows to archive, log and exit
    IF rows_to_archive = 0 THEN
        RAISE NOTICE 'No rows to archive.';
        RETURN;
    END IF;

    --Step4 Insert old transactions into the archive
    INSERT INTO public.archive_table
    SELECT * 
    FROM public.main_table 
    WHERE created_at < date - INTERVAL '3 months';

    --Step5 Count the number of rows inserted into the archive
    GET DIAGNOSTICS rows_archived = ROW_COUNT;

    --Step6 Verify if the number of rows archived matches the count
    IF rows_to_archive <> rows_archived THEN
        RAISE EXCEPTION 'Row count mismatch: expected %, archived %',         rows_to_archive, rows_archived;
    END IF;

    --Step7 Delete archived transactions from the original table
    DELETE FROM public.main_table 
    WHERE created_at < date - INTERVAL '3 months';
    
    --Step8 Count the number of rows inserted into the archive table
    GET DIAGNOSTICS rows_removed = ROW_COUNT;

    RAISE NOTICE '% rows archived and % rows deleted successfully', rows_archived,rows_removed;
END;
$$;

Let’s break-down the above procedure into 8 steps:

  • Step1 Declare Variables: Set up variables to hold the current date and track row counts.
  • Step2 Count Rows to Archive: Count how many rows are older than 3 months in main_table.
  • Step3 Check for Rows to Archive: If there are no rows to archive, exit early.
  • Step4 Insert into Archive Table: Insert rows older than 3 months into archive_table.
  • Step5 Count Inserted Rows: Check how many rows were inserted into archive_table.
  • Step5 Verify Row Count: Ensure the number of rows archived matches what was counted earlier.
  • Step6 Delete Archived Data: Delete the archived rows from main_table.
  • Step7 Count Deleted Rows: Count how many rows were deleted from main_table.
  • Step8 Log Final Result: Output a success message with the counts of archived and deleted rows.

Alter this procedure as per your environment and schedule a cron job using pg_cron extension which will execute daily in the non-business hours so that there will be less impact. If you are new to Postgres, or if you are unfamiliar with pg_cron, check our detailed blog on scheduling a cron job using pg_cron extension : https://opensource-db.com/automating-postgresql-tasks-with-pg_cron/

Approach2: Partitioning with attaching/detaching

Scenerio : Consider a main table named main_table, which stores data up to 2 years old. The table is partitioned monthly by the created_at column, which holds the timestamp for when each record was created. The objective is to ensure that the main_table only retains the most recent 12 months of data, and any data older than 12 months should be moved to an archival table called archive_table, which is also partitioned.
Note: main_table and archive_table must have the same structure / DDL syntax.

In this approach, we detach the partitioned_table older than 12 months from the main_table and attach the same to the archive_table .

#Step1 Creating the partitioned table for 12 months
postgres=# CREATE TABLE public.main_table_2024_12 PARTITION OF public.main_table
    FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');
CREATE TABLE

#Step2 At the end of each month, Detach the partitioned table from the main table.
postgres=# ALTER TABLE public.main_table DETACH PARTITION public.main_table_2024_12;
ALTER TABLE

#Step3 At the end of each month, Attach the partitioned table to the archive table.
postgres=# ALTER TABLE public.archive_table ATTACH PARTITION public.main_table_2024_12 FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');
ALTER TABLE

Let’s break-down the above steps:

  • Step 1: Create a partition for the month wise data (example:(December 2024) within the main_table, ensuring that all data created in that month is handled in its own partition. We have created one for December’24 , create one partition for the each upcoming month accordingly.
  • Step 2: At the end of the month, Detach the created partition (main_table_2024_12) from main_table so that it is no longer part of the active table, preparing it for archival.
  • Step 3: Attach the detached partition to the archive_table, effectively moving the data to the archival structure while maintaining the same partitioning scheme.

Alter this commands as per your environment and write a simple script which automates the whole process.

Since these approaches are based on our specific requirements, it is recommended to alter and test them in your test environment before implementing them in the production environment based on your requirements.

Conclusion

Archiving is crucial for maintaining the health and performance of a PostgreSQL database as it grows. By employing a well-thought-out archival strategy, you can significantly improve query performance, reduce storage costs, and ensure efficient backup processes. Whether you choose a manual insert/delete approach or a more sophisticated partitioning strategy, the key is to regularly assess your data growth and adjust your archival methods accordingly.

The two approaches we’ve discussed — moving data through insert/delete and partitioning with attaching/detaching — each offer distinct advantages depending on your use case. Choosing the right one depends on factors like your data access patterns, performance requirements, and system complexity. By embracing a suitable archival strategy, you can ensure that your PostgreSQL database remains scalable, performant, and cost-effective in the long run.

Stay tuned!!

Configuring pgAudit for Effective Database Auditing

In today’s highly regulated and security-conscious world, database auditing has become a critical requirement for monitoring database activity, ensuring compliance, and preventing malicious activity. PostgreSQL, one of the most powerful open-source relational databases, provides a robust auditing mechanism through the pgAudit extension.

In our previous blog, we explored the importance of this extension and its installation. In this blog, we’ll explore some of the configuring parameters of the pgAudit extension effectively to achieve secure and comprehensive auditing in your PostgreSQL environment.

Understanding some of the parameters

pgaudit.log

The pgaudit.log setting in postgresql.conf specifies what kinds of SQL statements should be logged by pgAudit. It’s a flexible configuration option that lets you filter what gets logged, making it easier to manage log volumes and ensure you’re capturing the right activities for your audit trail.

The general format for configuring pgaudit.log is:

pgaudit.log = 'option1, option2, ...'
#If you wanna use the alter command
ALTER SYSTEM SET pgaudit.log TO 'option1,option2,...';

Where each option specifies a category of PostgreSQL operations to be logged. Multiple classes can be provided using a comma-separated list and classes can be subtracted by prefacing the class with a - sign . The default is none. To check the option available , please check the previous blog

For example,

# Enabling one option
postgres=# ALTER SYSTEM SET pgaudit.log TO 'WRITE';
ALTER SYSTEM
postgres=# SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)
postgres=# SHOW pgaudit.log;
 pgaudit.log
-------------
 WRITE
(1 row)

# Enabling multiple options
postgres=# ALTER SYSTEM SET pgaudit.log TO 'WRITE,READ';
ALTER SYSTEM
postgres=# SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)
postgres=# show pgaudit.log;
 pgaudit.log
-------------
 WRITE,READ
(1 row)

# Excluding an option
postgres=# ALTER SYSTEM SET pgaudit.log TO 'ALL,-MISC';
ALTER SYSTEM
postgres=# SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)
postgres=# SHOW pgaudit.log;
 pgaudit.log
-------------
 ALL,-MISC
(1 row)

pgaudit.log_catalog

The pgaudit.log_catalog parameter in pgAudit controls whether or not catalog tables (i.e., system tables) are logged in audit logs. The default is on

There are both pro’s and con’s enabling this parameter. Here are some

  • Enabling this parameter
    • Pro’s
      • Detailed Audit trail of System Activity.
      • Compliance and Security.
      • Prevents unauthorized access/usage
    • Con’s
      • Increase in log volumes
      • Performance overhead

Enabling or disabling this option is purely based on the Organization requirements. If you want to disable this option , here is the command

postgres=# ALTER SYSTEM SET pgaudit.log_catalog TO 'off';
ALTER SYSTEM
postgres=# SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)
postgres=# SHOW pgaudit.log_catalog;
 pgaudit.log_catalog
---------------------
 off
(1 row)

pgaudit.log_client & pgaudit.log_level

pgaudit.log_client is used to audit the client connection such as psql. Default is off

The pgaudit.log_level parameter determines the level of logging detail for pgAudit operations. It controls the verbosity of the audit logs. Essentially, it allows you to adjust how much information is recorded in the PostgreSQL logs when using pgAudit. pgaudit.log_level is only enabled when pgaudit.log_client is on; otherwise the default will be used i.e., log .

SeverityUsagesyslogeventlog
DEBUG1 .. DEBUG5Provides successively-more-detailed information for use by developers.DEBUGINFORMATION
INFOProvides information implicitly requested by the user, e.g., output from VACUUM VERBOSE.INFOINFORMATION
NOTICEProvides information that might be helpful to users, e.g., notice of truncation of long identifiers.NOTICEINFORMATION
WARNINGProvides warnings of likely problems, e.g., COMMIT outside a transaction block.NOTICEWARNING
LOGReports information of interest to administrators, e.g., checkpoint activity.INFOINFORMATION
These are the available options. (Source)

There are a lot more parameters in pgAudit to look at, but focusing on key settings like pgaudit.log, pgaudit.log_catalog, pgaudit.log_client, and pgaudit.log_level will help you establish a solid auditing foundation. As your auditing needs evolve, you can explore additional parameters to fine-tune your logging configuration. Always remember to balance detailed auditing with performance considerations, especially in production environments. If required, feel free to take our help in optimizing and configuring your pgAudit setup to best suit your needs.

Conclusion

In conclusion, the pgAudit extension provides a powerful and flexible mechanism for auditing PostgreSQL database activity. By configuring key parameters such as pgaudit.log, pgaudit.log_catalog, pgaudit.log_client, and pgaudit.log_level, you can tailor your auditing to meet both security and compliance needs.

Choosing the right logging options, whether you’re tracking write/read operations, system table access, or client connections, is crucial to ensuring an effective and manageable audit trail. While enabling comprehensive auditing may lead to increased log volumes and some performance overhead, it significantly enhances your ability to monitor, detect, and prevent unauthorized activity in your PostgreSQL environment.

By carefully balancing auditing granularity and performance, you can establish a robust auditing strategy that aligns with your organization’s security policies and regulatory requirements, providing better visibility and control over your database operations.

Understanding pgAudit: PostgreSQL Auditing Made Easy

PostgreSQL is one of the most robust and flexible relational database management systems available today, offering a wide array of features for database administrators (DBAs) and developers alike. However, like any database, security and auditing are critical concerns when managing sensitive data. pgaudit is an extension for PostgreSQL that helps with auditing database activity, providing a clear and comprehensive audit trail of database operations.

In this blog post, we’ll dive into what pgaudit is, how to install and configure it, and explore some uses that make it essential for security and compliance.

What is pgAudit?

pgaudit is an open-source PostgreSQL extension that enhances PostgreSQL’s logging functionality by adding detailed audit logging capabilities. It logs all the database activity that is required to track and monitor user interactions, including queries, user sessions, and system-level events. By default, PostgreSQL provides basic logging, which includes information such as the start and end times of queries. However, for many compliance frameworks (e.g., ISO), it is often necessary to capture more granular data. This is where pgAudit can play a major role, by providing detailed activity logs, including:

  • SQL queries executed (including the values of parameters).
  • Successful or failed login attempts.
  • Changes to database objects (e.g., table creation or dropping).
  • DML operations (INSERT, UPDATE, DELETE).
  • SELECT queries with detailed information, such as the user issuing them.

Why pgAudit?

The need for robust auditing is more important than ever. Organizations dealing with sensitive information (such as financial or health data) need to comply with various regulations. pgaudit offers several key benefits:

  • Compliance with Regulatory Standards like GDPR ,ISO etc
  • Enhanced Security
  • Transparency and Monitoring
  • Simplified Troubleshooting

Installation and Configuration

To begin using pgaudit, you must first install the PostgreSQL development tools. Here , I have setup of Ubuntu 24.04 with PostgreSQL 17.2 .

sudo apt-get update
sudo apt-get install postgresql-contrib postgresql-server-dev-all

Once , we have the required tools installed, let’s install pgaudit

sudo apt-get install postgresql-17-pgaudit  
# Replace 17 with your PostgreSQL version

To complete the setup, we need to change a parameter in the postgresql.conf file

#In postgresql.conf file
shared_preload_libraries='pgaudit'
#It requires the restart of the postgresql services

Login to psql for creating and configuring the extension

postgres@ip-172-31-23-19:~$ psql
psql (17.2 (Ubuntu 17.2-1.pgdg24.04+1))
Type "help" for help.

postgres=# create extension pgaudit;
CREATE EXTENSION
postgres=# \dx
                   List of installed extensions
  Name   | Version |   Schema   |           Description
---------+---------+------------+---------------------------------
 pgaudit | 17.0    | public     | provides auditing functionality
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)
#Configuring the pgaudit.log to'read'
postgres=# alter system set pgaudit.log to 'READ';
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

Let‘s look at some commonly used pgaudit.log settings:

  • WRITE – Log all SQL writes like INSERT, UPDATE, DELETE, TRUNCATE
  • FUNCTION – Log calls to functions and stored procedures
  • ROLE – Log user creation, role grants etc.
  • DDL – Log all DDL statements like CREATE TABLE
  • DML – Enable logging for all DML statements
  • MISC – Log miscellaneous commands like VACUUM, EXPLAIN

Verifying the log file

Let’s verify in our log file

2024-11-27 11:37:01.857 UTC [10329] LOG:  parameter "pgaudit.log" changed to "READ"
2024-11-27 11:38:00.473 UTC [10495] postgres@postgres LOG:  AUDIT: SESSION,1,1,READ,SELECT,,,"SELECT pg_catalog.lower(name) FROM pg_catalog.pg_settings  WHERE pg_catalog.lower(name) LIKE pg_catalog.lower('log\_dire%')

Here , we can see that pgaudit extension captures the select query as we set the pgaudit.log to 'read'.

Conclusion

pgaudit is a powerful extension for PostgreSQL that provides the necessary auditing capabilities for businesses to comply with regulations, enhance security, and track database activities. Its ability to log detailed information about SQL queries, user actions, and changes to database objects makes it an invaluable tool for DBAs, security teams, and compliance officers.

With pgaudit in place, you can ensure that your PostgreSQL databases are fully auditable, which is crucial for regulatory compliance and maintaining a secure database environment. Implementing it correctly in your PostgreSQL system can go a long way toward improving your organization’s security posture and reducing the risk of unauthorized data access or manipulation.

Introduction to pg_repack: Dealing with PostgreSQL bloat

PostgreSQL is a powerful and widely used relational database management system (RDBMS) known for its stability, scalability, and flexibility. However, over time, as data in PostgreSQL tables and indexes grows, performance can degrade due to fragmentation. This is where pg_repack comes in—a tool that helps in reorganising and optimising your PostgreSQL database to improve performance without significant downtime.In this blog post, we’ll cover what pg_repack is, its primary uses, and how to install and configure it for your PostgreSQL setup.

What is pg_repack?

pg_repack is an Open-source PostgreSQL extension designed to eliminate the bloat of table and index without requiring table locks or downtime. It removes bloat from tables and indexes and optionally restores clustered indexes’ physical order. It works online without holding an exclusive lock on the processed tables during processing.

Key Features of pg_repack

  • Online Repacking: Unlike the traditional VACUUM or REINDEX commands in PostgreSQL, pg_repack works online, meaning it doesn’t lock tables or indexes during the process. This makes it ideal for production environments where downtime is critical.
  • Index Rebuilding: pg_repack can rebuild indexes, removing fragmentation, and optimizing the storage for quicker access.
  • Table Repacking: It can also repack entire tables, reclaiming space occupied by dead tuples and reducing the size of the data files.
  • Database Repacking: In addition to individual tables and indexes, pg_repack can perform repacking on an entire database.
  • Space Savings: By reorganizing data and removing wasted space, pg_repack helps reduce the storage requirements of your PostgreSQL instance.

Common Uses of pg_repack

Here are some common scenarios where pg_repack can help:

  1. Reducing Table Bloat: Over time, as data is inserted, updated, and deleted in tables, PostgreSQL’s internal storage can become fragmented. Repacking these tables can restore performance by removing unused space.
  2. Rebuilding Indexes: Index fragmentation can slow down query performance. pg_repack rebuilds indexes in a way that eliminates fragmentation and optimizes query speed.
  3. Improving Query Performance: By eliminating bloat in both tables and indexes, queries that rely on these objects will experience faster execution times due to reduced I/O operations.
  4. Reclaiming Disk Space: After a large number of DELETE operations or bulk updates, you may find that your database size doesn’t shrink automatically. pg_repack can shrink the disk space used by these operations.
  5. Running in Production with Minimal Impact: Since pg_repack works online, you can run the tool in a live production environment without locking tables for long periods. This is particularly useful for large, active databases.

Installation and Configuration

The installation process for pg_repack is straightforward and depends on your operating system. Here are the steps to install on RHEL 9 running PG16.

[root@localhost ~]# yum install pg_repack_16
#Check for the version to verify download
pg_repack --version 

Once the packages are installed, add pg_repack to shared_preload_libraries in postgresql.conf

shared_preload_libraries='pg_repack'
#Restart the postgresql server

After the restart , login to the database server and create the extension

postgres=# create extension pg_repack;
CREATE EXTENSION
postgres=> \dx
                                  List of installed extensions
   Name    | Version |   Schema   |                         Description                          
-----------+---------+------------+--------------------------------------------------------------
 pg_repack | 1.5.0   | public     | Reorganize tables in PostgreSQL databases with minimal locks
 plpgsql   | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

Testing

Create a Table

Create a table named test with columns id (integer) and name (varchar with a maximum length of 250 characters). The id column is set as the primary key:

postgres=# CREATE TABLE test (id int primary key, name varchar(250) );
CREATE TABLE

The table has been successfully created.

Insert 100,000 Rows

insert 100,000 rows into the test table.

postgres=# INSERT INTO test (id, name)
SELECT
    gs.id,
    'name_' || gs.id AS name                                           
FROM generate_series(1, 100000) AS gs(id);
INSERT 0 100000

100,000 rows have been successfully inserted.

Check the Size of the Table

After inserting the rows, check the size of the test table:

postgres=# \dt+
                                   List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size   | Description 
--------+------+-------+----------+-------------+---------------+---------+-------------
 public | test | table | postgres | permanent   | heap          | 4360 kB | 
(1 row)

The size of the test table is 4360 kB.

Update All Rows

Update all rows in the test table, setting the name column to ‘somename’ for all records:

postgres=# update test set name ='somename';
UPDATE 100000

100,000 rows have been updated.

Check the Size of the Table After Update

postgres=# \dt+
                                   List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size   | Description 
--------+------+-------+----------+-------------+---------------+---------+-------------
 public | test | table | postgres | permanent   | heap          | 8688 kB | 
(1 row)

The size of the table has increased to 8688 kB.

Delete Rows

postgres=# DELETE FROM test WHERE id BETWEEN 50000 AND 100000;
DELETE 50001

50,001 rows have been deleted.

Check the Size of the Table After Deletion

After the deletion, check the size of the test table:

postgres=# \dt+
                                   List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size   | Description 
--------+------+-------+----------+-------------+---------------+---------+-------------
 public | test | table | postgres | permanent   | heap          | 6520 kB | 
(1 row)

The size of the table is now 6520 kB.

Using pg_repack to Reclaim Space in PostgreSQL.

To reclaim space in a PostgreSQL table, you can use the pg_repack extension. Here’s the process to repack a table and check the space before and after the operation:

To repack the test table and reclaim the unused space

[postgres@localhost ~]$ pg_repack --table=test -d postgres -e
LOG: (query) SET search_path TO pg_catalog, pg_temp, public
LOG: (query) SET search_path TO pg_catalog, pg_temp, public
LOG: (query) select repack.version(), repack.version_sql()
LOG: (query) SET statement_timeout = 0
LOG: (query) SET search_path = pg_catalog, pg_temp, public
LOG: (query) SET client_min_messages = warning
LOG: (query) SELECT r FROM (VALUES ($1, 'r')) AS given_t(r,kind) WHERE NOT EXISTS(  SELECT FROM repack.tables WHERE relid=to_regclass(given_t.r)) AND NOT EXISTS(  SELECT FROM pg_catalog.pg_class c WHERE c.oid=to_regclass(given_t.r) AND c.relkind = given_t.kind AND given_t.kind = 'p')
LOG: 	(param:0) = test
LOG: (query) SELECT t.*, coalesce(v.tablespace, t.tablespace_orig) as tablespace_dest FROM repack.tables t,  (VALUES (quote_ident($1::text))) as v (tablespace) WHERE (relid = $2::regclass) ORDER BY t.relname, t.schemaname
LOG: 	(param:0) = (null)
LOG: 	(param:1) = test
INFO: repacking table "public.test"
LOG: (query) SELECT pg_try_advisory_lock($1, CAST(-2147483648 + $2::bigint AS integer))
LOG: 	(param:0) = 16185446
LOG: 	(param:1) = 25021
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT pg_get_indexdef(indexrelid) FROM pg_index WHERE indrelid = $1 AND NOT indisvalid
LOG: 	(param:0) = 25021
LOG: (query) SELECT indexrelid, repack.repack_indexdef(indexrelid, indrelid, $2, FALSE)  FROM pg_index WHERE indrelid = $1 AND indisvalid
LOG: 	(param:0) = 25021
LOG: 	(param:1) = (null)
LOG: (query) SELECT repack.conflicted_triggers($1)
LOG: 	(param:0) = 25021
LOG: (query) SELECT repack.create_index_type(25024,25021)
LOG: (query) SELECT repack.create_log_table(25021)
LOG: (query) CREATE TRIGGER repack_trigger AFTER INSERT OR DELETE OR UPDATE ON public.test FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('id')
LOG: (query) ALTER TABLE public.test ENABLE ALWAYS TRIGGER repack_trigger
LOG: (query) SELECT repack.disable_autovacuum('repack.log_25021')
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SELECT pg_backend_pid()
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'relation' AND granted = false AND relation = 25021 AND mode = 'AccessExclusiveLock' AND pid <> pg_backend_pid()
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL SERIALIZABLE
LOG: (query) SELECT set_config('work_mem', current_setting('maintenance_work_mem'), true)
LOG: (query) SELECT coalesce(array_agg(l.virtualtransaction), '{}')   FROM pg_locks AS l   LEFT JOIN pg_stat_activity AS a     ON l.pid = a.pid   LEFT JOIN pg_database AS d     ON a.datid = d.oid   WHERE l.locktype = 'virtualxid'   AND l.pid NOT IN (pg_backend_pid(), $1)   AND (l.virtualxid, l.virtualtransaction) <> ('1/1', '-1/0')   AND (a.application_name IS NULL OR a.application_name <> $2)  AND a.query !~* E'^\\s*vacuum\\s+'   AND a.query !~ E'^autovacuum: '   AND ((d.datname IS NULL OR d.datname = current_database()) OR l.database = 0)
LOG: 	(param:0) = 41042
LOG: 	(param:1) = pg_repack
LOG: (query) DELETE FROM repack.log_25021
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'relation' AND granted = false AND relation = 25021 AND mode = 'AccessExclusiveLock' AND pid <> pg_backend_pid()
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS SHARE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.create_table($1, $2)
LOG: 	(param:0) = 25021
LOG: 	(param:1) = pg_default
LOG: (query) INSERT INTO repack.table_25021 SELECT id,name FROM ONLY public.test
LOG: (query) SELECT repack.disable_autovacuum('repack.table_25021')
LOG: (query) COMMIT
LOG: (query) CREATE UNIQUE INDEX index_25024 ON repack.table_25021 USING btree (id)
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG: 	(param:0) = SELECT * FROM repack.log_25021 ORDER BY id LIMIT $1
LOG: 	(param:1) = INSERT INTO repack.table_25021 VALUES ($1.*)
LOG: 	(param:2) = DELETE FROM repack.table_25021 WHERE (id) = ($1.id)
LOG: 	(param:3) = UPDATE repack.table_25021 SET (id, name) = ($2.id, $2.name) WHERE (id) = ($1.id)
LOG: 	(param:4) = DELETE FROM repack.log_25021 WHERE id IN (
LOG: 	(param:5) = 1000
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'virtualxid' AND pid <> pg_backend_pid() AND virtualtransaction = ANY($1)
LOG: 	(param:0) = {}
LOG: (query) SAVEPOINT repack_sp1
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG: 	(param:0) = SELECT * FROM repack.log_25021 ORDER BY id LIMIT $1
LOG: 	(param:1) = INSERT INTO repack.table_25021 VALUES ($1.*)
LOG: 	(param:2) = DELETE FROM repack.table_25021 WHERE (id) = ($1.id)
LOG: 	(param:3) = UPDATE repack.table_25021 SET (id, name) = ($2.id, $2.name) WHERE (id) = ($1.id)
LOG: 	(param:4) = DELETE FROM repack.log_25021 WHERE id IN (
LOG: 	(param:5) = 0
LOG: (query) SELECT repack.repack_swap($1)
LOG: 	(param:0) = 25021
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SAVEPOINT repack_sp1
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.repack_drop($1, $2)
LOG: 	(param:0) = 25021
LOG: 	(param:1) = 4
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) ANALYZE public.test
LOG: (query) COMMIT
LOG: (query) SELECT pg_advisory_unlock($1, CAST(-2147483648 + $2::bigint AS integer))
LOG: 	(param:0) = 16185446
LOG: 	(param:1) = 25021

Checking the Table Size After Repacking

After the repack process completes, you can check the table size again:


postgres=# \dt+
                                   List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size   | Description 
--------+------+-------+----------+-------------+---------------+---------+-------------
 public | test | table | postgres | permanent   | heap          | 2192 kB | 
(1 row)

The size of the test table has reduced to 2192 kB, indicating that the unused space has been reclaimed, and the table’s storage has been optimized.

Check and use the option available based on your requirement.

[postgres@localhost ~]$ pg_repack --help
pg_repack re-organizes a PostgreSQL database.

Usage:
  pg_repack [OPTION]... [DBNAME]
Options:
  -a, --all                 repack all databases
  -t, --table=TABLE         repack specific table only
  -I, --parent-table=TABLE  repack specific parent table and its inheritors
  -c, --schema=SCHEMA       repack tables in specific schema only
  -s, --tablespace=TBLSPC   move repacked tables to a new tablespace
  -S, --moveidx             move repacked indexes to TBLSPC too
  -o, --order-by=COLUMNS    order by columns instead of cluster keys
  -n, --no-order            do vacuum full instead of cluster
  -N, --dry-run             print what would have been repacked
  -j, --jobs=NUM            Use this many parallel jobs for each table
  -i, --index=INDEX         move only the specified index
  -x, --only-indexes        move only indexes of the specified table
  -T, --wait-timeout=SECS   timeout to cancel other backends on conflict
  -D, --no-kill-backend     don't kill other backends when timed out
  -Z, --no-analyze          don't analyze at end
  -k, --no-superuser-check  skip superuser checks in client
  -C, --exclude-extension   don't repack tables which belong to specific extension

Connection options:
  -d, --dbname=DBNAME       database to connect
  -h, --host=HOSTNAME       database server host or socket directory
  -p, --port=PORT           database server port
  -U, --username=USERNAME   user name to connect as
  -w, --no-password         never prompt for password
  -W, --password            force password prompt

Generic options:
  -e, --echo                echo queries
  -E, --elevel=LEVEL        set output message level
  --help                    show this help, then exit
  --version                 output version information, then exit

Read the website for details: <https://reorg.github.io/pg_repack/>.
Report bugs to <https://github.com/reorg/pg_repack/issues>.

Conclusion

pg_repack is a powerful tool for optimizing PostgreSQL databases by reducing fragmentation and reclaiming disk space. It helps maintain high performance in production systems by reorganizing tables and indexes online, with minimal impact on ongoing operations. By regularly running pg_repack, you can keep your database lean, fast, and efficient.

With its simple installation and straightforward usage, pg_repack is an essential tool for PostgreSQL administrators who want to ensure their databases remain performant as they grow. Whether you’re working with a large, active database or a smaller, more static one, pg_repack can help you maintain optimal performance with minimal effort.

Unlocking PostgreSQL 17: Enhancing the Developer Experience

PostgreSQL, widely regarded as one of the most powerful and reliable open-source relational database systems, has launched its latest version: PostgreSQL 17. This major release brings exciting improvements in developer experience and new features that simplify the work of data professionals, database administrators, and developers alike. Whether you’re looking to optimize your data workflows, refine your queries, or leverage cutting-edge SQL capabilities, PostgreSQL 17 has you covered. In this post, we’ll explore some of the standout features that make this release a game-changer for the community.

SQL/JSON Enhancements

PostgreSQL 17 builds on its already strong support for JSON data by introducing several new features aimed at making JSON handling even more powerful and intuitive.

  • New JSON_TABLE Command: This new addition allows you to convert JSON data into a tabular format, making it easier to query and manipulate JSON objects in your database. The JSON_TABLE function enables developers to map JSON keys to specific columns and use SQL queries to interact with structured JSON data as if it were a traditional relational table.
  • Enhanced JSON Query Functions: PostgreSQL 17 expands its suite of SQL/JSON functions with the introduction of JSON_EXISTS, JSON_VALUE, and JSON_QUERY:
    • JSON_EXISTS allows you to check if a specified path exists within a JSON object, making it easier to validate data structures.
    • JSON_VALUE extracts a scalar value from a JSON object, returning it directly as a result.
    • JSON_QUERY retrieves a JSON subdocument, allowing for greater flexibility in handling nested structures.

These enhancements are designed to streamline your ability to work with JSON, allowing you to do more with less effort. Whether you’re parsing complex JSON data or building sophisticated APIs, these tools will enable your applications to interact more efficiently with JSON-based datasets.

MERGE Enhancements

The MERGE statement, which was introduced in PostgreSQL 15, continues to evolve with PostgreSQL 17, now offering even more flexibility and ease of use for developers.

  • Conditional Updates: The new version expands the functionality of MERGE by adding the ability to perform conditional updates. This means you can now merge data based on specific criteria, simplifying many complex data manipulation tasks. For example, you can now update records in a target table if they meet certain conditions in the source table—without needing to write complex CASE statements or use multiple queries.
  • RETURNING Clause in MERGE: Another notable enhancement is the addition of the RETURNING clause to MERGE. This allows you to return values from the rows that are updated, inserted, or deleted in the MERGE operation. This feature is especially useful for applications that need to track changes or perform additional operations on the modified data immediately after the merge.
  • View Updates: PostgreSQL 17 also allows MERGE operations on views, making it much easier to perform complex data manipulations on views without needing to worry about whether the underlying data structure is a physical table or a virtual view.

With these enhancements, MERGE becomes a much more powerful tool for handling upserts, conditional updates, and data synchronization tasks, enabling you to write more efficient, maintainable SQL.

Bulk Data Handling

Handling large datasets efficiently is a key concern for database administrators, and PostgreSQL 17 has made significant strides in this area.

  • Up to 2x Faster Exports with COPY: PostgreSQL 17 introduces performance improvements to the COPY command, which is often used for bulk data loading and unloading. Exports using COPY can now be up to 2x faster, making it much quicker to move large datasets in and out of your database. This is a game-changer for scenarios such as backup/restore operations, data migration, and exporting large data sets for analytics or reporting purposes.
  • New ON_ERROR Option for COPY: One of the most requested features in PostgreSQL is more granular error handling during bulk imports. In version 17, the COPY command now includes an ON_ERROR option, which allows you to specify whether to continue the import process when an error occurs. With this option, you can choose to skip problematic rows and continue processing the rest of the data. This is particularly useful when dealing with large datasets that may have some malformed rows, but you still want to ensure the majority of the data is imported successfully.

These improvements make PostgreSQL 17 an even better choice for businesses that need to manage large volumes of data efficiently and without interruption.

Conclusion

PostgreSQL 17 is a powerhouse of new features and performance enhancements that will benefit developers, database administrators, and organisations of all sizes. With its improved support for JSON, advanced MERGE capabilities, faster bulk data handling, PostgreSQL 17 solidifies its position as a top-tier database platform for modern applications.

If you’re not already using PostgreSQL 17, now is the perfect time to upgrade. Whether you’re building data-intensive applications, managing large datasets, or just looking to improve the performance of your existing PostgreSQL installation, this release brings a wealth of improvements that will make your life as a developer and database administrator easier and more productive.

Setting up PostgreSQL 17 is easier than ever, and OpenSource DB can assist you with the upgrade process. With our expertise, we’ll ensure a smooth transition to PostgreSQL 17, helping you unlock its full potential for your projects. Contact us today to get started!

Automating PostgreSQL Tasks with pg_cron

In the world of database management, automation is key to maintaining efficiency and ensuring that routine tasks are executed consistently. For PostgreSQL users, the pg_cron extension provides a powerful way to schedule jobs directly within the database. In this post, we’ll explore what pg_cron is, how to install it, and how to use it effectively.

What is pg_cron?

pg_cron is an extension for PostgreSQL that allows you to run scheduled jobs using familiar cron syntax. With this, you can automate various tasks, such as database maintenance, data processing, and reporting, all from within your PostgreSQL environment. This eliminates the need for external cron jobs or scripts, making your setup cleaner and more integrated.

Installing pg_cron

Identify the version of pg_cron that is compatible with your PostgreSQL installation. For example, to install the pg_cron (pg_cron_16) extension on PostgreSQL 16, run:

#For RHEL / Fedora Derivatives:
sudo yum install -y pg_cron_16

#For Ubuntu / Debian Derivatives:
sudo apt-get install -y postgresql-16-cron

Setting Up pg_cron

To start the pg_cron background worker when PostgreSQL starts, you need to add pg_cron to shared_preload_libraries in postgresql.conf . Note that pg_cron does not run any jobs while the server is in hot standby mode, but it automatically starts when the server is promoted.

By default, the pg_cron background worker expects its metadata tables to be created in the postgres database. However, you can configure this by setting the cron.database_name configuration parameter in postgresql.conf.

#Parameters need to be change for pg_cron 
shared_preload_libraries = 'pg_cron'
cron.database_name = 'postgres'

After changing the parameter and restarting the PostgreSQL server.

You can create the pg_cron and metadata tables using

postgres=# CREATE EXTENSION pg_cron;
CREATE EXTENSION

After creating extension, grant the usage privileges to user postgres

postgres=# GRANT USAGE ON SCHEMA cron TO postgres;
GRANT

Let’s schedule a vacuum operation on the postgres database at 10 P.M every day.

postgres=# SELECT cron.schedule('0 22 * * *', 'VACUUM');
 schedule 
----------
        1
(1 row)

Let’s see how we can manage the scheduled jobs

postgres=# SELECT * FROM cron.job;
 jobid |  schedule   | command | nodename  | nodeport | database | username | active | jobname
-------+-------------+---------+-----------+----------+----------+----------+--------+---------
     1 | 0 22 * * *  | VACUUM  | localhost |     5432 | postgres | postgres | t      |
(1 row)

To check the logs of the scheduled activities.

postgres=# select * from cron.job_run_details;
-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------
jobid | 1
runid | 1
job_pid | 35683
database | postgres
username | postgres
command | VACUUM
status | succeeded
return_message | VACUUM
start_time | 2024-10-01 13:27:29.473088+05:30
end_time | 2024-10-01 13:27:29.47479+05:30

Here, the breakdown of each field.

  • jobid : This is the unique identifier for the scheduled job. Each job scheduled in pg_cron has a distinct jobid.
  • runid : This indicates the unique identifier for this specific execution of the job. If the job is scheduled to run multiple times, each execution will have a different runid.
  • job_pid : This is the process ID of the PostgreSQL backend process that executed the job. It can be useful for tracking the job’s execution in system logs.
  • database : This shows the database context in which the job was run. In this case, the job was executed in the postgres database.
  • username : This indicates the username under which the job was executed. Here, the job ran as the postgres user.
  • command : This is the SQL command that was executed as part of the job. In this instance, it indicates that a VACUUM operation was performed.
  • status : This shows the outcome of the job execution. A status of succeeded indicates that the job completed successfully without errors.
  • return_message : This message reflects the output from the executed command. In this case, it confirms that the VACUUM command was executed.
  • start_time : This is the timestamp when the job started executing. It includes the time zone offset (+05:30), indicating the local time zone.
  • end_time : This is the timestamp when the job finished executing. It also includes the time zone offset.

To unschedule a previously scheduled job in pg_cron, run:

SELECT cron.unschedule(job_id);

Conclusion

pg_cron is a powerful tool that can significantly enhance your PostgreSQL experience by simplifying job scheduling and automation. By integrating routine tasks directly within the database, you can save time and reduce complexity. Whether you’re handling maintenance, data processing, or reporting, pg_cron provides the flexibility and control you need.

Ready to automate your PostgreSQL tasks? Give pg_cron a try, and see how it can streamline your database operations! Contact us today to know more about our database maintenance services.

Happy Automation!!