Bulk loading with COPY in PostgreSQL 16

Introduction:

At OpenSource DB, we work with Customers on various Data Engineering projects, and among the many critical aspects of data warehousing, the efficient and rapid movement of data is paramount. This is where PostgreSQL’s COPY feature shines. COPY is a versatile and high-performance utility that allows users to move data in and out of PostgreSQL quickly and effectively. When it comes to data warehousing, especially for tasks like bulk data loading or unloading, COPY is a game-changer. It provides a direct and efficient path for ingesting data from various sources into PostgreSQL tables or exporting data from these tables for further analysis. This feature significantly accelerates the ETL (Extract, Transform, Load) processes and enhances the overall performance and scalability of data warehousing solutions powered by PostgreSQL. Whether you’re dealing with terabytes of data or more, COPY simplifies the process and ensures that your data warehousing operations run smoothly.

PostgreSQL has long been revered for its extensibility, reliability, and data integrity. With each new release, it evolves to meet the growing demands of modern applications. PostgreSQL 16, the latest iteration of this venerable open-source database management system, brings with it a transformative feature — Concurrent Bulk Loading. As per the PG16 release notes “This release includes improvements for bulk loading using COPY in both single and concurrent operations, with tests showing up to a 300% performance improvement in some cases.”

This feature redefines the way you load massive amounts of data into your PostgreSQL database, promising improved performance, reduced downtime, and greater efficiency. In this blog, we’ll delve into the feature, understand how it works, and explore its implications for your PostgreSQL-powered applications.

Here are a few PostgreSQL hackers’ list discussions and commit link to understand the background work that went into the feature:

Discussion: https://www.postgresql.org/message-id/20221029025420.eplyow6k7tgu6he3%40awork3.anarazel.de

Commit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=31966b151e6ab7a6284deab6e8fe5faddaf2ae4c

Also, note that async I/O won’t be included in PostgreSQL 16. While there were a couple of related patches, such as this one, it’s expected that Postgres 17 will be the first release to offer substantial and meaningful asynchronous I/O benefits.

Summary of change:

Today, when you’re adding data into a table and creating new pages in the table, each time that Postgres adds a new page it has to hold this relation extension lock. And there’s a lot of work that used to be done whilst holding that lock. As Andres describes in his mailing list thread, it’s really way too much work while holding that lock. The refactoring work that Andres worked on in PostgreSQL 16 relates to reducing the time that the relation extension lock is held.

While we’re on this topic, another PG16 feature enhancement that needs a mention is “Enhancements to storage“

The latest version brings a substantial enhancement (a threefold improvement for 16 clients) when performing concurrent COPY operations into a single relation. In this version, the relation extension lock is only held during the actual extension process. In contrast, in the previous version, the relation extension lock remained held while the system accomplished the following steps:

Acquired a victim buffer for the new page, which might entail further writing of the old page contents. This process could require flushing Write-Ahead Logging (WAL) as well.
Wrote a zero page during the extension and then proceeded to write out the actual page contents. This additional step could almost double the write rate.

Now, let’s discuss a simple use case:

-- Step 1: Data Preparation
-- Assume you have a CSV file named 'sales_data.csv' prepared externally.

-- Step 2: Initiating Bulk Load
-- You can use the COPY command
COPY sales FROM '/path/to/sales_data.csv' CSV HEADER;

-- Step 3: Parallel Processing
-- The data loading process is executed using multiple CPU cores, dramatically improving loading speed.

-- Step 4: Avoiding Locks
-- Concurrent queries and transactions can continue running without significant disruption.

-- Step 5: Resource Management
-- PostgreSQL intelligently manages system resources to prevent overuse during data loading.

-- Step 6: Incremental Loading
-- If desired, you can load data incrementally to avoid massive, time-consuming operations

Each backend running COPY will report its progress in the pg_stat_progress_copy View

A new COPY FROM … WITH introduced in PG16 is DEFAULT 'default_string'

DEFAULT
Specifies the string that represents a default value. Each time the string is found in the input file, the default value of the corresponding column will be used. This option is allowed only in COPY FROM, and only when not using binary format.

Conclusion

To gauge the potential benefits of this feature with PostgreSQL 16, keep an eye on specific performance indicators during bulk loads with the COPY command. Monitor the “lock” wait event type and the subsequent “extend” lock wait event. If you find that these events are causing bottlenecks, you’ll observe that in your pg_stat_activity. The COPY operation, responsible for inserting data, will frequently show it’s busy with this type of wait event compared to others.

This observation will indicate a bottleneck associated with the particular “relation extend” lock. In PostgreSQL 16, you can anticipate an improvement of up to 300% in this regard.

Introduction:

Summary of change:

Conclusion

Leave a Comment