PostgreSQL is a powerful and widely used relational database management system (RDBMS) known for its stability, scalability, and flexibility. However, over time, as data in PostgreSQL tables and indexes grows, performance can degrade due to fragmentation. This is where pg_repack comes in—a tool that helps in reorganising and optimising your PostgreSQL database to improve performance without significant downtime.In this blog post, we’ll cover what pg_repack is, its primary uses, and how to install and configure it for your PostgreSQL setup.
What is pg_repack?
pg_repack is an Open-source PostgreSQL extension designed to eliminate the bloat of table and index without requiring table locks or downtime. It removes bloat from tables and indexes and optionally restores clustered indexes’ physical order. It works online without holding an exclusive lock on the processed tables during processing.
Key Features of pg_repack
- Online Repacking: Unlike the traditional
VACUUMorREINDEXcommands in PostgreSQL, pg_repack works online, meaning it doesn’t lock tables or indexes during the process. This makes it ideal for production environments where downtime is critical. - Index Rebuilding: pg_repack can rebuild indexes, removing fragmentation, and optimizing the storage for quicker access.
- Table Repacking: It can also repack entire tables, reclaiming space occupied by dead tuples and reducing the size of the data files.
- Database Repacking: In addition to individual tables and indexes, pg_repack can perform repacking on an entire database.
- Space Savings: By reorganizing data and removing wasted space, pg_repack helps reduce the storage requirements of your PostgreSQL instance.
Common Uses of pg_repack
Here are some common scenarios where pg_repack can help:
- Reducing Table Bloat: Over time, as data is inserted, updated, and deleted in tables, PostgreSQL’s internal storage can become fragmented. Repacking these tables can restore performance by removing unused space.
- Rebuilding Indexes: Index fragmentation can slow down query performance. pg_repack rebuilds indexes in a way that eliminates fragmentation and optimizes query speed.
- Improving Query Performance: By eliminating bloat in both tables and indexes, queries that rely on these objects will experience faster execution times due to reduced I/O operations.
- Reclaiming Disk Space: After a large number of DELETE operations or bulk updates, you may find that your database size doesn’t shrink automatically. pg_repack can shrink the disk space used by these operations.
- Running in Production with Minimal Impact: Since pg_repack works online, you can run the tool in a live production environment without locking tables for long periods. This is particularly useful for large, active databases.
Installation and Configuration
The installation process for pg_repack is straightforward and depends on your operating system. Here are the steps to install on RHEL 9 running PG16.
[root@localhost ~]# yum install pg_repack_16
#Check for the version to verify download
pg_repack --version
Once the packages are installed, add pg_repack to shared_preload_libraries in postgresql.conf
shared_preload_libraries='pg_repack'
#Restart the postgresql server
After the restart , login to the database server and create the extension
postgres=# create extension pg_repack;
CREATE EXTENSION
postgres=> \dx
List of installed extensions
Name | Version | Schema | Description
-----------+---------+------------+--------------------------------------------------------------
pg_repack | 1.5.0 | public | Reorganize tables in PostgreSQL databases with minimal locks
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)
Testing
Create a Table
Create a table named test with columns id (integer) and name (varchar with a maximum length of 250 characters). The id column is set as the primary key:
postgres=# CREATE TABLE test (id int primary key, name varchar(250) );
CREATE TABLE
The table has been successfully created.
Insert 100,000 Rows
insert 100,000 rows into the test table.
postgres=# INSERT INTO test (id, name)
SELECT
gs.id,
'name_' || gs.id AS name
FROM generate_series(1, 100000) AS gs(id);
INSERT 0 100000
100,000 rows have been successfully inserted.
Check the Size of the Table
After inserting the rows, check the size of the test table:
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------+-------+----------+-------------+---------------+---------+-------------
public | test | table | postgres | permanent | heap | 4360 kB |
(1 row)
The size of the test table is 4360 kB.
Update All Rows
Update all rows in the test table, setting the name column to ‘somename’ for all records:
postgres=# update test set name ='somename';
UPDATE 100000
100,000 rows have been updated.
Check the Size of the Table After Update
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------+-------+----------+-------------+---------------+---------+-------------
public | test | table | postgres | permanent | heap | 8688 kB |
(1 row)
The size of the table has increased to 8688 kB.
Delete Rows
postgres=# DELETE FROM test WHERE id BETWEEN 50000 AND 100000;
DELETE 50001
50,001 rows have been deleted.
Check the Size of the Table After Deletion
After the deletion, check the size of the test table:
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------+-------+----------+-------------+---------------+---------+-------------
public | test | table | postgres | permanent | heap | 6520 kB |
(1 row)
The size of the table is now 6520 kB.
Using pg_repack to Reclaim Space in PostgreSQL.
To reclaim space in a PostgreSQL table, you can use the pg_repack extension. Here’s the process to repack a table and check the space before and after the operation:
To repack the test table and reclaim the unused space
[postgres@localhost ~]$ pg_repack --table=test -d postgres -e
LOG: (query) SET search_path TO pg_catalog, pg_temp, public
LOG: (query) SET search_path TO pg_catalog, pg_temp, public
LOG: (query) select repack.version(), repack.version_sql()
LOG: (query) SET statement_timeout = 0
LOG: (query) SET search_path = pg_catalog, pg_temp, public
LOG: (query) SET client_min_messages = warning
LOG: (query) SELECT r FROM (VALUES ($1, 'r')) AS given_t(r,kind) WHERE NOT EXISTS( SELECT FROM repack.tables WHERE relid=to_regclass(given_t.r)) AND NOT EXISTS( SELECT FROM pg_catalog.pg_class c WHERE c.oid=to_regclass(given_t.r) AND c.relkind = given_t.kind AND given_t.kind = 'p')
LOG: (param:0) = test
LOG: (query) SELECT t.*, coalesce(v.tablespace, t.tablespace_orig) as tablespace_dest FROM repack.tables t, (VALUES (quote_ident($1::text))) as v (tablespace) WHERE (relid = $2::regclass) ORDER BY t.relname, t.schemaname
LOG: (param:0) = (null)
LOG: (param:1) = test
INFO: repacking table "public.test"
LOG: (query) SELECT pg_try_advisory_lock($1, CAST(-2147483648 + $2::bigint AS integer))
LOG: (param:0) = 16185446
LOG: (param:1) = 25021
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT pg_get_indexdef(indexrelid) FROM pg_index WHERE indrelid = $1 AND NOT indisvalid
LOG: (param:0) = 25021
LOG: (query) SELECT indexrelid, repack.repack_indexdef(indexrelid, indrelid, $2, FALSE) FROM pg_index WHERE indrelid = $1 AND indisvalid
LOG: (param:0) = 25021
LOG: (param:1) = (null)
LOG: (query) SELECT repack.conflicted_triggers($1)
LOG: (param:0) = 25021
LOG: (query) SELECT repack.create_index_type(25024,25021)
LOG: (query) SELECT repack.create_log_table(25021)
LOG: (query) CREATE TRIGGER repack_trigger AFTER INSERT OR DELETE OR UPDATE ON public.test FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('id')
LOG: (query) ALTER TABLE public.test ENABLE ALWAYS TRIGGER repack_trigger
LOG: (query) SELECT repack.disable_autovacuum('repack.log_25021')
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SELECT pg_backend_pid()
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'relation' AND granted = false AND relation = 25021 AND mode = 'AccessExclusiveLock' AND pid <> pg_backend_pid()
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL SERIALIZABLE
LOG: (query) SELECT set_config('work_mem', current_setting('maintenance_work_mem'), true)
LOG: (query) SELECT coalesce(array_agg(l.virtualtransaction), '{}') FROM pg_locks AS l LEFT JOIN pg_stat_activity AS a ON l.pid = a.pid LEFT JOIN pg_database AS d ON a.datid = d.oid WHERE l.locktype = 'virtualxid' AND l.pid NOT IN (pg_backend_pid(), $1) AND (l.virtualxid, l.virtualtransaction) <> ('1/1', '-1/0') AND (a.application_name IS NULL OR a.application_name <> $2) AND a.query !~* E'^\\s*vacuum\\s+' AND a.query !~ E'^autovacuum: ' AND ((d.datname IS NULL OR d.datname = current_database()) OR l.database = 0)
LOG: (param:0) = 41042
LOG: (param:1) = pg_repack
LOG: (query) DELETE FROM repack.log_25021
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'relation' AND granted = false AND relation = 25021 AND mode = 'AccessExclusiveLock' AND pid <> pg_backend_pid()
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS SHARE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.create_table($1, $2)
LOG: (param:0) = 25021
LOG: (param:1) = pg_default
LOG: (query) INSERT INTO repack.table_25021 SELECT id,name FROM ONLY public.test
LOG: (query) SELECT repack.disable_autovacuum('repack.table_25021')
LOG: (query) COMMIT
LOG: (query) CREATE UNIQUE INDEX index_25024 ON repack.table_25021 USING btree (id)
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG: (param:0) = SELECT * FROM repack.log_25021 ORDER BY id LIMIT $1
LOG: (param:1) = INSERT INTO repack.table_25021 VALUES ($1.*)
LOG: (param:2) = DELETE FROM repack.table_25021 WHERE (id) = ($1.id)
LOG: (param:3) = UPDATE repack.table_25021 SET (id, name) = ($2.id, $2.name) WHERE (id) = ($1.id)
LOG: (param:4) = DELETE FROM repack.log_25021 WHERE id IN (
LOG: (param:5) = 1000
LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'virtualxid' AND pid <> pg_backend_pid() AND virtualtransaction = ANY($1)
LOG: (param:0) = {}
LOG: (query) SAVEPOINT repack_sp1
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG: (param:0) = SELECT * FROM repack.log_25021 ORDER BY id LIMIT $1
LOG: (param:1) = INSERT INTO repack.table_25021 VALUES ($1.*)
LOG: (param:2) = DELETE FROM repack.table_25021 WHERE (id) = ($1.id)
LOG: (param:3) = UPDATE repack.table_25021 SET (id, name) = ($2.id, $2.name) WHERE (id) = ($1.id)
LOG: (param:4) = DELETE FROM repack.log_25021 WHERE id IN (
LOG: (param:5) = 0
LOG: (query) SELECT repack.repack_swap($1)
LOG: (param:0) = 25021
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) SAVEPOINT repack_sp1
LOG: (query) SET LOCAL lock_timeout = 100
LOG: (query) LOCK TABLE public.test IN ACCESS EXCLUSIVE MODE
LOG: (query) RESET lock_timeout
LOG: (query) SELECT repack.repack_drop($1, $2)
LOG: (param:0) = 25021
LOG: (param:1) = 4
LOG: (query) COMMIT
LOG: (query) BEGIN ISOLATION LEVEL READ COMMITTED
LOG: (query) ANALYZE public.test
LOG: (query) COMMIT
LOG: (query) SELECT pg_advisory_unlock($1, CAST(-2147483648 + $2::bigint AS integer))
LOG: (param:0) = 16185446
LOG: (param:1) = 25021
Checking the Table Size After Repacking
After the repack process completes, you can check the table size again:
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------+-------+----------+-------------+---------------+---------+-------------
public | test | table | postgres | permanent | heap | 2192 kB |
(1 row)
The size of the test table has reduced to 2192 kB, indicating that the unused space has been reclaimed, and the table’s storage has been optimized.
Check and use the option available based on your requirement.
[postgres@localhost ~]$ pg_repack --help
pg_repack re-organizes a PostgreSQL database.
Usage:
pg_repack [OPTION]... [DBNAME]
Options:
-a, --all repack all databases
-t, --table=TABLE repack specific table only
-I, --parent-table=TABLE repack specific parent table and its inheritors
-c, --schema=SCHEMA repack tables in specific schema only
-s, --tablespace=TBLSPC move repacked tables to a new tablespace
-S, --moveidx move repacked indexes to TBLSPC too
-o, --order-by=COLUMNS order by columns instead of cluster keys
-n, --no-order do vacuum full instead of cluster
-N, --dry-run print what would have been repacked
-j, --jobs=NUM Use this many parallel jobs for each table
-i, --index=INDEX move only the specified index
-x, --only-indexes move only indexes of the specified table
-T, --wait-timeout=SECS timeout to cancel other backends on conflict
-D, --no-kill-backend don't kill other backends when timed out
-Z, --no-analyze don't analyze at end
-k, --no-superuser-check skip superuser checks in client
-C, --exclude-extension don't repack tables which belong to specific extension
Connection options:
-d, --dbname=DBNAME database to connect
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port
-U, --username=USERNAME user name to connect as
-w, --no-password never prompt for password
-W, --password force password prompt
Generic options:
-e, --echo echo queries
-E, --elevel=LEVEL set output message level
--help show this help, then exit
--version output version information, then exit
Read the website for details: <https://reorg.github.io/pg_repack/>.
Report bugs to <https://github.com/reorg/pg_repack/issues>.
Conclusion
pg_repack is a powerful tool for optimizing PostgreSQL databases by reducing fragmentation and reclaiming disk space. It helps maintain high performance in production systems by reorganizing tables and indexes online, with minimal impact on ongoing operations. By regularly running pg_repack, you can keep your database lean, fast, and efficient.
With its simple installation and straightforward usage, pg_repack is an essential tool for PostgreSQL administrators who want to ensure their databases remain performant as they grow. Whether you’re working with a large, active database or a smaller, more static one, pg_repack can help you maintain optimal performance with minimal effort.
