While setting up physical streaming replication in PostgreSQL, one misstep can lead to critical issues. Even when all standard steps are followed.
In one of my recent setups, I began by preparing the primary server. I enabled archive_mode, configured an archive_command, and tuned a few WAL parameters. Then, I used pg_basebackup to initialize the standby from the primary. Everything looked fine, the standby connected, streaming replication started, and WALs were flowing.
But shortly after running write-heavy operations on the primary, the standby crashed.
The root cause? A misconfiguration that got silently copied during the base backup. The archive configuration from the primary was still active on the standby, but the archive destination didnât exist there. WAL segments started piling up, couldn’t be archived, and filled up disk space which eventually caused the standby to crash.
In this blog, Iâll walk you through
- The steps I followed to set up replication
- What exactly went wrong and why
- How I fixed the issue and the lessons I learned
If youâre using pg_basebackup for physical replication, this is a mistake worth learning from, it is not always the complex parts that fail, but the ones you didnât even think to double-check.
Replication setup:
Here are the steps I followed to setup physical streaming replication.
Create replication user:
CREATE ROLE rep_user with REPLICATION LOGIN ENCRYPTED âpasswordâ;
Configure postgresql.conf file:
wal_level = replica
archive_mode = on
archive_command = âcp %p /test/archive/%fâ
min_wal_size = 2GB
max_wal_size = 8GB
hot_standby = on
Update pg_hba.conf file for allowing replication:
host replication rep_user standby_ip/32 Â md5
On Standby:
On the standby server, itâs important to stop PostgreSQL before running pg_basebackup. This tool takes a binary copy of the primary serverâs data directory to initialize the standby, and if the standby is still running during this process, it can result in an incomplete or inconsistent backup. This may lead to issues like WAL corruption, recovery failures, or replication not starting properly. Stopping the standby ensures a clean snapshot of the primaryâs state and avoids complications during the replication setup.
Remove old standby data directory:
rm -rf /var/lib/pgsql/13/data/*
Run pg_basebackup on standby:
Pg_basebackup -h primary_ip -U rep_user -D /var/lib/pgsql/13/data -X stream -P -R
This made a complete copy of primary. With -R which automatically creates the standby.signal.
restart the postgresql on the standby
Verify replication status immediately:
On Primary:
SELECT pid, username,application_name, client_addr, state, sync_state,
Sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;
On standby:
SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
SELECT pg_is_in_recovery();
Crash issues:
After running a batch of inserts on primary, I noticed the standby crashed. Steps observed:
- WAL generation on primary surged due to heavy insert activity.
- The standby, trying to stay in sync but started replaying WALs at a high rate.
- It also inherited the archive_command from the primary copied during pg_basebackup.
- Unfortunately, the directory /test/archive/ doesnât exist on the standby.
- As a result the archive_command failed silently causing WALs to pile up.
- PostgreSQL couldnât recycle or remove old WALs because archiving was still marked in progress.
- Disk usage on standby exploded filling up pg_wal
- At same time memory usage spiked due to WAL replay buffers
- Both of these issues had eventually caused the standby to crash abruptly.
Here are the logs:
Here are the logs:
-postgres@new-0LOG: connection authenticated: identity="postgres" method=peer (/var/lib/pgsql/13/data/pg_hba.conf:85)
-postgres@new-0LOG: connection authorized: user=postgres database=new application_name=psql
-postgres@postgres-0LOG: disconnection: session time: 0:00:06.593 user=postgres database=postgres host=[local]
-postgres@new-0LOG: duration: 5.188 ms
-postgres@new-0LOG: duration: 0.972 ms
--@-0LOG: could not create archive status file "pg_wal/archive_status/0000000100000000000000B2.done": No space left on device
--@-0FATAL: could not extend file "base/16384/16388.1": No space left on device
--@-0HINT: Check free disk space.
--@-0CONTEXT: WAL redo at 0/B224D8D8 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/16384/16388, blk 146658
--@-0FATAL: could not create file "pg_wal/xlogtemp.22337": No space left on device
--@-0LOG: startup process (PID 22333) exited with exit code 1
--@-0LOG: terminating any other active server processes
--@-0LOG: could not open temporary statistics file "pg_stat/global.tmp": No space left on device
--@-0LOG: shutting down due to startup process failure
--@-0LOG: database system is shut down
The WAL files had consumed all the available disk space. Worse, PostgreSQL couldnât even write temporary stats files or archive status markers. It was stuck.
Recovery steps:
- Got to know which part consumed more space and found the pg_wal directory is flooded with wal files.
- Before moving further copied all wal files to separate location and cleaned up the pg_wal directory to free up space.
- After cleaning up the directory, I tried restarting the database and it failed again.
- The log showed PostgreSQL was still looking for a specific WAL segment. This made sense because the database needed that segment to recover.
- Searched for that WAL file in the backup location and copied it back to pg_wal directory and fixed ownership and permissions.
- This time the standby started successfully.
Lessons learned:
- Pg_basebackup copies postgresql.conf as it is, including the changes made to the parameters which are meant for primary.
- Always review configs on standby
- WAL bloat is deadly in replication
Let me know if youâve run into similar issues or have tricks for preventing archive-related mishaps in replication. Always learning.