PostgreSQL Physical Replication Gone Wrong: A Lesson in Archive Management

While setting up physical streaming replication in PostgreSQL, one misstep can lead to critical issues. Even when all standard steps are followed.

In one of my recent setups, I began by preparing the primary server. I enabled archive_mode, configured an archive_command, and tuned a few WAL parameters. Then, I used pg_basebackup to initialize the standby from the primary. Everything looked fine, the standby connected, streaming replication started, and WALs were flowing.

But shortly after running write-heavy operations on the primary, the standby crashed.

The root cause? A misconfiguration that got silently copied during the base backup. The archive configuration from the primary was still active on the standby, but the archive destination didn’t exist there. WAL segments started piling up, couldn’t be archived, and filled up disk space which eventually caused the standby to crash.

In this blog, I’ll walk you through

The steps I followed to set up replication
What exactly went wrong and why
How I fixed the issue and the lessons I learned

If you’re using pg_basebackup for physical replication, this is a mistake worth learning from, it is not always the complex parts that fail, but the ones you didn’t even think to double-check.

Replication setup:

Here are the steps I followed to setup physical streaming replication.

Create replication user:

CREATE ROLE rep_user with REPLICATION LOGIN ENCRYPTED ‘password’;

Configure postgresql.conf file:

wal_level = replica

archive_mode = on

archive_command = ‘cp %p /test/archive/%f’

min_wal_size = 2GB

max_wal_size = 8GB

hot_standby = on

Update pg_hba.conf file for allowing replication:

host replication rep_user standby_ip/32 md5

On Standby:

On the standby server, it’s important to stop PostgreSQL before running pg_basebackup. This tool takes a binary copy of the primary server’s data directory to initialize the standby, and if the standby is still running during this process, it can result in an incomplete or inconsistent backup. This may lead to issues like WAL corruption, recovery failures, or replication not starting properly. Stopping the standby ensures a clean snapshot of the primary’s state and avoids complications during the replication setup.

Remove old standby data directory:

rm -rf /var/lib/pgsql/13/data/*

Run pg_basebackup on standby:

Pg_basebackup -h primary_ip -U rep_user -D /var/lib/pgsql/13/data -X stream -P -R

This made a complete copy of primary. With -R which automatically creates the standby.signal.

restart the postgresql on the standby

Verify replication status immediately:

On Primary:

SELECT pid, username,application_name, client_addr, state, sync_state,

Sent_lsn, write_lsn, flush_lsn, replay_lsn

FROM pg_stat_replication;

On standby:

SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();

SELECT pg_is_in_recovery();

Crash issues:

After running a batch of inserts on primary, I noticed the standby crashed. Steps observed:

WAL generation on primary surged due to heavy insert activity.
The standby, trying to stay in sync but started replaying WALs at a high rate.
It also inherited the archive_command from the primary copied during pg_basebackup.
Unfortunately, the directory /test/archive/ doesn’t exist on the standby.
As a result the archive_command failed silently causing WALs to pile up.
PostgreSQL couldn’t recycle or remove old WALs because archiving was still marked in progress.
Disk usage on standby exploded filling up pg_wal
At same time memory usage spiked due to WAL replay buffers
Both of these issues had eventually caused the standby to crash abruptly.

Here are the logs:

Here are the logs:

-postgres@new-0LOG:  connection authenticated: identity="postgres" method=peer (/var/lib/pgsql/13/data/pg_hba.conf:85)
-postgres@new-0LOG:  connection authorized: user=postgres database=new application_name=psql
-postgres@postgres-0LOG:  disconnection: session time: 0:00:06.593 user=postgres database=postgres host=[local]
-postgres@new-0LOG:  duration: 5.188 ms
-postgres@new-0LOG:  duration: 0.972 ms
--@-0LOG:  could not create archive status file "pg_wal/archive_status/0000000100000000000000B2.done": No space left on device
--@-0FATAL:  could not extend file "base/16384/16388.1": No space left on device
--@-0HINT:  Check free disk space.
--@-0CONTEXT:  WAL redo at 0/B224D8D8 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/16384/16388, blk 146658
--@-0FATAL:  could not create file "pg_wal/xlogtemp.22337": No space left on device
--@-0LOG:  startup process (PID 22333) exited with exit code 1
--@-0LOG:  terminating any other active server processes
--@-0LOG:  could not open temporary statistics file "pg_stat/global.tmp": No space left on device
--@-0LOG:  shutting down due to startup process failure
--@-0LOG:  database system is shut down

The WAL files had consumed all the available disk space. Worse, PostgreSQL couldn’t even write temporary stats files or archive status markers. It was stuck.

Recovery steps:

Got to know which part consumed more space and found the pg_wal directory is flooded with wal files.
Before moving further copied all wal files to separate location and cleaned up the pg_wal directory to free up space.
After cleaning up the directory, I tried restarting the database and it failed again.
The log showed PostgreSQL was still looking for a specific WAL segment. This made sense because the database needed that segment to recover.
Searched for that WAL file in the backup location and copied it back to pg_wal directory and fixed ownership and permissions.
This time the standby started successfully.

Lessons learned:

Pg_basebackup copies postgresql.conf as it is, including the changes made to the parameters which are meant for primary.
Always review configs on standby
WAL bloat is deadly in replication

Let me know if you’ve run into similar issues or have tricks for preventing archive-related mishaps in replication. Always learning.

PostgreSQL Physical Replication Gone Wrong: A Lesson in Archive Management

Replication setup:

Create replication user:

Configure postgresql.conf file:

Update pg_hba.conf file for allowing replication:

On Standby:

Remove old standby data directory:

Run pg_basebackup on standby:

Verify replication status immediately:

On Primary:

On standby:

Crash issues:

Here are the logs:

Recovery steps:

Lessons learned:

2 thoughts on “PostgreSQL Physical Replication Gone Wrong: A Lesson in Archive Management”

Leave a Comment

Replication setup:

Create replication user:

Configure postgresql.conf file:

Update pg_hba.conf file for allowing replication:

On Standby:

Remove old standby data directory:

Run pg_basebackup on standby:

Verify replication status immediately:On Primary:

On standby:

Crash issues:

Here are the logs:

Recovery steps:

Lessons learned:

2 thoughts on “PostgreSQL Physical Replication Gone Wrong: A Lesson in Archive Management”

Leave a Comment

Verify replication status immediately:

On Primary: