When a Disk Expansion Brought Down a PostgreSQL Archive Pipeline

Production database incidents rarely look like what you expect. They don’t arrive with sirens. They accumulate quietly — a mount point that didn’t remount, a WAL directory no one checked, a permission that evaporated after a restart. By the time the alert fires, the root cause is three steps back, and the damage is already compounding.

This post is a complete account of one such incident: a 100% full physical disk, a VM restart that silently broke archive connectivity, a growing backlog of unarchived WAL files, a careless remount that crashed the database, and a second disk fill event that struck during recovery. We document every phase — what happened, why it happened, and exactly how it was resolved.

Understanding the environment

Before walking through the incident, it helps to understand the components involved and how they relate to each other.

PostgreSQL WAL
Write-Ahead Logging is PostgreSQL’s durability mechanism. Every change is first written to pg_wal before being applied. These files must either be archived or recycled — they cannot accumulate indefinitely.

/archive mount point
A dedicated filesystem where archived WAL segments are shipped by the archive_command. PostgreSQL tracks which files have been successfully archived; if archiving fails, the segment stays in pg_wal.

Shared memory
PostgreSQL uses POSIX shared memory segments to coordinate between its worker processes. These are created at startup and owned by the postgres OS user. OS-level events like reboots or remounts may affect shared memory.

pg_basebackup
A utility that takes a consistent physical snapshot of the entire cluster. Used here as the recovery baseline — allowing cleanup of WAL history without risking data loss.

In PostgreSQL’s continuous archiving setup, archive_status inside pg_wal tracks each WAL segment’s state: .ready means waiting to be archived, .done means successfully archived. A broken archive destination means an endless queue of .ready files and a growing pg_wal directory.

The incident — a complete timeline

Phase 1 : Root cause

Physical mount point hits 100% capacity.

The underlying physical volume hosting the database server filled to capacity. This wasn’t a sudden spike — it was the slow result of insufficient capacity planning. No automated alert triggered in time, and the first sign of trouble was a manual observation that disk writes were failing. At this point the database was still running, but archiving was already degraded.

Phase 2 : Infrastructure change

Decision to expand storage requires VM restart.

The chosen remediation — expanding the physical disk volume — required a full VM server restart to allow the OS to recognise the new capacity and resize the filesystem. This is a routine operation in many environments. A pre-restart checklist was not followed, and no explicit note was made of which mount points existed and which would need to be manually re-verified after boot.

Phase 3 : Silent failure

/archive fails to remount after restart.

The VM restarted successfully and PostgreSQL came back up normally. However, the /archive mount point — likely an NFS share, SAN volume, or external block device — did not remount automatically. This is a classic /etc/fstab misconfiguration: the entry was either missing, had incorrect options, or had noauto set. The unmounted state was not immediately visible because PostgreSQL continued running — it simply started failing to archive, silently queuing WAL segments in pg_wal.

Phase 4 : Accumulation

WAL files build up unnoticed in pg_wal.

With the archive destination unreachable, every WAL segment completed by PostgreSQL was marked .ready and left in pg_wal. PostgreSQL’s archive_command retries on failure but never discards segments — it waits indefinitely. Over time, this directory grew significantly. Because the primary disk had just been expanded, there was temporarily enough headroom to absorb the growth — which is precisely why it went unnoticed for an extended period.

Phase 5 : Discovery

Unmounted /archive noticed; remounted without assessment.

Eventually the unmounted /archive was noticed — either through a routine check or secondary alert. The natural instinct was to fix it immediately: remount the partition and let archiving resume. This was done without first assessing how many WAL files had accumulated, without checking shared memory state, and without reviewing PostgreSQL’s current process ownership. The mount was restored while the database was live.

Phase 6 : Crash

PostgreSQL throws FATAL shared memory error.

Almost immediately after the remount, the PostgreSQL process crashed with a fatal error. The database became completely unavailable — no connections accepted, all queries failing. This is the moment the incident escalated from a background issue to a full production outage.

FATAL: could not open shared memory segment "/PostgreSQL.xxxxxxxx": Permission denied

Why did this error appear?

This error is caused by a mismatch between the POSIX shared memory segment that PostgreSQL created and the OS-level permissions at the time of access. When a filesystem is remounted — especially if it involves changes to mount options, ownership, or security context — the OS can revoke or alter shared memory permissions. PostgreSQL’s background processes then fail to access the segment they need to coordinate, causing a fatal exit.

It can also occur if the VM restart caused the shared memory segment to be recreated under different permissions than the postgres OS user expects — a state the database only discovers when it next tries to access it after an external filesystem event triggers a process restart or re-check.

The remount was not the cause of the outage in isolation — it was the trigger. The underlying permission drift had already occurred during the VM restart. The remount was the event that forced PostgreSQL to re-examine its shared memory context and discover the inconsistency.

The second wave: archive disk fills during recovery

With the database down and the team focused on restoring it, a second problem was quietly forming. The WAL files that had accumulated in pg_wal over the period of the broken archive began streaming to /archive the moment the database came back online and archiving resumed.

This is a well-known recovery trap: a WAL backlog of any significant size will attempt to archive all at once, because PostgreSQL processes the archival queue as fast as the archive_command allows. In environments with high write throughput, this backlog can represent gigabytes or even tens of gigabytes of WAL segments.

The /archive mount point filled to 100% — not because of the original planning failure, but because a WAL backlog was dumped into a partition whose baseline available space had never been sized to absorb a burst of that magnitude.

The resolution — step by step

Recovery required addressing three distinct problems in sequence: restoring database availability, clearing the WAL backlog, and reclaiming archive disk space — all without risking data integrity.

Step 1 : Verify psql connectivity.

Before making any changes, we confirmed whether PostgreSQL could accept connections at all. This establishes a baseline: is this a full crash requiring a restart, or is the postmaster still alive?

Step 2 : Restore shared memory permissions.

Identified the POSIX shared memory segment owned by PostgreSQL and restored the correct permissions for the postgres OS user. This was the direct fix for the FATAL error.

Step 3 : Restart PostgreSQL cleanly.

After permissions were restored, the service was restarted to reinitialise all process ownership cleanly. This also triggered the archive queue to begin draining.

Step 4 : Take pg_basebackup to alternate volume.

Rather than work on the full /archive partition directly, a full base backup was taken to a separate volume with available space. This provided a consistent recovery point before any destructive cleanup.

Step 5 : Clean archive WAL with pg_archivecleanup.

Used pg_archivecleanup to remove all WAL segments older than the base backup’s starting LSN, safely reclaiming the /archive partition without touching segments needed for recovery.

Key commands used

#Check if PostgreSQL accepts connections
psql -U postgres -c "SELECT 1;"

#Identify shared memory segments owned by postgres
ipcs -m | grep postgres 

#Remove the stale shared memory segment (use shmid from above) 
ipcrm -m <shmid> 

#Alternatively: fix permissions on the segment directly 
ls -la /dev/shm/ | grep PostgreSQL 
chmod 600 /dev/shm/PostgreSQL.<segment_id> 

#Restart the PostgreSQL service 
sudo systemctl restart postgresql 

#Take a base backup to an alternate location with free space 
pg_basebackup -h localhost -U replication_user \ -D /mnt/backup/basebackup_$(date +%Y%m%d) \ -Fp -Xs -P -R 

#Clean old WAL files from /archive up to the backup's start WAL pg_archivecleanup /archive <latest_wal_segment_in_backup> 

#Verify archive disk usage after cleanup 
df -Th /archive

Why take a base backup before cleaning WAL?

This is the most important sequencing decision in the recovery. pg_archivecleanup permanently removes WAL files. If the oldest WAL you delete turns out to be needed for point-in-time recovery, that recovery path is gone. By taking a base backup first, you establish a new consistent restore point — which means any WAL older than that backup’s starting LSN is genuinely no longer needed for recovery from that point forward.

After pg_archivecleanup completed, the /archive partition dropped back to a healthy usage level. Archiving resumed normally, WAL segments began clearing in real time, and the database returned to full operational status.

Post-incident checklist: what should have been in place

  • Proper impact planning of the database. If the activity will impact the database, prefer a proper shutdown of the database.
  • fstab entry for /archive with correct mount options. Every critical mount point used by a running database must persist across reboots. The entry should include appropriate options for the mount type (NFS, block device, etc.) and should be tested with mount -a before relying on it in production.
  • Alert on pg_wal directory size. Set a threshold alert — for example, if pg_wal exceeds 2× its normal steady-state size — to catch silent archiving failures before they become disk events.
  • Alert on archive_command failure rate. PostgreSQL logs every failed archive attempt. Feed these into your monitoring and alert if the failure rate exceeds zero for more than a configurable window.
  • Pre-restart mount point inventory. Before any VM restart, produce a list of active mount points and their expected post-boot state. Verify the list after the restart completes before declaring the restart successful.
  • Pre-remount database state check. Before reattaching any filesystem that a live PostgreSQL instance depends on, check:
    • Is the postmaster running?
    • What is the current shared memory state?
    • How many WAL files are pending archival?
    • How much space is available on the target partition?
  • Size /archive for burst capacity. The archive partition must accommodate not just steady-state WAL throughput but a reasonable burst — for example, if archiving is interrupted for several hours. Size accordingly, and alert before it reaches 70% to give runway for cleanup.
  • Post-restart archive verification. After every infrastructure restart, run SELECT pg_switch_wal(); and confirm the resulting WAL segment appears in the archive destination within the expected time window. This 30-second check would have caught this incident immediately.
  • Documented runbook for shared memory errors. The FATAL shared memory permission error has a known, repeatable fix. Document it. Having a runbook reduces mean time to resolution from an hour of investigation to five minutes of execution.

Conclusion

Cascade failures exploit invisible dependencies : The /archive mount was critical infrastructure for PostgreSQL’s durability guarantee — but it wasn’t treated as such. Infrastructure dependencies that aren’t explicitly documented tend to surface only when they break, and usually at the worst time.

Silence is not success : PostgreSQL did not crash when archiving broke. It continued running and queuing files. This silent failure mode is a feature, not a bug — but it demands that monitoring fill the gap. If the database doesn’t crash when something is wrong, your alerting system must identify it and generate an alert.

Recovery actions have downstream consequences : Restoring the archive mount looked like a solution. It was actually the trigger for a second problem. Before any recovery action, ask: what happens immediately after this? What queues will drain, what processes will restart, what volumes will receive data? Think two steps ahead.

Base backup before destructive cleanup — always : pg_archivecleanup is not reversible. Taking a base backup before running it is not optional in a production environment; it is the safety net that makes the cleanup safe. This principle extends to any irreversible cleanup operation during an incident.

Runbooks compress recovery time : The FATAL shared memory error, the WAL cleanup procedure, the base backup process — none of these are novel problems. Documented runbooks with exact commands turn a 90-minute fire drill into a 10-minute procedure. The time to write a runbook is right after an incident, not before the next one.

Every production database incident is a story about gaps between expectation and reality. The expectation was that a VM restart is routine. The reality was that routine restarts expose every unmaintained assumption in your infrastructure. Closing those gaps — with fstab entries, with monitoring, with runbooks, with pre-action checklists — is the unglamorous work that separates environments that recover quickly from those that don’t recover at all.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top