Battle-Tested: Being a PostgreSQL DBA

Being a PostgreSQL DBA means living in a world where production issues don’t arrive on a schedule. Over the past few months, with a series of incidents that reinforced an important truth about this role: every problem is an opportunity to sharpen one’s diagnostic skills and operational practices.

These weren’t isolated incidents that happened in a single chaotic day. They were distinct challenges that emerged over time, each teaching something different about running PostgreSQL in production.

Incident 1: The Query That Kept Getting Worse

A few weeks after resolving the I/O situation, a different pattern emerged. A specific report that had always been “a bit slow” was now timing out completely. As our data volume grew, what had been tolerable performance became unacceptable.

Diagnosing the Problem

When we examined the query and immediately spotted something that made me wince: the WHERE clause contained the same column twice with different filter conditions. This was a logic error from when the query had been modified months earlier, and it had been hiding in plain sight.

SELECT exam_id,
       exam_date
FROM student_exam
WHERE exam_result_status = 'PASS'
  AND (
       student_roll_no = 'STU10234'
    OR student_email = 'abc@gmail.com'
    OR student_phone = '1234567890'
    OR student_roll_no = 'STU10234'
    OR student_registration_id = 'STU10234'
  )
ORDER BY exam_date DESC
LIMIT 1;

But the duplicate condition student_roll_no wasn’t the only issue. The query was performing sequential scans on a table that had grown significantly over the past quarter. What worked fine on a smaller dataset was now completely impractical.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT exam_id,exam_date FROM student_exam WHERE exam_result_status = 'PASS' AND (student_roll_no = 'STU10234' OR student_email = 'abc@gmail.com' OR student_phone = '1234567890' OR student_roll_no = 'STU10234' OR student_registration_id = 'STU10234')
ORDER BY exam_date DESC LIMIT 1;

The Fix

We cleaned up the WHERE clause to eliminate the redundant condition—basic SQL hygiene that should have been caught in code review. Then we analyzed the access pattern and created a composite index on the two columns that were consistently filtered together.

CREATE INDEX CONCURRENTLY idx_exam_id_date_created ON student_exam(exam_id,exam_date);

The performance improvement was significant. More importantly, we now had a query that would scale with continued data growth rather than degrading further.

SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE indexname = 'idx_exam_id_date_created';

This incident reminded me that query performance isn’t static. A query that performs acceptably today might become a problem as data volumes increase. As DBAs, we need to think about trajectory, not just current state.

Incident 2: The Slow Burn of High I/O

The first sign was subtle — occasional complaints about application slowness that would disappear before we could investigate. Over several weeks, these complaints became more frequent and persistent. Eventually, it became clear that there is a real problem.

When the I/O metrics finally trended high enough to trigger alerts, We needed to understand the complete picture. We can look at query statistics and identify the top 20 most resource-intensive queries. This wasn’t about finding a single smoking gun — it was about understanding the workload pattern.

SELECT query,calls,total_exec_time,mean_exec_time,shared_blks_read, shared_blks_written FROM pg_stat_statements ORDER BY total_exec_time DESC
LIMIT 20;

Everytime there might not be a single query that is causing the problem. Instead, multiple factors were combining to create sustained I/O pressure that our storage subsystem couldn’t handle during peak hours. As a DBA, it is important to analyse the system over a period of time.

Checkpoint frequency might create artificial pressure. The parameter checkpoint_timeout setting can force frequent checkpoint operations, and each checkpoint created predictable I/O spikes. During low-traffic periods, this was manageable. During peak hours, it was compounding other I/O demands.

SHOW checkpoint_timeout;

We also identified a recurring application error pattern that was generating unnecessary database traffic. Each occurrence was small, but multiplied across thousands of requests, it added up. The development team acknowledged the issue and committed to fixing it in their next release cycle.

We tuned checkpoint_timeout to better match our actual workload characteristics and dialed back the logging verbosity to capture what we needed without creating unnecessary write volume.

ALTER SYSTEM SET checkpoint_timeout = '15min'; #Depends on your pattern

#Reload
SELECT pg_reload_conf();

The improvements were measurable but not dramatic which made sense. We’d been living with this problem for weeks, gradually adapting to degraded performance. Fixing it felt more like relief than rescue.

SELECT checkpoints_timed, checkpoints_req, buffers_checkpoint, buffers_clean, buffers_backend FROM pg_stat_bgwriter;

Incident 3: The Silent Space Consumer

About a month later, we noticed something odd during a routine capacity planning review: WAL disk usage on the primary was trending upward in a way that didn’t match our transaction growth. It wasn’t critical yet, but the trend was concerning.

Tracking Down the Cause

This investigation took some time because the problem was gradual. WAL files were accumulating slowly enough that it didn’t trigger any immediate alarms, but the growth rate was clearly wrong.

df -h /var/lib/pgsql/
du -sh /var/lib/pgsql/*/pg_wal/

#In PSQL
SELECT now()::timestamp(0), pg_current_wal_insert_lsn() as reading1; 
SELECT sleep(1800); 
SELECT now()::timestamp(0), pg_current_wal_insert_lsn() as reading2;

When we finally checked pg_replication_slots, we found the culprit: a replication slot for a standby server that had been decommissioned weeks earlier. Nobody had dropped the slot after the replica was shut down.

SELECT slot_name,slot_type,active,restart_lsn,confirmed_flush_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retention_bytes FROM pg_replication_slots;

PostgreSQL was doing exactly what it was designed to do—holding onto WAL files to ensure that replica could catch up if it came back online. Of course, it was never coming back online.

The Resolution

We dropped the orphaned replication slot, which allowed PostgreSQL to immediately begin purging WAL files that were no longer needed to manage WAL retention more proactively going forward.

SELECT pg_drop_replication_slot('standby_server_02');

More importantly, we added replication slot monitoring to our standard health checks.

SELECT slot_name, active,pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots WHERE NOT active;

This incident happened because we had a gap in our operational runbooks—decommissioning a replica should always include dropping its replication slot, but we didn’t have that documented.

Lessons from the Trenches

Looking back at these three incidents, they tell a story about the reality of database administration:

Problems evolve over time. The I/O issue didn’t appear overnight—it grew gradually as our traffic patterns changed and our configuration assumptions became outdated. Monitoring trends is just as important as responding to alerts.

Context matters. That poorly optimized query worked fine when it was written. It only became a problem as our data volume grew. Understanding how systems change over time is crucial for preventing future incidents.

Operational discipline prevents problems. The orphaned replication slot existed because we didn’t have proper decommissioning procedures. Technical debt accumulates when you skip the cleanup steps.

Every incident is a learning opportunity. Each of these problems exposed gaps in our processes—configuration management, code review, infrastructure hygiene. Fixing the immediate issue is important, but improving your operational practices is how you prevent the next incident.

Beyond These Incidents

These three incidents are resolved, but they’ve raised other questions. We’ve identified concerns with our high availability setup that need attention. While none of these issues caused a complete outage, they revealed architectural considerations we need to address to ensure long-term reliability.

What Being a DBA Really Means

This is the reality of being a PostgreSQL DBA: incidents don’t arrive neatly packaged or on a predictable schedule. They emerge gradually, they compound with other issues, and they require different diagnostic approaches.

You need to understand configuration tuning for the I/O problems, query optimization for the performance issues, and replication mechanics for the WAL accumulation. You need to think in timescales ranging from immediate response to long-term capacity planning.

Most importantly, you need to see patterns. The I/O issue taught me to review configurations regularly as workloads evolve. The query problem reminded me that performance is relative to data volume. The WAL accumulation showed me the importance of operational checklists and proper decommissioning procedures.

Each incident made me a better DBA—not just because we solved the problem, but because we learned something about how PostgreSQL behaves in production and how to build more resilient operational practices.

Tomorrow will bring different challenges. But with each incident, our mental model gets sharper, our diagnostic toolkit gets broader, and our runbooks get better. That’s what being battle-tested really means.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top