Understanding Replication Conflicts in PostgreSQL

In any PostgreSQL production database environment, high availability is paramount. Streaming replication is a powerful feature that enables high availability by creating a standby node which is a replica of the primary database. It allows the standby node to serve read-only queries, offloading read traffic from the primary and improving overall system performance. It also provides disaster recovery capabilities, as the standby can act as a backup that’s always up-to-date with the primary database. With such a setup, PostgreSQL provides a robust solution for businesses to handle large-scale workloads and ensure their applications remain operational even in case of hardware or software failures.

Even though streaming replication is beneficial, it also brings you some challenges. One of the issues I’ve recently encountered in one of our client’s production environments is

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

We observed this issue when we were trying to run a query on one of the Standby nodes which is in streaming replication with Primary.

Common causes

This error occurs when a read-only query on the standby node tries to access row versions that are about to be or have been removed during the recovery process. Essentially, the query is trying to read data that no longer exists on the replica because it’s being cleaned up as part of the WAL recovery due to:

Long-running queries on the Standby server.
Replication Lag.
Vacuuming of old rows.

What is replication conflict?

It is a type of Conflict that occurs only on the Standby server, and not on the Primary server. This issue occurs when the recovery process is unable to apply WAL information from Primary to Standby if it would interfere with query processing on the standby. There are some of the conflicts:

Tablespace conflict occurs if an expected temp tablespace is dropped on Primary which is a part of queries being processed on Standby.
Lock conflict occurs when you query a table on standby while the same table is being modified on the primary.
Snapshot conflict occurs when a backend process tries to access rows on the standby server that have been vacuumed out on the primary.
Deadlock conflict occurs due to query cancellations occurring because of deadlocks on standby.

To monitor these replication conflicts, there is a pg_catalog view called pg_stat_database_conflicts . This view shows the statistics of the occurrence of query cancellation due to conflicts with recovery on the standby server as these conflicts don’t occur on the primary server. Here is the output of this view:

postgres=# select * from pg_stat_database_conflicts;
 datid |  datname  | confl_tablespace | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock | confl_active_logicalslot 
-------+-----------+------------------+------------+----------------+-----------------+----------------+--------------------------
     5 | postgres  |                0 |          0 |              0 |               0 |              0 |                        0
 16469 | test      |                0 |          0 |              0 |               0 |              0 |                        0
     1 | template1 |                0 |          0 |              0 |               0 |              0 |                        0
     4 | template0 |                0 |          0 |              0 |               0 |              0 |                        0
(4 rows)

Here you can see that there are no conflicts on my local standby. If there is any occurrence of this issue, it will display as 1 in any of the columns that begin with ‘confl_’ depending on the type of conflict.

How PostgreSQL handles these conflicts

PostgreSQL has a parameter max_standby_streaming_delay that determines what happens when WAL replay encounters a replication conflict. PostgreSQL suspends replay of the WAL information for at most max_standby_streaming_delay milliseconds. If the conflicting query is still running after that time, PostgreSQL cancels it. Default value is 30 seconds . If it is set to, this setting would be disabled.

How to Avoid

To avoid replication conflicts, one simple option is to disable Hot Standby on the replica by setting hot_standby = off. This eliminates the chance of conflicts because the standby won’t process queries. However, this is only practical if the standby server is used exclusively for high availability and not for read queries.

To avoid lock conflicts, it’s important not to run queries that acquire an ACCESS EXCLUSIVE lock, such as DROP TABLE, TRUNCATE, or ALTER TABLE. A special case is VACUUM truncation, which can cause conflicts on the standby. While you can’t avoid the ACCESS EXCLUSIVE lock during VACUUM truncation, you can disable it for individual tables in the database.

For snapshot conflicts, enabling hot_standby_feedback on the standby can help prevent the primary from removing tuples that the standby still needs to see. However, this can lead to table bloat on the primary due to long-running queries, so use it with caution.

Summary

In PostgreSQL, streaming replication is a vital feature for achieving high availability, data redundancy, and disaster recovery. While it offers many advantages, it also introduces challenges, particularly around replication conflicts that can occur on the standby server. The “canceling statement due to conflict with recovery” error is a common issue that arises when long-running queries or replication lag interferes with the WAL recovery process.

By understanding the different types of replication conflicts such as lock conflicts & snapshot conflicts, you can better manage and mitigate these issues. Monitoring tools like pg_stat_database_conflicts can help identify the occurrence of these conflicts and guide you in resolving them.

To prevent replication conflicts, you can consider strategies like disabling, or avoiding certain types of queries that acquire exclusive locks, and enabling settings like hot_standby_feedback. However, each solution has its trade-offs, and it’s crucial to strike a balance between performance and the prevention of conflicts. Ultimately, carefully managing replication, query performance, and system configuration will ensure your PostgreSQL setup remains robust, reliable, and efficient, even in the face of replication challenges.

Common causes

What is replication conflict?

How PostgreSQL handles these conflicts

How to Avoid

Summary

Leave a Comment