PostgreSQL Automatic failover with repmgr

Introduction

Welcome to our latest blog!! In today’s world of distributed databases, ensuring high availability and reliability is paramount. This blog explores the crucial concepts of automatic failover and switchover using repmgr, a popular tool for managing replication and failover in PostgreSQL environments. We know how important it is for the crucial sectors like banking to have a HA solution for their PODS, with PostgreSQL. In the previous blog , we have seen the designing and implementation part of the HA systems but now we will look into the automatic failover and switchover part of the story.

Automatic failover is a critical feature in database systems that ensures minimal downtime in the event of a primary server failure. When the primary node becomes unavailable due to hardware failure, network issues, or other reasons, an automatic failover mechanism seamlessly promotes a standby node to become the new primary node. This transition happens automatically without manual intervention, thereby reducing downtime and ensuring continuous service availability. Here are the detailed process of achieving HA and also retaining old primary back.

step1: check the daemon status

[postgres@node1 ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf daemon status
WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node1   | standby |   running | standby | running | 1641 | no      | 0 second(s) ago    
 2  | standby | primary | * running |          | running | 1716 | no      | n/a

We are able to see the both primary and standby running the daemon status

Step2 : Stopping the Primary Server

Stop the primary PostgreSQL server.

postgres@node1 ~]$ /usr/pgsql-15/bin/pg_ctl stop -D /var/lib/pgsql/15/data/
waiting for server to shut down.... done
server stopped

Confirms the server shutdown.

Step 3:Check the Cluster show

checking the cluster show on the standby server

[postgres@node1 ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf  cluster show
WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
 ID | Name    | Role    | Status        | Upstream  | Location | Priority | Timeline | Connection string                                                               
----+---------+---------+---------------+-----------+----------+----------+----------+----------------------------------------------------------------------------------
 1  | node1   | standby |   running     | ? standby | default  | 100      | 2        | host=192.168.232.150 user=repmgr dbname=repmgr password=repmgr connect_timeout=2
 2  | standby | primary | ? unreachable | ?         | default  | 100      |          | host=192.168.232.151 user=repmgr dbname=repmgr password=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)'s upstream node "standby" (ID: 2)
  - unable to determine if node "node1" (ID: 1) is attached to its upstream node "standby" (ID: 2)
  - unable to connect to node "standby" (ID: 2)
  - node "standby" (ID: 2) is registered as an active primary but is unreachable

HINT: execute with --verbose option to see connection error messages

Node “standby” (ID: 2) is registered as an active primary but is unreachable

Step 4 : Restarting the old primary and checking cluster

[postgres@standby ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name    | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                               
----+---------+---------+----------------------+----------+----------+----------+----------+----------------------------------------------------------------------------------
 1  | node1   | standby | ! running as primary |          | default  | 100      | 3        | host=192.168.232.150 user=repmgr dbname=repmgr password=repmgr connect_timeout=2
 2  | standby | primary | * running            |          | default  | 100      | 2        | host=192.168.232.151 user=repmgr dbname=repmgr password=repmgr connect_timeout=2

WARNING: following issues were detected
  - node "node1" (ID: 1) is registered as standby but running as primary

Step5 : Converting old primary to standby clone

Stop the old primary server to perform this activity.

[postgres@standby ~]$ /usr/pgsql-15/bin/repmgr -h 192.168.232.150 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/15/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.232.150 user=repmgr dbname=repmgr
DETAIL: current installation size is 30 MB
INFO: "repmgr" extension is installed in database "repmgr"
WARNING: target data directory appears to be a PostgreSQL data directory
DETAIL: target data directory is "/var/lib/pgsql/15/data"
HINT: use -F/--force to overwrite the existing data directory
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
WARNING: data checksums are not enabled and "wal_log_hints" is "off"
DETAIL: pg_rewind requires "wal_log_hints" to be enabled
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
  /usr/pgsql-15/bin/pg_basebackup -l "repmgr base backup"  -D /var/lib/pgsql/15/data -h 192.168.232.150 -p 5432 -U repmgr -X stream 
INFO: all prerequisites for "standby clone" are met

Confirms prerequisites for standby clone.

[postgres@standby ~]$ /usr/pgsql-15/bin/repmgr -h 192.168.232.150 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone -F
NOTICE: destination directory "/var/lib/pgsql/15/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.232.150 user=repmgr dbname=repmgr
DETAIL: current installation size is 30 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: data checksums are not enabled and "wal_log_hints" is "off"
DETAIL: pg_rewind requires "wal_log_hints" to be enabled
WARNING: directory "/var/lib/pgsql/15/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/var/lib/pgsql/15/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/pgsql-15/bin/pg_basebackup -l "repmgr base backup"  -D /var/lib/pgsql/15/data -h 192.168.232.150 -p 5432 -U repmgr -X stream 
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /var/lib/pgsql/15/data start
HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record

Completes the standby clone process.

Step 6: Starting the New Standby Server

[postgres@standby ~]$ /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data start
waiting for server to start....2024-06-19 12:09:05.620 IST [4644] LOG:  redirecting log output to logging collector process
2024-06-19 12:09:05.620 IST [4644] HINT:  Future log output will appear in directory "log".
 done
server started

Confirms the server has started.

Step 7: Registering the New Standby

[postgres@standby ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby register -F
INFO: connecting to local node "standby" (ID: 2)
INFO: connecting to primary database
INFO: standby registration complete
NOTICE: standby node "standby" (ID: 2) successfully registered
Standby registration is complete and successful.

Check the cluster status:

postgres@standby ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                               
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------
 1  | node1   | primary | * running |          | default  | 100      | 3        | host=192.168.232.150 user=repmgr dbname=repmgr password=repmgr connect_timeout=2
 2  | standby | standby |   running | node1    | default  | 100      | 3        | host=192.168.232.151 user=repmgr dbname=repmgr password=repmgr connect_timeout=2

Step 8: Performing a Switchover

Switch roles between primary and standby server.

[postgres@localhost data]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby switchover --always-promote
NOTICE: executing switchover on node "standby" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "standby" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1) 
DETAIL: executing server command "/usr/pgsql-15/bin/pg_ctl  -D '/var/lib/pgsql/15/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/27000028
NOTICE: promoting standby to primary
DETAIL: promoting server "standby" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "standby" (ID: 2) was successfully promoted to primary
NOTICE: node "standby" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "standby" is now primary and node "node1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

Standby promoted to primary, and node1 demoted to standby. Switchover successful

Step 9: Final Cluster Status Check

[postgres@localhost data]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                              
----+------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
 1  | node1  | standby |   running | standby    | default  | 100      | 7        |host=192.168.232.150 user=repmgr dbname=repmgr password=repmgr connect_timeout=2
 2  | standby  | primary | * running |          | default  | 100      | 8        | host=192.168.232.151 user=repmgr dbname=repmgr password=repmgr  connect_timeout=2

node1 is now the standby and standby is now the primary.

Conclusion

Knowing the important of having a HA systems for their PODs, with PostgreSQL and using a tool like repmgr will be asset to the organization . In this blog , we have performed automatic failover by detecting failure of the primary and promoting the most suitable standby without any manual intervention. And also process of retaining the old primary to standby later to primary again is a must important to know for the DBA’s.

Thank you and stay tuned!!