PostgreSQL Failover in Action: Testing VIP Management and Split-Brain Detection Without Keepalived

Our journey to PGConf India has officially begun! Starting today, we’re launching a daily blog series that will run every working day as we count down to PGConf India in March 2026. This series goes beyond simple anticipation, we’ll explore PostgreSQL’s cutting-edge developments, dive into real-world database challenges and solutions, and examine the innovations driving the PostgreSQL ecosystem forward. Whether you’re preparing for the talks, eager to network with the community, or simply want to deepen your PostgreSQL expertise, join us on this journey as we build momentum toward one of India’s premier database conferences.

Simple Mantra: Prepared. Postgres-Powered. PGConf India-Bound.

In our First blog, we’ll see production PostgreSQL environments, ensuring continuous database availability while preventing data inconsistencies is paramount. One of the most critical challenges in high-availability (HA) clusters is managing Virtual IP (VIP) failover during primary node failures and protecting against split-brain scenarios—where multiple nodes simultaneously believe they are the primary especially while using repmgr.

This blog demonstrates a production-ready PostgreSQL HA solution using repmgr across a three-node cluster, showcasing automated VIP management without the need for additional tools like keepalived. Through real-world testing scenarios, we’ll explore how repmgr’s event notification mechanism can intelligently handle VIP assignment during failovers and, more importantly, how it detects and mitigates split-brain conditions by removing the VIP from all nodes until manual intervention resolves the conflict.

We’ll walk through two critical scenarios: automatic VIP management with detailed logs and validation steps that demonstrate the cluster’s resilience and safety mechanisms.

This blog demonstrates the hands-on testing results for a production-ready PostgreSQL high-availability cluster using repmgr across three nodes. Let’s consider the servers are:

  • Node1 – Current primary
  • Node2 – Standby
  • Node3 – Witness
  • VIP – 192.168.121.100

To setup 3node repmgr cluster, checkout the blog here : Building PostgreSQL HA Systems with repmgr

Add this parameter in the repmgr.conf:

event_notification_command = ‘/var/lib/pgsql/vip_scripts.sh >> /var/log/repmgr/repmgr.log’

Note : The configurations in repmgr.conf settings and script must be identical across all nodes.

Testing Scenarios:

Let’s look at this two testing scenerios in this blog.

  1. Primary Failover – VIP automatically shifts from Primary to Standby.
  2. Split-Brain Protection – VIP removal from both Primary and Standby nodes during network partition.

Scenario 1 – VIP automatically shifts from Primary to Standby when Primary failed

The expection is that in case of planned or unplanned failure of Primary, Standby will be promoted as Primary. In this case , VIP should be disabled from Old Primary and assigned to New Primary. Let’s look into before failover & after failover outcomes

Before failover

Let’s check the cluster and VIP on the current primary node i.e., node1

#Checking the cluster show of repmgr on node1
[postgres@node1 ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 50       | 3        | host=192.168.121.194 user=postgres dbname=osdb port=5432
 2  | node2   | standby |   running | node1   | default  | 50       | 3        | host=192.168.121.82 user=postgres dbname=osdb port=5432
 3  | node3  | witness | * running | node1   | default  | 0        | n/a      | host=192.168.121.172 user=postgres dbname=osdb port=5432

#Checking the IP assigned to the server
[postgres@node1 ~]$ hostname -I
192.168.121.194 192.168.121.100

Let’s check the cluster and VIP on the standby node i.e., node2

#Checking the cluster show of repmgr on node2
[postgres@node2 ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 50       | 3        | host=192.168.121.194 user=postgres dbname=osdb port=5432
 2  | node2   | standby |   running | node1   | default  | 50       | 3        | host=192.168.121.82 user=postgres dbname=osdb port=5432
 3  | node3  | witness | * running | node1   | default  | 0        | n/a      | host=192.168.121.172 user=postgres dbname=osdb port=5432

#Checking the IP assigned to the server
[postgres@node2 ~]$ hostname -I
192.168.121.82

Observations:

As per the output gathered in both node1 & node2,here are the observations:

  • node1 is primary and node2 is standby
  • node2 & node3 are upstreaming to node1
  • VIP 192.168.121.100 is assigned to node1

After failover 

Let’s stop the PostgreSQL services and check the cluster show & VIP in node1

#Stopping the PostgreSQL services
[root@node1 ~]# systemctl stop postgresql-15
[root@node1 ~]# su - postgres
Last login: Wed Jan 28 09:13:41 UTC 2026 on pts/0

#Checking the cluster show in node1
[postgres@node1 ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
ERROR: connection to database failed
DETAIL:
connection to server at "192.168.121.194", port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
DETAIL: attempted to connect using:
  user=postgres dbname=osdb host=192.168.121.194 port=5432 connect_timeout=2 fallback_application_name=repmgr options=-csearch_path=

#Checking the IPs assigned to the server
[postgres@node1 ~]$ hostname -I
192.168.121.194

Let’s check the cluster show and VIP in node2

#Checking the cluster show of repmgr on node2
[postgres@node2 ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 50       |          | host=192.168.121.194 user=postgres dbname=osdb port=5432
 2  | node2   | primary | * running |          | default  | 50       | 4        | host=192.168.121.82 user=postgres dbname=osdb port=5432
 3  | node3  | witness | * running | ? node1 | default  | 0        | n/a      | host=192.168.121.172 user=postgres dbname=osdb port=5432
WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)
  - unable to connect to node "node3" (ID: 3)'s upstream node "node1" (ID: 1)
HINT: execute with --verbose option to see connection error messages

#Checking the IPs assigned to the server
[postgres@node1 ~]$ hostname -I
192.168.121.82 192.168.121.100

Observations:

As per the output gathered in both node1 & node2,here are the observations:

  • node1 is in failed state.
  • node2 is promoted as primary.
  • VIP is removed from node1.
  • VIP is assigned to node2.

Analysis the script logs:

[2026-01-28 07:39:16] =========================================
[2026-01-28 07:39:16] Event triggered: manual
[2026-01-28 07:39:16] Local IP: 192.168.121.82
[2026-01-28 07:39:16] Checking for split-brain scenario...
[2026-01-28 07:39:16] Local node (192.168.121.82): NOT in recovery
[2026-01-28 07:39:17] ✓ Check 1 passed: pg_is_in_recovery() = false
[2026-01-28 07:39:20] ✗ Check 2 failed: repmgr role =
[2026-01-28 07:39:20] ✓ Check 3 passed: Cluster primary =  (local: 192.168.121.82)
[2026-01-28 07:39:20] Primary verification: 2/3 checks passed - NODE IS PRIMARY
[2026-01-28 07:39:20] Decision: This node IS PRIMARY - ensuring VIP is assigned
[2026-01-28 07:39:20] Adding VIP 192.168.121.100 to eth0
[2026-01-28 07:39:20] ✓ VIP successfully added
[2026-01-28 07:39:20] Updating ARP tables for 192.168.121.100
[2026-01-28 07:39:24] ✓ ARP tables updated (gratuitous ARP sent)
[2026-01-28 07:39:24] Attempting to remove VIP from remote node 192.168.121.194
[2026-01-28 07:39:25] ✓ VIP removed from remote node 192.168.121.194
[2026-01-28 07:39:25] VIP management completed
[2026-01-28 07:39:25] =========================================

Scenario2 – VIP removal from both Primary and Standby nodes during Split-Brain detection

Currently the VIP is assigned to New Primary node2. We all know that in case the old Primary is back , we will see two primaries in the cluster show output as shown below

[postgres@node1 ~]$ /usr/pgsql-15/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+----------------------+----------+----------+----------+----------+-----------------------------------------------------------
 1  | node1 | primary | * running            |          | default  | 50       | 3        | host=192.168.121.194 user=postgres dbname=osdb port=5432
 2  | node2   | standby | ! running as primary |          | default  | 50       | 4        | host=192.168.121.82 user=postgres dbname=osdb port=5432
 3  | node3  | witness | * running            | node1   | default  | 0        | n/a      | host=192.168.121.172 user=postgres dbname=osdb port=5432
WARNING: following issues were detected
  - node "node2" (ID: 2) is registered as standby but running as primary

In this case, we have to be careful to avoid the SPLIT-BRAIN scenario which is not good for an transactional data. So, we are removing VIP from both the two nodes.

Let’s check whether the VIP is assigned to either of the two nodes.

#Checking the IPs in node1
[postgres@node1 ~]$ hostname -I
192.168.121.194

#Checking the IPs in node2
[postgres@node2 ~]$ hostname -I
192.168.121.82

Observation: VIP is not assigned to either of both nodes.

Analysis the script logs:

[2026-01-28 07:40:02] =========================================
[2026-01-28 07:40:02] Event triggered: manual
[2026-01-28 07:40:02] Local IP: 192.168.121.194
[2026-01-28 07:40:02] Checking for split-brain scenario...
[2026-01-28 07:40:02] Local node (192.168.121.194): NOT in recovery
[2026-01-28 07:40:03] Remote node (192.168.121.82): NOT in recovery
[2026-01-28 07:40:03] 🚨 SPLIT-BRAIN DETECTED! Both nodes think they are primary!
[2026-01-28 07:40:03] Removing VIP from BOTH nodes as safety measure...
[2026-01-28 07:40:03] Removing VIP 192.168.121.100 from eth0
[2026-01-28 07:40:03] VIP not assigned to this node
[2026-01-28 07:40:03] VIP removed from remote node 192.168.121.82
[2026-01-28 07:40:03] ⚠ MANUAL INTERVENTION REQUIRED - Resolve split-brain before restoring VIP
[2026-01-28 07:40:03] Split-brain detected - VIP management aborted
[2026-01-28 07:40:03] =========================================

Once the VIP is removed from both nodes, manual intervention is required to reconfigure the cluster and ensure all nodes are in sync.

Important Considerations

This approach requires careful script maintenance, thorough testing, and deep understanding of cluster states. The script must handle edge cases, network partitions, and various failure modes—responsibilities that mature tools like keepalived are specifically designed to manage.When implementing custom VIP management scripts:

  • Testing is crucial: Every failover scenario must be thoroughly tested in non-production environments.
  • Monitoring is essential: Script execution logs must be actively monitored to catch issues early.
  • Manual intervention readiness: Teams must be prepared to intervene when split-brain or other complex scenarios occur.
  • Regular validation: The script logic should be reviewed and updated as cluster configurations evolve.

Check out the script in Github: Click here

Conclusion

The testing results demonstrate that repmgr, combined with intelligent VIP management scripts, provides a robust and reliable solution for PostgreSQL high availability. By leveraging repmgr’s event notification parameters and implementing comprehensive split-brain detection logic, we’ve achieved:

  • Seamless automatic failover with VIP transfer from the failed primary to the promoted standby in seconds.
  • Proactive split-brain protection that automatically removes the VIP from all nodes when conflicting primaries are detected.
  • Production-grade reliability without the complexity of additional clustering software like keepalived.

The script’s multi-check verification approach—validating recovery status, repmgr roles, and cluster state—ensures accurate primary identification while the split-brain detection mechanism safeguards against data divergence by forcing manual intervention when necessary. This solution strikes an optimal balance between automation and safety, making it an excellent choice for production PostgreSQL deployments where both uptime and data consistency are non-negotiable.

For organizations seeking a lightweight yet powerful HA solution, this repmgr-based approach provides enterprise-grade failover capabilities with the added benefit of simplified architecture and reduced operational overhead.

See this in action at PGConf India 2026 – Inside PostgreSQL High Availability: Quorum, Split-Brain, and Failover at Scale presented by myself Venkat Akhil. See you there!!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top