pgroll in Action: Client-Side Evaluation of Zero-Downtime Schema Migrations

Introduction

Schema migrations in PostgreSQL are notoriously hard to get right, especially when your systems require zero downtime. Anyone who’s juggled production changes knows the pain: DDL locks, application crashes, and convoluted rollback plans.

This blog dives into our real-world proof of concept using pgroll, a promising open-source extension built by the team at Xata, which offers a new approach to schema migrations—one that’s client-driven, GitOps-friendly, and focused on zero-downtime.

The primary goal of this PoC was clear: to demonstrate practically how pgroll facilitates zero-downtime migrations. We aimed to validate the core features of the dual-schema versioning, the expand/contract pattern, safe rollbacks, and non-blocking DDL changes within a real-world scenario. For this, we simulated an evolving user management system, a perfect use case where schema changes must be applied safely while applications continuously read and write data.

Our Objectives for the PoC

When the client approached us, they had specific concerns and requirements. Our objectives for the pgroll PoC directly addressed these:

ObjectiveDescription
Validate Zero-Downtime MigrationsConfirm that schema changes can be applied without locking tables or disrupting application traffic, which was a top priority for them.
Demonstrate Expand/Contract PatternShow how pgroll supports multiple schema versions during the transition phase, ensuring no application downtime.
Test Backfilling with Large DatasetsVerify pgroll’s efficient handling of data migration for existing rows, especially for their large user base.
Validate Safe Rollback MechanismDemonstrate robust rollback capability without data loss or downtime, a critical safety net.
Evaluate Developer ExperienceAssess ease of use, debugging, and seamless integration into their existing CI/CD workflows.

Environment Setup: Getting Started

Setting up the environment for the PoC was straightforward.

Prerequisites

We ensured the following were in place:

  • PostgreSQL 12+ running locally on port 5432.
  • pgroll CLI installed and available in $PATH.
  • Basic understanding of YAML-based migration definitions.
  • Familiarity with tools like psql, pgAdmin (optional), or any PostgreSQL client.

Installation

Installation of pgroll was quick:

go install github.com/xataio/pgroll@latest

export PATH=$PATH:$HOME/go/bin

pgroll --version

Initializing pgroll

Next, we initialized pgroll within our test database:

pgroll init

The expected output confirmed successful initialization:

Successfully initialized pgroll in database “postgres”

This step created pgroll’s internal schema, including the migrations table to track history, helper functions, triggers for out-of-band DDL capture, and internal state management for consistency across deployments. This initial setup gave us confidence in the tool’s foundational robustness.

Migration 1: Creating the Users Table

Our first migration was simple: creating a users table, mimicking the client’s core data.

Migration File: 01_create_users_table.yaml

operations:
  - create_table:
      name: users
      columns:
        - name: id
          type: serial
          pk: true
        - name: name
          type: varchar(255)
          unique: true
        - name: description
          type: text
          nullable: true

Applying the Migration

We applied this migration with the –complete flag, as no old clients existed yet, making it safe for immediate finalization.

pgroll start 01_create_users_table.yaml --complete

Verifying Table Creation and Initial Data

To simulate a real-world scenario with existing data, we inserted 100,000 rows into the new users table:

INSERT INTO users (name, description)
SELECT
    'user_' || suffix,
    CASE WHEN random() < 0.5 THEN 'description for user_' || suffix ELSE NULL END
FROM generate_series(1, 100000) AS suffix;

This successfully inserted a significant amount of data, with approximately 50% having NULL descriptions – a perfect setup for our next, more challenging migration.

Migration 2: Making the Description Column NOT NULL

This was where pgroll truly shined. The client required  the enforcement of a NOT NULL constraint on the description column. The challenge: existing NULL values needed to be populated, ongoing writes might still insert NULLs, and, critically, we couldn’t lock the table during the update.

Solution with pgroll

pgroll’s approach to this problem was elegant and effective.

Migration File: 02_user_description_set_nullable.yaml

operations:
  - alter_column:
      table: users
      column: description
      nullable: false
      up: "SELECT CASE WHEN description IS NULL THEN 'description for ' || name ELSE description END"
      down: description

The up clause defined how to transform old data (where description could be NULL) into the new NOT NULL format. The down clause, for rollback, would restore the original value. This explicit definition was highly valuable for ensuring data integrity.

Starting the Migration

pgroll start 02_user_description_set_nullable.yaml

As we watched the progress of the updates, pgroll began backfilling of the 100,000 rows in batches. Crucially, there were no table locks, and the system remained fully online. This was a significant win for the client’s zero-downtime mandate.

Under the Hood: What pgroll Does During Migration

To fully appreciate pgroll’s capabilities, we delved into its internal mechanisms, which are designed to support the zero-downtime philosophy.

A. Adding Temporary Columns

When we initiated Migration 2, pgroll didn’t immediately modify the description column. Instead, it added a temporary column and a tracking flag:

— New column with NOT NULL constraint

pgroll_new_description TEXT CHECK (_pgroll_new_description IS NOT NULL) NOT VALID

— Tracking flag

pgroll_needs_backfill BOOLEAN DEFAULT TRUE

The use of NOT VALID was key here; it avoids a full-table scan, allowing for a very fast DDL operation, which is essential for maintaining performance.

B. Applying the Expand Phase (Dual Schema)

During the migration, pgroll effectively maintains two coexisting schemas:

  • public_01_create_users_table → representing the old view of the table.
  • public_02_user_description_set_nullable → representing the new view of the table.

This means applications can continue to query either the “old” or “new” schema based on their search_path:

Old View:

SELECT id, name, description FROM users;

New View:

SELECT id, name, _pgroll_new_description AS description FROM users;

Applications can choose their schema by setting their search_path, enabling a graceful transition without forcing all services to update simultaneously.

C. Data Synchronization with Triggers

To ensure data consistency between the old and new schema views, pgroll automatically creates two triggers:

  • _pgroll_trigger_users__pgroll_new_description: This trigger copies data from the description column to _pgroll_new_description using the up logic we defined.
  • _pgroll_trigger_users_description: This trigger handles the reverse flow, crucial for rollbacks.

Testing Dual-Write Behavior

We tested this by simulating concurrent writes using both schema versions:

— Use old schema

SET search_path = public_01_create_users_table;

INSERT INTO users(name, description) VALUES ('Alice', 'this is Alice'), ('Bob', NULL);

— Switch to new schema

SET search_path = public_02_user_description_set_nullable;

SELECT * FROM users WHERE name IN ('Alice', 'Bob');

The result was impressive:

idnamedescription
AliceThis is Alice
Bobdescription for Bob


The trigger automatically filled the NULL value for ‘Bob’ using our up expression, confirming that even during the dual-schema phase, data integrity was maintained without manual intervention.

Completing the Migration

Once all applications were safely migrated to the new schema, the final step was to complete the migration:

pgroll complete

The result was a clean, final schema:

  • The old schema (public_01_create_users_table) was removed.
  • _pgroll_new_description was seamlessly renamed to description.
  • _pgroll_needs_backfill and all temporary triggers were dropped.

Final Schema:

DESCRIBE users; now showed description TEXT NOT NULL.

All indexes and constraints were preserved, and critically, there was no downtime at any stage. This “contract” phase truly solidified the zero-downtime promise.

Rollback Test: Simulating a Bug

One of the biggest concerns was the ability to rollback safely if a migration introduced an issue. We intentionally simulated a bug to test this.

Migration 3: Adding a Typoed Column

We created a migration file (03_add_is_active_column.yaml) with a deliberate typo (is_atcive instead of is_active):

operations:
  - add_column:
      table: users
      column:
        name: is_atcive # Intentional typo!
        type: boolean
        nullable: true
        default: "true"

We then started this migration:

pgroll start 03_add_is_active_column.yaml

As expected, a new schema (public_03_add_is_active_column) and a new shadow column (_pgroll_new_is_atcive) were created.

Performing the Rollback

If this were a real production scenario and we discovered the typo, we would immediately roll back:

pgroll rollback

The result was immediate and flawless:

  • The faulty schema public_03_add_is_active_column was removed.
  • The shadow column _pgroll_new_is_atcive was dropped.
  • The system instantly reverted to the previous stable state.

This validated pgroll’s instant and safe rollback capability, eliminating the need for complex, manual undo scripts and significantly reducing the fear of broken deployments.

Client Application Integration

A crucial aspect for the client was understanding how their applications would integrate with pgroll’s dual-schema approach. The key is for applications to dynamically set their search_path to access the correct schema version.

Example: Go Application Integration

We provided a Go application example to illustrate this:

db, err := sql.Open("postgres", "postgres://postgres:postgres@localhost:5432/postgres?sslmode=disable")
if err != nil {
    log.Fatal(err)
}
searchPath := os.Getenv("DB_SCHEMA_VERSION") // e.g., "public_02_user_description_set_nullable"
_, err = db.Exec(fmt.Sprintf("SET search_path = %s", pq.QuoteIdentifier(searchPath)))
if err != nil {
    log.Fatalf("Failed to set search path: %v", err)
}

This demonstrated how the search_path could be externalized (e.g., via environment variables), making it easy to manage during deployments.

Deployment Strategy

This mechanism naturally lends itself to blue/green or canary deployment strategies:

Service Versionsearch_path
v1 (Legacy)public_01_create_users_table
v2 (Updated)public_02_user_description_set_nullable


This allows for gradual rollouts: deploy the new application version pointing to the new schema, observe health, and then decommission the old version, all while maintaining continuous service.

Key Observations & Benefits from the PoC

Our PoC highlighted several significant benefits for the client:

FeatureBenefit
Two-Phase Migrations (start / complete)Enables gradual cutover without downtime, a core client requirement.
Expand/Contract PatternOld and new schemas coexist, ensuring safe transitions for all applications.
Batched BackfillingHandles large datasets efficiently without locking, crucial for performance.
Automatic TriggersKeeps old/new columns in sync during migration, reducing manual effort.
Instant RollbacksEliminates fear of broken deployments, providing a vital safety net.
No Application PausesWrites accepted throughout migration, maintaining continuous service.
NOT VALID ConstraintsAvoids expensive scans when adding constraints, optimizing DDL performance.

Limitations & Considerations

While pgroll is incredibly powerful, our PoC also surfaced a few considerations that the client should be aware of:

LimitationMitigation
Requires schema-aware clientsEnforce search_path via connection pool configurations or application-level middleware to standardize access.
Adds temporary columns/triggersMonitor disk space and performance impact on exceptionally large tables, though for most cases, the impact is minimal and temporary.
Not all DDL operations supported yetAlways check the pgroll documentation for current compatibility. For unsupported operations, a brief downtime might still be necessary, or a different approach might be needed.
Learning curve for teamsPlan for internal documentation, training sessions, and pair programming to help engineering teams adopt best practices.

Conclusion: A Strong Candidate for Production

This Proof of Concept successfully demonstrates that pgroll delivers on the promise of zero-downtime schema migrations for PostgreSQL. It genuinely enables teams to:

  • Make backward-incompatible changes safely.
  • Eliminate maintenance windows, a huge win for business continuity.
  • Significantly reduce the risk of deployment failures.
  • Support modern blue/green and canary deployment strategies with ease.

With proper orchestration of the search_path in client applications, pgroll fits seamlessly into modern deployment pipelines and microservices architectures. We are confident that this tool can be a game-changer for the client’s database evolution.

Next Steps

Based on the success of this PoC, we’ve recommended the following next steps to the client:

  • Integrate with CI/CD: Automate pgroll start, health checks, and complete steps within their existing continuous integration/continuous deployment pipelines.
  • Monitor Migration Progress: Build dashboards and alerts to actively monitor ongoing migrations in real-time.
  • Adopt in Staging: Conduct long-running tests with realistic traffic patterns in a staging environment to catch any unforeseen issues.
  • Train Engineering Teams: Develop internal best practices and provide thorough training on pgroll usage and rollback procedures.
  • Evaluate Production Readiness: A comprehensive assessment of monitoring, alerting, and backup interactions within their production environment.

Appendix A: Commands Reference

CommandPurpose
pgroll initInitialize pgroll metadata
pgroll start <migration.yaml>Begin migration (expand phase)
pgroll completeFinalize migration (contract phase)
pgroll rollbackRevert last migration
\dnList schemas (check versioned schemas)
\d+ <schema>.<table>Inspect the view definition


Appendix B: Sample Migration Flow

# Initialize pgroll for the first time

pgroll init

# First migration (complete immediately as no old clients exist)

pgroll start 01_create_users_table.yaml --complete

# Second migration (a two-phase migration)

pgroll start 02_user_description_set_nullable.yaml

# … At this point, new applications can start using the new schema.

# … Old applications continue to use the old schema.

# … We would monitor traffic and application health here.

pgroll complete # Once all applications are updated and stable, finalize the migration.

# Third migration (demonstrating a rollback scenario)

pgroll start s03_add_is_active_column.yaml

# … Oh no, a bug or typo detected!

pgroll rollback # Instantly revert to the previous stable state.

Tool Version: pgroll v0.5.0

PostgreSQL: 16

Recommendation: Based on the successful outcomes of this PoC, we confidently recommend that the client can proceed with further testing in a staging environment. pgroll has proven itself a strong candidate for enabling genuine zero-downtime migrations in its production PostgreSQL environments.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top