Introduction
Schema migrations in PostgreSQL are notoriously hard to get right, especially when your systems require zero downtime. Anyone who’s juggled production changes knows the pain: DDL locks, application crashes, and convoluted rollback plans.
This blog dives into our real-world proof of concept using pgroll, a promising open-source extension built by the team at Xata, which offers a new approach to schema migrations—one that’s client-driven, GitOps-friendly, and focused on zero-downtime.
The primary goal of this PoC was clear: to demonstrate practically how pgroll facilitates zero-downtime migrations. We aimed to validate the core features of the dual-schema versioning, the expand/contract pattern, safe rollbacks, and non-blocking DDL changes within a real-world scenario. For this, we simulated an evolving user management system, a perfect use case where schema changes must be applied safely while applications continuously read and write data.
Our Objectives for the PoC
When the client approached us, they had specific concerns and requirements. Our objectives for the pgroll PoC directly addressed these:
| Objective | Description |
| Validate Zero-Downtime Migrations | Confirm that schema changes can be applied without locking tables or disrupting application traffic, which was a top priority for them. |
| Demonstrate Expand/Contract Pattern | Show how pgroll supports multiple schema versions during the transition phase, ensuring no application downtime. |
| Test Backfilling with Large Datasets | Verify pgroll’s efficient handling of data migration for existing rows, especially for their large user base. |
| Validate Safe Rollback Mechanism | Demonstrate robust rollback capability without data loss or downtime, a critical safety net. |
| Evaluate Developer Experience | Assess ease of use, debugging, and seamless integration into their existing CI/CD workflows. |
Environment Setup: Getting Started
Setting up the environment for the PoC was straightforward.
Prerequisites
We ensured the following were in place:
- PostgreSQL 12+ running locally on port 5432.
- pgroll CLI installed and available in $PATH.
- Basic understanding of YAML-based migration definitions.
- Familiarity with tools like psql, pgAdmin (optional), or any PostgreSQL client.
Installation
Installation of pgroll was quick:
go install github.com/xataio/pgroll@latest
export PATH=$PATH:$HOME/go/bin
pgroll --version
Initializing pgroll
Next, we initialized pgroll within our test database:
pgroll init
The expected output confirmed successful initialization:
Successfully initialized pgroll in database “postgres”
This step created pgroll’s internal schema, including the migrations table to track history, helper functions, triggers for out-of-band DDL capture, and internal state management for consistency across deployments. This initial setup gave us confidence in the tool’s foundational robustness.
Migration 1: Creating the Users Table
Our first migration was simple: creating a users table, mimicking the client’s core data.
Migration File: 01_create_users_table.yaml
operations:
- create_table:
name: users
columns:
- name: id
type: serial
pk: true
- name: name
type: varchar(255)
unique: true
- name: description
type: text
nullable: true
Applying the Migration
We applied this migration with the –complete flag, as no old clients existed yet, making it safe for immediate finalization.
pgroll start 01_create_users_table.yaml --complete
Verifying Table Creation and Initial Data
To simulate a real-world scenario with existing data, we inserted 100,000 rows into the new users table:
INSERT INTO users (name, description)
SELECT
'user_' || suffix,
CASE WHEN random() < 0.5 THEN 'description for user_' || suffix ELSE NULL END
FROM generate_series(1, 100000) AS suffix;
This successfully inserted a significant amount of data, with approximately 50% having NULL descriptions – a perfect setup for our next, more challenging migration.
Migration 2: Making the Description Column NOT NULL
This was where pgroll truly shined. The client required the enforcement of a NOT NULL constraint on the description column. The challenge: existing NULL values needed to be populated, ongoing writes might still insert NULLs, and, critically, we couldn’t lock the table during the update.
Solution with pgroll
pgroll’s approach to this problem was elegant and effective.
Migration File: 02_user_description_set_nullable.yaml
operations:
- alter_column:
table: users
column: description
nullable: false
up: "SELECT CASE WHEN description IS NULL THEN 'description for ' || name ELSE description END"
down: description
The up clause defined how to transform old data (where description could be NULL) into the new NOT NULL format. The down clause, for rollback, would restore the original value. This explicit definition was highly valuable for ensuring data integrity.
Starting the Migration
pgroll start 02_user_description_set_nullable.yaml
As we watched the progress of the updates, pgroll began backfilling of the 100,000 rows in batches. Crucially, there were no table locks, and the system remained fully online. This was a significant win for the client’s zero-downtime mandate.
Under the Hood: What pgroll Does During Migration
To fully appreciate pgroll’s capabilities, we delved into its internal mechanisms, which are designed to support the zero-downtime philosophy.
A. Adding Temporary Columns
When we initiated Migration 2, pgroll didn’t immediately modify the description column. Instead, it added a temporary column and a tracking flag:
— New column with NOT NULL constraint
pgroll_new_description TEXT CHECK (_pgroll_new_description IS NOT NULL) NOT VALID
— Tracking flag
pgroll_needs_backfill BOOLEAN DEFAULT TRUE
The use of NOT VALID was key here; it avoids a full-table scan, allowing for a very fast DDL operation, which is essential for maintaining performance.
B. Applying the Expand Phase (Dual Schema)
During the migration, pgroll effectively maintains two coexisting schemas:
- public_01_create_users_table → representing the old view of the table.
- public_02_user_description_set_nullable → representing the new view of the table.
This means applications can continue to query either the “old” or “new” schema based on their search_path:
Old View:
SELECT id, name, description FROM users;
New View:
SELECT id, name, _pgroll_new_description AS description FROM users;
Applications can choose their schema by setting their search_path, enabling a graceful transition without forcing all services to update simultaneously.
C. Data Synchronization with Triggers
To ensure data consistency between the old and new schema views, pgroll automatically creates two triggers:
- _pgroll_trigger_users__pgroll_new_description: This trigger copies data from the description column to _pgroll_new_description using the up logic we defined.
- _pgroll_trigger_users_description: This trigger handles the reverse flow, crucial for rollbacks.
Testing Dual-Write Behavior
We tested this by simulating concurrent writes using both schema versions:
— Use old schema
SET search_path = public_01_create_users_table;
INSERT INTO users(name, description) VALUES ('Alice', 'this is Alice'), ('Bob', NULL);
— Switch to new schema
SET search_path = public_02_user_description_set_nullable;
SELECT * FROM users WHERE name IN ('Alice', 'Bob');
The result was impressive:
| id | name | description |
| … | Alice | This is Alice |
| … | Bob | description for Bob |
The trigger automatically filled the NULL value for ‘Bob’ using our up expression, confirming that even during the dual-schema phase, data integrity was maintained without manual intervention.
Completing the Migration
Once all applications were safely migrated to the new schema, the final step was to complete the migration:
pgroll complete
The result was a clean, final schema:
- The old schema (public_01_create_users_table) was removed.
- _pgroll_new_description was seamlessly renamed to description.
- _pgroll_needs_backfill and all temporary triggers were dropped.
Final Schema:
DESCRIBE users; now showed description TEXT NOT NULL.
All indexes and constraints were preserved, and critically, there was no downtime at any stage. This “contract” phase truly solidified the zero-downtime promise.
Rollback Test: Simulating a Bug
One of the biggest concerns was the ability to rollback safely if a migration introduced an issue. We intentionally simulated a bug to test this.
Migration 3: Adding a Typoed Column
We created a migration file (03_add_is_active_column.yaml) with a deliberate typo (is_atcive instead of is_active):
operations:
- add_column:
table: users
column:
name: is_atcive # Intentional typo!
type: boolean
nullable: true
default: "true"
We then started this migration:
pgroll start 03_add_is_active_column.yaml
As expected, a new schema (public_03_add_is_active_column) and a new shadow column (_pgroll_new_is_atcive) were created.
Performing the Rollback
If this were a real production scenario and we discovered the typo, we would immediately roll back:
pgroll rollback
The result was immediate and flawless:
- The faulty schema public_03_add_is_active_column was removed.
- The shadow column _pgroll_new_is_atcive was dropped.
- The system instantly reverted to the previous stable state.
This validated pgroll’s instant and safe rollback capability, eliminating the need for complex, manual undo scripts and significantly reducing the fear of broken deployments.
Client Application Integration
A crucial aspect for the client was understanding how their applications would integrate with pgroll’s dual-schema approach. The key is for applications to dynamically set their search_path to access the correct schema version.
Example: Go Application Integration
We provided a Go application example to illustrate this:
db, err := sql.Open("postgres", "postgres://postgres:postgres@localhost:5432/postgres?sslmode=disable")
if err != nil {
log.Fatal(err)
}
searchPath := os.Getenv("DB_SCHEMA_VERSION") // e.g., "public_02_user_description_set_nullable"
_, err = db.Exec(fmt.Sprintf("SET search_path = %s", pq.QuoteIdentifier(searchPath)))
if err != nil {
log.Fatalf("Failed to set search path: %v", err)
}
This demonstrated how the search_path could be externalized (e.g., via environment variables), making it easy to manage during deployments.
Deployment Strategy
This mechanism naturally lends itself to blue/green or canary deployment strategies:
| Service Version | search_path |
| v1 (Legacy) | public_01_create_users_table |
| v2 (Updated) | public_02_user_description_set_nullable |
This allows for gradual rollouts: deploy the new application version pointing to the new schema, observe health, and then decommission the old version, all while maintaining continuous service.
Key Observations & Benefits from the PoC
Our PoC highlighted several significant benefits for the client:
| Feature | Benefit |
| Two-Phase Migrations (start / complete) | Enables gradual cutover without downtime, a core client requirement. |
| Expand/Contract Pattern | Old and new schemas coexist, ensuring safe transitions for all applications. |
| Batched Backfilling | Handles large datasets efficiently without locking, crucial for performance. |
| Automatic Triggers | Keeps old/new columns in sync during migration, reducing manual effort. |
| Instant Rollbacks | Eliminates fear of broken deployments, providing a vital safety net. |
| No Application Pauses | Writes accepted throughout migration, maintaining continuous service. |
| NOT VALID Constraints | Avoids expensive scans when adding constraints, optimizing DDL performance. |
Limitations & Considerations
While pgroll is incredibly powerful, our PoC also surfaced a few considerations that the client should be aware of:
| Limitation | Mitigation |
| Requires schema-aware clients | Enforce search_path via connection pool configurations or application-level middleware to standardize access. |
| Adds temporary columns/triggers | Monitor disk space and performance impact on exceptionally large tables, though for most cases, the impact is minimal and temporary. |
| Not all DDL operations supported yet | Always check the pgroll documentation for current compatibility. For unsupported operations, a brief downtime might still be necessary, or a different approach might be needed. |
| Learning curve for teams | Plan for internal documentation, training sessions, and pair programming to help engineering teams adopt best practices. |
Conclusion: A Strong Candidate for Production
This Proof of Concept successfully demonstrates that pgroll delivers on the promise of zero-downtime schema migrations for PostgreSQL. It genuinely enables teams to:
- Make backward-incompatible changes safely.
- Eliminate maintenance windows, a huge win for business continuity.
- Significantly reduce the risk of deployment failures.
- Support modern blue/green and canary deployment strategies with ease.
With proper orchestration of the search_path in client applications, pgroll fits seamlessly into modern deployment pipelines and microservices architectures. We are confident that this tool can be a game-changer for the client’s database evolution.
Next Steps
Based on the success of this PoC, we’ve recommended the following next steps to the client:
- Integrate with CI/CD: Automate pgroll start, health checks, and complete steps within their existing continuous integration/continuous deployment pipelines.
- Monitor Migration Progress: Build dashboards and alerts to actively monitor ongoing migrations in real-time.
- Adopt in Staging: Conduct long-running tests with realistic traffic patterns in a staging environment to catch any unforeseen issues.
- Train Engineering Teams: Develop internal best practices and provide thorough training on pgroll usage and rollback procedures.
- Evaluate Production Readiness: A comprehensive assessment of monitoring, alerting, and backup interactions within their production environment.
Appendix A: Commands Reference
| Command | Purpose |
| pgroll init | Initialize pgroll metadata |
| pgroll start <migration.yaml> | Begin migration (expand phase) |
| pgroll complete | Finalize migration (contract phase) |
| pgroll rollback | Revert last migration |
| \dn | List schemas (check versioned schemas) |
| \d+ <schema>.<table> | Inspect the view definition |
Appendix B: Sample Migration Flow
# Initialize pgroll for the first time
pgroll init
# First migration (complete immediately as no old clients exist)
pgroll start 01_create_users_table.yaml --complete
# Second migration (a two-phase migration)
pgroll start 02_user_description_set_nullable.yaml
# … At this point, new applications can start using the new schema.
# … Old applications continue to use the old schema.
# … We would monitor traffic and application health here.
pgroll complete # Once all applications are updated and stable, finalize the migration.
# Third migration (demonstrating a rollback scenario)
pgroll start s03_add_is_active_column.yaml
# … Oh no, a bug or typo detected!
pgroll rollback # Instantly revert to the previous stable state.
Tool Version: pgroll v0.5.0
PostgreSQL: 16
Recommendation: Based on the successful outcomes of this PoC, we confidently recommend that the client can proceed with further testing in a staging environment. pgroll has proven itself a strong candidate for enabling genuine zero-downtime migrations in its production PostgreSQL environments.
