PG18 Hacktober: 31 Days of New Features : Internationalization: pg_unicode_fast Collation

Welcome back to the Day29 of the PG18Hacktober !!

pg_unicode_fast is a new built-in collation introduced in PostgreSQL 18, available exclusively for UTF-8 encoded databases. Unlike traditional ICU or OS locale-based collations which apply complex language-specific rules, pg_unicode_fast sorts text by Unicode code point values directly. It provides full Unicode semantics for case transformations, meaning functions like lower(), upper(), and initcap() correctly and fully map Unicode characters according to the Unicode standard.

For pattern matching operations such as LIKE and regular expressions, pg_unicode_fast uses the Standard variant of Unicode compatibility properties, ensuring efficient and compliant processing. Importantly, this collation offers stable and predictable results within the same PostgreSQL major version.

What is `pg_unicode_fast`?

pg_unicode_fast is a built-in collation, available if your database encoding is UTF-8. Unlike natural language collations, pg_unicode_fast:

Sorts text by Unicode code point value (not by language-dependent cultural rules).
Applies full Unicode case mapping for case transformations. This means functions like lower() and upper() handle all Unicode-defined cases correctly.
Leverages Unicode Compatibility Properties for pattern matching, making operations like LIKE and regex handling more compliant and efficient.
Delivers stable, predictable results within a PostgreSQL major version.

For most text-heavy workloads that need correctness without the overhead of language tailoring, pg_unicode_fast strikes a balance between speed and Unicode accuracy.

What does the PG18 documentation say?

This collation sorts by Unicode code point values rather than natural language order. For the functions lower, initcap, and upper, it uses Unicode full case mapping. For pattern matching (including regular expressions), it uses the Standard variant of Unicode Compatibility Properties. Behavior is efficient and stable within a Postgres major version. It is only available for encoding UTF8.

Verify pg_unicode_fast collation exists in your database

postgres=# SELECT collname, collprovider, collversion
FROM pg_collation
WHERE collname LIKE 'pg_unicode%';
    collname     | collprovider | collversion
-----------------+--------------+-------------
 pg_unicode_fast | b            | 1
(1 row)

This output confirms that only pg_unicode_fast is available among collations starting with 'pg_unicode%' in the PostgreSQL 18 instance.
The result:

collname: pg_unicode_fast is present.
collprovider: b means “builtin” (provided by PostgreSQL internally, not dependent on OS locale or ICU).
collversion: 1 is the version of this built-in collation.

Performance testing

Let’s perform a performance comparison between pg_unicode_fast and ICU Unicode collations in PostgreSQL, focusing on three key workloads:

Bulk ORDER BY (sorting)
lower() / upper() function calls
Pattern matching using LIKE

This will help us evaluate how the new pg_unicode_fast collation improves performance over standard ICU-based Unicode collations in real-world text processing scenarios.

#Created a table and inserted 10 million records
postgres=# CREATE TABLE collation_test (
  id serial PRIMARY KEY,
  content TEXT
);
CREATE TABLE
postgres=# INSERT INTO collation_test(content)
SELECT chr(65 + (random()*57)::int) ||
       chr(1040 + (random()*64)::int)
FROM generate_series(1, 10000000);
INSERT 0 10000000

Sorting

postgres=# EXPLAIN ANALYZE SELECT * FROM collation_test ORDER BY content COLLATE "und-x-icu";
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1853794.76..1878794.71 rows=9999977 width=40) (actual time=18842.292..25959.443 rows=10000000.00 loops=1)
   Sort Key: content COLLATE "und-x-icu"
   Sort Method: external merge  Disk: 215328kB
   Buffers: shared hit=10259 read=33992, temp read=53830 written=53970
   ->  Seq Scan on collation_test  (cost=0.00..144247.77 rows=9999977 width=40) (actual time=0.097..6759.340 rows=10000000.00 loops=1)
         Buffers: shared hit=10256 read=33992
 Planning Time: 0.049 ms
 Execution Time: 32310.280 ms
(8 rows)

postgres=# EXPLAIN ANALYZE SELECT * FROM collation_test ORDER BY content COLLATE "pg_unicode_fast";
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1853794.76..1878794.71 rows=9999977 width=40) (actual time=16459.850..23493.584 rows=10000000.00 loops=1)
   Sort Key: content COLLATE pg_unicode_fast
   Sort Method: external merge  Disk: 215328kB
   Buffers: shared hit=10350 read=33898, temp read=53830 written=53970
   ->  Seq Scan on collation_test  (cost=0.00..144247.77 rows=9999977 width=40) (actual time=0.080..6718.560 rows=10000000.00 loops=1)
         Buffers: shared hit=10350 read=33898
 Planning Time: 0.057 ms
 Execution Time: 29942.329 ms
(8 rows)

The built-in pg_unicode_fast collation shows a clear performance benefit in sorting large datasets – a 7% faster overall execution compared to ICU "und-x-icu". This is a meaningful gain at scale.

Case mapping

postgres=# EXPLAIN ANALYZE SELECT lower(content COLLATE "und-x-icu") FROM collation_test;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on collation_test  (cost=0.00..169247.71 rows=9999977 width=32) (actual time=0.167..10908.392 rows=10000000.00 loops=1)
   Buffers: shared hit=10420 read=33828
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.109 ms
 Execution Time: 17121.534 ms
(6 rows)

postgres=# EXPLAIN ANALYZE SELECT lower(content COLLATE "pg_unicode_fast") FROM collation_test;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on collation_test  (cost=0.00..169247.71 rows=9999977 width=32) (actual time=0.174..7160.083 rows=10000000.00 loops=1)
   Buffers: shared hit=10514 read=33734
 Planning Time: 0.041 ms
 Execution Time: 13394.296 ms
(4 rows)

Applying lower() to the entire dataset executed 28% faster under pg_unicode_fast

Pattern matching

postgres=# EXPLAIN ANALYZE SELECT * FROM collation_test
    WHERE content COLLATE "und-x-icu" LIKE 'A%';
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..97430.41 rows=992 width=8) (actual time=0.256..370.130 rows=87870.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=10650 read=33640
   ->  Parallel Seq Scan on collation_test  (cost=0.00..96331.21 rows=413 width=8) (actual time=0.808..342.654 rows=29290.00 loops=3)
         Filter: ((content)::text ~~ 'A%'::text)
         Rows Removed by Filter: 3304043
         Buffers: shared hit=10650 read=33640
 Planning:
   Buffers: shared hit=3
 Planning Time: 0.061 ms
 Execution Time: 451.846 ms
(12 rows)

postgres=# EXPLAIN ANALYZE SELECT * FROM collation_test
    WHERE content COLLATE "pg_unicode_fast" LIKE 'A%';
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..97430.41 rows=992 width=8) (actual time=0.338..337.682 rows=87870.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=10931 read=33359
   ->  Parallel Seq Scan on collation_test  (cost=0.00..96331.21 rows=413 width=8) (actual time=0.179..326.891 rows=29290.00 loops=3)
         Filter: ((content)::text ~~ 'A%'::text)
         Rows Removed by Filter: 3304043
         Buffers: shared hit=10931 read=33359
 Planning Time: 0.061 ms
 Execution Time: 439.951 ms
(10 rows)

The pg_unicode_fast collation performs slightly better, about a 2.7% faster runtime in this parallel I/O bound filtering query (439.95 ms vs 451.85 ms).

Use cases

Bulk data warehousing with millions of Unicode strings where correctness and speed are needed, but not language-specific sorting.
Search applications needing Unicode case-folded searching and fast LIKE/regex operations.
Replication/logical decoding across heterogeneous systems where identical collation behavior is mandatory.
Non-localized applications such as analytics, logging, or identity providers dealing with internationalized but non-locale-sensitive data.

For highly localized, culturally-correct sorting (e.g., alphabetical order for dictionaries in a specific language), ICU collations may still be preferable.

Summary

pg_unicode_fast in PostgreSQL 18 represents a strategic advancement in Unicode text processing. It accelerates sorting, case mapping and pattern matching by simplifying collation logic and relying on efficient Unicode standards rather than complex locale rules. This makes pg_unicode_fast a compelling choice for high-performance UTF-8 workloads that demand Unicode correctness without the overhead of language tailoring.

Migrating to pg_unicode_fast can yield meaningful, consistent performance gains across large datasets and numerous text operations, strengthening PostgreSQL’s capability as a performant Unicode-compliant database platform.

What’s next?

Let’s dive into how PostgreSQL 18 has enhanced its pg_catalog views to be more intelligent and informative from observability standpoint. These improvements empower DBAs and Devs with richer insights and better observability into internal database operations and performance.

Stay tuned!!

What is pg_unicode_fast?