Welcome to the 27th blog post of PG18 Hacktober!
Among the many new features, one quiet and small feature is making its debut, promising to enable developers and DBAs approach case-insensitive string matching: the casefold() function.
While it might not grab headlines like AIO, parallel query improvements or advanced security features, casefold() is a powerful addition that will significantly improve data handling, especially for applications dealing with internationalized text.
Out with the Old (Sort Of): The lower() Dilemma
For years, lower() has been the go-to function for converting strings to lowercase, often used in conjunction with LIKE or = for case-insensitive comparisons:
SELECT * FROM products WHERE lower(name) = lower('New Widget');
This works perfectly fine for simple English text. However, the world of Unicode is far more complex than ASCII. lower() operates based on locale-specific rules, which can lead to unexpected and incorrect results when dealing with certain characters and languages.
Consider these common pitfalls of lower():
- Growing Characters: The German eszett (ß) converts to ss when case-folded, but lower() might not handle this.
- Contextual Lowercasing: The Greek capital sigma (Σ) can become either σ (standard) or ς (final). lower() might only produce one, leading to missed matches.
- Dotless/Dotted I: In Turkish, I (capital i) lowercases to ı (dotless i), and İ (capital I with dot) lowercases to i (dotted i). lower() can often misinterpret these.
These inconsistencies mean that lower() isn’t truly reliable for robust, Unicode-aware case-insensitive comparisons. This is where casefold() steps in.
Enter casefold():
PostgreSQL 18’s casefold(text) function is designed as the definitive solution for case-insensitive string operations. It performs Unicode case folding, which is a process specifically engineered to convert strings into a common form where case distinctions are eliminated.
Here’s why casefold() is a game-changer:
- Unicode Standard Compliance:
casefold()adheres to the Unicode Consortium’s recommendations for case folding, ensuring consistent and correct behavior across all languages. - True Equivalence: It guarantees that if two strings are considered “case-insensitively equal” by Unicode rules, their casefold() results will be identical.
- Robust Matching: You can be confident that your search queries and comparisons will catch all relevant results, without being tripped up by nuanced linguistic differences.
How to Use casefold()
The usage is straightforward and mirrors lower():
— More reliable case-insensitive comparison
SELECT * FROM users WHERE casefold(email) = casefold('user@example.com');
For performance, especially on large tables, you can create a functional index using casefold():
CREATE INDEX users_email_casefold_idx ON users (casefold(email));
This index will allow PostgreSQL to quickly find matching rows without having to scan the entire table and apply casefold() to every row during a query.
When to make the switch
If you’re upgrading to PostgreSQL 18 and your application relies on case-insensitive string matching – especially if you handle multi-lingual data – you should strongly consider migrating from lower() to casefold().
While lower() will continue to exist and has its place for simple transformations, casefold() is the technically superior choice for ensuring accurate, Unicode-compliant case-insensitive comparisons.
This seemingly small addition to PostgreSQL 18 is a testament to the community’s commitment to building a database that is not only powerful and performant but also incredibly robust and globally aware. Give casefold() a try, your data (and your users) will thank you for it.
Here is a test setup and a series of scripts to demonstrate the power and necessity of the casefold() function, highlighting where it succeeds and where lower() fails.
Before starting remember that we are using database that is configured with the ICU (International Components for Unicode). Only ICU has the advanced mapping rules to handle cases like ß -> ss in your dev/production environment.
postgres=# CREATE EXTENSION IF NOT EXISTS icu_ext;
NOTICE: extension "icu_ext" already exists, skipping
CREATE EXTENSION
Test setup
Prerequisite: You must have PostgreSQL 18 installed and running.
- Connect to your PostgreSQL instance using psql or any SQL client.
- Create a new database for our test (optional, but clean):
postgres=# CREATE DATABASE icu_test_db
LOCALE_PROVIDER = 'icu'
ICU_LOCALE = 'und-u-kf-lower'
TEMPLATE = template0;
CREATE DATABASE
postgres=# \c icu_test_db
- Run the first script below (01-setup.sql) to create the table and populate it with our specific test cases.
1. Setup script
This script creates our test table and inserts four strings designed to test the specific Unicode edge cases that casefold() solves.
01-setup.sql
— Create our test table
icu_test_db=# CREATE TABLE string_tests (
id SERIAL PRIMARY KEY,
description TEXT,
test_string TEXT
);
CREATE TABLE
— Insert our test data
icu_test_db=# INSERT INTO string_tests (description, test_string) VALUES
('Basic English', 'Hello'),
('German Eszett (ß)', 'Straße'),
('Greek Sigma (σ vs ς)', 'ὀδυσσεύσ'), -- Ends with standard sigma
('Turkish Dotless I (I)', 'ISPARTA');
INSERT 0 4
icu_test_db=# SELECT * FROM string_tests;
id | description | test_string
----+-----------------------+-------------
1 | Basic English | Hello
2 | German Eszett (ß) | Straße
3 | Greek Sigma (σ vs ς) | ὀδυσσεύσ
4 | Turkish Dotless I (I) | ISPARTA
(4 rows)
2. Testing with casefold()
This script runs comparisons against our test data using casefold(). Notice how it correctly matches strings that are semantically equivalent but have different characters.
02-test-casefold.sql
— Test 1: Basic English (Sanity Check)
— ‘Hello’ should match ‘hello’
icu_test_db=# SELECT * FROM string_tests
WHERE description = 'Basic English'
AND casefold(test_string) = casefold('hello');
id | description | test_string
----+---------------+-------------
1 | Basic English | Hello
(1 row)
— Test 2: German Eszett (ß)
— ‘Straße’ (with ß) should match ‘strasse’ (with ss)
icu_test_db=# SELECT * FROM string_tests
WHERE description = 'German Eszett (ß)'
AND casefold(test_string) = casefold('strasse');
id | description | test_string
----+-------------------+-------------
2 | German Eszett (ß) | Straße
(1 row)
— Test 3: Greek Sigma (σ vs ς)
— ‘ὀδυσσεύσ’ (with σ) should match ‘ὀδυσσεύς’ (with ς – final sigma)
icu_test_db=# SELECT * FROM string_tests
WHERE description = 'Greek Sigma (σ vs ς)'
AND casefold(test_string) = casefold('ὀδυσσεύς');
id | description | test_string
----+----------------------+-------------
3 | Greek Sigma (σ vs ς) | ὀδυσσεύσ
(1 row)
When you run this script, all queries will successfully return a row. This demonstrates that casefold() correctly identifies all these pairs as equivalent.
3. Testing with lower() (the “why we need casefold()” script)
This script runs the exact same comparisons using the old lower() function. This is where you will see the test failures.
03-test-lower.sql
— Test 1: Basic English (Sanity Check)
— ‘Hello’ should match ‘hello’
icu_test_db=# SELECT * FROM string_tests
WHERE description = 'Basic English'
AND lower(test_string) = lower('hello');
id | description | test_string
----+---------------+-------------
1 | Basic English | Hello
(1 row)
— Test 2: German Eszett (ß)
— ‘Straße’ (with ß) vs ‘strasse’ (with ss)
icu_test_db=# SELECT * FROM string_tests
WHERE description = 'German Eszett (ß)'
AND lower(test_string) = lower('strasse');
id | description | test_string
----+-------------+-------------
(0 rows)
— Test 3: Greek Sigma (σ vs ς)
— ‘ὀδυσσεύσ’ (with σ) vs ‘ὀδυσσεύς’ (with ς – final sigma)
icu_test_db=# SELECT * FROM string_tests
WHERE description = 'Greek Sigma (σ vs ς)'
AND lower(test_string) = lower('ὀδυσσεύς');
id | description | test_string
----+-------------+-------------
(0 rows)
When you run this script, only the first query (Basic English) will return a row.
- Test 2 (German) fails: lower(‘Straße’) is straße, which does not equal strasse.
- Test 3 (Greek) fails: lower(‘ὀδυσσεύσ’) is ὀδυσσεύσ, which does not equal ὀδυσσεύς.
This script clearly proves the limitations of lower() for case-insensitive matching.
4. Indexing for Performance
Finally, here is how you would use casefold() in a production environment to ensure your case-insensitive searches are fast.
04-test-index.sql
— 1. Create a functional index using casefold()
icu_test_db=# CREATE UNIQUE INDEX users_name_casefold_idx ON string_tests (casefold(test_string));
— 2. Run an EXPLAIN to see the index in action
— PostgreSQL will see the WHERE clause matches the index definition
icu_test_db=# EXPLAIN SELECT * FROM string_tests
WHERE casefold(test_string) = 'strasse';
QUERY PLAN
---------------------------------------------------------------------------------------------
Index Scan using users_name_casefold_idx on string_tests (cost=0.42..8.44 rows=1 width=36)
Index Cond: (casefold(test_string) = 'strasse'::text)
(2 rows)
Conclusion
The casefold() function is a small but mighty addition to PostgreSQL 18, finally providing a robust, Unicode-compliant solution for case-insensitive string matching. It solves complex linguistic challenges where the traditional lower() function falls short.
As we’ve demonstrated, its true power—like correctly handling German ß or Greek σ—is unlocked by using the ICU locale provider, which is easily enabled with the icu_ext extension. By creating a functional index on casefold(your_column), you can achieve this superior accuracy without sacrificing performance, even on massive tables.If you’re upgrading to PostgreSQL 18 and deal with international data, it’s time to move beyond lower() and embrace casefold() for more accurate, reliable, and globally-aware applications.
What’s Next:
We’ll continue our exploration of PostgreSQL 18’s Internationalization features – LIKE comparisons for nondeterministic collations. Stay tuned!
