In Part 1 and Part 2 of pg_search: Modern Full-Text Search Inside PostgreSQL, we explored the architecture, motivation, and performance advantages of pg_search in PostgreSQL, and we demonstrated practical implementation, including installation, BM25 indexing, and real-world search scenarios.
In this continuation, we shift the focus from features to mechanics.
This article examines how BM25 behaves in practice how it improves ranking stability, prevents keyword bias, handles multi-term queries, and integrates scoring directly within the index layer. We also explore how the broader PostgreSQL ecosystem is adopting BM25 through different approaches, from native extensions like ParadeDB’s pg_search to hybrid architectures and search bridges.
Rather than reintroducing the algorithm itself, this discussion focuses on its practical impact on query execution, ranking quality, and system design decisions in production environments.
Moving Beyond “Frequency-Based” Ranking
Traditional PostgreSQL ranking (like ts_rank) often creates a bias toward documents that repeat keywords excessively. BM25 introduces controlled term saturation and length normalization to solve this.
1. Saturation: The Law of Diminishing Returns
In BM25, repeating a term 50 times does not make a document 50x more relevant. The “value” of each additional match diminishes, preventing keyword-stuffing from gaming the system.
- Example: Imagine searching for “Database Optimization”
- Document A (Spammy): Contains the word “Optimization” 40 times in a hidden list.
- Document B (High Quality): Contains “Optimization” 5 times in a technical guide.
- The Result: BM25 uses a saturation curve (k1 parameter). After the first few occurrences, the score contribution plateaus. Document B’s context-rich matches are weighted effectively, while Document A’s 40th mention adds almost zero value.
2. Normalization: The Fairness Factor
Longer documents naturally contain more words, which usually gives them an unfair advantage. BM25 avoids this bias through normalization factors embedded in scoring.
- Example: Searching for “Logical Replication.”
- Document A (Encyclopedia): A 100,000-word manual mentioning the term 10 times.
- Document B (Technical Blog): A 500-word post mentioning the term 10 times.
- The Result: BM25 recognizes that 10 matches in a short post indicate much higher “relevance density.” The normalization factor (b parameter) penalizes Document A for its volume, ensuring the concise, focused blog ranks higher.
3. Balance: Multi-Term Ranking Stability
Users rarely search for a single word; they search with intent: “postgres logical replication performance.” BM25 handles these queries by:
- Independent Evaluation: Scoring each term separately.
- Inverse Document Frequency (IDF): Weighting rare terms (e.g., “replication”) more heavily than common terms (e.g., “performance”).
- Probabilistic Combination: Merging scores so documents matching multiple distinct query terms rise to the top.
Ranking Dynamics in Multi-Term Queries
Users rarely search for a single word; they search with intent: “postgres logical replication performance.” BM25 handles these queries by:
- Independent Evaluation: Scoring each term separately.
- Inverse Document Frequency (IDF): Weighting rare terms (e.g., “replication”) more heavily than common terms (e.g., “performance”).
- Probabilistic Combination: Merging scores so documents matching multiple distinct query terms rise to the top.
Integrated Scoring Within the Index
One of the most important differences in modern PostgreSQL search extensions is where scoring happens.
In a native FTS setup, the database often retrieves candidates and then performs a secondary ranking pass. Modern BM25 implementations perform term lookup, score computation, and ranking evaluation directly inside the index access method.
- Tightly Coupled: Scoring happens during retrieval, not after.
- Low Overhead: No additional ranking pass is required.
- Efficient Filtering: Structured filters (like
WHERE category_id = 5) can be pushed down into the index scan, ensuring the database only ranks what it actually needs to return.
Where This is Used: The “BM25” Ecosystem
Because PostgreSQL is highly extensible, you don’t have to wait for “Native BM25” to be added to the core engine. Several industry-leading projects have already built this capability:
- ParadeDB (
pg_search): Perhaps the most prominent implementation. It replaces the standard GIN index with a Rust-based bm25 index type powered by Tantivy (a Lucene alternative). It is used by companies like Bilt Rewards and Modern Treasury to provide Elasticsearch-quality search inside their primary database. - Tiger Data (
pg_textsearch): A high-performance extension often used in AI and RAG (Retrieval-Augmented Generation) applications. It uses a “memtable” architecture to make index updates extremely fast, making it ideal for update-heavy environments. - ZomboDB: A bridge that makes an external Elasticsearch cluster look like a native PostgreSQL index. While it requires an external service, it offers the full power of the ELK stack through standard SQL.
Practical Implications for System Design
Bringing BM25 into PostgreSQL means:
- No External Services: You don’t need to manage an Elasticsearch or Solr cluster.
- Real-time Consistency: Because the index is part of the database, search results reflect committed data immediately (ACID compliance).
- Unified API: You can join your search results with other tables using standard SQLno complex ETL required.
Where This Becomes Most Valuable
BM25-based ranking is particularly impactful when:
- The dataset size exceeds tens of thousands of rows.
- Multi-field search is common (searching titles, tags, and content simultaneously).
- Hybrid Search is required (combining BM25 keyword matching with pgvector semantic search).
Conclusion
BM25 is more than a scoring formula it represents a modern approach to relevance inside PostgreSQL.
By integrating probabilistic ranking, controlled term saturation, and document normalization directly into the index layer, today’s PostgreSQL ecosystem enables search quality that rivals dedicated search engines without leaving the database.
With solutions like ParadeDB’s pg_search and other ecosystem extensions, teams can deliver high-quality, transactionally consistent search while maintaining architectural simplicity.
BM25 inside PostgreSQL is no longer experimental it is becoming a practical and production-ready choice for modern applications.
See this in action at PGConf India 2026 – pg_search: Bringing Elasticsearch-Grade Search to PostgreSQL presented by Mithun Chicklore Yogendra.
