Introduction:
One of our customers were experiencing high CPU utilization during specific operations. We conducted close monitoring of the system and observed that frequent SELECT, UPDATE, and INSERT operations, as well as Stored Procedure calls, were causing significant CPU usage.
Upon reviewing the table structures and DDL, we found that no indexes were in place. Based on these observations, we recommended using the HypoPG tool, which is supported by the PostgreSQL community. This tool enables the simulation of index creation, helping to evaluate the impact of indexes and table partitioning on performance without creating actual indexes on tables.
HypoPG is a tool that allows you to simulate the creation of indexes without consuming CPU, memory, or server resources. It helps determine whether adding an index to a table would improve performance. It is not intended for use in real-time workloads, but it helps to evaluate whether an index will have a positive impact on performance.
Indexes are essential for speeding up query execution in PostgreSQL, but determining whether an index will be utilized in the execution plan can be challenging without actually creating it. However, on large tables, index creation can be time-consuming and resource-intensive.
HypoPG dependencies:
HypoPG requires the following dependencies to be installed: postgresql16-contrib and postgresql16-devel.
To install HypoPG You can either use the version-specific packages or clone from GitHub.
Option 1: Install HypoPg 16 via RPM
dnf install -y hypopg_16
or
Option 2: Use GitHub
I like the GitHub approach, so, I’ve elaborated the same in detail:
git clone https://github.com/HypoPG/hypopg.git
Cloning into 'hypopg'...
remote: Enumerating objects: 2081, done.
remote: Counting objects: 100% (567/567), done.
remote: Compressing objects: 100% (227/227), done.
remote: Total 2081 (delta 370), reused 466 (delta 318), pack-reused 1514 (from 1)
Receiving objects: 100% (2081/2081), 835.68 KiB | 1.59 MiB/s, done.
Resolving deltas: 100% (1355/1355), done.
Once the git clone gets completed, navigate to the hypopg directory:
[root@localhost ~]# cd hypopg/
Once you’re in the ‘hypopg’ directory, run the make command as follows:
[root@localhost ~]# make PG_CONFIG=/usr/pgsql-16/bin/pg_config
Then run Make install
[root@localhost hypopg]# sudo make PG_CONFIG=/usr/pgsql-16/bin/pg_config install
Finally, create the extension in Postgres:
postgres=# CREATE EXTENSION hypopg ;
CREATE EXTENSION
Once the extension is successfully created, you can start putting it to use.
In my case, I wanted to test the Hypopg with a table to demonstrate to the client.
To test, I have created the table:
postgres=# CREATE TABLE student ( regno INT, st_na VARCHAR(300), st_ag INT, joining_date DATE, gender VARCHAR(150), totalmarks INT );
CREATE TABLE
Insert the data:
postgres=# INSERT INTO student (regno, st_na, st_ag, joining_date, gender, totalmarks)
VALUES (1001, 'Alice Johnson', 20, '2022-01-01', 'Female', 850);
INSERT 0 1
I have then inserted approximately 10 million rows using custom scripts to conduct the test.
postgres=# SELECT count(*) from student ;
count
---------
9909107
(1 row)
Retrieving the data :
postgres=# SELECT st_ag as age, joining_date as join from student where regno =100000;
age | join
-----+------------
23 | 2021-01-01
(1 row)
Explain plain:
postgres=# EXPLAIN SELECT st_ag as age, joining_date as join from student where regno =100000;
QUERY PLAN
----------------------------------------------------------------
Seq Scan on student (cost=0.00..206439.84 rows=49546 width=8)
Filter: (regno = 100000)
(2 rows)
The Sequential Scan (Seq Scan) indicates that PostgreSQL is scanning the entire student table to find rows where regno = 100000. This means it does not use an index and instead checks each row one by one.
Cost=0.00..206439.84: This is the estimated startup cost (0.00) and total cost (206439.84) to execute the query.
Rows=49546: The planner estimates that 49546 rows match the condition.
Width=8: This is the average size in bytes of each output row (in this case, 8 bytes for the two selected columns: st_ag and joining_date).
A Seq Scan It is less efficient for large tables when filtering on specific values unless an index is used. Consider adding an index on regno to improve performance.
Explain analyze:
postgres=# EXPLAIN ANALYZE SELECT st_ag as age, joining_date as join from student where regno =100000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on student (cost=0.00..206439.84 rows=49546 width=8) (actual time=64.544..4437.461 rows=1 loops=1)
Filter: (regno = 100000)
Rows Removed by Filter: 9909106
Planning Time: 0.177 ms
Execution Time: 4437.559 ms
(5 rows)
PostgreSQL performed a Sequential Scan, meaning it scanned all 9,909,107 rows in the student table to find rows where regno = 100000.
Estimated Cost: 0.00..206439.84
This range includes the startup and total estimated cost for scanning the entire table.
Estimated Rows: 49546
PostgreSQL overestimated how many rows would match; it expected 49K rows, but only 1 row matched.
Actual Time: 64.544..4437.461 ms
It took over 4.4 seconds to find the matching row because PostgreSQL had to check every row.
Rows Removed by Filter: 9909106
All rows except row 1 were discarded after evaluating the filter condition.
Execution Time: 4437.559 ms
Confirms that the full table scan made the query slower.
Creating the index on the “regno” column, based on the column requirement:
postgres=# SELECT * FROM hypopg_create_index('CREATE INDEX ON student(regno)');
indexrelid | indexname
------------+----------------------------
14541 | <14541>btree_student_regno
(1 row)
The index was created.
Check the list of indexes:
postgres=# SELECT * FROM hypopg_list_indexes ;
indexrelid | index_name | schema_name | table_name | am_name
------------+----------------------------+-------------+------------+---------
14541 | <14541>btree_student_regno | public | student | btree
(1 row)
EXPLAIN Plan with Index:
postgres=# EXPLAIN SELECT st_ag as age, joining_date as join from student where regno =100000;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on student (cost=808.04..76200.29 rows=49546 width=8)
Recheck Cond: (regno = 100000)
-> Bitmap Index Scan on "<14541>btree_student_regno" (cost=0.00..795.65 rows=49546 width=0)
Index Cond: (regno = 100000)
(4 rows)
Bitmap Index Scan:
PostgreSQL first performs a Bitmap Index Scan using the index named “<14541>btree_student_regno”.
It identifies all the locations (TIDs) where regno = 100000 appears.
Cost=0.00..795.65 refers to the estimated cost of scanning the index.
Bitmap Heap Scan:
Then it uses a Bitmap Heap Scan to retrieve the actual rows from the student table using the TIDs found by the index scan.
Recheck Cond: This verifies that the rows still match regno = 100000, in case of any false positives.
cost=808.04..76200.29 rows=49546 width=8: These indicate the cost range, estimated matching rows, and average width (in bytes) of each result row.
Explain Analyze:
postgres=# EXPLAIN ANALYZE SELECT st_ag as age, joining_date as join from student where regno =100000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on student (cost=0.00..206439.84 rows=49546 width=8) (actual time=72.526..5132.849 rows=1 loops=1)
Filter: (regno = 100000)
Rows Removed by Filter: 9909106
Planning Time: 0.225 ms
Execution Time: 5133.400 ms
(5 rows)
PostgreSQL performed a Sequential Scan, meaning it scanned all 9,909,107 rows in the student table to find rows where regno = 100000.
Estimated Cost: 0.00..206439.84
This range includes the startup and total estimated cost for scanning the entire table.
Estimated Rows: 49546
PostgreSQL overestimated how many rows would match; it expected 49K rows, but only 1 row matched.
Actual Time: 72.526..5132.849 ms
It took over 5.1 seconds to find the matching row because PostgreSQL had to check every row.
Rows Removed by Filter: 9909106
All rows except row 1 were discarded after evaluating the filter condition.
Execution Time: 5133.400 ms
Confirms that the full table scan made the query slower.
Comparison: Before vs After Index
Before Index: PostgreSQL performs a full table scan using a Sequential Scan, resulting in higher cost and slower performance on large datasets.
After Index (HypoPG): PostgreSQL uses a simulated index with a Bitmap Index Scan, significantly reducing cost and improving query efficiency.
After creating the HypoPG index, the query plan changed from a costly Sequential Scan (cost=206439.84) to a much more efficient Bitmap Index Scan (cost=76200.29), significantly improving performance.
List of the Indexes:
postgres=# SELECT * FROM hypopg_list_indexes;
indexrelid | index_name | schema_name | table_name | am_name
------------+----------------------------+-------------+------------+---------
14541 | <14541>btree_student_regno | public | student | btree
(1 row)
Resetting Hypothetical Indexes:
postgres=# SELECT hypopg_reset();
hypopg_reset
--------------
(1 row)
This command clears all hypothetical (virtual) indexes created using the hypopg extension.
Verifying Hypothetical Indexes Are Cleared:
postgres=# SELECT * FROM hypopg_list_indexes;
indexrelid | index_name | schema_name | table_name | am_name
------------+------------+-------------+------------+---------
(0 rows)
This shows that no hypothetical indexes remain, confirming a successful reset.
Creating Indexes on the student Table :
After testing with the HypoPG tool, I created a real index on the same column in my environment to verify whether it improves performance in the same way.
postgres=# CREATE INDEX student_regno_idx on student (regno);
CREATE INDEX
The index (student_regno_idx) helps speed up queries that filter only by regno.
Query Plan Using Index :
postgres=# EXPLAIN SELECT st_ag as age, joining_date as join from student where regno =100000;
QUERY PLAN
--------------------------------------------------------------------------------------
Bitmap Heap Scan on student (cost=928.42..76320.67 rows=49546 width=8)
Recheck Cond: (regno = 100000)
-> Bitmap Index Scan on student_regno_idx (cost=0.00..916.03 rows=49546 width=0)
Index Cond: (regno = 100000)
(4 rows)
Bitmap Index Scan:
PostgreSQL first performs a Bitmap Index Scan using the index named “student_regno_idx “.
It identifies all the locations (TIDs) where regno = 100000 appears.
Cost=0.00..916.03 refers to the estimated cost of scanning the index.
Bitmap Heap Scan:
Then it uses a Bitmap Heap Scan to retrieve the actual rows from the student table using the TIDs found by the index scan.
Recheck Cond: This verifies that the rows still match regno = 100000, in case of any false positives.
cost=928.42..76320.67 rows=49546 width=8: These indicate the cost range, estimated matching rows, and average width (in bytes) of each result row.
Explain analyze:
postgres=# EXPLAIN ANALYZE SELECT st_ag as age, joining_date as join from student where regno =100000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on student (cost=928.42..76320.67 rows=49546 width=8) (actual time=7.729..7.842 rows=1 loops=1)
Recheck Cond: (regno = 100000)
Heap Blocks: exact=1
-> Bitmap Index Scan on student_regno_idx (cost=0.00..916.03 rows=49546 width=0) (actual time=7.663..7.691 rows=1 loops=1)
Index Cond: (regno = 100000)
Planning Time: 0.483 ms
Execution Time: 8.055 ms
(7 rows)
Bitmap Index Scan:
PostgreSQL first performs a Bitmap Index Scan query using the index named student_regno_idx.
It identifies all the locations (TIDs, tuple IDs) where the condition regno = 100000.
Index Cond: (regno = 100000)
cost=0.00..916.03: This shows the estimated cost to scan the index.
rows=49546: PostgreSQL estimated that 49,546 rows would match the condition.
width=0: The index scan does not return any column data, only pointers to rows.
actual time=7.663..7.691: The real time it took to perform the scan was about 0.028 ms, and only 1 matching row was found (overestimated).
Bitmap Heap Scan:
After identifying the matching row locations from the index, PostgreSQL performs a Bitmap Heap Scan on the student table to retrieve the actual row(s) using those pointers.
Recheck Cond: (regno = 100000):
PostgreSQL rechecks the condition on each row to ensure accuracy, guarding against false positives.
cost=928.42..76320.67:928.42 is the startup cost (initial read cost), and 76320.67 is the total estimated cost if all 49k rows were fetched.
rows=49546: Estimated number of matching rows (again, overestimated).
width=8: Estimated average width of each result row (8 bytes, likely due to selecting st_ag and joining_date).
actual time=7.729..7.842: Actual time to retrieve the row from the heap was about 0.113 ms.
Heap Blocks: exact=1: Only 1 data block was read, a very efficient lookup.
rows=1: Only 1 actual row was returned.
Final Execution Info
Planning Time: 0.483 msExecution Time: 8.055 ms
Overall: Query used an efficient index-based plan and completed quickly, even though PostgreSQL overestimated the number of rows.
Comparison: real-time index and virtual index using hypopg
Real-time Index:
PostgreSQL used the real B-tree index during execution, but significantly overestimated the number of matching rows (estimated 49,546; actual 1).
This caused a cost estimate (928.42..76320.67), making the plan look expensive.
Virtual Index (HypoPG):
HypoPG simulated a B-tree index on the regno column without physically creating it.
PostgreSQL used this during planning, estimating only 375 rows with a much lower cost (6.95..1248.59).
Conclusion:
In this exercise, we installed and configured HypoPG, created a sample student table, and tested query performance with and without a hypothetical index on the regno column. Using HypoPG, we simulated the index, observed a switch from a costly sequential scan to an efficient bitmap index scan in the execution plan, and confirmed the improvement without actually building the index on disk. Finally, we created the real index and verified similar performance benefits. HypoPG proved to be a powerful tool for evaluating index impact quickly, safely, and without consuming production resources.
