In today’s data-driven world, businesses rely heavily on the power of databases to store and manage vast amounts of information. However, as data grows, it becomes more challenging to keep systems performant and efficient. One of the key strategies for managing large datasets in databases like PostgreSQL is an effective archival strategy.
Archiving involves the process of moving older, less frequently accessed data from active tables to an archival table which can later moved to more cost-effective, less performance-intensive storage solution if required. Without proper management of large tables, query performance can degrade, backups can become unwieldy, and maintenance tasks can become time-consuming. This is where having a well-defined archival strategy comes into play. Let’s discuss some of the challenges in managing large size tables and importance of having an archival strategy in this blog post.
Challenges of Managing Large Size Tables
Managing large tables in PostgreSQL presents several challenges. Here are some of them,
- Query Performance Degradation: As tables grow in size, the time taken to run queries, especially complex ones, increases. Indexes may become less efficient, and queries can lead to significant I/O operations.
- Longer Backup and Restore Times: Large tables result in large database backups. Restoring such backups can be a lengthy process, which may disrupt business operations.
- Increased Resource Usage: Storing vast amounts of historical data can become costly in terms of high CPU utilization , disk storage.
- Maintenance Overhead: Operations like vacuuming, reindexing, or analyzing large tables take longer and may lock the table, affecting other operations.
Importance of Having an Archival Strategy
An archival strategy is essential for several reasons. Here are some of them,
- Improved Performance: By archiving old data, we reduce the size of active tables, which leads to faster queries, better use of indexes, and improves overall database performance.
- Reduced Costs: Archiving allows you to move data from the active tables, where we can significantly cut operational expenses like computational and storage costs.
- Better Backup Management: Smaller active tables make backup and restore operations faster and less resource-intensive.
- Regulatory Compliance and Data Retention: In some industries, data retention is a legal requirement. A proper archival strategy ensures compliance with these regulations by safely storing older data in a manner that can be easily retrieved when necessary.
Archival Strategies
In managing large datasets, we have recently worked on two different archival strategies in two different scenarios.
Approach1 : Moving data through insert/delete
Scenerio : Imagine we have a main table called main_table
, which contains data that is up to 3 years old. This table has a created_at
column, which stores the timestamp of when each record was created. The goal is to keep only the most recent 3 months of data in the main table. Any data older than 3 months should be moved to an archival table called archive_table
.
Note: main_table and archive_table must have the same structure/DDL syntax.
In this approach, we implement a procedure that manually moves the old data from main_table
to archive_table
based on the created_at
timestamp. This approach inserts the old data into the archive and then delete it from the main table.
CREATE OR REPLACE PROCEDURE archive_old_data()
LANGUAGE plpgsql AS $$
DECLARE
--Step1 declare the variables
date TIMESTAMP := CURRENT_TIMESTAMP;
rows_to_archive INTEGER;
rows_archived INTEGER;
rows_removed INTEGER;
BEGIN
--Step2 Count the number of rows to archive
SELECT COUNT(*) INTO rows_to_archive
FROM public.main_table
WHERE created_at < date - INTERVAL '3 months';
--Step3 If no rows to archive, log and exit
IF rows_to_archive = 0 THEN
RAISE NOTICE 'No rows to archive.';
RETURN;
END IF;
--Step4 Insert old transactions into the archive
INSERT INTO public.archive_table
SELECT *
FROM public.main_table
WHERE created_at < date - INTERVAL '3 months';
--Step5 Count the number of rows inserted into the archive
GET DIAGNOSTICS rows_archived = ROW_COUNT;
--Step6 Verify if the number of rows archived matches the count
IF rows_to_archive <> rows_archived THEN
RAISE EXCEPTION 'Row count mismatch: expected %, archived %', rows_to_archive, rows_archived;
END IF;
--Step7 Delete archived transactions from the original table
DELETE FROM public.main_table
WHERE created_at < date - INTERVAL '3 months';
--Step8 Count the number of rows inserted into the archive table
GET DIAGNOSTICS rows_removed = ROW_COUNT;
RAISE NOTICE '% rows archived and % rows deleted successfully', rows_archived,rows_removed;
END;
$$;
Let’s break-down the above procedure into 8 steps:
- Step1 Declare Variables: Set up variables to hold the current date and track row counts.
- Step2 Count Rows to Archive: Count how many rows are older than 3 months in
main_table
. - Step3 Check for Rows to Archive: If there are no rows to archive, exit early.
- Step4 Insert into Archive Table: Insert rows older than 3 months into
archive_table
. - Step5 Count Inserted Rows: Check how many rows were inserted into
archive_table
. - Step5 Verify Row Count: Ensure the number of rows archived matches what was counted earlier.
- Step6 Delete Archived Data: Delete the archived rows from
main_table
. - Step7 Count Deleted Rows: Count how many rows were deleted from
main_table
. - Step8 Log Final Result: Output a success message with the counts of archived and deleted rows.
Alter this procedure as per your environment and schedule a cron job using pg_cron extension which will execute daily in the non-business hours so that there will be less impact. If you are new to Postgres, or if you are unfamiliar with pg_cron, check our detailed blog on scheduling a cron job using pg_cron extension : https://opensource-db.com/automating-postgresql-tasks-with-pg_cron/
Approach2: Partitioning with attaching/detaching
Scenerio : Consider a main table named main_table
, which stores data up to 2 years old. The table is partitioned monthly by the created_at
column, which holds the timestamp for when each record was created. The objective is to ensure that the main_table
only retains the most recent 12 months of data, and any data older than 12 months should be moved to an archival table called archive_table
, which is also partitioned.
Note: main_table and archive_table must have the same structure / DDL syntax.
In this approach, we detach the partitioned_table
older than 12 months from the main_table
and attach the same to the archive_table
.
#Step1 Creating the partitioned table for 12 months
postgres=# CREATE TABLE public.main_table_2024_12 PARTITION OF public.main_table
FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');
CREATE TABLE
#Step2 At the end of each month, Detach the partitioned table from the main table.
postgres=# ALTER TABLE public.main_table DETACH PARTITION public.main_table_2024_12;
ALTER TABLE
#Step3 At the end of each month, Attach the partitioned table to the archive table.
postgres=# ALTER TABLE public.archive_table ATTACH PARTITION public.main_table_2024_12 FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');
ALTER TABLE
Let’s break-down the above steps:
- Step 1: Create a partition for the month wise data (example:(December 2024) within the
main_table
, ensuring that all data created in that month is handled in its own partition. We have created one for December’24 , create one partition for the each upcoming month accordingly. - Step 2: At the end of the month, Detach the created partition (
main_table_2024_12
) frommain_table
so that it is no longer part of the active table, preparing it for archival. - Step 3: Attach the detached partition to the
archive_table
, effectively moving the data to the archival structure while maintaining the same partitioning scheme.
Alter this commands as per your environment and write a simple script which automates the whole process.
Since these approaches are based on our specific requirements, it is recommended to alter and test them in your test environment before implementing them in the production environment based on your requirements.
Conclusion
Archiving is crucial for maintaining the health and performance of a PostgreSQL database as it grows. By employing a well-thought-out archival strategy, you can significantly improve query performance, reduce storage costs, and ensure efficient backup processes. Whether you choose a manual insert/delete approach or a more sophisticated partitioning strategy, the key is to regularly assess your data growth and adjust your archival methods accordingly.
The two approaches we’ve discussed — moving data through insert/delete and partitioning with attaching/detaching — each offer distinct advantages depending on your use case. Choosing the right one depends on factors like your data access patterns, performance requirements, and system complexity. By embracing a suitable archival strategy, you can ensure that your PostgreSQL database remains scalable, performant, and cost-effective in the long run.
Stay tuned!!