Introduction
Here I am again! Talking about the series of topics around Data Integration with PostgreSQL, The World’s Most Advanced Open Source Relational Database. If you haven’t looked at the previous blog in the series, I’d highly recommend reading the same, it is available here.
In the ever-evolving landscape of data management, organisations are constantly seeking ways to stay ahead of the modifications, additions and deletions in real-time. CDC is a transformative technology that allows organisations to achieve data integrity and consistency across all systems and deployment environments. Additionally, it allows organisations to use the right tool for the right job by moving data from legacy databases to purpose-built data platforms, such as document or search databases and data warehouses.
Change Data Capture (CDC) has emerged as a game-changer, providing a dynamic solution to keep pace with the rapidly changing world of data. This blog explores three W’s of CDC and methods of CDC.
Three W’s of Change Data Capture
What is Change Data Capture?
Change Data Capture (CDC) is a technique used to capture the changes made to data in a source database. So that the changes can be implemented in the target database also. Replication is one of the important aspects where the data is constantly changing, CDC reduces the load by sending only incremental data.
Why Change Data Capture is important?
CDC enables real-time synchronisation of data between source and target systems. This is critical for applications where access to the most current information is essential, such as in financial transactions or healthcare etc
Unlike traditional methods that involve replicating entire datasets, CDC focuses only on the changes. This reduces the processing overhead, minimises network traffic, and optimises resource utilisation.
When to use Change Data Capture?
CDC is particularly used when dealing with datasets that undergo frequent changes. In scenarios where data is constantly updated, capturing only the changes helps maintain system efficiency.
In environments where data warehouses and analytical systems need to stay current, implementing CDC ensures that updates are propagated swiftly and without the need for extensive data transfers.
Methods of Change Data Capture
Change Data Capture (CDC) comes in various types, each having its own specific use cases and requirements. The primary methods of CDC ,we want to discuss :
Timestamp based CDC
In timestamp based CDC ,we use the timestamp column with the name “last_updated”, “ last_modified” or “date_modified” (as per your convenience) in the source table to identify changes.
Pros:
- Simple to implement, especially when timestamps are readily available.
- No additional load on the source database compared to trigger-based CDC.
Cons:
- Timestamp columns need to be updated every time.
- Delete operations will not be identified as there is no timestamp column for deleted rows.
Log based CDC
In log based CDC, changes are captured directly from the transaction logs of the source database. This method is efficient because it reads the logs generated by the database management system (DBMS) to identify changes.
Pros:
- More reliable and efficient.
- There is no need to change the schemas of the database system or add additional columns.
Cons:
- More complex to implement.
- Dependency on database-specific features.
Trigger based CDC
In trigger based CDC, changes in the source are identified using the database triggers. Triggers are special stored procedures that are automatically executed when a specified event (such as an update, insert, or delete) occurs on a particular table.
Pros:
- Simple to implement.
- Directly tied to database events, making it easy to understand.
Cons:
- Can introduce additional overhead on the source database.
- May not be as real-time as log-based approaches.
- Complexity increases with the number of triggers.
Conclusion
As we conclude the exploration of Change Data Capture ,uses and methods of CDC. Change Data Capture is not just a technological tool, it is a strategic asset for organisations navigating the complexities of a data-driven world. By embracing this approach, businesses can not only adapt to change but thrive in an environment where accurate, timely, and reliable data is the cornerstone of success.
If you are interested in exploring CDC in your PostgreSQL database using Pentaho Data Integration (PDI) , please reach out to us, let’s connect and collaborate and extract more from your PostgreSQL database.
Thank you and Stay tuned!!
Leave a Reply