Mastering Timestamp-Based CDC Hurdles: A Proven Solution

Introduction

Have you experimented with Timestamp-Based Change Data Capture using the Pentaho Data Integration (PDI) tool? Achieving data replication from a source database to a target database through “Timestamp-Based Change Data Capture” with Pentaho Data Integration is indeed straightforward. Perhaps you’ve successfully implemented it. However, there are two drawbacks to achieving comprehensive data replication using this method. Firstly, updating the timestamp column in the source table is required each time rows are inserted or updated. Secondly, ‘Delete’ operations face a challenge, as there is no means to retrieve the value of the timestamp column in deleted rows.

As far as the first case is concerned, it would be taken care of as long as you incorporate these two practices in your database design:

Include a Date/Time column to capture the Last Modified / Last Updated Timestamp in each and every table (or, at least the ones that you plan to replicate onto another database instance).
Associate the column with an insert or update trigger that automatically updates the values in the column with the current timestamp whenever an insert or update action is performed. (or)
Make sure to set/update the values in the column with the current timestamp value when performing any insert or update operation within your application’s code.

Coming to the second case, here is the effective and simple solution to get complete data replication with some modifications to the design of the Timestamp based CDC that we discussed in the previous blog. We highly recommend you to follow the series of topics around the Data Integration with PostgreSQL (the World’s Most Advanced Open Source Relational Database System) for better understanding. We have so far explored the Installation of Pentaho Data Integration (PDI) tool and connection to the PostgreSQL database, about Change Data Capture and its methods and the implementation of Timestamp based Change Data Capture. If you have already checked these out, you’re good to go!

We are dividing this topic into two parts to make it easier for you to follow-along:

Design of the CDC pipeline.
Execution of the CDC pipeline(next blog)

Design of the CDC pipeline

In this pipeline , we will identify the difference in comparison of two tables (one from Source Database and another from Target Database) and apply changes in the Target database accordingly. Before designing this pipeline , we will save the transformation we used to achieve ‘Timestamp based CDC’ (previous blog) as ‘Timestamp based CDC’ . Open a new transformation.

As we need two inputs for the second step , we will export the data from Source Database and also Target Database.

Export Data from Source (Step 1.1)

Go to Design → Input → Table input
Here, we will extract the data in the table(t1) from Source database by giving SQL statements.
Hop the result to the next step 2.0 as input.

Export Data from Target (Step 1.2)

Go to Design → Input → Table input
Here, we will extract the data in the table(t2) from Target database by giving SQL statements.
Hop the result to the next step 2.0 as input.

Identify the difference in two tables (Step 2.0)

Go to Design → Joins → Merge rows (diff)
Here, we will compare the field given in both the tables(t1 & t2) and show the output in flagfield column (Temporary column) as Identical, new ,deleted or changed.
Hop the result to the next step 3.0 as input.

Separate the difference (Step 3.0)

Go to Design → Flow → Filter rows
Here, we will filter the non-identical rows( from flagfield column) from the previous output by giving condition.
Hop the result to the next step 4.0 as input.

Apply changes in target (Step 4.0)

Go to Design → Output → Delete
Here, we will delete the non-identical rows( from flagfield column) from the previous output.
Hop the result to the next step 5.0 as input.

Timestamp based CDC (Step 5.0)

Go to Design → Flow → Transformation executor
Here, we will execute the transformation named ‘Timestamp based CDC’ (discussed in previous blog) .
Save and run the whole transformation.

Summary

Pentaho Data Integration (PDI) toolset is vast with numerous options built-in. We have explored just a few of those options for addressing the drawbacks of a simple timestamp based Change Data Capture, and effectively replicating ‘Delete’ operations from the source to the target databases. Understanding these two aspects is crucial, as Timestamp based is the most used method of Change Data Capture. In this blog post, we explored the design (part1) of the CDC pipeline. In the next post, we’ll solidify our understanding further on this topic by going through an example for replicating all of the insert, update and delete operations from the source database to the target database.

We highly recommend you to follow the series of topics around the Data Integration with PostgreSQL (the World’s Most Advanced Open Source Relational Database System) for better understanding. We have so far explored the Installation of Pentaho Data Integration (PDI) tool and connection to the PostgreSQL database, learnt about Change Data Capture and its methods and the implementation of Timestamp based Change Data Capture, and topped it up by adding the ability to handle ‘Delete’ operations. Try designing this pipeline, and connect with us if you need any additional information.

Stay tuned !!