Introduction
Pentaho Data Integration (PDI) serves as a robust ETL (Extract, Transform, Load) tool, playing a pivotal role in handling the complexities of data ingestion pipelines. As organizations accumulate vast amounts of data from diverse sources and in different formats, orchestrating seamless data movement becomes essential for informed decision-making. PDI stands out as a valuable solution, providing a comprehensive platform to design, deploy, and manage data pipelines efficiently. With its user-friendly interface and a wide range of transformation steps and connectors, PDI facilitates the extraction, transformation, and loading of data across various stages of the pipeline. Importantly, PDI is highly adaptable, making it well-suited for creating configurable data ingestion pipelines that can accommodate diverse data structures and formats. In a series of upcoming blogs, I will delve into the intricacies of configuring end-to-end data ingestion pipelines using PDI. The focus will be on demonstrating how to leverage PDI’s capabilities to handle the challenges posed by the increasing volume and diversity of data, particularly emphasizing its application in conjunction with PostgreSQL databases, ensuring a robust foundation for efficient data management and analytics.
Feel free to explore the previously published blogs on Data Engineering by my colleagues:
Building an Efficient Data Pipeline with PostgreSQL and Talend OpenStudio
Data Engineering with Hydra
Installation :
The basic requirements to install Pentaho Data Integration on Ubuntu Operating System are as follows:
- Pentaho Data Integration Community
- Edition Ubuntu 16 or above
- JDK 11 and above (Java Development Kit)
Step-1: Downloading the Pentaho Data Integration (PDI/Kettle) Software
Download the PDI-CE from SourceForge link at the time of PDI’s latest version is 9.4, you can download the latest stable version as per your requirements.
Step-2: Extracting the zip file
Extract the downloaded zip file which will be in the Downloads folder . Right click the file and choose the ‘Extract Here’ if you want it to get extracted in the downloads folder.
Step-3: Checking Java Availability
Since Pentaho is written on Java, PDI tool requires Java to run;
java -version
if you don’t already have a Java runtime environment(JRE) installed, download and install JRE using the following commands:
sudo apt-get update
sudo apt-get install openjdk-11-jre
Step-4: Launching Spoon
The last step would be to launch the Spoon application. For this, go to folder where we have extracted pdi (data-integration folder) earlier in step-2. Right click within this folder and select ‘Open Terminal’ and type the below command:
./spoon.sh
How to connect with Postgresql Database:
- Before connecting to spoon we need to authenticate ip address of the pentaho server in pg_hba.conf file
/etc/postgresql/*/main/pg_ident.conf
- we need to change the isten address from ‘local host’ to ‘*’in postgresql.conf file
/etc/postgresql/*/main/postgresql.conf
. Locate the line: #listen_addresses = ‘localhost’ and change it to*
- Restart the Postgresql server
- Open Spoon and create a new transformation.
Click on the View option that appears in the upper-left corner


Fill the Database Connection dialog window.
Click on the Test button. The following window shows up:
- Click on OK to close the test window.
- Click on OK again to close the database definition window. A new database connection is added to the tree.
- Right-click on the database connection and click on Share. The connection is available in all transformations you create from now onwards.
Next steps :
Embarking on a comprehensive Data Engineering journey, the integration of PostgreSQL with Pentaho Data Integration (PDI) marks just the initial step. Throughout the upcoming series of blog posts, I will guide you through the intricacies of the complete data ingestion process. This journey will provide insights into seamlessly navigating end-to-end processes with Pentaho PDI while specifically focusing on optimizing workflows with PostgreSQL. In the forthcoming blog, our exploration will delve deeper into the sophisticated realm of Change Data Capture (CDC). This pivotal feature ensures that our data pipelines are not only capable of handling static datasets but are also adept at capturing and processing real-time changes within the PostgreSQL database. By unraveling the intricacies of CDC, we aim to help you with the knowledge to build dynamic, responsive, and intelligent data solutions that align with the evolving nature of your PostgreSQL data sources. Stay tuned for an insightful exploration of CDC and its transformative impact on data integration workflows.