Open-source Data Engineering with PostgreSQL

Blog-2: Installation and Setup on Ubuntu

INTRODUCTION:

Welcome back to the series on Open-source Data Engineering with PostgreSQL. In this post, we shall delve into the installation and configuration of Apache Spark and Apache Drill on an Ubuntu environment. Our aim is to guide you through each step, building upon the foundational concepts we’ve previously covered. By following this comprehensive walkthrough, you’ll have all the necessary prerequisites for a streamlined installation process, setting the stage for an optimized data engineering environment that facilitates efficient data loading and exploration. Let’s dive in and set the stage for a seamlessly integrated experience.

Prerequisites:

Before we embark on the installation journey, make sure your Ubuntu operating system meets the following prerequisites:

Sufficient Disk storage
Apache Spark and Apache Drill require Java version Java 8 or Java 11 to run
- Here is the command to install Java if it is not already installed
  - sudo apt-get install openjdk-8-jdk
Apache spark interacts with various data sources, including PostgreSQL. To connect to a Postgres database, we need a PostgreSQL JDBC Driver; we can download it from hhttps://jdbc.postgresql.org/download/
Place the PostgreSQL JDBC Driver JAR file in this location /opt/spark/jars so that is accessible to Apache Spark.

Installation and setup

Apache Spark:

Download the latest Apache Spark utility (version 3.5.0 as of this writing) from this link https://spark.apache.org/downloads.html, we can download any other available version as per the need.
First, we will update using the command
sudo apt-get update
sudo apt-get upgrade
Use this command to extract the tarball and before that change the directory where we want to extract spark
cd /path/to/your/desired/directory
tar -xzf spark-3.5.0-bin-hadoop3.tar.gz
Update your .bashrc file with the following environment variables
export HADOOP_HOME=/path/to/your/hadoop
export SPARK_HOME=/path/to/your/spark
Navigate to the Spark directory and run the following command, we can see the Spark working directory
cd /path/to/your/spark
spark-shell

Apache Drill:

Download the latest version of Apache Drill (version 1.19.0 as of this writing) using
wget https://www.apache.org/dyn/closer.cgi/drill/drill-1.19.0/apache-drill-1.19.0.tar.gz
Firstly we will run
sudo apt-get update
sudo apt-get upgrade
Use the following command to extract the tarball at the desired directory
cd /path/to/your/desired/directory
tar -xvf apache-drill-1.19.0.tar.gz
Navigate to the Drill’s bin directory and run any of the following commands to start the Drill shell or the sqlline shell:
./drill-embedded
./sqlline
./sqlline -u jdbc:drill:zk=local

What’s next:

This installation guide has equipped you with the essential steps to set up Apache Spark and Apache Drill on your Ubuntu environment. By following these instructions, you’ve laid a solid foundation for efficient data engineering, enabling seamless data loading and exploration.

As you embark on your journey with Spark and Drill, remember that this is just the beginning. In upcoming blogs, we will delve deeper into advanced configurations, optimization strategies, and real-world use cases. Stay tuned for more insights and practical tips that will elevate your data engineering experience. The upcoming blogs will provide valuable insights for harnessing the full potential of Apache Spark and Apache Drill in your data projects. Happy exploring!