Open-source Data Engineering with PostgreSQL

Blog-2: Installation and Setup on Ubuntu

INTRODUCTION:

Welcome back to the series on Open-source Data Engineering with PostgreSQL. In this post, we shall delve into the installation and configuration of Apache Spark and Apache Drill on an Ubuntu environment. Our aim is to guide you through each step, building upon the foundational concepts we’ve previously covered. By following this comprehensive walkthrough, you’ll have all the necessary prerequisites for a streamlined installation process, setting the stage for an optimized data engineering environment that facilitates efficient data loading and exploration. Let’s dive in and set the stage for a seamlessly integrated experience.

Prerequisites:

Before we embark on the installation journey, make sure your Ubuntu operating  system meets the following prerequisites:

  • Sufficient Disk storage
  • Apache Spark and Apache Drill require Java version Java 8 or Java 11 to run
    • Here is the command to install Java if it is not already installed
      • sudo apt-get install openjdk-8-jdk
  • Apache spark interacts with various data sources, including PostgreSQL. To connect to a Postgres database, we need a PostgreSQL JDBC Driver; we can download it from hhttps://jdbc.postgresql.org/download/
  • Place the PostgreSQL JDBC Driver JAR file in this location /opt/spark/jars so that is accessible to Apache Spark.

Installation and setup

Apache Spark:

  • Download the latest Apache Spark utility (version 3.5.0 as of this writing) from this link https://spark.apache.org/downloads.html, we can download any other available version as per the need.
  • First, we will update using the command
    sudo apt-get update
    sudo apt-get upgrade
  • Use this command to extract the tarball and before that change the directory where we want to extract spark
    cd /path/to/your/desired/directory
    tar -xzf spark-3.5.0-bin-hadoop3.tar.gz
  • Update your .bashrc file with the following environment variables
    export HADOOP_HOME=/path/to/your/hadoop
    export SPARK_HOME=/path/to/your/spark
  • Navigate to the Spark directory and run the following command, we can see the Spark working directory
    cd /path/to/your/spark
    spark-shell

Apache Drill:

  • Download the latest version of Apache Drill (version 1.19.0 as of this writing) using
    wget https://www.apache.org/dyn/closer.cgi/drill/drill-1.19.0/apache-drill-1.19.0.tar.gz
  • Firstly we will run 
    sudo apt-get update
    sudo apt-get upgrade
  • Use the following command to extract the tarball at the desired directory 
    cd /path/to/your/desired/directory
    tar -xvf apache-drill-1.19.0.tar.gz
  • Navigate to the Drill’s bin directory and run any of the following commands to start the Drill shell or the sqlline shell:
    ./drill-embedded
    ./sqlline
    ./sqlline -u jdbc:drill:zk=local

What’s next:

This installation guide has equipped you with the essential steps to set up Apache Spark and Apache Drill on your Ubuntu environment. By following these instructions, you’ve laid a solid foundation for efficient data engineering, enabling seamless data loading and exploration.

As you embark on your journey with Spark and Drill, remember that this is just the beginning. In upcoming blogs, we will delve deeper into advanced configurations, optimization strategies, and real-world use cases. Stay tuned for more insights and practical tips that will elevate your data engineering experience. The upcoming blogs will provide valuable insights for harnessing the full potential of Apache Spark and Apache Drill in your data projects. Happy exploring!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>