Divide & Conquer with Partitioning and Sharding

As we keep adding data to a database, we often reach a point where both read and write operations on the database become extremely cost prohibitive. While it indicates a thriving business and is good in a way, the flip side is that the database starts becoming sluggish at this point in time, and we start losing business due to the slowness that creeps in. What can we do to avoid falling into this trap, and keep the database slim and slick so that it works like a well-oiled machine, while at the same time accommodating all read and write operations? Well, the answer lies in the age-old concept of Divide & Conquer, and the two most common implementations of this strategy are Partitioning and Sharding. Here, we’ll take a quick look at what these terms mean, and how to make effective use of these techniques. We’ll cover:

What is Sharding?
Why Is Sharding Used?
Types of Sharding
What is Partitioning?
Key differences between Partitioning and Sharding
Partitioning Example
Sharding Example
How to find in which shard the data is present
Conclusion:
When to use Partitioning?
When to use Sharding?

What is Sharding?

Sharding is actually a type of database partitioning, more specifically, Horizontal Partitioning. Each of the partitions is located on a separate server, and is called a “shard”. Sharding involves replicating [copying] the schema, and then dividing the data based on a shard key onto a separate database server instance, to spread load.

We can use the Citus Postgres extension to manage shards efficiently. The Citus extension facilitates automatic partitioning, rebalancing and query routing, which makes it a convenient and efficient solution for scaling a PostgreSQL database.

Why is Sharding Used?

By Sharding a larger table, you can store the new chunks of data, called logical shards, across multiple nodes to achieve horizontal scalability and improved performance.

Types of Sharding

Database Sharding is often implemented in one of these two ways:

Horizontal Sharding is effective when queries tend to return a subset of rows that are often grouped together.
For example, queries that filter data based on short date ranges are ideal for horizontal Sharding since the date range will necessarily limit querying to only a subset of the servers.

Vertical Sharding is effective when queries tend to return only a subset of columns of the data.
For example, if some queries request only names, and others request only addresses, then the names and addresses can be Sharded on to separate servers.

What is Partitioning?

Partitioning is a generic term that just means dividing your logical entities into different physical entities for performance, availability, or some other purpose.

Key differences between Partitioning and Sharding

Sharding and Partitioning are both about breaking up a large data set into smaller subsets. The difference is that Sharding implies the data is spread across multiple computers while Partitioning does not. Partitioning is about grouping subsets of data within a single database instance.

Partitioning Example:

Here is an example of how to shard a table called orders based on the order_date column

postgres=# CREATE TABLE orders (
    order_id BIGSERIAL,
    order_date DATE NOT NULL,
    customer_id INTEGER NOT NULL
)
PARTITION BY RANGE (order_date);
CREATE TABLE
postgres=# alter table orders add primary key ( order_id, order_date);
ALTER TABLE

Creating partition tables:

This creates an empty partitioned table and specifies that it should be partitioned by range, using the order_date column as the partition key.

CREATE TABLE orders_2021 PARTITION OF orders
    FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
CREATE TABLE orders_2022 PARTITION OF orders
    FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE orders_2023 PARTITION OF orders
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

This creates three partitions for the orders table, one for each year.

Sharding Example:

Creating and Managing Shards with the Citus database extension:

Citus is a distributed database built entirely as an open-source extension for PostgreSQL. With Citus, you can create tables that are transparently distributed or replicated across a cluster of PostgreSQL nodes.

This is the process to follow to manage Shards using the Citus extension:

A. Install the Citus extension:

CREATE EXTENSION citus;

B. Set up a coordinator node:

SELECT citus_add_node('coordinator_node', 5432);

This sets up a coordinator node that will act as the interface for your sharded database.

C. Set up worker nodes:

SELECT citus_add_node('worker_node1', 5432); SELECT citus_add_node('worker_node2', 5432);

These commands add two worker nodes to the cluster, which will store the sharded data.

D. Create a distributed table:

CREATE TABLE orders (
    order_id SERIAL,
    order_date DATE NOT NULL,
    customer_id INTEGER NOT NULL
)
DISTRIBUTED BY (order_date);

This creates a distributed table that will be sharded across the worker nodes based on the order_date column.

E. Insert data into the distributed table:

INSERT INTO orders (order_date, customer_id, ...) VALUES ('2019-01-01', 1, ...); INSERT INTO orders (order_date, customer_id, ...) VALUES ('2020-01-01', 2, ...); INSERT INTO orders (order_date, customer_id, ...) VALUES ('2021-01-01', 3, ...);

This inserts data into the distributed table, which will be automatically sharded across the worker nodes based on the order_date.

How to find in which shard the data is present

The rows of a distributed table are grouped into shards, and each shard is placed on a worker node in the Citus cluster. In the multi-tenant Citus use case we can determine which worker node contains the rows for a specific tenant by putting together two pieces of information: the shard id associated with the tenant id, and the shard placements on workers.

To find the worker node holding the data for order date = 2019-01-01, ask for the placement of rows whose distribution column has value 2019-01-01:

SELECT shardid, shardstate, shardlength, nodename, nodeport, placementid
  FROM pg_dist_placement AS placement,
       pg_dist_node AS node
 WHERE placement.groupid = node.groupid
   AND node.noderole = 'primary'
   AND shardid = (
     SELECT get_shard_id_for_distribution_column('orders', '2019-01-01')
   );

Conclusion:

While both Partitioning and Sharding help to split up huge volumes of data, and aid in performance and scalability, these options should be implemented only after careful consideration, as they could introduce certain level of complexity to the data layer. Besides we must also be cognizant of the additional costs and potential security vulnerabilities if we want to host the data on multiple Postgres instances.

When to use Partitioning

Partitioning can improve scalability, reduce contention, and optimize performance. It can also provide a mechanism for dividing data by usage pattern. For example, you can archive older data in cheaper data storage.

When to use Sharding?

Sharding is a great solution for applications with large data requirements and high-volume read/write workloads, but it does come with additional complexity. Consider whether the benefits outweigh the costs or whether there is a simpler solution before you begin implementation.