Sharding

DATABASE SHARDING

What is Database Sharding?

Database sharding is the process of storing a large database across multiple machines. A single machine, or database server, can store and process only a limited amount of data. Database sharding overcomes this limitation by splitting data into smaller chunks, called shards, and storing them across several database servers. All database servers usually have the same underlying technologies, and they work together to store and process large volumes of data.

Advantages of Database Sharding:

Solve Scalability Issue: With a single database server architecture any application experiences performance degradation when users start growing on that application.  Read and write queries become slower and the network bandwidth starts to saturate.  At some point, you will be running out of disk space. Database sharding fixes all these issues by partitioning the data across multiple machines.

High Availability: A problem with single server architecture is that if an outage happens then the entire application will be unavailable which is not good for a website with more number of users. This is not the case with a sharded database. If an outage happens in sharded architecture, then only some specific shards will be down. All the other shards will continue to operate and the entire application won’t become unavailable for the users.

Speed Up Query Response Time: When you submit a query in an application with a large monolithic database and have no sharded architecture, it takes more time to find the result. It has to search every row in the table and that slows down the response time for the query you have given. This doesn’t happen in sharded architecture. In a sharded database a query has to go through fewer rows and you receive the response in less time.

More Write Bandwidth: For many applications writing is a major bottleneck. With no master database serialising writes, sharded architecture allows you to write in parallel and increase your write throughput.Scaling Out: Sharding a database facilitates horizontal scaling, known as scaling out. In horizontal scaling, you add more machines in the network and distribute the load on these machines for faster processing and response. This has many advantages. You can do more work simultaneously and you can handle high requests from the users, especially when writing data because there are parallel paths through your system. You can also load balance web servers that access shards over different network paths, which are processed by different CPUs, and use separate caches of RAM or disk IO paths to process work.

Disadvantages of Database Sharding;

Adds Complexity in the System: You need to be careful while implementing a proper sharded database architecture in an application. It’s a complicated task and if it’s not implemented properly, then you may lose the data or get corrupted tables in your database. You also need to manage the data from multiple shard locations instead of managing and accessing it from a single entry point. This may affect the workflow of your team which can be potentially disruptive to some teams.

Rebalancing Data: In a sharded database architecture, sometimes shards become unbalanced (when one shard outgrows other shards). Consider an example that you have two shards of a database. One shard store the name of the customers begins with letter A through M. Another shard store the name of the customer begins with the letters N through Z. If there are so many users with the letter L then shard one will have more data than shard two. This will affect the performance (slow down) of the application and it will stall out for a significant portion of your users. The A-M shard will become unbalanced and it will be known as a database hotspot. To overcome this problem and to rebalance the data, you would need to do re-sharding for even data distribution. Moving data from one shard to another shard is not a good idea because it requires a lot of downtime.

Joining Data From Multiple Shards is Expensive: In a single database, joins can be performed easily to implement any functionalities. But in a sharded architecture, you need to pull the data from different shards and you need to perform joins across multiple networked servers. You can’t submit a single query to get the data from various shards. You need to submit multiple queries – one for each one of the shards, pull out the data, and join the data across the network. This is going to be a very expensive and time-consuming process. It adds latency to your system.

No Native Support: Sharding is not natively supported by every database engine. For example, PostgreSQL doesn’t include automatic sharding features, so there you have to do manual sharding. You need to follow the “roll-your-own” approach. It will be difficult for you to find the tips or documentation for sharding and troubleshoot the problem during the implementation of sharding.

Different Types of Database Shardings:

1.Key Based Sharding:

   If you are familiar with the concept of Hashing then this concept is easy to understand. Hashing is popular to store key-value pairs. Each key has a unique value. Analogically the Key Based Sharding has a Hash function that maps each row to its Shard by taking in some data from the row and mapping it to the unique value which is the Shard in which the data should be stored. 

Example:

Consider an example that you have 3 database servers and each request has an application id which is incremented by 1 every time a new application is registered. To determine which server data should be placed on, we perform a modulo operation on these applications id with the number 3. Then the remainder is used to identify the server to store our data.

      The downside of this method is elastic load balancing which means if you will try to add or remove the database servers dynamically it will be a difficult and expensive process. For example, in the above one if you will add 5 more servers then you need to add more corresponding hash values for the additional entries. Also, the majority of the existing keys need to be remapped to their new, correct hash value and then migrated to a new server. The hash function needs to be changed from modulo 3 to modulo 8. While the migration of data is in effect both the new and old hash functions won’t be valid. During the migration, your application won’t be able to service a large number of requests and you’ll experience downtime for your application till the migration completes.

Note: A shard shouldn’t contain values that might change over time. It should be always static otherwise it will slow down the performance.

2.Range Based Sharding:

Range-based sharding is the simplest sharding method to implement. Every shard holds a different set of data but they all have the same schema as the original database. In this method, you just need to identify in which range your data falls, and then you can store the entry to the corresponding shard. This method is best suitable for storing non-static data (example: storing the contact info for students in a college.)

The drawback of this method is that the data may not be evenly distributed on shards. In the above example, you might have a lot of customers whose names fall into the category of A-P. In such cases, the first shard will have to take more load than the second one and it can become a system bottleneck.

3.Vertical Based Sharding:

In this method, we split the entire column from the table and we put those columns into new distinct tables. Data is totally independent of one partition to the other ones. Also, each partition holds both distinct rows and columns. Take the example of Twitter features. We can split different features of an entity in different shards on different machines. On Twitter, users might have a profile, number of followers, and some tweets posted by themselves. We can place the user profiles on one shard, followers in the second shard, and tweets on a third shard.

4.Direct-Based Sharding:

To keep track of the data in a Database Shard, this architecture uses lookup tables. The lookup table can give you information about where the data is stored. This Database Sharding architecture is more flexible as it allows you to have freedom over the range of values in the lookup table, or create Shards based on algorithms and so on. The only drawback here is that every single time a query needs to execute, it needs to consult a lookup table to locate the concerned data. Also, the whole system will fail if the lookup table crashes because this architecture cannot function without it.

What are the limitations of Database Sharding?

Just like every other technique, creating Shards also has its own limitations. Some of the limitations are:

  • Complicated to implement.
  • Can easily lead to crashes and failure if not implemented properly.
  • Difficult to maintain Data Integrity
  • Can cause data loss.
  • Very few Databases have an in-built Sharding mechanism.
  • Sometimes the query performance decreases due to the increasing number of Shards.

When should I consider Database Sharding?

If the amount of your application data is growing rapidly and you run out of capacity on a single server, sharding can help you distribute the load across multiple servers and increase capacity.

If the number of reads or writes to your database exceeds what a single node or its read replicas can handle, you will observe slower response times or timeouts. Sharding can help you distribute the load and improve performance by allowing each shard to be optimized for specific queries or workloads.

Similarly, you can also face slower response times or timeouts when the network bandwidth needed by the application surpasses the bandwidth available to a single database node and any read replicas.

What are the challenges in Database Sharding?

Data Hotspots: There might be an imbalance in terms of data distribution as a few shards may have to store more data and hence will require more computational resources.

Operational Complexity: Rather than managing a single database, you have to maintain multiple shards. When querying, developers must read several shards and integrate the pieces of information. 

Infrastructure Costs: As you increase the number of shards, the cost directly increases. Your maintenance costs will also shoot up.

Conclusion:

We’ve discussed sharding, when to use it and how it can be set up. Sharding is an excellent solution for applications that need to manage a large amount of data and have it readily available for high amounts of reading and writing. Still, it makes things more complicated to operate. Before you start implementation, you should think about whether the benefits are worth the costs or if there is a more straightforward solution.

Why do we need multi-master replication in PostgreSQL and why now?

Introduction:

In today’s data-driven world, organizations face ever-increasing demands for scalability, high availability, and global data access. Traditional single-master replication architectures in PostgreSQL are no longer sufficient to meet these evolving needs. This blog explores the significance of adopting multi-master replication in PostgreSQL, highlighting its ability to empower businesses with enhanced scalability, resilience, and efficient data distribution across geographically dispersed environments.

Harnessing the Strengths of the Postgres Ecosystem:

	Upcoming native Support and Feature Integration

	Seamless Integration with Existing Tools and Extensions

	Robust Management and Monitoring Capabilities
  • Achieving Unprecedented/Ultra High Availability
    • Multi-Master replication allows multiple nodes to accept both read and write operations simultaneously, eliminating the single point of failure associated with traditional single-master architectures.
    • Organizations can ensure continuous availability of their critical applications and services, even in the face of node failures or maintenance operations.
  • Scaling Beyond Horizons
    • The ability to distribute read and write operations across multiple nodes empowers organizations to scale their database systems horizontally.
    • Multi-master replication unlocks the potential for handling high volumes of concurrent requests, supporting growing user bases and workloads.
  • Geographical Distribution and Reduced Latency
    • With Multi-Master replication, organizations can deploy database nodes in different geographical regions, bringing data closer to end-users.
    • Local writes on distributed nodes result in reduced latency, providing a seamless experience for users regardless of their location.
  • Enhanced Data Consistency and Conflict Resolution
    • Multi-Master replication provides sophisticated conflict resolution mechanisms to handle concurrent updates on multiple nodes.
    • This ensures data integrity and consistency across the distributed system, maintaining a single source of truth for applications and users.
  • Supporting Active-Active Architectures
    • Organizations require the flexibility to distribute data modifications across multiple nodes in active-active architectures.
    • Multi-Master replication facilitates de-centralized writes, empowering businesses to design resilient and scalable systems that can handle distributed workloads.
  • Future-Proofing Database Infrastructure
    • Implementing Multi-Master replication in PostgreSQL future-proofs database infrastructure by adopting a scalable and highly available architecture.
    • Organizations can adapt to evolving business needs, such as accommodating increased user demands, expanding into new markets, or integrating with partner systems.

Conclusion

PostgreSQL’s multi-master replication is no longer a luxury but a necessity for organizations striving to deliver high-performance, fault-tolerant, and globally scalable applications. By embracing Multi-Master replication, businesses can achieve unprecedented levels of high availability, scalability, and data consistency, empowering them to meet the demands of a rapidly evolving digital landscape. PostgreSQL’s robust support for multi-master replication positions it as a leading choice for organizations seeking to build resilient and future-proof database infrastructures.

Divide & Conquer with Partitioning and Sharding

As we keep adding data to a database, we often reach a point where both read and write operations on the database become extremely cost prohibitive. While it indicates a thriving business and is good in a way, the flip side is that the database starts becoming sluggish at this point in time, and we start losing business due to the slowness that creeps in. What can we do to avoid falling into this trap, and keep the database slim and slick so that it works like a well-oiled machine, while at the same time accommodating all read and write operations? Well, the answer lies in the age-old concept of Divide & Conquer, and the two most common implementations of this strategy are Partitioning and Sharding. Here, we’ll take a quick look at what these terms mean, and how to make effective use of these techniques. We’ll cover:

  • What is Sharding?
  • Why Is Sharding Used?
  • Types of Sharding
  • What is Partitioning?
  • Key differences between Partitioning and Sharding
  • Partitioning Example
  • Sharding Example
  • How to find in which shard the data is present
  • Conclusion:
    When to use Partitioning?
    When to use Sharding?

What is Sharding?

Sharding is actually a type of database partitioning, more specifically, Horizontal Partitioning. Each of the partitions is located on a separate server, and is called a “shard”.  Sharding involves replicating [copying] the schema, and then dividing the data based on a shard key onto a separate database server instance, to spread load.

We can use the Citus Postgres extension to manage shards efficiently. The Citus extension facilitates automatic partitioning, rebalancing and query routing, which makes it a convenient and efficient solution for scaling a PostgreSQL database.

Why is Sharding Used?

By Sharding a larger table, you can store the new chunks of data, called logical shards, across multiple nodes to achieve horizontal scalability and improved performance.

Types of Sharding

Database Sharding is often implemented in one of these two ways:

Horizontal Sharding is effective when queries tend to return a subset of rows that are often grouped together.
For example, queries that filter data based on short date ranges are ideal for horizontal Sharding since the date range will necessarily limit querying to only a subset of the servers.

Vertical Sharding is effective when queries tend to return only a subset of columns of the data.
For example, if some queries request only names, and others request only addresses, then the names and addresses can be Sharded on to separate servers.

What is Partitioning?

Partitioning is a generic term that just means dividing your logical entities into different physical entities for performance, availability, or some other purpose.

Key differences between Partitioning and Sharding

Sharding and Partitioning are both about breaking up a large data set into smaller subsets. The difference is that Sharding implies the data is spread across multiple computers while Partitioning does not. Partitioning is about grouping subsets of data within a single database instance.

Partitioning Example:

Here is an example of how to shard a table called orders based on the order_date column

postgres=# CREATE TABLE orders (
    order_id BIGSERIAL,
    order_date DATE NOT NULL,
    customer_id INTEGER NOT NULL
)
PARTITION BY RANGE (order_date);
CREATE TABLE
postgres=# alter table orders add primary key ( order_id, order_date);
ALTER TABLE

Creating partition tables:

This creates an empty partitioned table and specifies that it should be partitioned by range, using the order_date column as the partition key.

CREATE TABLE orders_2021 PARTITION OF orders
    FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
CREATE TABLE orders_2022 PARTITION OF orders
    FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE orders_2023 PARTITION OF orders
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

This creates three partitions for the orders table, one for each year.

Sharding Example:

Creating and Managing Shards with the Citus database extension:

Citus is a distributed database built entirely as an open-source extension for PostgreSQL. With Citus, you can create tables that are transparently distributed or replicated across a cluster of PostgreSQL nodes.

This is the process to follow to manage Shards using the Citus extension:

A. Install the Citus extension:

CREATE EXTENSION citus;

B. Set up a coordinator node:

SELECT citus_add_node('coordinator_node', 5432);

This sets up a coordinator node that will act as the interface for your sharded database.

C. Set up worker nodes:

SELECT citus_add_node('worker_node1', 5432);
SELECT citus_add_node('worker_node2', 5432);

These commands add two worker nodes to the cluster, which will store the sharded data.

D. Create a distributed table:

CREATE TABLE orders (
    order_id SERIAL,
    order_date DATE NOT NULL,
    customer_id INTEGER NOT NULL
)
DISTRIBUTED BY (order_date);

This creates a distributed table that will be sharded across the worker nodes based on the order_date column.

E. Insert data into the distributed table:

INSERT INTO orders (order_date, customer_id, ...) VALUES ('2019-01-01', 1, ...);
INSERT INTO orders (order_date, customer_id, ...) VALUES ('2020-01-01', 2, ...);
INSERT INTO orders (order_date, customer_id, ...) VALUES ('2021-01-01', 3, ...);

This inserts data into the distributed table, which will be automatically sharded across the worker nodes based on the order_date.

How to find in which shard the data is present

The rows of a distributed table are grouped into shards, and each shard is placed on a worker node in the Citus cluster. In the multi-tenant Citus use case we can determine which worker node contains the rows for a specific tenant by putting together two pieces of information: the shard id associated with the tenant id, and the shard placements on workers.

To find the worker node holding the data for order date = 2019-01-01, ask for the placement of rows whose distribution column has value 2019-01-01:

SELECT shardid, shardstate, shardlength, nodename, nodeport, placementid
  FROM pg_dist_placement AS placement,
       pg_dist_node AS node
 WHERE placement.groupid = node.groupid
   AND node.noderole = 'primary'
   AND shardid = (
     SELECT get_shard_id_for_distribution_column('orders', '2019-01-01')
   );

Conclusion:

While both Partitioning and Sharding help to split up huge volumes of data, and aid in performance and scalability, these options should be implemented only after careful consideration, as they could introduce certain level of complexity to the data layer. Besides we must also be cognizant of the additional costs and potential security vulnerabilities if we want to host the data on multiple Postgres instances.

When to use Partitioning

Partitioning can improve scalability, reduce contention, and optimize performance. It can also provide a mechanism for dividing data by usage pattern. For example, you can archive older data in cheaper data storage.

When to use Sharding?

Sharding is a great solution for applications with large data requirements and high-volume read/write workloads, but it does come with additional complexity. Consider whether the benefits outweigh the costs or whether there is a simpler solution before you begin implementation.