DATABASE SHARDING - OpenSourceDB

What is Database Sharding?

Database sharding is the process of storing a large database across multiple machines. A single machine, or database server, can store and process only a limited amount of data. Database sharding overcomes this limitation by splitting data into smaller chunks, called shards, and storing them across several database servers. All database servers usually have the same underlying technologies, and they work together to store and process large volumes of data.

Advantages of Database Sharding:

Solve Scalability Issue: With a single database server architecture any application experiences performance degradation when users start growing on that application. Read and write queries become slower and the network bandwidth starts to saturate. At some point, you will be running out of disk space. Database sharding fixes all these issues by partitioning the data across multiple machines.

High Availability: A problem with single server architecture is that if an outage happens then the entire application will be unavailable which is not good for a website with more number of users. This is not the case with a sharded database. If an outage happens in sharded architecture, then only some specific shards will be down. All the other shards will continue to operate and the entire application won’t become unavailable for the users.

Speed Up Query Response Time: When you submit a query in an application with a large monolithic database and have no sharded architecture, it takes more time to find the result. It has to search every row in the table and that slows down the response time for the query you have given. This doesn’t happen in sharded architecture. In a sharded database a query has to go through fewer rows and you receive the response in less time.

More Write Bandwidth: For many applications writing is a major bottleneck. With no master database serialising writes, sharded architecture allows you to write in parallel and increase your write throughput.Scaling Out: Sharding a database facilitates horizontal scaling, known as scaling out. In horizontal scaling, you add more machines in the network and distribute the load on these machines for faster processing and response. This has many advantages. You can do more work simultaneously and you can handle high requests from the users, especially when writing data because there are parallel paths through your system. You can also load balance web servers that access shards over different network paths, which are processed by different CPUs, and use separate caches of RAM or disk IO paths to process work.

Disadvantages of Database Sharding;

Adds Complexity in the System: You need to be careful while implementing a proper sharded database architecture in an application. It’s a complicated task and if it’s not implemented properly, then you may lose the data or get corrupted tables in your database. You also need to manage the data from multiple shard locations instead of managing and accessing it from a single entry point. This may affect the workflow of your team which can be potentially disruptive to some teams.

Rebalancing Data: In a sharded database architecture, sometimes shards become unbalanced (when one shard outgrows other shards). Consider an example that you have two shards of a database. One shard store the name of the customers begins with letter A through M. Another shard store the name of the customer begins with the letters N through Z. If there are so many users with the letter L then shard one will have more data than shard two. This will affect the performance (slow down) of the application and it will stall out for a significant portion of your users. The A-M shard will become unbalanced and it will be known as a database hotspot. To overcome this problem and to rebalance the data, you would need to do re-sharding for even data distribution. Moving data from one shard to another shard is not a good idea because it requires a lot of downtime.

Joining Data From Multiple Shards is Expensive: In a single database, joins can be performed easily to implement any functionalities. But in a sharded architecture, you need to pull the data from different shards and you need to perform joins across multiple networked servers. You can’t submit a single query to get the data from various shards. You need to submit multiple queries – one for each one of the shards, pull out the data, and join the data across the network. This is going to be a very expensive and time-consuming process. It adds latency to your system.

No Native Support: Sharding is not natively supported by every database engine. For example, PostgreSQL doesn’t include automatic sharding features, so there you have to do manual sharding. You need to follow the “roll-your-own” approach. It will be difficult for you to find the tips or documentation for sharding and troubleshoot the problem during the implementation of sharding.

Different Types of Database Shardings:

1.Key Based Sharding:

If you are familiar with the concept of Hashing then this concept is easy to understand. Hashing is popular to store key-value pairs. Each key has a unique value. Analogically the Key Based Sharding has a Hash function that maps each row to its Shard by taking in some data from the row and mapping it to the unique value which is the Shard in which the data should be stored.

Example:

Consider an example that you have 3 database servers and each request has an application id which is incremented by 1 every time a new application is registered. To determine which server data should be placed on, we perform a modulo operation on these applications id with the number 3. Then the remainder is used to identify the server to store our data.

The downside of this method is elastic load balancing which means if you will try to add or remove the database servers dynamically it will be a difficult and expensive process. For example, in the above one if you will add 5 more servers then you need to add more corresponding hash values for the additional entries. Also, the majority of the existing keys need to be remapped to their new, correct hash value and then migrated to a new server. The hash function needs to be changed from modulo 3 to modulo 8. While the migration of data is in effect both the new and old hash functions won’t be valid. During the migration, your application won’t be able to service a large number of requests and you’ll experience downtime for your application till the migration completes.

Note: A shard shouldn’t contain values that might change over time. It should be always static otherwise it will slow down the performance.

2.Range Based Sharding:

Range-based sharding is the simplest sharding method to implement. Every shard holds a different set of data but they all have the same schema as the original database. In this method, you just need to identify in which range your data falls, and then you can store the entry to the corresponding shard. This method is best suitable for storing non-static data (example: storing the contact info for students in a college.)

The drawback of this method is that the data may not be evenly distributed on shards. In the above example, you might have a lot of customers whose names fall into the category of A-P. In such cases, the first shard will have to take more load than the second one and it can become a system bottleneck.

3.Vertical Based Sharding:

In this method, we split the entire column from the table and we put those columns into new distinct tables. Data is totally independent of one partition to the other ones. Also, each partition holds both distinct rows and columns. Take the example of Twitter features. We can split different features of an entity in different shards on different machines. On Twitter, users might have a profile, number of followers, and some tweets posted by themselves. We can place the user profiles on one shard, followers in the second shard, and tweets on a third shard.

4.Direct-Based Sharding:

To keep track of the data in a Database Shard, this architecture uses lookup tables. The lookup table can give you information about where the data is stored. This Database Sharding architecture is more flexible as it allows you to have freedom over the range of values in the lookup table, or create Shards based on algorithms and so on. The only drawback here is that every single time a query needs to execute, it needs to consult a lookup table to locate the concerned data. Also, the whole system will fail if the lookup table crashes because this architecture cannot function without it.

What are the limitations of Database Sharding?

Just like every other technique, creating Shards also has its own limitations. Some of the limitations are:

Complicated to implement.
Can easily lead to crashes and failure if not implemented properly.
Difficult to maintain Data Integrity
Can cause data loss.
Very few Databases have an in-built Sharding mechanism.
Sometimes the query performance decreases due to the increasing number of Shards.

When should I consider Database Sharding?

If the amount of your application data is growing rapidly and you run out of capacity on a single server, sharding can help you distribute the load across multiple servers and increase capacity.

If the number of reads or writes to your database exceeds what a single node or its read replicas can handle, you will observe slower response times or timeouts. Sharding can help you distribute the load and improve performance by allowing each shard to be optimized for specific queries or workloads.

Similarly, you can also face slower response times or timeouts when the network bandwidth needed by the application surpasses the bandwidth available to a single database node and any read replicas.

What are the challenges in Database Sharding?

Data Hotspots: There might be an imbalance in terms of data distribution as a few shards may have to store more data and hence will require more computational resources.

Operational Complexity: Rather than managing a single database, you have to maintain multiple shards. When querying, developers must read several shards and integrate the pieces of information.

Infrastructure Costs: As you increase the number of shards, the cost directly increases. Your maintenance costs will also shoot up.

Conclusion:

We’ve discussed sharding, when to use it and how it can be set up. Sharding is an excellent solution for applications that need to manage a large amount of data and have it readily available for high amounts of reading and writing. Still, it makes things more complicated to operate. Before you start implementation, you should think about whether the benefits are worth the costs or if there is a more straightforward solution.