Database partitioning is the backbone of distributed database management systems. It is a process of dividing a large dataset into several small partitions placed on different machines. In other words, It is a way of partitioning data like index-organized tables into smaller pieces so that data can be easily accessed.
With the growth in services and user base, it becomes tricky for a single database server to function efficiently. We may experience lower performance with the architecture of a single database server. Here is some situation that could arise:
Database partition helps us fix all the above challenges by distributing data across several partitions. Each partition may reside on the same machine (coresident) or different machines (remote). The idea of co-resident partitioning is to reduce individual indexes size, and the amount of I/O needed to update records. Similarly, the concept of remote partitioning is to increase the bandwidth access to data by having more RAM, avoiding disk access, or having more network interfaces and disk I/O channels available.
A high level view of partitioned tables
There are several scenarios when partitioning data can be beneficial:
However, not all cases require data partitioning. So it is important to carefully analyse the specific needs of the system before deciding whether or not to use partitioning.
There are various data partitioning strategies. Let’s have a look at each one of them.
Horizontal partitioning (also known as database sharding) is a strategy for splitting table data horizontally based on the range of values defined by a partition key. Here we divide the table into smaller and more manageable tables, with each row of the table being assigned to one of the partitions.
Suppose there is a large database containing multiple rows of customer data that has a slow query performance. So we can think of partitioning the table into two separate tables horizontally. The first table would contain the first half of the customer data, and the second table would contain the second half. Now query will go to either partition 1 or partition 2, depending on the partition key. For example, suppose we store the contact details for customers. In that case, we can keep the contact info starting with the name A-H on one partition and contact info starting with the name I-Z on another partition.
The benefit of the horizontal partition: The horizontal partitioning is the most fundamental partitioning method. It divides the database into separate partitions that have the same schema as the original database. So this makes it easy to answer queries without having to combine data from multiple partitions.
The disadvantage of the horizontal partition: Data may not be evenly distributed across the partitions. For example, if there are many more customers with names that fall in the range of A-H than in the range I-Z, the first partition may experience a much heavier load than the second partition.
Vertical partitioning (also known as normalization) divides a table into smaller tables based on columns. For example, in a social media application like LinkedIn, a user's profile data, list of connections, and articles they have written can be placed on separate partitions using vertical partitioning i.e. first partition for user profile data, second partition for the list of connections and third partition for the articles.
There are a few disadvantages to using vertical partitioning:
The choice of which type of partitioning to use depends on the structure of the data. In some cases, it may be useful to combine both horizontal and vertical partitioning. For example, if we have a large dataset of customer information with different data types, we could use vertical partitioning to divide the database and horizontal partitioning to divide the customer information.
There are a large number of criteria available for data partitioning. Most of them use partition keys to identify partitions. Some of the popular criteria are range-partitioning, list-partitioning, hash partitioning, etc.
In range partitioning, data is organized into partitions based on ranges of values of the partition key i.e. each partition contains rows with values for the partition key within a specific range. The ranges are typically contiguous and do not overlap, where each range specify a lower and upper bound for the partition. Any partition key equal to or higher than the upper bound of the range are added to the next higher partition.
Range partitioning is used in a few specific cases:
In hash partitioning, rows are divided into different partitions based on a hashing algorithm. This is different from range partitioning, which groups database rows based on continuous indexes.
Hash partitioning can be used in a few different ways:
One disadvantage of hash partitioning is that it can be expensive to dynamically add or remove database servers. For example, if we want to add more partitions, we may need to remap some of the keys and migrate them to a new partition, which requires changing the hash function. During this process, a large number of requests may not be served, resulting in downtime until the migration is complete. We can solve this problem using consistent hashing!
In list partitioning, each partition is defined and selected based on a list of values for a particular column, rather than a set of contiguous ranges. Some key points:
As an example, consider a table with data for 20 video stores distributed among 4 regions:
Using list partitioning, we could partition the table so that rows belonging to the same region are stored in the same partition. This will help us to easily add or drop records relating to specific regions from the table.
Composite partitioning is a method of partitioning data based on two or more partitioning techniques. In this method, data is first partitioned using one technique, and then each partition is further divided into sub-partitions using the same or a different method. Here all sub-partitions of a given partition together represent a logical subset of the data.
Composite partitioning can be a useful for organizing and managing large datasets. It can help to improve the performance and scalability of the database by providing more precise control over data placement.
There are several types of composite partitioning:
Composite Range-Range Partitioning: This method performs range partitioning based on two table entries. For example, we could first partition the data by date and then sub-partition the range by price.
Composite Range-Hash Partitioning: This method first partitions the data using range partitioning and then sub-partitions it using hash partitioning within each range partition.
Composite Range-List Partitioning: This method first partitions the data using range partitioning and then sub-partitions it using list partitioning within each range partition.
Composite List-Range Partitioning: This method performs range sub-partitioning within a list partition. For example, first perform list partitioning by country name and then perform range sub-partitioning by date.
Composite List-Hash Partitioning: This method sub-partitions list-partitioned data using the hash partitioning technique.
Composite List-List Partitioning: This method performs list partitioning based on two table dimensions. For example, perform list partitioning by country name and list sub-partitioning by customer account status.
Data Partitioning is the backbone of modern distributed data management systems. Data Partitioning proves very effective in improving the availability, scalability, and performance of the system. In this blog, we tried to present a full conceptual understanding of Data Partitioning. Hope you liked it. Please share your views in the comments below.
References
Enjoy learning, Enjoy system design!