Partitioning
It is the technique to physically divide your data into separate data stores to improve scalability, reduce contention and optimize performance
Benefits of Partitioning
- Improved Scalability: Each partition can be saved on a node on a different server
- Improved Performance: Operations that effect more than one partition can run in parallel
- Improved Security: We can split the data based on the security requirement
- Match the data store to pattern of use : Use different types of data stores to store the data based on its usage
- Improved Availability: As data is distributed we can avoid a single point of failure
Designing Partitions
- Avoid Cross Partition Joins. The queries should be as self-contained as possible
- Data lookup is an expensive operation so replicate static data that is referenced frequently
- Periodic Rebalancing of Shards should be performed as some partitions might become hot spots causing performance degradation
- Avoid accessing data from multiple partitions on a frequent bases. If this is the case store them all in a single partition
Types of Partitioning
Horizontal Partitioning (Sharding)
Splitting the data based on values. e.g. Partition by date
Each partition is stored in a different data store/ node but they all have the same schema
Each partition holds a specific subset of the whole data
Selecting the correct partitioning key in this type of approach is very vital (The data should be distributed as evenly as possible).
If suboptimal columns are used the performance of the data store can be effected drastically.
It is difficult to change the partition key once data has been loaded into the system
Vertical Partitioning
Each partition holds a subset of fields that are stored in the data store (Splitting data based on columns)
The goal of this type of partitioning is to reduce the performance cost associated with I/O
The data that is stored in one partition will be accessed more frequently compared to others
This approach allows to apply different security constraints on different data. PII data can be stored in a separate partition where strict security constraints can be used
Allows to save slowly changing data and fast changing data separately
Functional Partitioning
Separate data from different domains into different physical nodes to ensure better isolation and performance
It allows to avoid concurrent access to the same table
Criteria for choosing Partition Key
High Cardinality
If the key only stores a few values. e.g. Yes/ No, Male/ Female, etc. the system will not scale horizontally
Partition Reshuffling
The column selected for partitioning should not be such that it uses items to be redistributed frequently
Generates contention as two partitions cannot be accessed as insert operation is taking place
Older partitions will have less frequency
Frequency
The column used for partitioning should have elements that have an even frequency distribution i.e. all partitions will have equal number of entries
A partition that is frequently accessed/ overloaded with data is referred to as a hot partition