For optimal performance and compressing across Clustered Columnstore Index (CCI), we need to ensure that each partition/ distribution should have at minimum 1 million records
If each partition will have less than 60 million rows then it is recommended to reduce the number of partitions
Partitioning Tables in Dedicated SQL Pool - Azure Synapse Analytics | Microsoft Docs
Types of Distribution
Hash Distribution
Distributes the rows across the compute nodes using a deterministic hash function
Distributes the data evenly so that Synapse compute power can be properly utilized
Works well for tables bigger than 2 GB
Used on tables that frequently receive insert, update and delete operations
Generally used for Fact Tables in Star Schema
Round Robin
It is the default distribution type in Synapse
Assigns rows evenly across the nodes. The data is distributed in the nodes randomly
There is no guarantee that similar data is going to be stored on the same node
Joining a round-robin table will always result in reshuffling resulting in a performance hit
Round Robin can be used if the table has no joining key or joins are performed very infrequently
If no column can distribute the data evenly use hash distribution
Temporary tables are generally created using Round Robin distribution
Replicated
It is a table that gets distributed across all 60 nodes of Synapse
Useful for small tables and tables that are joined frequency against a big fact table
Dimension tables are generally created using replicated distribution
Hash Key Characteristics
The column should be used for grouping operations. The grouped records should lie in 1 distribution
The hash key is used as a join condition, especially for large tables. This will prevent unnecessary data movement
Has more than 60 distinct keys so that the data can be distributed across the 60 nodes in Synapse (Many unique values)
It should not be a date column and should have very few (ideally no) NULL values
Best Practices
For more than 60 million records Clustered Columnstore Index (CCI) provides the best performance
For smaller tables heap and Clustered Index will provide better performance
Heaps are useful for temporary storage and are a great choice for staging tables
For Hash distribution, the column that is extensively used in join should be used
If round robin is used on a dimension table then the column used in the grouping operation should be used as a key
Round Robin tables provide good speed when performing data load operations
Frequently run queries are cached by Synapse at the Control Node
We should not enable cache when working with large datasets as it will throttle the control nodes and degrade the overall performance
Cheatsheet for Dedicated SQL Pool - Azure Synapse Analytics | Microsoft Learn