Joins use more shuffles than any other process in Spark. It always leads to bad performance
Broadcast Hash Join (Map Side Join/ Replicated Join)
- Used when we want to compare a small and large table
- The entire small table is copied into each partition location of the large table
- The table joining happens in the mapper stage itself
- Default: 10 Mb (Max size of small table)
Shuffle Hash Join
- Used when we join 2 large tables
- The partitions are shuffled to get similar data together (Done for both table)
- And then this shuffled partitions are joined together
- The values that going to be in a partition after shuffle is decided by hash function
- Used for non-indexed data
If we try to join normally for similar data we need to access all the partitions and then are join and reduced into a new partition
Sort Merge Join
- Used to join very large tables
- Groups the same keys to the same partition
- It is the default sort strategy in Spark since v2.3
- The join key needs to be orderable (sortable) for this algorithm to work
Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast -24 Tutorials