This is a continuation of my previous posts as follows.
Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)
Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)
Spark Join
Apache Spark has a type of join called Broadcast Join, which avoids shuffle processing. This method is effective when one table is small and the other is large. Essentially, it distributes the small table to all worker nodes, allowing each node to perform the join. This experiment will test the effectiveness of Broadcast Join for speed optimization using a small dataframe (small enough to fit in the memory of each worker node) and a large dataframe.
https://sparkbyexamples.com/spark/broadcast-join-in-spark/?expand_article=1
With broadcast join, Spark broadcast the smaller DataFrame to all executors and the executor keeps this DataFrame in memory and the larger DataFrame is split and distributed across all executors so that Spark can perform a join without shuffling any data from the larger DataFrame as the data required for join colocated on every executor.
BroadCast Join
join_df = part_df.select(part_df['request_port']).distinct().withColumn("random", F.round(F.rand()*(10-5)+5,0))
with timer('broadcast join dataframe'):
broadcast_df = part_df.join(join_df.hint('BROADCAST'), part_df.request_port == join_df.request_port, how='left')
broadcast_df.count()
with timer('sortmerge join dataframe'):
merge_df = part_df.join(join_df.hint('MERGE'), part_df.request_port == join_df.request_port, how='left')
merge_df.count()
with timer('shuffle hash join dataframe'):
shuffle_df = part_df.join(join_df.hint('SHUFFLE_HASH'), part_df.request_port == join_df.request_port, how='left')
shuffle_df.count()
I used Join Hints to suggest the join strategy. You can find more about JOIN hints here.
Although not shown here, I looked Physical Plan using explain(), to verify the Hint's effectiveness.
[shuffle hash join dataframe] done in 23.2022 s
[broadcast join dataframe] done in 11.7729 s
[sortmerge join dataframe] done in 38.4018 s
The broadcast join is the fastest!
Summary
When performing a JOIN operation between a small df and a large df, the broadcast join is the fastest strategy.
Top comments (0)