Discussion on: The 5-minute guide to using bucketing in Pyspark

View post

Hi, thank you for sharing.
Could you elaborate on the gotchas? And why do you find the use not trivial? Is it because we have to save the df using write method ect? So sometimes we will save the df "normally" and using buckets?

Thank you ;)

luminousmen • Jan 15 '20

Thank you for your support, Maxime!
I say it's not trivial because you have to fulfill at least a few conditions.

You have to read the data in the same way as it was bucketed - while writing to the bucket Spark uses the hash function on the bucketed key to select which bucket to write the data into.
spark.sql.shuffle.partitions must be the same as the number of buckets, otherwise, we will get a standard shuffle
Choose the bucket columns wisely, everything depends on the workload. Sometimes it is better to handle the optimization process to the catalyst than to do it yourself.
Choose the number of buckets wisely, this is also a tradeoff. If you had as many performers as buckets, it would lead to a fast load. However, if the data volume is too small, it may not be very good in terms of performance.