DEV Community

Discussion on: The 5-minute guide to using bucketing in Pyspark

luminousmen profile image
luminousmen Author

Thank you for your support, Maxime!
I say it's not trivial because you have to fulfill at least a few conditions.

  1. You have to read the data in the same way as it was bucketed - while writing to the bucket Spark uses the hash function on the bucketed key to select which bucket to write the data into.
  2. spark.sql.shuffle.partitions must be the same as the number of buckets, otherwise, we will get a standard shuffle
  3. Choose the bucket columns wisely, everything depends on the workload. Sometimes it is better to handle the optimization process to the catalyst than to do it yourself.
  4. Choose the number of buckets wisely, this is also a tradeoff. If you had as many performers as buckets, it would lead to a fast load. However, if the data volume is too small, it may not be very good in terms of performance.