Pradip Sodha

Posted on Aug 28 • Edited on Oct 3

Avoid These Top 10 Mistakes When Using Apache Spark

#dataengineering #development #coding #dbt

We all know how easy it is to overlook small parts of our code, especially when we have powerful tools like Apache Spark to handle the heavy lifting. Spark's core engine is great at optimizing our messy, complex code into a sleek, efficient physical plan. But here's the catch: Spark isn't flawless. It's on a journey to perfection, sure, but it still has its limits. And Spark is upfront about those limitations, listing them out in the documentation (sometimes as little notes).

But let’s be honest—how often do we skip the docs and head straight to Stack Overflow or ChatGPT for quick answers? I've been there too. The thing is, while these shortcuts can be useful, they don't always tell the whole story. So, if you're ready to dive in, let's talk about some common mistakes and how to avoid them. Stay with me; this is going to be a ride!

Table of Content

Mistake #1: Adding Columns the Wrong Way
Mistake #2: Order of Narrow and Wide Transformation
Mistake #3: Overlooking Data Serialization Format
Mistake #4: Not Using Parallel Listing on Input Paths
Mistake #5: Ignoring Data Locality
Mistake #6: Relying on Default Number of Shuffle Partitions
Mistake #7: Overlooking Broadcast Join Thresholds
Mistake #8: Relying on default storage level for Cache
Mistake #9: Misconfiguring Spark Memory Settings
Mistake #10: Relying Only on Cache and Persist

Mistake #1: Adding Columns the Wrong Way

client: "Hey, can you add 5 columns? Make it quick, okay?"

Developer: "Sure, I'll just use withColumn() in a loop 5 times!"

Client: (Happy) "Great! Now, can you add 10 more columns? Make it quick, and keep the code short!"

Developer: "No problem! I'll loop 15 times now."

Spark: "Sorry I can't optimize"

But wait—according to Spark's documentation...

Don't use withColumn in loop

Solution: SelectExpr or Select
here is complete solution,

def addOrReplaceColumns(newColumns: List[Column], sourceColumns: List[String]): List[Column] = {
  val (columnsToBeReplace, newColumns) = newColumns.partition(column => sourceColumns.contains(column.toString()))
  val restOfColumns = sourceColumns.diff(columnsToBeReplace.map(column => column.toString())).map(col => col)

  (columnsToBeReplace ++ newColumns ++ restOfColumns).toList
}

Mistake #2: Order of Narrow and Wide Transformation

Normally we focus on business logic when developing a data solution and it's common to ignore the order of narrow and wide transformation but things is spark recommended to combine all narrow first and then wide, for example,
if you have

narrow, wide, narrow, narrow, wide, narrow

then try to arrange like,

narrow, narrow, narrow, wide, wide

then spark will optimize your code more accurately for example all narrow transformation will happen as pipeline operation and only one shuffle will required.

Mistake #3: Overlooking Data Serialization Format

By default, Spark uses Java serialization, which is not the most efficient option. Switching to Kryo serialization can lead to better performance, as it is faster and uses less memory. Use the following configuration to enable Kryo:

spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

But does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

Mistake #4: Not Using Parallel Listing on Input Paths

When reading files from storage systems like Amazon S3, Azure Data Lake Storage (ADLS), or even local storage, Spark needs to list and find all matching files in the input directory before starting the next task. This listing process can become a bottleneck, especially when dealing with large directories or a vast number of files. By default, Spark uses only a single thread to list files, which can significantly slow down the start of your job.

To mitigate this, you can increase the number of threads used for listing files by setting the spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads property. This allows Spark to parallelize the file listing process, speeding up the initialization phase of your job.

spark.conf.set("spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads", 10)

Mistake #5: Ignoring Data Locality

Data locality significantly impacts the performance of Spark jobs.

When data and the code processing it are close together, computation is faster, as there is less need to move large chunks of data. Spark scheduling prioritizes data locality to minimize data movement, following levels of locality from best to worst: PROCESS_LOCAL (data and code in the same JVM), NODE_LOCAL (data on the same node), RACK_LOCAL (data on the same rack but different node), and ANY (data elsewhere on the network).

Spark tries to schedule tasks at the highest locality level possible, but this isn't always feasible. If no idle executors have unprocessed data at the desired locality level, Spark can either wait for a busy executor to free up or fall back to a lower locality level by moving data to an idle executor. The time Spark waits before falling back can be adjusted using the spark.locality.wait settings. Adjusting these settings can help improve performance in scenarios with long-running tasks or when data locality is poor.

In case of medium data skew or cluster with ample resources or using .catch() then increasing would benefits rather than going to lower locality.

spark.conf.set("spark.locality.wait", "10s")

Mistake #6: Relying on Default Number of Shuffle Partitions

By default, Spark uses 200 partitions for shuffle operations (e.g., join, groupBy). This number might be too high or too low, depending on your dataset and cluster size.

AQE (enabled by default from 7.3 LTS + onwards) adjusts the shuffle partition number automatically at each stage of the query, based on the size of the map-side shuffle output.

But it's advisable to update shuffle partition before performing a wide transformation, if you need accurate optimization and if you are unsure spark recommended to set shuffle partition value to number of cores in your cluster.

spark.conf.set("spark.sql.shuffle.partitions", "num_core_in_cluster")

And don't forgot to tune spark.default.parallelism this setting accordingly as well.

Mistake #7: Overlooking Broadcast Join Thresholds

Scenario:

Developer: "I thought small lookup tables would be broadcasted automatically and my each of executors has 32GB of memory! Why are my joins so slow?"

Spark: "Sorry, your lookup table is just above the default threshold."

Broadcast joins can drastically speed up join operations when one of the tables is small enough to fit into memory on each worker node. However, if you don't adjust the broadcast join threshold, Spark might not broadcast tables that could be effectively broadcasted, leading to unnecessary shuffling.

Solution:

Adjust the broadcast join threshold using spark.sql.autoBroadcastJoinThreshold. If your lookup table is slightly larger than the default 10MB limit, increase the threshold.

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024) // 50MB

When setting the broadcast join threshold, don't base it only on executor memory. The driver loads the small table into memory first before distributing it to executors. Make sure the threshold is suitable for both driver and executor memory capacities to prevent memory issues and optimize performance.

Mistake #8: Relying on default storage level for Cache

It’s crucial to select the appropriate storage level for caching and persisting data based on the type of executors in your cluster,

Choosing the right storage level based on the executor type and objectives is crucial for optimizing Spark performance and resource utilization. By understanding the trade-offs between speed, memory usage, and fault tolerance, you can tailor your Spark configuration to meet the specific needs of your application.

Executor Type	Primary Objective	Recommended Storage Level	Description	Alternative for Fault Tolerance	Notes
Memory-Optimized	Fast access, low memory usage	`MEMORY_ONLY_SER`	Stores RDD as serialized objects in memory. Balances speed and memory efficiency.	`MEMORY_ONLY_SER_2`	Use `MEMORY_ONLY` if serialization overhead is not a concern.
		`MEMORY_ONLY`	Stores RDD as deserialized objects in memory. Fastest access, highest memory usage.	`MEMORY_ONLY_2`	Use for small datasets that fit comfortably in memory.
CPU-Optimized	Balanced memory and disk	`MEMORY_AND_DISK_SER`	Serialized storage in memory, spills to disk if needed. Good for large datasets.	`MEMORY_AND_DISK_SER_2`	Preferred when memory is limited; avoids out-of-memory errors.
		`MEMORY_AND_DISK`	Deserialized storage in memory, spills to disk. Faster access than `MEMORY_AND_DISK_SER`.	`MEMORY_AND_DISK_2`	Use when memory can accommodate deserialized objects, with fallback to disk.
General Purpose	Flexibility, moderate size datasets	`MEMORY_AND_DISK`	Deserialized in-memory, spills to disk. Good balance for general use cases.	`MEMORY_AND_DISK_2`	Good for mixed workloads; balances speed and fault tolerance.
		`MEMORY_ONLY_SER`	Serialized in-memory storage. Optimized for memory efficiency and speed.	`MEMORY_ONLY_SER_2`	Suitable for datasets that fit well in memory after serialization.
Disk-Optimized	Low memory, high fault tolerance	`DISK_ONLY`	Stores RDD partitions only on disk. Minimizes memory usage but slowest access.	`DISK_ONLY_2`	Suitable for very large datasets where memory is a constraint.
		`MEMORY_AND_DISK_SER`	Serialized storage in memory with spillover to disk. More efficient than deserialized.	`MEMORY_AND_DISK_SER_2`	Balances disk usage and memory efficiency.

The _2 options (e.g., MEMORY_ONLY_2, MEMORY_AND_DISK_2) are useful for scenarios where fault tolerance is crucial. They replicate data across two nodes, ensuring data is not lost if a node fails. This is particularly valuable in environments where reliability is prioritized over resource efficiency, such as production systems handling critical data or real-time processing pipelines.
The _SER option (e.g., MEMORY_AND_DISK_SER) Stores RDD as serialized Java objects (one byte array per partition) in memory. More memory-efficient than MEMORY_ONLY, but slower due to serialization/deserialization overhead.

Mistake #9: Misconfiguring Spark Memory Settings

Scenario:

Developer: "My Spark job keeps failing with out-of-memory errors. I gave it all the memory available!"

Spark: "Memory isn't just for you; I need some for myself, too."

Many users allocate almost all available memory to the executor heap space (spark.executor.memory) without considering Spark's overhead memory, causing frequent out-of-memory errors. Additionally, insufficient memory can lead to excessive garbage collection (GC) pauses, slowing down jobs.

Solution:

Properly configure memory settings by tuning spark.executor.memory and spark.executor.memoryOverhead.

--conf spark.executor.memory=4g --spark.executor.memoryOverhead=512m

Ensure you leave enough memory overhead to accommodate Spark's internal needs (shuffle, RDD storage, etc.). Typically, 10-15% of the total memory should be allocated as overhead.

spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

Mistake #10: Relying Only on Cache and Persist

Many Spark developers are familiar with the cache() and persist() methods for improving performance, but they often overlook the value of checkpoint(). While cache() and persist() keep data in memory or on disk to speed up processing, they don’t provide fault tolerance in the case of a failure. checkpoint(), on the other hand, saves the RDD to a reliable storage system, allowing for fault recovery and optimizing job execution.

Using checkpoint() not only ensures that your job can recover from failures but also helps Spark optimize the execution of other jobs that share the same lineage. This can lead to improved performance and resource utilization.

spark.sparkContext.setCheckpointDir("path/to/checkpoint/dir")
df.checkpoint()

DEV Community

Avoid These Top 10 Mistakes When Using Apache Spark

Table of Content

Mistake #1: Adding Columns the Wrong Way

Mistake #2: Order of Narrow and Wide Transformation

Mistake #3: Overlooking Data Serialization Format

Mistake #4: Not Using Parallel Listing on Input Paths

Mistake #5: Ignoring Data Locality

Mistake #6: Relying on Default Number of Shuffle Partitions

Mistake #7: Overlooking Broadcast Join Thresholds

Mistake #8: Relying on default storage level for Cache

Mistake #9: Misconfiguring Spark Memory Settings

Mistake #10: Relying Only on Cache and Persist

Top comments (0)

Read next

Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

[pt-BR] Como utilizei o Vagrant e o Virtual Box para desenvolvimento local

HealthChecks

Day 1035 : Closer