This is a continuation of my previous posts as follows.
Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)
Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)
Using Cache
Spark RDDs are re-computed each time an action is performed on them. You can avoid this by using cache() or persist(), which keep the RDD in memory.
Comparison between using cache and no cache
Please note that cache() and persist() are transformations, not actions, so they are evaluated lazily.
Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.
Let's try cache!
with timer('before cache'):
part_df.select("backend_port").distinct().count()
part_df.cache()
part_df.count() # execute cache (cache is a transformation)
with timer('after cache'):
part_df.select("backend_port").distinct().count()
[before cache] done in 4.5241 s
[after cache] done in 1.6293 s
It's faster with cache()!
Summary
- RDDs are re-computed for each action, so caching makes them faster
- Since cache() and persist() are transformations
Top comments (0)