Spark on AWS Glue: Performance Tuning 5 ( Using Cache)

#aws #glue #spark #performance

This is a continuation of my previous posts as follows.

Using Cache

Spark RDDs are re-computed each time an action is performed on them. You can avoid this by using cache() or persist(), which keep the RDD in memory.

Comparison between using cache and no cache

Please note that cache() and persist() are transformations, not actions, so they are evaluated lazily.

https://kb.databricks.com/scala/best-practice-cache-count-take#:~:text=Since%20cache()%20is%20a,RDD%20in%20a%20single%20action

Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.

Let's try cache!

with timer('before cache'):
    part_df.select("backend_port").distinct().count()

part_df.cache()
part_df.count() # execute cache (cache is a transformation)
with timer('after cache'):
    part_df.select("backend_port").distinct().count()

[before cache] done in 4.5241 s
[after cache] done in 1.6293 s

It's faster with cache()!

Summary

RDDs are re-computed for each action, so caching makes them faster
Since cache() and persist() are transformations

DEV Community

Spark on AWS Glue: Performance Tuning 5 ( Using Cache)

Using Cache

Comparison between using cache and no cache

Summary

Top comments (0)

Read next

Top 10 announcements from AWS re:Invent 2024 you need to know

Transform Your Cloud Migration Strategy: Transition Microsoft workloads to Linux on AWS with AI Solutions

My First Full-Stack Deployment with Docker and NGINX as Load Balancer

Building an Event-Driven Architecture for Content Embedding Generation with AWS Bedrock, DynamoDb, and AWS Batch