When To Cache?

#datascience #python #pyspark #spark

In Spark, memory is used for two purposes -

Storage - store/cache the data that will be used later. This data usually occupies memory for a long time.
Execution - memory space used for certain operations like joins, srots, aggregations and especially shuffles (when data is shuffled between two executors because of partition skew)

One has to be careful with caching. You want to cache only something you're sure is going to be needed later for a transformation.

Dynamic Resource Allocation and Caching

The idea behind dynamic allocation is that you define the initial, min and max number of executors. Spark starts with the initial executors, then based on the load it decides whether or not to start the other executors. Once the heavy lifting the processing is done, Spark asks the executors to be released. This is a great way to increase efficiency but what if you had cached something?

Because you had cached something, now the executors can't be freed unless that cached data is moved to another set of executors. And that's what you need to bear in mind.

Cache vs. Persist

We'll just talk about RDDs (and not Datasets). So, there's no real difference between cache() and persist(). They're both used for the same purpose. Both of these functions cache your data in MEMORY_ONLY as that's the default storage level for both in RDDs. For Datasets, it's MEMORY_AND_DISK.

You can obviously go and change the storage level depending on the use case. Also, there's no uncache() function. There's just unpersist().

Databricks has improved caching by quite a bit by introducing Delta caching but it is limited to some file formats.

DEV Community

When To Cache?

Top comments (0)

Read next

Fortifying Reliability for Large-Scale ML Research Clusters

Unlocking Efficient Training for AI Language Giants: Deep Optimizer States

Part 2: Building Your Own AI - Setting Up the Environment for AI/ML Development

Deploying a MongoDB Collection Generator on Kubernetes