Apache Spark: Unpersisting RDD after next action? - apache-spark

In spark programming, when I call persist/cache() on an RDD, I found that its lifespan for reusability is not optimal in many cases:
Namely, it always last for a few hours, after which an RDD storage is evicted from executor's memory and disk. This sometimes cause performance/GC problems: sometimes an RDD storage drains memory long after the RDD implementation itself on driver has been garbage collected (until a few hours later, but for a job that cache/checkpoint often this is still inefficient). Sometimes vice versa: an RDD storage get evicted even the RDD object is still referenced by the driver jvm, and it may be reused later.
I'm looking for a way to override it. The "unpersist()" function is rarely useful: due to lazy execution it can only be called after next action, which can't be determined by the time it is created. Is there a pattern to mark an RDD as "unpersist after next action"? This can save a lot of memory and disk space.

Related

how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory?

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?
many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.
If that is the case then OOM error should never occur but it does.What is the reason?
Detailed explanation is highly appreciated.Thanks in advance
There are different ways you can persist in your dataframe in spark.
1)Persist (MEMORY_ONLY)
when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).
To answer your question
If that is the case then OOM error should never occur but it does.What is the reason?
even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM
2)Persist (MEMORY_AND_DISK)
when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)
3)Persist (MEMORY_ONLY_SER)
when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.
4)Persist (MEMORY_AND_DISK_SER)
it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.
5)Persist (DISK_ONLY)
In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.
6)Persist (MEMORY_ONLY_2 or MEMORY_AND_DISK_2)
These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.
7)Persist (OFF_HEAP)
Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.

Is there an extra overhead to cache Spark dataframe in memory?

I am new to Spark and wanted to understand if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
From what I know so far that there is not data movement that happens when we used cache a dataframe and it is just saved on executor's memory. So it should be just a matter of setting/unsetting a flag.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
It depends. If you only mark a DataFrame to be persisted, nothing really happens since it's a lazy operation. You have to execute an action to trigger DataFrame persistence / caching. With the action you do add an extra overhead.
Moreover, think of persistence (caching) as a way to precompute data and save it closer to executors (memory, disk or their combinations). This moving data from where it lives to executors does add an extra overhead at execution time (even if it's just a tiny bit).
Internally, Spark manages data as blocks (using BlockManagers on executors). They're peers to exchange blocks on demand (using torrent-like protocol).
Unpersisting a DataFrame is simply to send a request (sync or async) to BlockManagers to remove RDD blocks. If it happens in async manner, the overhead is none (minus the extra work executors have to do while running tasks).
So it should be just a matter of setting/unsetting a flag.
In a sense, that's how it is under the covers. Since a DataFrame or an RDD are just abstractions to describe distributed computations and do nothing at creation time, this persist / unpersist is just setting / unsetting a flag.
The change can be noticed at execution time.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
If you use async caching (the default), there should be a very minimal delay.

which is faster in spark, collect() or toLocalIterator()

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?
The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.
To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.

Does Spark write intermediate shuffle outputs to disk

I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:
Spark’s internal scheduler may truncate the lineage of the RDD graph
if an existing RDD has already been persisted in cluster memory or on
disk. A second case in which this truncation can happen is when an RDD
is already materialized as a side effect of an earlier shuffle, even
if it was not explicitly persisted. This is an under-the-hood
optimization that takes advantage of the fact that Spark shuffle
outputs are written to disk, and exploits the fact that many times
portions of the RDD graph are recomputed.
As I understand there are different persistence policies, for example, the default MEMORY_ONLY which means the intermediate result will never be persisted to the disk.
When and why will a shuffle persist something on disk? How can that be reused by further computations?
When
It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled
Why
This is an optimization. Shuffling is one of the expensive things that happen in Spark.
How that can be reused by further computations?
It is automatically reused with any subsequent action executed on the same RDD.

Is Spark RDD cached on worker node or driver node (or both)?

Can any one please correct my understanding on persisting by Spark.
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially.
Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second worker nodes.
So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.
Am I correct?
(OR)
Is it something that the RDD value is persisted in driver memory and not on nodes ?
Change this:
then Spark is going to cache its value only in first or second worker nodes.
to this:
then Spark is going to cache its value only in first and second worker nodes.
and...Yes correct!
Spark tries to minimize the memory usage (and we love it for that!), so it won't make any unnecessary memory loads, since it evaluates every statement lazily, i.e. it won't do any actual work on any transformation, it will wait for an action to happen, which leaves no choice to Spark, than to do the actual work (read the file, communicate the data to the network, do the computation, collect the result back to the driver, for example..).
You see, we don't want to cache everything, unless we really can to (that is that the memory capacity allows for it (yes, we can ask for more memory in the executors or/and the driver, but sometimes our cluster just doesn't have the resources, really common when we handle big data) and it really makes sense, i.e. that the cached RDD is going to be used again and again (so caching it will speedup the execution of our job).
That's why you want to unpersist() your RDD, when you no longer need it...! :)
Check this image, is from one of my jobs, where I had requested 100 executors, however the Executors tab displayed 101, i.e. 100 slaves/workers and one master/driver:
RDD.cache is a lazy operation. it does nothing until unless you call an action like count. Once you call the action the operation will use the cache. It will just take the data from the cache and do the operation.
RDD.cache- Persists the RDD with default storage level (Memory only).
Spark RDD API
2.Is it something that the RDD value is persisted in driver memory and not on nodes ?
RDD can be persisted to disk and Memory as well . Click on the link to Spark document for all the option
Spark Rdd Persist
# no actual caching at the end of this statement
rdd1=sc.read('myfile.json').rdd.map(lambda row: myfunc(row)).cache()
# again, no actual caching yet, because Spark is lazy, and won't evaluate anything unless
# a reduction op
rdd2=rdd2.map(mysecondfunc)
# caching is done on this reduce operation. Result of rdd1 will be cached in the memory of each worker node
n=rdd1.count()
So to answer your question
If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially
The only possibility of caching something is on worker nodes, and not on driver nodes.
cache function can only be applied to an RDD (refer), and since RDD only exists on the worker node's memory (Resilient Distributed Datasets!), it's results are cached in the respective worker node memory. Once you apply an operation like count which brings back the result to the driver, it's not really an RDD anymore, it's merely a result of computation done RDD by the worker nodes in their respective memories
Since cache in the above example was called on rdd2 which is still on multiple worker nodes, the caching only happens on the worker node's memory.
In the above example, when do some map-red op on rdd1 again, it won't read the JSON again, because it was cached
FYI, I am using the word memory based on the assumption that the caching level is set to MEMORY_ONLY. Of course, if that level is changed to others, Spark will cache to either memory or storage based on the setting
Here is an excellent answer on caching
(Why) do we need to call cache or persist on a RDD
Basically caching stores the RDD in the memory / disk (based on persistence level set) of that node, so that the when this RDD is called again it does not need to recompute its lineage (lineage - Set of prior transformations executed to be in the current state).

Resources