DataFrame Lifespan in memory, Spark? - apache-spark

My question is more related to memory management and GC in sprak internally.
If I will create a RDD, how long it will leave in my Executor memory.
# Program Starts
spark = SparkSession.builder.appName("").master("yarn").getOrCreate()
df = spark.range(10)
df.show()
# other Operations
# Program end!!!
Will it be automatically deleted once my Execution finishes. If Yes, Is there any way to delete it manually during program execution.
How and when Garbage collection called in Spark. Can we implement custom GC like JAVA program and use it in Spark.

DataFrame are Java objects so if no reference found your object is eligible to garbage collection
Spark - Scope, Data Frame, and memory management
Calling Custom gc not possible
Manually calling spark's garbage collection from pyspark
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://spark.apache.org/docs/2.2.0/tuning.html#memory-management-overview

"how long it will leave in my Executor memory."
In this particular case spark will no materialize the full dataset ever, instead it will iterate through one by one. Only a few operators materialize the full dataset. This includes, sorts/joins/groupbys/writes/etc
"Will it be automatically deleted once my Execution finishes."
spark automatically cleans any temp data.
"If Yes, Is there any way to delete it manually during program execution."
spark only keeps that data around if its in use or has been manually persisted. what are you trying to accomplish in particular?
"How and when Garbage collection called in Spark."
Spark runs on the JVM and the JVM with automatically GC when certain metrics are hit.

Related

Does spark automatically un-cache and delete unused dataframes?

I have the following strategy to change a dataframe df.
df = T1(df)
df.cache()
df = T2(df)
df.cache()
.
.
.
df = Tn(df)
df.cache()
Here T1, T2, ..., Tn are n transformations that return spark dataframes. Repeated caching is used because df has to pass through a lot of transformations and used mutiple times in between; without caching lazy evaluation of the transformations might make using df in between very slow. What I am worried about is that the n dataframes that are cached one by one will gradually consume the RAM. I read that spark automatically un-caches "least recently used" items. Based on this I have the following queries -
How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Based on the answer to the above two queries, is the above strategy correct?
How is "least recently used" parameter determined? I hope that a dataframe, without any reference or evaluation strategy attached to it, qualifies as unused - am I correct?
Results are cached on spark executors. A single executor runs multiple tasks and could have multiple caches in its memory at a given point in time. A single executor caches are ranked based on when it is asked. Cache just asked in some computation will have rank 1 always, and others are pushed down. Eventually when available space is full, cache with last rank is dropped to make space for new cache.
Does a spark dataframe, having no reference and evaluation strategy attached to it, get selected for garbage collection as well? Or does a spark dataframe never get garbage collected?
Dataframe is an execution expression and unless an action is called, no computation is materialised. Moreover, everything will be cleared once the executor is done with computation for that task. Only when dataframe is cached (before calling action), results are kept aside in executor memory for further use. And these result caches are cleared based on LRU.
Based on the answer to the above two queries, is the above strategy correct?
Your example seems like transformation are done in sequence and reference for previous dataframe is not used further (no idea why you are using cache). If multiple executions are done by same executor, it is possible that some results are dropped and when asked they will be re-computed again.
N.B. - Nothing is executed unless a spark action is called. Transformations are chained and optimised by spark engine when an action is called.
As far as I have worked with spark and also with the communication with the cloudera that I had, we should unpersist/uncache the data, if we do not do that job will start to slow down, the problem becomes more severe in case of streaming job.
I have nothing to support my answer but
read here and here for details

When does Spark evict broadcasted dataframe from Executors?

I have a doubt, about when we broadcast a dataframe.
Copies of broadcasted dataframe are sent to each Executor.
So, when does Spark evict these copies from each Executor ?
I find this topic functionally easy to understand, but the manuals harder to follow technically and there are improvements always in the offing.
My take:
There is a ContextCleaner that is running on the Driver for every Spark App.
It gets created immediately started when the SparkContext commences.
It is more about all sorts of objects in Spark.
The ContextCleaner thread cleans RDD, shuffle, and broadcast states, Accumulators using keepCleaning method that runs always
from this class. It decides which objects needs eviction due to no longer being
referenced and these get placed on a list. It calls various methods, such
as registerShuffleForCleanup. That is to say a check is made to see if there are no alive root objects pointing to a given object; if so, then that object is eligible for clean-up, eviction.
context-cleaner-periodic-gc asynchronously requests the standard JVM garbage collector. Periodic runs of this are started when
ContextCleaner starts and stopped when ContextCleaner terminates.
Spark makes use of the standard Java GC.
This https://mallikarjuna_g.gitbooks.io/spark/content/spark-service-contextcleaner.html is a good reference next to the Spark official docs.

Is there an extra overhead to cache Spark dataframe in memory?

I am new to Spark and wanted to understand if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
From what I know so far that there is not data movement that happens when we used cache a dataframe and it is just saved on executor's memory. So it should be just a matter of setting/unsetting a flag.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
It depends. If you only mark a DataFrame to be persisted, nothing really happens since it's a lazy operation. You have to execute an action to trigger DataFrame persistence / caching. With the action you do add an extra overhead.
Moreover, think of persistence (caching) as a way to precompute data and save it closer to executors (memory, disk or their combinations). This moving data from where it lives to executors does add an extra overhead at execution time (even if it's just a tiny bit).
Internally, Spark manages data as blocks (using BlockManagers on executors). They're peers to exchange blocks on demand (using torrent-like protocol).
Unpersisting a DataFrame is simply to send a request (sync or async) to BlockManagers to remove RDD blocks. If it happens in async manner, the overhead is none (minus the extra work executors have to do while running tasks).
So it should be just a matter of setting/unsetting a flag.
In a sense, that's how it is under the covers. Since a DataFrame or an RDD are just abstractions to describe distributed computations and do nothing at creation time, this persist / unpersist is just setting / unsetting a flag.
The change can be noticed at execution time.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
If you use async caching (the default), there should be a very minimal delay.

Why does SparkContext.parallelize use memory of the driver?

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node.
Here's an example of my code:
# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)
I've tried
del a
to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.
After I create rdd_a, how can I destroy a to free the master node's memory?
Thanks!
The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.
Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
That's how it supposed to be and that's why SparkContext.parallelize is only meant for demos and learning purposes, i.e. for quite small datasets.
Quoting the scaladoc of parallelize
parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] Distribute a local Scala collection to form an RDD.
Note "a local Scala collection" that means that the collection you want to map to a RDD (or create a RDD from) is already in the memory of the driver.
In your case, a is a local Python variable and Spark knows nothing about it. What happens when you use parallelize is that the local variable (that's already in the memory) is wrapped in this nice data abstraction called RDD. It's simply a wrapper around the data that's already in memory on the driver. Spark can't do much about that. It's simply too late. But Spark plays nicely and pretends the data is as distributed as other datasets you could have processed using Spark.
That's why parallelize is only meant for small datasets to play around (and mainly for demos).
Just like Jacek's answer, parallelize is only demo for small dataset, you can access all variables defined in driver within parallelize block.

Is a static file loaded and unloaded for each batch operation in Spark?

I am using Spark to perform some operations on my data.
I need use a auxiliary dictionary to help my data operations.
streamData = sc.textFile("path/to/stream")
dict = sc.textFile("path/to/static/file")
//some logic like:
//if(streamData["field"] exists in dict)
// do something
My question is: is the dict in memory all the time or does it need to be loaded and unloaded each time Spark is working on a batch?
Thanks
The dict will remain persisted in memory unless it needs to be evicted for another object(s) that needs the memory at runtime. If you need to reuse it later, you should do dict.cache() after initializing it. You can also persist the RDD to disk with .persist(DISK_ONLY) if it's very large and untenable for caching in memory. This post has a useful summary on RDD mechanics.

Resources