How long does RDD remain in memory? - apache-spark

Considering memory being limited, I had a feeling that spark automatically removes RDD from each node. I'd like to know is this time configurable? How does spark decide when to evict an RDD from memory
Note: I'm not talking about rdd.cache()

I'd like to know is this time configurable? How does spark decide when
to evict an RDD from memory
An RDD is an object just like any other. If you don't persist/cache it, it will act as any other object under a managed language would and be collected once there are no alive root objects pointing to it.
The "how" part, as #Jacek points out is the responsibility of an object called ContextCleaner. Mainly, if you want the details, this is what the cleaning method looks like:
private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
while (!stopped) {
try {
val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
.map(_.asInstanceOf[CleanupTaskWeakReference])
// Synchronize here to avoid being interrupted on stop()
synchronized {
reference.foreach { ref =>
logDebug("Got cleaning task " + ref.task)
referenceBuffer.remove(ref)
ref.task match {
case CleanRDD(rddId) =>
doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
case CleanShuffle(shuffleId) =>
doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
case CleanBroadcast(broadcastId) =>
doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
case CleanAccum(accId) =>
doCleanupAccum(accId, blocking = blockOnCleanupTasks)
case CleanCheckpoint(rddId) =>
doCleanCheckpoint(rddId)
}
}
}
} catch {
case ie: InterruptedException if stopped => // ignore
case e: Exception => logError("Error in cleaning thread", e)
}
}
}
If you want to learn more, I suggest browsing Sparks source or even better, reading #Jacek book called "Mastering Apache Spark" (This points to an explanation regarding ContextCleaner)

In general, that's how Yuval Itzchakov wrote "just like any other object", but...(there's always "but", isn't it?)
In Spark, it's not that obvious since we have shuffle blocks (among the other blocks managed by Spark). They are managed by BlockManagers running on executors. They somehow will have to be notified when an object on the driver gets evicted from memory, right?
That's where ContextCleaner comes to stage. It's Spark Application's Garbage Collector that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.
ContextCleaner runs on the driver. It is created and immediately started when SparkContext starts (and spark.cleaner.referenceTracking Spark property is enabled, which it is by default). It is stopped when SparkContext is stopped.
You can see it working by doing the dump of all the threads in a Spark application using jconsole or jstack. ContextCleaner uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast states.
You can also see its work by enabling INFO or DEBUG logging levels for org.apache.spark.ContextCleaner logger. Just add the following line to conf/log4j.properties:
log4j.logger.org.apache.spark.ContextCleaner=DEBUG

Measuring the Impact of GC
The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options. (See the configuration guide for info on passing Java options to Spark jobs.) Next time your Spark job is run, you will see messages printed in the worker’s logs each time a garbage collection occurs. Note these logs will be on your cluster’s worker nodes (in the stdout files in their work directories), not on your driver program.
Advanced GC Tuning
To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:
Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.
The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].
A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.

According to the Resilient Distributed Data-set paper -
Our worker nodes cache RDD partitions in memory as
Java objects. We use an LRU replacement policy at the
level of RDDs (i.e., we do not evict partitions from an
RDD in order to load other partitions from the same
RDD) because most operations are scans. We found this
simple policy to work well in all our user applications so
far. Programmers that want more control can also set a
retention priority for each RDD as an argument to cache.

Related

When does Spark evict broadcasted dataframe from Executors?

I have a doubt, about when we broadcast a dataframe.
Copies of broadcasted dataframe are sent to each Executor.
So, when does Spark evict these copies from each Executor ?
I find this topic functionally easy to understand, but the manuals harder to follow technically and there are improvements always in the offing.
My take:
There is a ContextCleaner that is running on the Driver for every Spark App.
It gets created immediately started when the SparkContext commences.
It is more about all sorts of objects in Spark.
The ContextCleaner thread cleans RDD, shuffle, and broadcast states, Accumulators using keepCleaning method that runs always
from this class. It decides which objects needs eviction due to no longer being
referenced and these get placed on a list. It calls various methods, such
as registerShuffleForCleanup. That is to say a check is made to see if there are no alive root objects pointing to a given object; if so, then that object is eligible for clean-up, eviction.
context-cleaner-periodic-gc asynchronously requests the standard JVM garbage collector. Periodic runs of this are started when
ContextCleaner starts and stopped when ContextCleaner terminates.
Spark makes use of the standard Java GC.
This https://mallikarjuna_g.gitbooks.io/spark/content/spark-service-contextcleaner.html is a good reference next to the Spark official docs.

DataFrame Lifespan in memory, Spark?

My question is more related to memory management and GC in sprak internally.
If I will create a RDD, how long it will leave in my Executor memory.
# Program Starts
spark = SparkSession.builder.appName("").master("yarn").getOrCreate()
df = spark.range(10)
df.show()
# other Operations
# Program end!!!
Will it be automatically deleted once my Execution finishes. If Yes, Is there any way to delete it manually during program execution.
How and when Garbage collection called in Spark. Can we implement custom GC like JAVA program and use it in Spark.
DataFrame are Java objects so if no reference found your object is eligible to garbage collection
Spark - Scope, Data Frame, and memory management
Calling Custom gc not possible
Manually calling spark's garbage collection from pyspark
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://spark.apache.org/docs/2.2.0/tuning.html#memory-management-overview
"how long it will leave in my Executor memory."
In this particular case spark will no materialize the full dataset ever, instead it will iterate through one by one. Only a few operators materialize the full dataset. This includes, sorts/joins/groupbys/writes/etc
"Will it be automatically deleted once my Execution finishes."
spark automatically cleans any temp data.
"If Yes, Is there any way to delete it manually during program execution."
spark only keeps that data around if its in use or has been manually persisted. what are you trying to accomplish in particular?
"How and when Garbage collection called in Spark."
Spark runs on the JVM and the JVM with automatically GC when certain metrics are hit.

Is there an extra overhead to cache Spark dataframe in memory?

I am new to Spark and wanted to understand if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
From what I know so far that there is not data movement that happens when we used cache a dataframe and it is just saved on executor's memory. So it should be just a matter of setting/unsetting a flag.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
It depends. If you only mark a DataFrame to be persisted, nothing really happens since it's a lazy operation. You have to execute an action to trigger DataFrame persistence / caching. With the action you do add an extra overhead.
Moreover, think of persistence (caching) as a way to precompute data and save it closer to executors (memory, disk or their combinations). This moving data from where it lives to executors does add an extra overhead at execution time (even if it's just a tiny bit).
Internally, Spark manages data as blocks (using BlockManagers on executors). They're peers to exchange blocks on demand (using torrent-like protocol).
Unpersisting a DataFrame is simply to send a request (sync or async) to BlockManagers to remove RDD blocks. If it happens in async manner, the overhead is none (minus the extra work executors have to do while running tasks).
So it should be just a matter of setting/unsetting a flag.
In a sense, that's how it is under the covers. Since a DataFrame or an RDD are just abstractions to describe distributed computations and do nothing at creation time, this persist / unpersist is just setting / unsetting a flag.
The change can be noticed at execution time.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
If you use async caching (the default), there should be a very minimal delay.

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.
Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:
16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...
Here is my EMR config:
[
{
"classification":"spark-env",
"properties":{
},
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34"
},
"configurations":[
]
}
]
},
{
"classification":"spark",
"properties":{
"maximizeResourceAllocation":"true"
},
"configurations":[
]
}
]
From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error?
Here is the code I'm running:
def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp"))
diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
.over(window))
time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
.withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
.orderBy("timestamp")
.rowsBetween(-1, 0))
sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
return sessions
def aggregate_sessions(sessions):
median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
pyspark.sql.functions.first("site_id").alias("site_id"),
pyspark.sql.functions.first("user_id").alias("user_id"),
pyspark.sql.functions.count("id").alias("hits"),
pyspark.sql.functions.min("timestamp").alias("start"),
pyspark.sql.functions.max("timestamp").alias("finish"),
median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
)
return aggregated
spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
header=True,
inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)
Basically, nothing more than windowing and a groupBy to aggregate the data.
It starts with a few of those errors, and towards halting increases in the amount of the same error.
I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.
I feel your pain..
We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.
We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.
Doing more research, I found the answer to why this happens here: http://web.archive.org/web/20190806000138/https://mapr.com/blog/best-practices-yarn-resource-management/
Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.
That page had a link to a very useful page from IBM: https://web.archive.org/web/20170703001345/https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.
Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154
I recommend reading through both pages -- the information is very handy.
If you're not using spark-submit, and you're looking for another way to specify the yarn.nodemanager.vmem-check-enabled parameter mentioned by Duff, here are 2 other ways:
Method 2
If you're using a JSON Configuration file (that you pass to the AWS CLI or to your boto3 script), you'll have to add the following configuration:
[{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false"
}
}]
Method 3
If you use the EMR console, add the following configuration:
classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false]
See,
I had the same problem in a huge cluster that I'm working now. The problem will not be solved to adding memory to the worker. Sometimes in process aggregation spark will use more memory than it has and the spark jobs will start to use off-heap memory.
One simple example is:
If you have a dataset that you need to reduceByKey it will, sometimes, agregate more data in one worker than other, and if this data exeeds the memory of one worker you get that error message.
Adding the option spark.yarn.executor.memoryOverhead will help you if you set for 50% of the memory used for the worker (just for test, and see if it works, you can add less with more tests).
But you need to understand how Spark works with the Memory Allocation in the cluster:
The more common way Spark uses 75% of the machine memory. The rest goes to SO.
Spark has two types of memory during the execution. One part is for execution and the other is the storage. Execution is used for Shuffles, Joins, Aggregations and Etc. The storage is used for caching and propagating data accross the cluster.
One good thing about memory allocation, if you are not using cache in your execution you can set the spark to use that sotorage space to work with execution to avoid in part the OOM error. As you can see this in documentation of spark:
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
But how can we use that?
You can change some configurations, Add the MemoryOverhead configuration to your job call but, consider add this too: spark.memory.fraction change for 0.8 or 0.85 and reduce the spark.memory.storageFraction to 0.35 or 0.2.
Other configurations can help, but it need to check in your case. Se all these configuration here.
Now, what helps in My case.
I have a cluster with 2.5K workers and 2.5TB of RAM. And we were facing OOM error like yours. We just increase the spark.yarn.executor.memoryOverhead to 2048. And we enable the dynamic allocation. And when we call the job, we don't set the memory for the workers, we leave that for the Spark to decide. We just set the Overhead.
But for some tests for my small cluster, changing the size of execution and storage memory. That solved the problem.
Try repartition. It works in my case.
The dataframe was not so big at the very beginning when it was loaded with write.csv(). The data file amounted to be 10 MB or so, as may required say totally several 100 MB memory for each processing task in executor.
I checked the number of partitions to be 2 at the time.
Then it grew like a snowball during the following operations joining with other tables, adding new columns. And then I ran into the memory exceeding limits issue at a certain step.
I checked the number of partitions, it was still 2, derived from the original data frame I guess.
So I tried to repartition it at the very beginning, and there was no problem anymore.
I have not read many materials about Spark and YARN yet. What I do know is that there are executors in nodes. An executor could handle many tasks depending on the resources. My guess is one partition would be atomically mapped to one task. And its volume determines the resource usage. Spark could not slice it if one partition grows too big.
A reasonable strategy is to determine the nodes and container memory first, either 10GB or 5GB. Ideally, both could serve any data processing job, just a matter of time. Given the 5GB memory setting, the reasonable row for one partition you find, say is 1000 after testing (it won't fail any steps during the processing), we could do it as the following pseudo code:
RWS_PER_PARTITION = 1000
input_df = spark.write.csv("file_uri", *other_args)
total_rows = input_df.count()
original_num_partitions = input_df.getNumPartitions()
numPartitions = max(total_rows/RWS_PER_PARTITION, original_num_partitions)
input_df = input_df.repartition(numPartitions)
Hope it helps!
I had the same issue on small cluster running relatively small job on spark 2.3.1.
The job reads parquet file, removes duplicates using groupBy/agg/first then sorts and writes new parquet. It processed 51 GB of parquet files on 4 nodes (4 vcores, 32Gb RAM).
The job was constantly failing on aggregation stage. I wrote bash script watch executors memory usage and found out that in the middle of the stage one random executor starts taking double memory for a few seconds. When I correlated time of this moment with GC logs it matched with full GC that empties big amount of memory.
At last I understood that the problem is related somehow to GC. ParallelGC and G1 causes this issue constantly but ConcMarkSweepGC improves the situation. The issue appears only with small amount of partitions. I ran the job on EMR where OpenJDK 64-Bit (build 25.171-b10) was installed. I don't know the root cause of the issue, it could be related to JVM or operating system. But it is definitely not related to heap or off-heap usage in my case.
UPDATE1
Tried Oracle HotSpot, the issue is reproduced.

Apache Spark: Unpersisting RDD after next action?

In spark programming, when I call persist/cache() on an RDD, I found that its lifespan for reusability is not optimal in many cases:
Namely, it always last for a few hours, after which an RDD storage is evicted from executor's memory and disk. This sometimes cause performance/GC problems: sometimes an RDD storage drains memory long after the RDD implementation itself on driver has been garbage collected (until a few hours later, but for a job that cache/checkpoint often this is still inefficient). Sometimes vice versa: an RDD storage get evicted even the RDD object is still referenced by the driver jvm, and it may be reused later.
I'm looking for a way to override it. The "unpersist()" function is rarely useful: due to lazy execution it can only be called after next action, which can't be determined by the time it is created. Is there a pattern to mark an RDD as "unpersist after next action"? This can save a lot of memory and disk space.

Resources