Pyspark G1 Garabage Collector - garbage-collection

I had an error with my pyspark, saying GC collector out of memory.
I read an article about G1 Garbage collector, so I want to try it.
How do I set it in pyspark application? I couldn't find any instruction on that..
Thanks!

The main point to remember here is that the cost of garbage collection
is proportional to the number of Java objects
This is the start...
To optimize it by spark see this and see this to discover the principal conf paramters to use.
The rest is experience...

Related

DataFrame Lifespan in memory, Spark?

My question is more related to memory management and GC in sprak internally.
If I will create a RDD, how long it will leave in my Executor memory.
# Program Starts
spark = SparkSession.builder.appName("").master("yarn").getOrCreate()
df = spark.range(10)
df.show()
# other Operations
# Program end!!!
Will it be automatically deleted once my Execution finishes. If Yes, Is there any way to delete it manually during program execution.
How and when Garbage collection called in Spark. Can we implement custom GC like JAVA program and use it in Spark.
DataFrame are Java objects so if no reference found your object is eligible to garbage collection
Spark - Scope, Data Frame, and memory management
Calling Custom gc not possible
Manually calling spark's garbage collection from pyspark
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://spark.apache.org/docs/2.2.0/tuning.html#memory-management-overview
"how long it will leave in my Executor memory."
In this particular case spark will no materialize the full dataset ever, instead it will iterate through one by one. Only a few operators materialize the full dataset. This includes, sorts/joins/groupbys/writes/etc
"Will it be automatically deleted once my Execution finishes."
spark automatically cleans any temp data.
"If Yes, Is there any way to delete it manually during program execution."
spark only keeps that data around if its in use or has been manually persisted. what are you trying to accomplish in particular?
"How and when Garbage collection called in Spark."
Spark runs on the JVM and the JVM with automatically GC when certain metrics are hit.

Profiling spark executor memory

I have been wanting to find a good way to profile a spark application's executor when its run from a jupyter notebook interface. I basically want to see details like what is the heap memory usage, young and perm gen memory usage etc through time for a particular executor(ones that fail atleast).
I see many solutions out there but nothing that seems mature and easy to install/use.
Are there any good tools that let me do this easily?

Is garbage collection time part of execution time of a task in apache spark?

I am a beginner in apache spark and came across the garbage collection time of tasks in apache spark webUI. Does the execution time of a task include the task garbage collection time?
The answer is yes, the execution that shows in Spark UI of garbage collector is part of total execution time. If your GC is taking more time than the real execution, better you check what you are doing.
If you are facing any problem with the GC, there is a tons of solutions that you can improve the memory usage of Spark, or the GC administration.
According to Databricks blog, the GC execution time is a recursive problem in any big company that use GBs of memory to execute your tasks:
For example, garbage collection takes a long time, causing program to experience long delays, or even crash in severe cases.
You can see the full text here.
Other things that you can see is how to improve or tuning your spark application to avoid the GC time of execution, or GC Overhead Limit or even the OOM errors during execution.
Please check this part of documentation.

How long does RDD remain in memory?

Considering memory being limited, I had a feeling that spark automatically removes RDD from each node. I'd like to know is this time configurable? How does spark decide when to evict an RDD from memory
Note: I'm not talking about rdd.cache()
I'd like to know is this time configurable? How does spark decide when
to evict an RDD from memory
An RDD is an object just like any other. If you don't persist/cache it, it will act as any other object under a managed language would and be collected once there are no alive root objects pointing to it.
The "how" part, as #Jacek points out is the responsibility of an object called ContextCleaner. Mainly, if you want the details, this is what the cleaning method looks like:
private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
while (!stopped) {
try {
val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
.map(_.asInstanceOf[CleanupTaskWeakReference])
// Synchronize here to avoid being interrupted on stop()
synchronized {
reference.foreach { ref =>
logDebug("Got cleaning task " + ref.task)
referenceBuffer.remove(ref)
ref.task match {
case CleanRDD(rddId) =>
doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
case CleanShuffle(shuffleId) =>
doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
case CleanBroadcast(broadcastId) =>
doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
case CleanAccum(accId) =>
doCleanupAccum(accId, blocking = blockOnCleanupTasks)
case CleanCheckpoint(rddId) =>
doCleanCheckpoint(rddId)
}
}
}
} catch {
case ie: InterruptedException if stopped => // ignore
case e: Exception => logError("Error in cleaning thread", e)
}
}
}
If you want to learn more, I suggest browsing Sparks source or even better, reading #Jacek book called "Mastering Apache Spark" (This points to an explanation regarding ContextCleaner)
In general, that's how Yuval Itzchakov wrote "just like any other object", but...(there's always "but", isn't it?)
In Spark, it's not that obvious since we have shuffle blocks (among the other blocks managed by Spark). They are managed by BlockManagers running on executors. They somehow will have to be notified when an object on the driver gets evicted from memory, right?
That's where ContextCleaner comes to stage. It's Spark Application's Garbage Collector that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.
ContextCleaner runs on the driver. It is created and immediately started when SparkContext starts (and spark.cleaner.referenceTracking Spark property is enabled, which it is by default). It is stopped when SparkContext is stopped.
You can see it working by doing the dump of all the threads in a Spark application using jconsole or jstack. ContextCleaner uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast states.
You can also see its work by enabling INFO or DEBUG logging levels for org.apache.spark.ContextCleaner logger. Just add the following line to conf/log4j.properties:
log4j.logger.org.apache.spark.ContextCleaner=DEBUG
Measuring the Impact of GC
The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options. (See the configuration guide for info on passing Java options to Spark jobs.) Next time your Spark job is run, you will see messages printed in the worker’s logs each time a garbage collection occurs. Note these logs will be on your cluster’s worker nodes (in the stdout files in their work directories), not on your driver program.
Advanced GC Tuning
To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:
Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.
The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].
A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.
According to the Resilient Distributed Data-set paper -
Our worker nodes cache RDD partitions in memory as
Java objects. We use an LRU replacement policy at the
level of RDDs (i.e., we do not evict partitions from an
RDD in order to load other partitions from the same
RDD) because most operations are scans. We found this
simple policy to work well in all our user applications so
far. Programmers that want more control can also set a
retention priority for each RDD as an argument to cache.

spark uses sophisticated ways to leverage memory space - explain

I was watching a video on apache spark here . Where the speaker Paco Nathan says the following
"If you have 128 GB of RAM, you are not going to throw them all at once at the jvm.That will just cause a lot of garbage collection. And so one of the things with spark is, use more sophisticated ways to leverage the memory space, do more off-heap."
I am not able to understand what he says with regard to how spark efficiently handles this scenario.
also more specifically i completely did not understand the statement
"If you have 128 GB of RAM you are not going to throw them all at once at the jvm.That will just cause of lot of garbage collection"
Can someone explain what the reasoning actually is behind these statements ?
"If you have 128 GB of RAM you are not going to throw them all at once
at the jvm.That will just cause of lot of garbage collection"
This means that you will not assign all the memory to the JVM only when there is memory requirement for other stuff like garbage collection, off-heap operations, etc.
Spark does this by assigning fractions of the memory(that you have assigned to Spark executors) for such operations as shown in image below(for Spark 1.5.0):

Resources