Spark Driver Memory calculation - apache-spark

I know how to calculate executor cores and memory.But Can anyone explain on what basis spark.driver.memory is calculated ?

Operations on Datasets such as collect take require moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
You increase spark.driver.memory when you collect large volumes to the driver.
As per
High Performance Spark by Holden Karau and Rachel Warren (O’Reilly)
most of the computational work of a Spark query is performed by the
executors, so increasing the size of the driver rarely speeds up a
computation. However, jobs may fail if they collect too much data to
the driver or perform large local computations. Thus, increasing the
driver memory and correspondingly the value of
spark.driver.maxResultSize may prevent the out-of-memory errors in
the driver.
A good heuristic for setting the Spark driver memory is simply the
lowest possible value that does not lead to memory errors in the
driver, i.e., which gives the maximum possible resources to the
executors.

Spark driver memory is the amount of memory to use for the driver process, i.e. the process running the main() function of the application and where SparkContext is initialized, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m, 2g).
JVM memory is divided into separate parts. At broad level, JVM Heap memory is physically divided into two parts – Young Generation and Old Generation.
Young generation is the place where all the new objects are created. When young generation is filled, garbage collection is performed. This garbage collection is called Minor GC.
Old Generation memory contains the objects that are long lived and survived after many rounds of Minor GC. Usually garbage collection is performed in Old Generation memory when it’s full. Old Generation Garbage Collection is called Major GC and usually takes longer time.
Java Garbage Collection is the process to identify and remove the unused objects from the memory and free space to be allocated to objects created in the future processing. One of the best feature of java programming language is the automatic garbage collection, unlike other programming languages such as C where memory allocation and deallocation is a manual process.
Garbage Collector is the program running in the background that looks into all the objects in the memory and find out objects that are not referenced by any part of the program. All these unreferenced objects are deleted and space is reclaimed for allocation to other objects.
Sources:
https://spark.apache.org/docs/latest/configuration.html
https://www.journaldev.com/2856/java-jvm-memory-model-memory-management-in-java#java-memory-model-8211-permanent-generation

Related

In spark what is the meaning of spark.executor.pyspark.memory configuration option?

Documentation explanation is given as:
The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit Python's memory use, and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes. When PySpark is run in YARN or Kubernetes, this memory is added to executor resource requests.
Note: This feature is dependent on Python's resource module; therefore, the behaviours and limitations are inherited. For instance, Windows does not support resource limiting, and actual resource is not limited on macOS.
There are two other configuration options. One controlling the amount of memory allocated to each executor - spark.executor.memory and, another controlling the amount of memory that each python process within an executor can use before it starts to spill memory over to disk - spark.python.worker.memory
Can someone please explain what then is the behaviour and use of spark.executor.pyspark.memory configuration and in what ways is it different from spark.executor.memory and spark.python.worker.memory?
I extended my answer a little bit. And please, follow the links, at the end of the article, they are pretty useful and have some pictures that help to understand the whole picture of spark memory management.
We should dig into spark memory management(mm) to figure out what is spark.execution.pyspark.memory.
So, first of all, there are two big parts of spark mm:
Memory inside JVM;
Memory outside JVM.
Memory inside JVM is divided into 4 parts:
Storage memory - this memory is for spark cached data, broadcast variables, etc;
Execution memory - this memory is for storing data required during execution spark tasks;
User memory - this memory is for user purposes. You can store here your custom data structure, UDFs, UDAFs, etc;
Reserved memory - this memory is for spark purposes and it hardcoded to 300MB as of spark 1.6.
Memory outside JVM is divided into 2 parts:
OffHeap memory - this memory of things outside JVM, but for JVM purposes or this memory is used for Project Tungsten;
External process memory - this memory is specific for SparkR or PythonR and used by processes that resided outside of JVM.
So, the parameter spark.executor.memory(or --executor-memory for spar-submit) responds how much memory will allocate inside JVM Heap per exectuor. This memory will split between: reserved memory, user memory, execution memory, storage memory. To control this splitting we need 2 more parameters: spark.memory.fraction and spark.memory.storageFraction
According to spark documentation:
spark.memory.fraction is responsible for fraction of heap used for execution and storage;
spark.memory.storageFraction is responsible for to amount of
storage memory immune to eviction, expressed as a fraction of the
size of the region set aside by spark.memory.fraction. So if
storage memory isn't used, execution memory may acquire all the
available memory and vice versa. This parameter controls how much
memory execution can evict if necessary.
More details here
Please look pictures of Heap memory parts here
Finally, Heap will be split in a next way:
Reserved memory is hardcoded to 300MB
User memory will calculate as (spark.executor.memory - reserved memory) * (1 - spark.memory.fraction)
Spark memory(which consists of Storage memory and Execution memory) will calculate as (spark.executor.memory - reserved memory) * spark.memory.fraction. Then all this memory will split between Storage memory and Execution memory with spark.memory.storageFraction parameter.
The next parameter you asked about is spark.execution.pyspark.memory. It's a part of External process memory and it's responsible for how much memory python daemon will able to use. Python daemon is used, for example, for executing UDFs had written on python.
And the last one is spark.python.worker.memory. In this article I had found the next explanation: JVM process and Python process communicate to each other with py4J bridge that exposes objects between JVM and Python. So spark.python.worker.memory is controlling how much memory can be occupied by py4J for creating objects before spilling them to the disk.
You can read about mm more in the next articles:
Memory management inside JVM;
Decoding Memory in Spark — Parameters that are often confused;
One more SO answer which explaining offheap memory configuration
Hot to tune apache spark jobs

how spark handles out of memory error when cached( MEMORY_ONLY persistence) data does not fit in memory?

I'm new to the spark and i am not able to find clear answer that What happens when a cached data does not fit in memory?
many places i found that If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
for example:lets say 500 partition is created and say 200 partition didn't cached then again we have to re-compute the remaining 200 partition by re-evaluating the RDD.
If that is the case then OOM error should never occur but it does.What is the reason?
Detailed explanation is highly appreciated.Thanks in advance
There are different ways you can persist in your dataframe in spark.
1)Persist (MEMORY_ONLY)
when you persist data frame with MEMORY_ONLY it will be cached in spark.cached.memory section as deserialized Java objects. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level and can some times cause OOM when the RDD is too big and cannot fit in memory(it can also occur after recalculation effort).
To answer your question
If that is the case then OOM error should never occur but it does.What is the reason?
even after recalculation you need to fit those rdd in memory. if there no space available then GC will try to clean some part and try to allocate it.if not successfully then it will fail with OOM
2)Persist (MEMORY_AND_DISK)
when you persist data frame with MEMORY_AND_DISK it will be cached in spark.cached.memory section as deserialized Java objects if memory is not available in heap then it will be spilled to disk. to tackle memory issues it will spill down some part of data or complete data to disk. (note: make sure to have enough disk space in nodes other no-disk space errors will popup)
3)Persist (MEMORY_ONLY_SER)
when you persist data frame with MEMORY_ONLY_SER it will be cached in spark.cached.memory section as serialized Java objects (one-byte array per partition). this is generally more space-efficient than MEMORY_ONLY but it is a cpu-intensive task because compression is involved (general suggestion here is to use Kyro for serialization) but this still faces OOM issues similar to MEMORY_ONLY.
4)Persist (MEMORY_AND_DISK_SER)
it is similar to MEMORY_ONLY_SER but one difference is when no heap space is available then it will spill RDD array to disk the same as (MEMORY_AND_DISK) ... we can use this option when you have a tight constraint on disk space and you want to reduce IO traffic.
5)Persist (DISK_ONLY)
In this case, heap memory is not used.RDD's are persisted to disk. make sure to have enough disk space and this option will have huge IO overhead. don't use this when you have dataframes that are repeatedly used.
6)Persist (MEMORY_ONLY_2 or MEMORY_AND_DISK_2)
These are similar to above mentioned MEMORY_ONLY and MEMORY_AND_DISK. the only difference is these options replicate each partition on two cluster nodes just to be on the safe side.. use these options when you are using spot instances.
7)Persist (OFF_HEAP)
Off heap memory generally contains thread stacks, spark container application code, network IO buffers, and other OS application buffers. even you can utilize this part of the memory from RAM for caching your RDD with the above option.

How long does RDD remain in memory?

Considering memory being limited, I had a feeling that spark automatically removes RDD from each node. I'd like to know is this time configurable? How does spark decide when to evict an RDD from memory
Note: I'm not talking about rdd.cache()
I'd like to know is this time configurable? How does spark decide when
to evict an RDD from memory
An RDD is an object just like any other. If you don't persist/cache it, it will act as any other object under a managed language would and be collected once there are no alive root objects pointing to it.
The "how" part, as #Jacek points out is the responsibility of an object called ContextCleaner. Mainly, if you want the details, this is what the cleaning method looks like:
private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
while (!stopped) {
try {
val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
.map(_.asInstanceOf[CleanupTaskWeakReference])
// Synchronize here to avoid being interrupted on stop()
synchronized {
reference.foreach { ref =>
logDebug("Got cleaning task " + ref.task)
referenceBuffer.remove(ref)
ref.task match {
case CleanRDD(rddId) =>
doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
case CleanShuffle(shuffleId) =>
doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
case CleanBroadcast(broadcastId) =>
doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
case CleanAccum(accId) =>
doCleanupAccum(accId, blocking = blockOnCleanupTasks)
case CleanCheckpoint(rddId) =>
doCleanCheckpoint(rddId)
}
}
}
} catch {
case ie: InterruptedException if stopped => // ignore
case e: Exception => logError("Error in cleaning thread", e)
}
}
}
If you want to learn more, I suggest browsing Sparks source or even better, reading #Jacek book called "Mastering Apache Spark" (This points to an explanation regarding ContextCleaner)
In general, that's how Yuval Itzchakov wrote "just like any other object", but...(there's always "but", isn't it?)
In Spark, it's not that obvious since we have shuffle blocks (among the other blocks managed by Spark). They are managed by BlockManagers running on executors. They somehow will have to be notified when an object on the driver gets evicted from memory, right?
That's where ContextCleaner comes to stage. It's Spark Application's Garbage Collector that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.
ContextCleaner runs on the driver. It is created and immediately started when SparkContext starts (and spark.cleaner.referenceTracking Spark property is enabled, which it is by default). It is stopped when SparkContext is stopped.
You can see it working by doing the dump of all the threads in a Spark application using jconsole or jstack. ContextCleaner uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast states.
You can also see its work by enabling INFO or DEBUG logging levels for org.apache.spark.ContextCleaner logger. Just add the following line to conf/log4j.properties:
log4j.logger.org.apache.spark.ContextCleaner=DEBUG
Measuring the Impact of GC
The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options. (See the configuration guide for info on passing Java options to Spark jobs.) Next time your Spark job is run, you will see messages printed in the worker’s logs each time a garbage collection occurs. Note these logs will be on your cluster’s worker nodes (in the stdout files in their work directories), not on your driver program.
Advanced GC Tuning
To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:
Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.
The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].
A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.
According to the Resilient Distributed Data-set paper -
Our worker nodes cache RDD partitions in memory as
Java objects. We use an LRU replacement policy at the
level of RDDs (i.e., we do not evict partitions from an
RDD in order to load other partitions from the same
RDD) because most operations are scans. We found this
simple policy to work well in all our user applications so
far. Programmers that want more control can also set a
retention priority for each RDD as an argument to cache.

spark spilling independent of executor memory assigned

I've noticed strange behavior when running a pyspark application with spark 2.0. In the first step in my script involving a reduceByKey (and thus shuffle) operation, I observe that the amount the shuffle writes is roughly in line with my expectations, but that much more spills occur than I had expected. I tried to avoid these spills by increasing the amount of memory assigned per executor up to 8x the original amount, but see basically no difference in the amount spilled. Strangely, I also see that while this stage is running, hardly any of the assigned storage memory is used (as reported in the executors tab in the spark web UI).
I saw this earlier question, which led me to believe that increasing executor memory might help avoid the spills: How to optimize shuffle spill in Apache Spark application
. This leads me to believe that some hard limit is leading to the spills, and not the spark.shuffle.memoryFraction parameter. Does such a hard limit exist, possibly among HDFS parameters? Otherwise, what could be done to avoid spills besides increasing executor memory?
Many thanks, R
Spilling behavior in PySpark is controlled using spark.python.worker.memory:
Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.
which is by default set to 512MB. Moreover PySpark uses its own reducing mechanism with External(GroupBy|Sorter|Merger) and exhibits slightly different behavior than its native counterpart.

spark uses sophisticated ways to leverage memory space - explain

I was watching a video on apache spark here . Where the speaker Paco Nathan says the following
"If you have 128 GB of RAM, you are not going to throw them all at once at the jvm.That will just cause a lot of garbage collection. And so one of the things with spark is, use more sophisticated ways to leverage the memory space, do more off-heap."
I am not able to understand what he says with regard to how spark efficiently handles this scenario.
also more specifically i completely did not understand the statement
"If you have 128 GB of RAM you are not going to throw them all at once at the jvm.That will just cause of lot of garbage collection"
Can someone explain what the reasoning actually is behind these statements ?
"If you have 128 GB of RAM you are not going to throw them all at once
at the jvm.That will just cause of lot of garbage collection"
This means that you will not assign all the memory to the JVM only when there is memory requirement for other stuff like garbage collection, off-heap operations, etc.
Spark does this by assigning fractions of the memory(that you have assigned to Spark executors) for such operations as shown in image below(for Spark 1.5.0):

Resources