When does Spark evict broadcasted dataframe from Executors? - apache-spark

I have a doubt, about when we broadcast a dataframe.
Copies of broadcasted dataframe are sent to each Executor.
So, when does Spark evict these copies from each Executor ?

I find this topic functionally easy to understand, but the manuals harder to follow technically and there are improvements always in the offing.
My take:
There is a ContextCleaner that is running on the Driver for every Spark App.
It gets created immediately started when the SparkContext commences.
It is more about all sorts of objects in Spark.
The ContextCleaner thread cleans RDD, shuffle, and broadcast states, Accumulators using keepCleaning method that runs always
from this class. It decides which objects needs eviction due to no longer being
referenced and these get placed on a list. It calls various methods, such
as registerShuffleForCleanup. That is to say a check is made to see if there are no alive root objects pointing to a given object; if so, then that object is eligible for clean-up, eviction.
context-cleaner-periodic-gc asynchronously requests the standard JVM garbage collector. Periodic runs of this are started when
ContextCleaner starts and stopped when ContextCleaner terminates.
Spark makes use of the standard Java GC.
This https://mallikarjuna_g.gitbooks.io/spark/content/spark-service-contextcleaner.html is a good reference next to the Spark official docs.

Related

DataFrame Lifespan in memory, Spark?

My question is more related to memory management and GC in sprak internally.
If I will create a RDD, how long it will leave in my Executor memory.
# Program Starts
spark = SparkSession.builder.appName("").master("yarn").getOrCreate()
df = spark.range(10)
df.show()
# other Operations
# Program end!!!
Will it be automatically deleted once my Execution finishes. If Yes, Is there any way to delete it manually during program execution.
How and when Garbage collection called in Spark. Can we implement custom GC like JAVA program and use it in Spark.
DataFrame are Java objects so if no reference found your object is eligible to garbage collection
Spark - Scope, Data Frame, and memory management
Calling Custom gc not possible
Manually calling spark's garbage collection from pyspark
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://spark.apache.org/docs/2.2.0/tuning.html#memory-management-overview
"how long it will leave in my Executor memory."
In this particular case spark will no materialize the full dataset ever, instead it will iterate through one by one. Only a few operators materialize the full dataset. This includes, sorts/joins/groupbys/writes/etc
"Will it be automatically deleted once my Execution finishes."
spark automatically cleans any temp data.
"If Yes, Is there any way to delete it manually during program execution."
spark only keeps that data around if its in use or has been manually persisted. what are you trying to accomplish in particular?
"How and when Garbage collection called in Spark."
Spark runs on the JVM and the JVM with automatically GC when certain metrics are hit.

Is there an extra overhead to cache Spark dataframe in memory?

I am new to Spark and wanted to understand if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
From what I know so far that there is not data movement that happens when we used cache a dataframe and it is just saved on executor's memory. So it should be just a matter of setting/unsetting a flag.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
if there is an extra overhead/delay to persist and un-persist a dataframe in memory.
It depends. If you only mark a DataFrame to be persisted, nothing really happens since it's a lazy operation. You have to execute an action to trigger DataFrame persistence / caching. With the action you do add an extra overhead.
Moreover, think of persistence (caching) as a way to precompute data and save it closer to executors (memory, disk or their combinations). This moving data from where it lives to executors does add an extra overhead at execution time (even if it's just a tiny bit).
Internally, Spark manages data as blocks (using BlockManagers on executors). They're peers to exchange blocks on demand (using torrent-like protocol).
Unpersisting a DataFrame is simply to send a request (sync or async) to BlockManagers to remove RDD blocks. If it happens in async manner, the overhead is none (minus the extra work executors have to do while running tasks).
So it should be just a matter of setting/unsetting a flag.
In a sense, that's how it is under the covers. Since a DataFrame or an RDD are just abstractions to describe distributed computations and do nothing at creation time, this persist / unpersist is just setting / unsetting a flag.
The change can be noticed at execution time.
I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.
If you use async caching (the default), there should be a very minimal delay.

How long does RDD remain in memory?

Considering memory being limited, I had a feeling that spark automatically removes RDD from each node. I'd like to know is this time configurable? How does spark decide when to evict an RDD from memory
Note: I'm not talking about rdd.cache()
I'd like to know is this time configurable? How does spark decide when
to evict an RDD from memory
An RDD is an object just like any other. If you don't persist/cache it, it will act as any other object under a managed language would and be collected once there are no alive root objects pointing to it.
The "how" part, as #Jacek points out is the responsibility of an object called ContextCleaner. Mainly, if you want the details, this is what the cleaning method looks like:
private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
while (!stopped) {
try {
val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
.map(_.asInstanceOf[CleanupTaskWeakReference])
// Synchronize here to avoid being interrupted on stop()
synchronized {
reference.foreach { ref =>
logDebug("Got cleaning task " + ref.task)
referenceBuffer.remove(ref)
ref.task match {
case CleanRDD(rddId) =>
doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
case CleanShuffle(shuffleId) =>
doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
case CleanBroadcast(broadcastId) =>
doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
case CleanAccum(accId) =>
doCleanupAccum(accId, blocking = blockOnCleanupTasks)
case CleanCheckpoint(rddId) =>
doCleanCheckpoint(rddId)
}
}
}
} catch {
case ie: InterruptedException if stopped => // ignore
case e: Exception => logError("Error in cleaning thread", e)
}
}
}
If you want to learn more, I suggest browsing Sparks source or even better, reading #Jacek book called "Mastering Apache Spark" (This points to an explanation regarding ContextCleaner)
In general, that's how Yuval Itzchakov wrote "just like any other object", but...(there's always "but", isn't it?)
In Spark, it's not that obvious since we have shuffle blocks (among the other blocks managed by Spark). They are managed by BlockManagers running on executors. They somehow will have to be notified when an object on the driver gets evicted from memory, right?
That's where ContextCleaner comes to stage. It's Spark Application's Garbage Collector that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.
ContextCleaner runs on the driver. It is created and immediately started when SparkContext starts (and spark.cleaner.referenceTracking Spark property is enabled, which it is by default). It is stopped when SparkContext is stopped.
You can see it working by doing the dump of all the threads in a Spark application using jconsole or jstack. ContextCleaner uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast states.
You can also see its work by enabling INFO or DEBUG logging levels for org.apache.spark.ContextCleaner logger. Just add the following line to conf/log4j.properties:
log4j.logger.org.apache.spark.ContextCleaner=DEBUG
Measuring the Impact of GC
The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options. (See the configuration guide for info on passing Java options to Spark jobs.) Next time your Spark job is run, you will see messages printed in the worker’s logs each time a garbage collection occurs. Note these logs will be on your cluster’s worker nodes (in the stdout files in their work directories), not on your driver program.
Advanced GC Tuning
To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:
Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.
The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].
A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.
According to the Resilient Distributed Data-set paper -
Our worker nodes cache RDD partitions in memory as
Java objects. We use an LRU replacement policy at the
level of RDDs (i.e., we do not evict partitions from an
RDD in order to load other partitions from the same
RDD) because most operations are scans. We found this
simple policy to work well in all our user applications so
far. Programmers that want more control can also set a
retention priority for each RDD as an argument to cache.

How to know which piece of code runs on driver or executor?

I am new to Spark. How to know which piece of code will run on the driver & which will run on the executors ?
Do we always have to try to code such that everything runs on the executors ?. Is there any recommendations/ways to make most of your code to run on executors ?
Update: I far as I understand Transformations run on executors & actions runs on driver because it needs to return value. So is it fine if the action runs on driver or should it also run on executor ? Where does the driver actually run ? on cluster ?
Any Spark application consists of a single Driver process and one or more Executor processes. The Driver process will run on the Master node of your cluster and the Executor processes run on the Worker nodes. You can increase or decrease the number of Executor processes dynamically depending upon your usage but the Driver process will exist throughout the lifetime of your application.
The Driver process is responsible for a lot of things including directing the overall control flow of your application, restarting failed stages and the entire high level direction of how your application will process the data.
Coding your application so that more data is processed by Executors falls more under the purview of optimising your application so that it processes data more efficiently/faster making use of all the resources available to it in the cluster.
In practice, you do not really need to worry about making sure that more of your data is being processed by executors.
That being said, there are some Actions, which when triggered, necessarily involve shuffling around of data. If you call the collect action on an RDD, all the data is brought to the Driver process and if your RDD had a sufficiently large amount of data in it, an Out Of Memory error will be triggered by the application, as the single machine running the Driver process will not be able to hold all the data.
Keeping the above in mind, Transformations are lazy and Actions are not.
Transformations basically transform one RDD into another. But calling a transformation on an RDD does not actually result in any data being processed anywhere, Driver or Executor. All a transformation does is that it adds to the DAG's lineage graph which will be executed when an Action is called.
So the actual processing happens when you call an Action on an RDD. The simplest example is that of calling collect. As soon as an action is called, Spark gets to work and executes the previously saved DAG computations on the specified RDD, returning the result back. Where these computations are executed depends entirely on your application.
There is no simple and straightforward answer here.
As a rule of thumb everything that is executed inside closures of higher order functions like mapPartitions (map, filter, flatMap) or combineByKey should be handled mostly by executor machines. Everything outside these are handled by the driver. But you have to be aware that it is a serious simplification.
Depending on a specific method and language at least a part of the job can be handled by the driver. For example when you use combine-like methods (reduce, aggregate) final merging is applied locally on the driver machine. Complex algorithms (like many can ML / MLlib tools) can interleave distributed and local processing when needed.
Moreover data processing is only a fraction of a whole job. Driver is responsible for bookeeping, accumulator processing, initial broadcasting and other secondary tasks. It also handles lineage and DAG processing and generating execution plans for higher level APIs (Dataset, SparkSQL).
While the whole picture is relatively complex in practice your choices are relatively limited. You can:
Avoid collecting data (collect, toLocalIterator) to process locally.
Perform more work on the workers with tree* (treeAggregate, treeReduce) methods.
Avoid unnecessary tasks which increase bookkeeping costs.
To this part of your question "Update: I far as I understand Transformations run on executors & actions runs on the driver because it needs to return value. "
It is not true that only transformation runs on the executor and all actions run on the driver.
If we have to join 2 datasets where there is no aggregate operation that needs to be performed eg :
dataset1.join(dataset2,dataset1.col("colA").equalTo(dataset2.col("colA)),
"left_semi").as(Encoders.bean(Some.class)).write("/user/datasetresult");
In this case, as soon as the executor machine completes working on its partition it starts writing down the result to HDFS/some persistence without waiting for other executors to complete. This is the reason why we see different part files, which are technically partitions that each executor processed.
Driver does not wait for all executors to complete its computation.
Where does the driver actually run? on cluster?
Depends on the --deploy-mode chosen.
If --deploy-mode client then the gateway where you launch your spark application is your driver machine.
If --deploy-mode cluster, cluster manager choose a machine(in yarn/mesos) which it feels has sufficient memory to run as the driver.

Why are accumulators sent directly to the driver?

With Spark, if I've already defined my accumulators to be associative and reducible, why are does each worker send them directly to the driver rather than reducing incrementally along with my actual job? It seems a bit goofy to me.
Each task in Spark maintains its own accumulator and its value is send back to the driver when particular task has been finished.
Since accumulators are in Spark are mostly a diagnostic and monitoring sharing accumulators between tasks would make these almost useless. Not to mention that worker failure after particular task is finished would result in a loss of data and make accumulators even less reliable than they are right now.
Moreover this mechanism is pretty much the same as the standard RDD reduce where tasks results are continuously send to the driver and merged locally.

Resources