Sharing RDDs with storage level NONE among Spark jobs - apache-spark

I have multiple Spark jobs which share a part of the dataflow graph including an expensive shuffle operation. If I persist that RDD, I see huge improvement (22x) as expected.
However, even when I keep the storage level of those RDDs as NONE, I still see upto 4x improvement just by sharing the RDDs among jobs.
Why?
I am under the assumption that Sark always recompute RDDs with storage level NONE and those are not subject to eviction/spilling.
My Spark version is 3.3.1. Showing the code is difficult as the code is spread in multiple files in a bigger system. I am essentially doing the following:
Identify the repetitive (and expensive) Spark operation across jobs. I do that by maintaining my own lineage traces [1].
After the first execution of those Spark operations, I keep the RDD handles locally in a cache, which is a hashmap <lineage-trace, RDD>.
The next time onwards, when I get the same operation, I simply reuse the cached RDD.
If I persist the RDDs in the second step (by calling rdd.persist(StorageLevel.MEMORY_AND_DISK), I see a huge improvement. But even if I just reuse the same RDD (storage level NONE), I still see improvement.
[1] LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems. Arnab Phani, Benjamin Rath, Matthias Boehm. SIGMOD 2021

If we have a look at the source code for the rdd.persist(StorageLevel) method, we see the following:
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
So that calls a persist method with an extra input argument. It looks like this:
/**
* Mark this RDD for persisting using the specified level.
*
* #param newLevel the target storage level
* #param allowOverride whether to override any existing level with the new one
*/
private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
throw SparkCoreErrors.cannotChangeStorageLevelError()
}
// If this is the first time this RDD is marked for persisting, register it
// with the SparkContext for cleanups and accounting. Do this only once.
if (storageLevel == StorageLevel.NONE) {
sc.cleaner.foreach(_.registerRDDForCleanup(this))
sc.persistRDD(this)
}
storageLevel = newLevel
this
}
In there, we see something interesting. If the current storageLevel (not the new one) == StorageLevel.NONE, we're going to registerRDDForCleanup and persistRDD on this RDD.
Now, the default value for storageLevel is StorageLevel.NONE. That means that your case (calling persist on an unpersisted RDD) falls under this category.
So we found out that calling rdd.persist(StorageLevel.NONE) actually does something with your RDD! Let's have a look at both of these operations.
registerRDDForCleanup
registerRDDForCleanup is a method of the ContextCleaner class. It looks like this:
/** Register an RDD for cleanup when it is garbage collected. */
def registerRDDForCleanup(rdd: RDD[_]): Unit = {
registerForCleanup(rdd, CleanRDD(rdd.id))
}
// some other code between here that I removed for this explanation
/** Register an object for cleanup. */
private def registerForCleanup(objectForCleanup: AnyRef, task: CleanupTask): Unit = {
referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue))
}
So this method actually adds a cleanup task (associated with this RDD) to some buffer called the referenceBuffer. That looks like this:
/**
* A buffer to ensure that `CleanupTaskWeakReference`s are not garbage collected as long as they
* have not been handled by the reference queue.
*/
private val referenceBuffer =
Collections.newSetFromMap[CleanupTaskWeakReference](new ConcurrentHashMap)
So as the code comments are saying, this referenceBuffer is a buffer to ensure that tasks don't get garbage collected too soon. So your RDDs are getting less garbage collected, which improves your performance!
persistRDD
The second method that was called on our RDD is the persistRDD method. I won't go into too much detail (since it is less important here) but this method (of Sparkcontext.scala) basically adds this RDD to a Map in which the SparkContext keeps track of all persisted RDDs.
Conclusion
We could go deeper in this investigation, but that would become impractical to write/read. I think this level of abstraction is enough to understand that calling rdd.persist(StorageLevel) actually does something to make your RDDs not be garbage collected too soon!
Hope this helps :)

Related

Returning results from executor to driver

I have a spark application, which basically takes in a big dataset, performs some computations over it, and finally does some IO to store it in a database. All of these stages happen on executors, and driver gets (collects) a boolean from each task, representing the success/failure status of that task (e.g. computation or IO may fail for some items).
E.g., following is an over-simplified lineage (in the actual implementation, there are multiple repartitioning and computation steps):
readSomeDataset()
.repartition()
.mapPartition { // do some calculation }
.mapPartition { // do some IO }
.collect()
Problem:
Based on the result of the computations, I would like to do something else on the driver (like publishing a message saying "computation was success"). This needs to be done once for the entire dataset, and not for individual partition, and thus needs to happen on the driver.
However, the IO on executors takes a long time, and I do not want to wait for that to finish before publishing.
Is there a way for the executors to send a 'message' back to the driver while in middle of processing the tasks?
(Something like Accumulators comes to mind, however, afaik they will be usable only once the final action finishes on the executors)
Spark is a lazy framework, and need a complete job (from reading to writing) to execute, it can't execute only part.
To do these changes without reprocessing, you can cache dataframes, to recover as fast as you can, something like this.
val calculatedDF = readSomeDataset()
.repartition()
.mapPartition { // do some calculation }
.cache() // or persist if can't fit in memory of the executors
if (caculatedDF.map(checkEackAreOK).reduce(_ && _).head) { // a condition to see if the calculations are ok and an action to launch it
println("correct calculation")
calculatedDF
.mapPartition { // do some IO }
.collect()
} else {
println("incorrect calculation")
}

Spark and isolating time taken for tasks

I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.

In which scenario Object from driver node is serialized and sent to workers node in apache spark

let's say I declare a variable and I use it inside map/filter function in spark. does my above declared variable is each time sent from driver to worker for each operation on values of map/filter.
Does my helloVariable is sent to worker node for each values of consumerRecords ? if so how to avoid it ?
String helloVariable = "hello testing"; //or some config/json object
JavaDStream<String> javaDStream = consumerRecordJavaInputDStream.map(
consumerRecord -> {
return consumerRecord.value()+" --- "+helloVariable;
} );
Yep. When you normally pass functions to Spark, such as a map() or a filter(), this functions can use variables defined outside them in the driver program, but each task running on the cluster gets a new copy of each variable (using serialization and sending by network), and updates from these copies are not propagated back to the driver.
So the common case for this scenario is to use broadcast variables.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. If you are interested in the broadcasting mechanism, here you can read a very good short explanation.
According to the Spark documentation, this process can be graphically shown like this:
Broadcast variables can be used, for example, to give every node a copy of a large dataset (for example, a dictionary with a list of keywords) in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
So in your case your code might look like this:
Broadcast<String> broadcastVar = sc.broadcast("hello testing");
JavaDStream<String> javaDStream = consumerRecordJavaInputDStream.map(
consumerRecord -> {
return consumerRecord.value() + " --- " + broadcastVar.value();
});

Replaying an RDD in spark streaming to update an accumulator

I am actually running out of options.
In my spark streaming application. I want to keep a state on some keys. I am getting events from Kafka. Then I extract keys from the event, say userID. When there is no events coming from Kafka I want to keep updating a counter relative to each user ID each 3 seconds, since I configured the batchduration of my StreamingContext with 3 seconds.
Now the way I am doing it might be ugly, but at least it works: I have an accumulableCollection like this:
val userID = ssc.sparkContext.accumulableCollection(new mutable.HashMap[String,Long]())
Then I create a "fake" event and keep pushing it to my spark streaming context as the following:
val rddQueue = new mutable.SynchronizedQueue[RDD[String]]()
for ( i <- 1 to 100) {
rddQueue += ssc.sparkContext.makeRDD(Seq("FAKE_MESSAGE"))
Thread.sleep(3000)
}
val inputStream = ssc.queueStream(rddQueue)
inputStream.foreachRDD( UPDATE_MY_ACCUMULATOR )
This would let me access to my accumulatorCollection and update all the counters of all userIDs. Up to now everything works fine, however when I change my loop from:
for ( i <- 1 to 100) {} #This is for test
To:
while (true) {} #This is to let me access and update my accumulator through the whole application life cycle
Then when I run my ./spark-submit, my application gets stuck on this stage:
15/12/10 18:09:00 INFO BlockManagerMasterActor: Registering block manager slave1.cluster.example:38959 with 1060.3 MB RAM, BlockManagerId(1, slave1.cluster.example, 38959)
Any clue on how to resolve this ? Is there a pretty straightforward way that would allow me updating the values of my userIDs (rather than creating an unuseful RDD and pushing it periodically to the queuestream)?
The reason why the while (true) ... version does not work is that the control never returns to the main execution line and therefore nothing below that line gets executed. To solve that specific problem, we should execute the while loop in a separate thread. Future { while () ...} should probably work.
Also, the Thread.sleep(3000) when populating the QueueDStream in the example above is not needed. Spark Streaming will consume one message from the queue on each streaming interval.
A better way to trigger that inflow of 'tick' messages would be with the ConstantInputDStream that plays back the same RDD at each streaming interval, therefore removing the need to create the RDD inflow with the QueueDStream.
That said, it looks to me that the current approach seems fragile and would need revision.

When are accumulators truly reliable?

I want to use an accumulator to gather some stats about the data I'm manipulating on a Spark job. Ideally, I would do that while the job computes the required transformations, but since Spark would re-compute tasks on different cases the accumulators would not reflect true metrics. Here is how the documentation describes this:
For accumulator updates performed inside actions only, Spark
guarantees that each task’s update to the accumulator will only be
applied once, i.e. restarted tasks will not update the value. In
transformations, users should be aware of that each task’s update may
be applied more than once if tasks or job stages are re-executed.
This is confusing since most actions do not allow running custom code (where accumulators can be used), they mostly take the results from previous transformations (lazily). The documentation also shows this:
val acc = sc.accumulator(0)
data.map(x => acc += x; f(x))
// Here, acc is still 0 because no actions have cause the `map` to be computed.
But if we add data.count() at the end, would this be guaranteed to be correct (have no duplicates) or not? Clearly acc is not used "inside actions only", as map is a transformation. So it should not be guaranteed.
On the other hand, discussion on related Jira tickets talk about "result tasks" rather than "actions". For instance here and here. This seems to indicate that the result would indeed be guaranteed to be correct, since we are using acc immediately before and action and thus should be computed as a single stage.
I'm guessing that this concept of a "result task" has to do with the type of operations involved, being the last one that includes an action, like in this example, which shows how several operations are divided into stages (in magenta, image taken from here):
So hypothetically, a count() action at the end of that chain would be part of the same final stage, and I would be guaranteed that accumulators used on the last map will no include any duplicates?
Clarification around this issue would be great! Thanks.
To answer the question "When are accumulators truly reliable ?"
Answer : When they are present in an Action operation.
As per the documentation in Action Task, even if any restarted tasks are present it will update Accumulator only once.
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
And Action do allow to run custom code.
For Ex.
val accNotEmpty = sc.accumulator(0)
ip.foreach(x=>{
if(x!=""){
accNotEmpty += 1
}
})
But, Why Map+Action viz. Result Task operations are not reliable for an Accumulator operation?
Task failed due to some exception in code. Spark will try 4 times(default number of tries).If task fail every time it will give an exception.If by chance it succeeds then Spark will continue and just update the accumulator value for successful state and failed states accumulator values are ignored.Verdict : Handled Properly
Stage Failure : If an executor node crashes, no fault of user but an hardware failure - And if the node goes down in shuffle stage.As shuffle output is stored locally, if a node goes down, that shuffle output is gone.So Spark goes back to the stage that generated the shuffle output, looks at which tasks need to be rerun, and executes them on one of the nodes that is still alive.After we regenerate the missing shuffle output, the stage which generated the map output has executed some of it’s tasks multiple times.Spark counts accumulator updates from all of them.Verdict : Not handled in Result Task.Accumulator will give wrong output.
If a task is running slow then, Spark can launch a speculative copy of that task on another node.Verdict : Not handled.Accumulator will give wrong output.
RDD which is cached is huge and can't reside in Memory.So whenever the RDD is used it will re run the Map operation to get the RDD and again accumulator will be updated by it.Verdict : Not handled.Accumulator will give wrong output.
So it may happen same function may run multiple time on same data.So Spark does not provide any guarantee for accumulator getting updated because of the Map operation.
So it is better to use Accumulator in Action operation in Spark.
To know more about Accumulator and its issues refer this Blog Post - By Imran Rashid.
Accumulator updates are sent back to the driver when a task is successfully completed. So your accumulator results are guaranteed to be correct when you are certain that each task will have been executed exactly once and each task did as you expected.
I prefer relying on reduce and aggregate instead of accumulators because it is fairly hard to enumerate all the ways tasks can be executed.
An action starts tasks.
If an action depends on an earlier stage and the results of that stage are not (fully) cached, then tasks from the earlier stage will be started.
Speculative execution starts duplicate tasks when a small number of slow tasks are detected.
That said, there are many simple cases where accumulators can be fully trusted.
val acc = sc.accumulator(0)
val rdd = sc.parallelize(1 to 10, 2)
val accumulating = rdd.map { x => acc += 1; x }
accumulating.count
assert(acc == 10)
Would this be guaranteed to be correct (have no duplicates)?
Yes, if speculative execution is disabled. The map and the count will be a single stage, so like you say, there is no way a task can be successfully executed more than once.
But an accumulator is updated as a side-effect. So you have to be very careful when thinking about how the code will be executed. Consider this instead of accumulating.count:
// Same setup as before.
accumulating.mapPartitions(p => Iterator(p.next)).collect
assert(acc == 2)
This will also create one task for each partition, and each task will be guaranteed to execute exactly once. But the code in map will not get executed on all elements, just the first one in each partition.
The accumulator is like a global variable. If you share a reference to the RDD that can increment the accumulator then other code (other threads) can cause it to increment too.
// Same setup as before.
val x = new X(accumulating) // We don't know what X does.
// It may trigger the calculation
// any number of times.
accumulating.count
assert(acc >= 10)
I think Matei answered this in the referred documentation:
As discussed on https://github.com/apache/spark/pull/2524 this is
pretty hard to provide good semantics for in the general case
(accumulator updates inside non-result stages), for the following
reasons:
An RDD may be computed as part of multiple stages. For
example, if you update an accumulator inside a MappedRDD and then
shuffle it, that might be one stage. But if you then call map() again
on the MappedRDD, and shuffle the result of that, you get a second
stage where that map is pipeline. Do you want to count this
accumulator update twice or not?
Entire stages may be resubmitted if
shuffle files are deleted by the periodic cleaner or are lost due to a
node failure, so anything that tracks RDDs would need to do so for
long periods of time (as long as the RDD is referenceable in the user
program), which would be pretty complicated to implement.
So I'm going
to mark this as "won't fix" for now, except for the part for result
stages done in SPARK-3628.

Resources