Internal working of Spark - Communication/Synchronization - apache-spark

I am quite new to Spark but already have programming experience in BSP model. In BSP model (e.g. Apache Hama), we have to handle all the communication and synchronization of nodes on our own. Which is good on one side because we have a finer control on what we want to achieve but on the other hand it adds more complexity.
Spark on the other hand, takes all the control and handles everything on its own (which is great) but I don't understand how it works internally especially in cases where we have alot of data and message passing between nodes. Let me put an example
zb = sc.broadcast(z)
r_i = x_i.map(x => Math.pow(norm(x - zb.value), 2))
r_i.checkpoint()
u_i = u_i.zip(x_i).map(ux => ux._1 + ux._2 - zb.value)
u_i.checkpoint()
x_i = f.prox(u_i.map(ui => {zb.value - ui}), rho)
x_i.checkpoint()
x = x_i.reduce(_+_) / f.numSplits.toDouble
u = u_i.reduce(_+_) / f.numSplits.toDouble
z = g.prox(x+u, f.numSplits*rho)
r = Math.sqrt(r_i.reduce(_+_))
This is a method taken from here, which runs in a loop (let's say 200 times). x_i contains our data (let's say 100,000 entries).
In a BSP style program if we have to process this map operation, we will partition this data and distribute on multiple nodes. Each node will process sub part of data (map operation) and will return the result to master (after barrier synchronization). Since master node wants to process each individual result returned (centralized master- see figure below), we send the result of each entry to master (reduce operator in spark). So, (only) master receives 100,000 messages after each iterations. It processes this data and sends the new values to slaves again which again start processing for next iteration.
Now, since Spark takes control from user and does internally everything, I am unable to understand how Spark collects all the data after map operations (asynchronous message passing? i heard it has p2p message passing ? what about synchronization between map tasks? If it does synchronization, then is it right to say that Spark is actually a BSP model ?). Then in order to apply the reduce function, does it collects all the data on a central machine (If yes, does it receives 100,000 messages on a single machine?) or it reduces in a distributed fashion (If yes, then how can this be performed ?)
Following figure shows my reduce function on master. x_i^k-1 represents the i-th value calculated (in previous iteration) against x_i data entry of my input. x_i^k represents the value of x_i calculated in current iteration. Clearly, this equation, needs the results to be collected.
I actually want to compare both styles of distributed programming to understand when to use Spark and when to move to BSP. Further, I looked alot on the internet, all I find is how map/reduce works but nothing useful was available on actual communication/synchronization. Any helful material will be useful aswell.

Related

How does Hazelcast Jet assign task-to-CPU priority?

If I have the following code and let's say I'm running on 10 nodes of 32 cores each:
IList<...> ds = ....; //large collection, eg 1e6 elements
ds
.map() //expensive computation
.flatMap()//generates 10,000x more elements for every 1 incoming element
.rebalance()
.map() //expensive computation
....//other transformations (ie can be a sink, keyby, flatmap, map etc)
What will Hazelcast do with respect to task-to-CPU assignment priority when the SECOND map operation wants to process 10,000 elements that was generated from the 1st original element? Will it devote the 320 CPU cores (from 10 nodes) to processing the 1st original element's 10,000 generated elements? If so, will it "boot off" already running tasks? Or, will it wait for already running tasks to complete, and then give priority to the 10,000 elements resulting from the output of the flatmap-rebalance operations? Or, would the 10,000 elements be forced to run on a single core, since the remaining 319 cores are already being consumed by the output of the ds operation (ie the input of the 1st map). Or, is there some random competition for who gets access to the CPU cores?
What I would ideally like to have happen is that a) Hazelcast does NOT boot off running tasks (it lets them complete), but when deciding which tasks gets priority to run on a core, it chooses the path that would lead to the lowest latency, ie it would process all 10,000 elements which result from the output of the flatmap-rebalance operation on all 320 cores.
Note: I asked a virtually identical question to Flink a few weeks ago, but have since switched to trying out Hazelcast: How does Flink (in streaming mode) assign task-to-CPU priority?
First, IList is a non-distributed data structure, all its data are stored on a single node. The IList source therefore produces all data on that node. So the 1st expensive map will be all done on that member, but map is backed, by default by as many workers as there are cores, so 32 workers in your case.
The rebalance stage will cause that the 2nd map is run on all members. Each of the 10,000 elements produced in the 1st map is handled separately, so if you have 1 element in your IList, the 10k elements produced from it will be processed concurrently by 320 workers.
The workers backing different stages of the pipeline compete for cores normally. There will be total 96 workers for the 1st map, 2nd map and for the flatMap together. Jet uses cooperative scheduling for these workers, which means it cannot preempt the computation if it's taking too long. This means that one item taking a long time to process will block other workers.
Also keep in mind that the map and flatMap functions must be cooperative, that means they must not block (by waiting on IO, sleeping, or by waiting for monitors). If they block, you'll see less than 100% CPU utilization. Check out the documentation for more information.

How to reduce white space in the task stream?

I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?
PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.
Image for: No. of workers = 16
Image for: No. of workers = 32
Image for: No. of workers = 64
As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.
This can be caused by many reasons. I'll list a few below:
Very short tasks (sub millisecond)
Algorithms that are not very parallelizable
Objects in the task graph that are expensive to serialize
...
Looking at your images I don't think that any of these apply to you.
Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:
for i in ...:
results = dask.compute(...) # do some dask work
next_inputs = ... # do some local work
So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).
I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html

Spark and isolating time taken for tasks

I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

When are accumulators truly reliable?

I want to use an accumulator to gather some stats about the data I'm manipulating on a Spark job. Ideally, I would do that while the job computes the required transformations, but since Spark would re-compute tasks on different cases the accumulators would not reflect true metrics. Here is how the documentation describes this:
For accumulator updates performed inside actions only, Spark
guarantees that each task’s update to the accumulator will only be
applied once, i.e. restarted tasks will not update the value. In
transformations, users should be aware of that each task’s update may
be applied more than once if tasks or job stages are re-executed.
This is confusing since most actions do not allow running custom code (where accumulators can be used), they mostly take the results from previous transformations (lazily). The documentation also shows this:
val acc = sc.accumulator(0)
data.map(x => acc += x; f(x))
// Here, acc is still 0 because no actions have cause the `map` to be computed.
But if we add data.count() at the end, would this be guaranteed to be correct (have no duplicates) or not? Clearly acc is not used "inside actions only", as map is a transformation. So it should not be guaranteed.
On the other hand, discussion on related Jira tickets talk about "result tasks" rather than "actions". For instance here and here. This seems to indicate that the result would indeed be guaranteed to be correct, since we are using acc immediately before and action and thus should be computed as a single stage.
I'm guessing that this concept of a "result task" has to do with the type of operations involved, being the last one that includes an action, like in this example, which shows how several operations are divided into stages (in magenta, image taken from here):
So hypothetically, a count() action at the end of that chain would be part of the same final stage, and I would be guaranteed that accumulators used on the last map will no include any duplicates?
Clarification around this issue would be great! Thanks.
To answer the question "When are accumulators truly reliable ?"
Answer : When they are present in an Action operation.
As per the documentation in Action Task, even if any restarted tasks are present it will update Accumulator only once.
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
And Action do allow to run custom code.
For Ex.
val accNotEmpty = sc.accumulator(0)
ip.foreach(x=>{
if(x!=""){
accNotEmpty += 1
}
})
But, Why Map+Action viz. Result Task operations are not reliable for an Accumulator operation?
Task failed due to some exception in code. Spark will try 4 times(default number of tries).If task fail every time it will give an exception.If by chance it succeeds then Spark will continue and just update the accumulator value for successful state and failed states accumulator values are ignored.Verdict : Handled Properly
Stage Failure : If an executor node crashes, no fault of user but an hardware failure - And if the node goes down in shuffle stage.As shuffle output is stored locally, if a node goes down, that shuffle output is gone.So Spark goes back to the stage that generated the shuffle output, looks at which tasks need to be rerun, and executes them on one of the nodes that is still alive.After we regenerate the missing shuffle output, the stage which generated the map output has executed some of it’s tasks multiple times.Spark counts accumulator updates from all of them.Verdict : Not handled in Result Task.Accumulator will give wrong output.
If a task is running slow then, Spark can launch a speculative copy of that task on another node.Verdict : Not handled.Accumulator will give wrong output.
RDD which is cached is huge and can't reside in Memory.So whenever the RDD is used it will re run the Map operation to get the RDD and again accumulator will be updated by it.Verdict : Not handled.Accumulator will give wrong output.
So it may happen same function may run multiple time on same data.So Spark does not provide any guarantee for accumulator getting updated because of the Map operation.
So it is better to use Accumulator in Action operation in Spark.
To know more about Accumulator and its issues refer this Blog Post - By Imran Rashid.
Accumulator updates are sent back to the driver when a task is successfully completed. So your accumulator results are guaranteed to be correct when you are certain that each task will have been executed exactly once and each task did as you expected.
I prefer relying on reduce and aggregate instead of accumulators because it is fairly hard to enumerate all the ways tasks can be executed.
An action starts tasks.
If an action depends on an earlier stage and the results of that stage are not (fully) cached, then tasks from the earlier stage will be started.
Speculative execution starts duplicate tasks when a small number of slow tasks are detected.
That said, there are many simple cases where accumulators can be fully trusted.
val acc = sc.accumulator(0)
val rdd = sc.parallelize(1 to 10, 2)
val accumulating = rdd.map { x => acc += 1; x }
accumulating.count
assert(acc == 10)
Would this be guaranteed to be correct (have no duplicates)?
Yes, if speculative execution is disabled. The map and the count will be a single stage, so like you say, there is no way a task can be successfully executed more than once.
But an accumulator is updated as a side-effect. So you have to be very careful when thinking about how the code will be executed. Consider this instead of accumulating.count:
// Same setup as before.
accumulating.mapPartitions(p => Iterator(p.next)).collect
assert(acc == 2)
This will also create one task for each partition, and each task will be guaranteed to execute exactly once. But the code in map will not get executed on all elements, just the first one in each partition.
The accumulator is like a global variable. If you share a reference to the RDD that can increment the accumulator then other code (other threads) can cause it to increment too.
// Same setup as before.
val x = new X(accumulating) // We don't know what X does.
// It may trigger the calculation
// any number of times.
accumulating.count
assert(acc >= 10)
I think Matei answered this in the referred documentation:
As discussed on https://github.com/apache/spark/pull/2524 this is
pretty hard to provide good semantics for in the general case
(accumulator updates inside non-result stages), for the following
reasons:
An RDD may be computed as part of multiple stages. For
example, if you update an accumulator inside a MappedRDD and then
shuffle it, that might be one stage. But if you then call map() again
on the MappedRDD, and shuffle the result of that, you get a second
stage where that map is pipeline. Do you want to count this
accumulator update twice or not?
Entire stages may be resubmitted if
shuffle files are deleted by the periodic cleaner or are lost due to a
node failure, so anything that tracks RDDs would need to do so for
long periods of time (as long as the RDD is referenceable in the user
program), which would be pretty complicated to implement.
So I'm going
to mark this as "won't fix" for now, except for the part for result
stages done in SPARK-3628.

Resources