Confusion about spark accumulator performed inside actions - apache-spark

From offical doc we can see that :
For accumulator updates performed inside actions only, Spark
guarantees that each task’s update to the accumulator will only be
applied once, i.e. restarted tasks will not update the value. In
transformations, users should be aware of that each task’s update may
be applied more than once if tasks or job stages are re-executed.
I think it is means that accumulator should be performed inside actions only, such as rdd.foreachPartition()
Through rdd.foreachPartition's API code in pyspark, I find that rdd.foreachPartition(accum_func) is equal to :
rdd.mapPartitions(accum_func).mapPartitions(lambda i: [sum(1 for _ in i)]).mapPartitions(lambda x: [sum(x)]).mapPartitions(some_add_func).collect()
It seems that accum_func can run inside transformations(rdd.mapPartition) ?
Thanks a lot for any explanin

if the node running a partition of a map() operation crashes,
Spark will rerun it on another node and even if the node does not crash but is simply
much slower than other nodes, Spark can preemptively launch a “speculative” copy
of the task on another node, and take its result if that finishes.
Even if no nodes fail,Spark may have to rerun a task to rebuild a cached value that falls out of memory.The net result is therefore that the same function may run multiple times on the same data depending on what happens on the cluster.
Accumulators used in actions, Spark applies each task’s update to each accumulator only once. Thus, if we want a reliable absolute value counter, regardless of failures or multiple evaluations, we must put it inside an action like foreach().

Related

What does Spark do if a node running a .foreach() fails?

We have a large RDD with millions of rows. Each row needs to be processed with a third-party optimizer that is licensed (Gurobi). We have a limited number of licenses.
We have been calling the optimizer in the Spark .map() function. The problem is that Spark will run many more mappers than it needs and throw away the results. This causes a problem with license exhaustion.
We're looking at calling Gurobi inside the Spark .foreach() method. This works, but we have two problems:
Getting the data back from the optimizer into another RDD. Our tentative plan for this is to write the results into a database (e.g. MongoDB or DynamoDB).
What happens if the node on which the .foreach() method dies? Spark guarantees that each foreach only runs once. Does it detect that it dies and restart it elsewhere? Or does something else happen?
In general if task executed with foreachPartition dies a whole job dies.
This means that, if not additional steps are taken to ensure correctness, partial result might have been acknowledged by an external system, leading to inconsistent state.
Considering limited number of licenses map or foreachPartition shouldn't make any difference. Not going into discussion if using Spark in this case makes any sense, the best way to resolve it, is to limit number of executor cores, to the number of licenses you own.
If the goal here is to limit just X number of concurrent calls, I would repartition the RDD with x, and then run a partition level operation. I think that should keep you from exhausting the licenses.

Does count() cause map() code to execute in Spark?

So, I know that Spark is a lazy executor.
For example, if I call
post = pre.filter(lambda x: some_condition(x)).map(lambda x: do_something(x))
I know that it won't immediately execute.
But what happens to the above code when I call post.count()? I imagine the filtering would be forced into execution, since pre and post will likely not have the same number of rows since there is a filter condition there. However, map is a 1-to-1 relationship, so the count would not be affected by it. Would the map command be executed here given the count()?
Follow up: When I want to force execution of map statements (assuming count() doesn't work), what can I call to force execution? I'd prefer to not have to use saveAsTextFile().
count will execute all transformations in the lineage unless some stages can be fetched from cache. It means that every transformations will be executed at least once so along as you don't depend on some kind of side effects triggered by some_condition or do_something it should work just fine.

Synchronization between Spark RDD partitions

Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.
I have pondered opening a JDBC connection in each worker task node as illustrated below:
rdd.foreachPartition( partition => {
// 1. open jdbc connection
// 2. poll database for the completion of dependent partition
// 3. read dependent edge case value from computed dependent partition
// 4. compute this partition
// 5. write this edge case result to database
// 6. close connection
})
I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.
Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.
Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:
Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat
What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.

Concurrent operations in spark streaming

I wanted to understand something about the internals of spark streaming executions.
If I have a stream X, and in my program I send stream X to function A and function B:
In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file.
Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file
Are both functions being executed for each RDD in parallel? How does it work?
Thanks
--- Comments from Spark Community ----
I am adding comments from the spark community -
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter.
#Eswara's answer is seems right but it does not apply to your use case as your separate transformation DAG's (X->Y->Z and X->X2) have a common DStream ancestor in X. This means that when the actions are run to trigger each of these flows, the transformation X->Y and the transformation X->X2 cannot happen at the same time. What will happen is the partitions for RDD X will be either computed or loaded from memory (if cached) for each of these transformations separately in a non-parallel manner.
Ideally what would happen is that the transformation X->Y would resolve and then the transformations Y->Z and X->X2 would finish in parallel as there is no shared state between them. I believe Spark's pipelining architecture would optimize for this. You can ensure faster computation on X->X2 by persisting DStream X so that it can be loaded from memory rather than being recomputed or being loaded from disk. See here for more information on persistence.
What would be interesting is if you could provide the replication storage levels *_2 (e.g. MEMORY_ONLY_2 or MEMORY_AND_DISK_2) to be able to run transformations concurrently on the same source. I think those storage levels are currently only useful against lost partitions right now, as the duplicate partition will be processed in place of the lost one.
Yes.
It's similar to spark's execution model which uses DAGs and lazy evaluation except that streaming runs the DAG repeatedly on each fresh batch of data.
In your case, since the DAGs(or sub-DAGs of larger DAG if one prefers to call that way) required to finish each action(each of the 2 foreachs you have) do not have common links all the way back till source, they run completely in parallel.The streaming application as a whole gets X executors(JVMs) and Y cores(threads) per executor allotted at the time of application submission to resource manager.At any time, a given task(i.e., thread) in X*Y tasks will be executing a part or whole of one of these DAGs.Note that any 2 given threads of an application, whether in same executor or otherwise, can execute different actions of the same application at the same time.

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Resources