We have a large RDD with millions of rows. Each row needs to be processed with a third-party optimizer that is licensed (Gurobi). We have a limited number of licenses.
We have been calling the optimizer in the Spark .map() function. The problem is that Spark will run many more mappers than it needs and throw away the results. This causes a problem with license exhaustion.
We're looking at calling Gurobi inside the Spark .foreach() method. This works, but we have two problems:
Getting the data back from the optimizer into another RDD. Our tentative plan for this is to write the results into a database (e.g. MongoDB or DynamoDB).
What happens if the node on which the .foreach() method dies? Spark guarantees that each foreach only runs once. Does it detect that it dies and restart it elsewhere? Or does something else happen?

In general if task executed with foreachPartition dies a whole job dies.
This means that, if not additional steps are taken to ensure correctness, partial result might have been acknowledged by an external system, leading to inconsistent state.
Considering limited number of licenses map or foreachPartition shouldn't make any difference. Not going into discussion if using Spark in this case makes any sense, the best way to resolve it, is to limit number of executor cores, to the number of licenses you own.

If the goal here is to limit just X number of concurrent calls, I would repartition the RDD with x, and then run a partition level operation. I think that should keep you from exhausting the licenses.


Can Spark automatically detect nondeterministic results and adjust failure recovery accordingly?

If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.
No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.
Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.

Spark performs poorly when generating non-associate features

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

Spark join always stuck on the same task, how can I debug?

I am using pyspark to run a join of this sort:
rdd1=sc.textFile(hdfs_dir1).map(lambda row: (getKey1(row),getData1(row)))
rdd2=sc.textFile(hdfs_dir2).map(lambda row: (getKey2(row),getData2(row)))
The job executes the first 300 tasks quite fast (~seconds each), and hangs when reaching task 301/308, even when I let it run for days.
I tried to run the pyspark shell with different configuration (number of workers, memory, cpus, cores, shuffle rates) and the result is always the same.
What can be the cause ? and how can I debug it ?
Has anyone able to solve this problem? My guess is that the issue is because of shuffling data between executors. I used ,ridiculously, two small datasets ( 10 records ) with no missing key and still the join operation was stuck. I had to eventually kill the instance. The only thing which could help in my case was cache().
If we take above example
rdd1=sc.textFile(hdfs_dir1).map(lambda row: (getKey1(row),getData1(row)))
rdd2=sc.textFile(hdfs_dir2).map(lambda row: (getKey2(row),getData2(row)))
# cache it
# I also tried rdd1.collect() and rdd2.collect() to get data cached
# then try the joins
# I would get the answer
result.collect() # it works
I am not able to find why caching works though ( Apparently, it should have worked otherwise too ie without cache() ).
Collect will try to fetch the result of your join in the application driver node and you will run into memory issues.
The join operation will cause a lot of shuffle operation, but you can reduce this by using bloom filters (Bloom filter). You construct a bloom filter for the keys in one partition, broadcast and filter the other partition. After applying this operations you should expect smaller RDDs (if you do not have the exact same keys in both of them) and your join operation should be much faster.
The bloom filter can be collected efficiently since you can combine the bits set by one element with the bits set by another element with OR, which is associative and commutative.
You can narrow down whether this is a problem with the collect() call by calling a count instead to see if it is an issue pulling the results into the driver:
If the count works, it might be best to add a sample or limit, then call collect() if you're attempting to view the results.
You can also look at the task in the Spark UI to see if a task has been assigned to a particular executor and use the UI again to look at the executor logs. Within the executors tab, you can take a thread dump of the executor that is handling the task. If you take a few thread dumps and compare them, check to see if there's a thread that's hung.
Look at the driver log4j logs, stdout / stderr logs for any additional errors.

Concurrent operations in spark streaming

I wanted to understand something about the internals of spark streaming executions.
If I have a stream X, and in my program I send stream X to function A and function B:
In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file.
Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file
Are both functions being executed for each RDD in parallel? How does it work?
I am adding comments from the spark community -
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter.
#Eswara's answer is seems right but it does not apply to your use case as your separate transformation DAG's (X->Y->Z and X->X2) have a common DStream ancestor in X. This means that when the actions are run to trigger each of these flows, the transformation X->Y and the transformation X->X2 cannot happen at the same time. What will happen is the partitions for RDD X will be either computed or loaded from memory (if cached) for each of these transformations separately in a non-parallel manner.
Ideally what would happen is that the transformation X->Y would resolve and then the transformations Y->Z and X->X2 would finish in parallel as there is no shared state between them. I believe Spark's pipelining architecture would optimize for this. You can ensure faster computation on X->X2 by persisting DStream X so that it can be loaded from memory rather than being recomputed or being loaded from disk. See here for more information on persistence.
What would be interesting is if you could provide the replication storage levels *_2 (e.g. MEMORY_ONLY_2 or MEMORY_AND_DISK_2) to be able to run transformations concurrently on the same source. I think those storage levels are currently only useful against lost partitions right now, as the duplicate partition will be processed in place of the lost one.
It's similar to spark's execution model which uses DAGs and lazy evaluation except that streaming runs the DAG repeatedly on each fresh batch of data.
In your case, since the DAGs(or sub-DAGs of larger DAG if one prefers to call that way) required to finish each action(each of the 2 foreachs you have) do not have common links all the way back till source, they run completely in parallel.The streaming application as a whole gets X executors(JVMs) and Y cores(threads) per executor allotted at the time of application submission to resource manager.At any time, a given task(i.e., thread) in X*Y tasks will be executing a part or whole of one of these DAGs.Note that any 2 given threads of an application, whether in same executor or otherwise, can execute different actions of the same application at the same time.

Does it make sense to run Spark job for its side effects?

I want to run a Spark job, where each RDD is responsible for sending certain traffic over a network connection. The return value from each RDD is not very important, but I could perhaps ask them to return the number of messages sent. The important part is the network traffic, which is basically a side effect for running a function over each RDD.
Is it a good idea to perform the above task in Spark?
I'm trying to simulate network traffic from multiple sources to test the data collection infrastructure on the receiving end. I could instead manually setup multiple machines to run the sender, but I thought it'd be nice if I could take advantage of Spark's existing distributed framework.
However, it seems like Spark is designed for programs to "compute" and then "return" something, not for programs to run for their side effects. I'm not sure if this is a good idea, and would appreciate input from others.
To be clear, I'm thinking of something like the following
IDs = sc.parallelize(range(0, n))
def f(x):
for i in range(0,100):
message = make_message(x, i)
return (x, 100)
IDsOne = IDs.map(f)
counts = IDsOne.reduceByKey(add)
for (ID, count) in counts.collect():
print ("%i ran %i times" % (ID, count))
Generally speaking it doesn't make sense:
Spark is a heavyweight framework. At its core there is this huge machinery which ensures that data is properly distributed, collected, recovery is possible and so on. It has a significant impact on overall performance and latency but doesn't provide any benefits in case of side-effects-only tasks
Spark concurrency has a relatively low granularity with partition being the main unit of concurrency. At this level processing becomes synchronous. You cannot move on to the next partition before you finish the current one.
Lets say in your case there is a single slow SEND_OVER_NETWORK. If you use map you pretty much block processing on a whole partition. You can go at the lower level with mapPartitions, make SEND_OVER_NETWORK asynchronous, and return only when a whole partition has been processed. It is better but still suboptimal.
You can increase number of partitions, but it means higher bookkeeping overhead so at the end of the day you can make situation worse not better.
Spark API is designed mostly for side effects free operations. It makes it hard to express operations which doesn't fit into this model.
What is arguably more important is that Spark guarantees only that each operation is executed at-least-once (lets ignore zero-times if rdd is never materialized). If application requires for example exactly-once semantics things become tricky especially when you consider point 2.
It is possible to keep track of local state for each partition outside the main Spark logic but if you get there it is a really good sign that Spark is not the right tool.
