Spark tasks blockes randomly on standalone cluster - apache-spark

We are having a quite complex application that runs on Spark Standalone.
In some cases the tasks from one of the workers blocks randomly for an infinite amount of time in the RUNNING state.
Extra info:
there aren't any errors in the logs
ran with logger in debug and i didn't saw any relevant messages (i see when the tasks starts but then there is not activity for it)
the jobs are working ok if i have just only 1 worker
the same job may execute the second time without any issues, in a proper amount of time
i don't have any really big partitions that could cause delays for some of the tasks.
in spark 2.0 i've moved from RDD to Datasets and i have the same issue
in spark 1.4 i was able to overcome the issue by turning on speculation, but in spark 2.0 the blocking tasks are from different workers (while in 1.4 i have blocking tasks on only 1 worker) so speculation isn't fixing my issue.
i have the issue on more environments so i don't think it's hardware related.
Did anyone experienced something similar? Any suggestions on how could i identify the issue?
Thanks a lot!
Later Edit: I think i'm facing the same issue described here: Spark Indefinite Waiting with "Asked to send map output locations for shuffle" and here: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-td6067.html but both are without a working solution.
The last thing in the log repeated infinitely is: [dispatcher-event-loop-18] DEBUG org.apache.spark.scheduler.TaskSchedulerImpl - parentName: , name: TaskSet_2, runningTasks: 6

The issue was fixed for me by allocating just one core per executor. If I have executors with more then 1 core the issue appears again. I didn't yet understood why is this happening but for the ones having similar issue they can try this.

Related

Unexplainable delay in a few executors' task execution time (logs attached)

We are facing a rather unexplainable behaviour in Spark.
Some facts:
The spark streaming is running for hours without any issues.
All of a sudden, a particular section of the code starts to take longer
(data size has not increased) When we look into the execution, we
noticed that the delay is due to a few executors where the processing
takes multipleĀ folds longer than on all the others (the data per task
is the same, with no GC increase according to Spark UI)
See the logs below. If we compare a 'normal' executor log with a 'stuck' executor
log we can see that two log lines take a minute longer than on a
normal one
A restart usually solves the issue for some hours, and then it starts occuring again
Version PySpark 2.4.4.Spark Streaming.
We are really lost, and can't figure out what's going on. Does anyone have any suggestions?
Log example:
'Normal'
'stuck':

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

Apache Spark Executors Dead - is this the expected behaviour?

I am running a pipeline to process my data on Spark. It seems like my Executors die every now and then when they reach near the Storage Memory limit. The job continues and eventually finishes but is this the normal behaviour? Is there something I should be doing to prevent this from happening? Every time this happens the job hangs for some time until (and I am guessing here) YARN provides some new executors for the job to continue.
I think this turned out to be related with a Yarn bug. It doesn't happen anymore after I set the following YARN options like suggested in section 4. of this blog post
Best practice 5: Always set the virtual and physical memory check flag
to false.
"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"

Spark Streaming - Jobs run concurrently with default spark.streaming.concurrentJobs setting

I have come across a wierd behaviour in Spark Streaming Job.
We have used the default value for spark.streaming.concurrentJobs which is 1.
The same streaming job was running for more than a day properly with the batch interval being set to 10 Minutes.
Suddenly that same job has started running concurrently for all the batches that come in without putting them in Queue.
Has anyone faced this before?
This would be of great help!
This kind of behavior seems to be curious but I believe seems to happen when there is only 1 job running at a time and if batch processing time < batch interval, then the system seems to be stable.
Spark Streaming creator Tathagata hs mentioned about this: How jobs are assigned to executors in Spark Streaming?.

Spark Performance issue while adding more worker nodes

I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.
I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.
var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()
This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.
Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?
This should not happen, but it is possible for an algorithm to run slower when distributed.
Basically, if the synchronization part is a heavy one, doing that with 2 nodes will take more time then with one.
I would start by comparing some simpler transformations, running a more asynchronous code, as without any sync points (such as group by key), and see if you get the same issue.
#z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.

Resources