We are facing a rather unexplainable behaviour in Spark.
Some facts:
The spark streaming is running for hours without any issues.
All of a sudden, a particular section of the code starts to take longer
(data size has not increased) When we look into the execution, we
noticed that the delay is due to a few executors where the processing
takes multipleĀ folds longer than on all the others (the data per task
is the same, with no GC increase according to Spark UI)
See the logs below. If we compare a 'normal' executor log with a 'stuck' executor
log we can see that two log lines take a minute longer than on a
normal one
A restart usually solves the issue for some hours, and then it starts occuring again
Version PySpark 2.4.4.Spark Streaming.
We are really lost, and can't figure out what's going on. Does anyone have any suggestions?
Log example:
'Normal'
'stuck':
Related
This might be a very generic question but hope someone can point some hint. But I found that sometimes, my job spark seems to hit a "pause" many times:
The natural of the job is: read orc files (from a hive table), filter by certain columns, no join, then write out to another hive table.
There were total 64K tasks for my job / stage (FileScan orc, followed by Filter, Project).
The application has 500 executors, each has 4 cores. Initially, about 2000 tasks were running concurrently, things look good.
After a while, I noticed the number running tasks dropped all the way near 100. Many cores/executors were just waiting with nothing to do. (I checked the log from these waiting executors, there was no error. All assigned tasks were done on them, they were just waiting)
After about 3-5 minutes, then these waiting executors suddenly got tasks assigned and now were working happily.
Any particular reasons this can be? The application is running from spark-shell (--master yarn --deploy-mode client, with number of executors/sizes etc. specified)
Thanks!
We have a Spark Structured streaming stream which is using mapGroupWithState. After some time of processing in a stable manner suddenly each mini batch starts taking 40 seconds. Suspiciously it looks like exactly 40 seconds each time. Before this the batches were taking less than a second.
Looking at the details for a particular task most partitions are processed really quickly but a few take exactly 40 seconds:
The GC was looking ok as the data was being processed quickly but suddenly the full GCs etc stop (at the same time as the 40 second issue):
I have taken a thread dump from one of the executors as this issue is happening but I cannot see any resource they are blocked on:
Are we hitting a GC problem and why is it manifesting in this way? Is there another resource that is blocking and what is it?
Try give more HEAP space to see if GC was still so overwhelming, if so you are very likely to have mem leak issue
what spark version were you using? If its spark 2.3.1 there were known FD leakage issue if you were reading data from Kafka (which is extremely common), to figure out if your job is leaking FD, take a look at FD usage in container process in slave, usually it should be very consistently around 100 to 200, and simply upgrade to spark 2.3.2 will fix this issue, I`m so surprised that this issue was so fundamental but never get enough visibilities
I have come across a wierd behaviour in Spark Streaming Job.
We have used the default value for spark.streaming.concurrentJobs which is 1.
The same streaming job was running for more than a day properly with the batch interval being set to 10 Minutes.
Suddenly that same job has started running concurrently for all the batches that come in without putting them in Queue.
Has anyone faced this before?
This would be of great help!
This kind of behavior seems to be curious but I believe seems to happen when there is only 1 job running at a time and if batch processing time < batch interval, then the system seems to be stable.
Spark Streaming creator Tathagata hs mentioned about this: How jobs are assigned to executors in Spark Streaming?.
We are having a quite complex application that runs on Spark Standalone.
In some cases the tasks from one of the workers blocks randomly for an infinite amount of time in the RUNNING state.
Extra info:
there aren't any errors in the logs
ran with logger in debug and i didn't saw any relevant messages (i see when the tasks starts but then there is not activity for it)
the jobs are working ok if i have just only 1 worker
the same job may execute the second time without any issues, in a proper amount of time
i don't have any really big partitions that could cause delays for some of the tasks.
in spark 2.0 i've moved from RDD to Datasets and i have the same issue
in spark 1.4 i was able to overcome the issue by turning on speculation, but in spark 2.0 the blocking tasks are from different workers (while in 1.4 i have blocking tasks on only 1 worker) so speculation isn't fixing my issue.
i have the issue on more environments so i don't think it's hardware related.
Did anyone experienced something similar? Any suggestions on how could i identify the issue?
Thanks a lot!
Later Edit: I think i'm facing the same issue described here: Spark Indefinite Waiting with "Asked to send map output locations for shuffle" and here: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-td6067.html but both are without a working solution.
The last thing in the log repeated infinitely is: [dispatcher-event-loop-18] DEBUG org.apache.spark.scheduler.TaskSchedulerImpl - parentName: , name: TaskSet_2, runningTasks: 6
The issue was fixed for me by allocating just one core per executor. If I have executors with more then 1 core the issue appears again. I didn't yet understood why is this happening but for the ones having similar issue they can try this.
When I run a Spark Streaming application, the processing time shows strange behavior, even when there are no incoming data. Processing times are not near zero, and steadily increase until they reach the batch interval value of 10 seconds. They then suddenly drop to a minimum.
Is there an explanation for this strange behavior? I am aware of this question, but I am not using Mesos, but YARN. I have seen similar behavior multiple times with multiple applications.