When I run a Spark Streaming application, the processing time shows strange behavior, even when there are no incoming data. Processing times are not near zero, and steadily increase until they reach the batch interval value of 10 seconds. They then suddenly drop to a minimum.
Is there an explanation for this strange behavior? I am aware of this question, but I am not using Mesos, but YARN. I have seen similar behavior multiple times with multiple applications.
Related
We are facing a rather unexplainable behaviour in Spark.
Some facts:
The spark streaming is running for hours without any issues.
All of a sudden, a particular section of the code starts to take longer
(data size has not increased) When we look into the execution, we
noticed that the delay is due to a few executors where the processing
takes multiple folds longer than on all the others (the data per task
is the same, with no GC increase according to Spark UI)
See the logs below. If we compare a 'normal' executor log with a 'stuck' executor
log we can see that two log lines take a minute longer than on a
normal one
A restart usually solves the issue for some hours, and then it starts occuring again
Version PySpark 2.4.4.Spark Streaming.
We are really lost, and can't figure out what's going on. Does anyone have any suggestions?
Log example:
'Normal'
'stuck':
We have a Spark Structured streaming stream which is using mapGroupWithState. After some time of processing in a stable manner suddenly each mini batch starts taking 40 seconds. Suspiciously it looks like exactly 40 seconds each time. Before this the batches were taking less than a second.
Looking at the details for a particular task most partitions are processed really quickly but a few take exactly 40 seconds:
The GC was looking ok as the data was being processed quickly but suddenly the full GCs etc stop (at the same time as the 40 second issue):
I have taken a thread dump from one of the executors as this issue is happening but I cannot see any resource they are blocked on:
Are we hitting a GC problem and why is it manifesting in this way? Is there another resource that is blocking and what is it?
Try give more HEAP space to see if GC was still so overwhelming, if so you are very likely to have mem leak issue
what spark version were you using? If its spark 2.3.1 there were known FD leakage issue if you were reading data from Kafka (which is extremely common), to figure out if your job is leaking FD, take a look at FD usage in container process in slave, usually it should be very consistently around 100 to 200, and simply upgrade to spark 2.3.2 will fix this issue, I`m so surprised that this issue was so fundamental but never get enough visibilities
I have come across a wierd behaviour in Spark Streaming Job.
We have used the default value for spark.streaming.concurrentJobs which is 1.
The same streaming job was running for more than a day properly with the batch interval being set to 10 Minutes.
Suddenly that same job has started running concurrently for all the batches that come in without putting them in Queue.
Has anyone faced this before?
This would be of great help!
This kind of behavior seems to be curious but I believe seems to happen when there is only 1 job running at a time and if batch processing time < batch interval, then the system seems to be stable.
Spark Streaming creator Tathagata hs mentioned about this: How jobs are assigned to executors in Spark Streaming?.
Running a Spark Streaming job, I have encountered the following behavior more than once. Processing starts well: the processing time for each batch is well below the batch interval. Then suddenly, the input rate drops to near zero. See these graphs.
This happens even though the program could keep up and it slows down execution considerably. I believe the drop happens when there is not much unprocessed data left, but because the rate is so low, these final records take up most of the time needed to run the job. Is there any way to avoid this and speed up?
I am using PySpark with Spark 1.6.2 and using the direct approach for Kafka streaming. Backpressure is turned on and there is a maxRatePerPartition of 100.
Setting backpressure is more meaningful in the case of old spark streaming versions where you need receivers to consume the messages from a stream. From Spark 1.3 you have the receiver-less “direct” approach to ensure stronger end-to-end guarantees. So you do not need to worry about backpressure as spark does most of the fine tuning.
I use spark with the cassandra spark connector and direct kafka.
And I seed batch procession increasing slowly over the time.
Even when there is nothing to process incoming from kafka.
I think it is about a few milliseconds by batch, but after a long time, a batch can take several more seconds until it reaches the batch interval and finally crash.
I thought first it was a memory leak, but I think the processing time would be less linear but exponentially instead.
I don't really know if it is stages that become longer and longer or the latency
between stage that increases.
I use spark 1.4.0
Any pointers about this?
EDIT :
A attentive look at the evolution of the processing time of each batch, comparing total jobs processing time.
And it appears that even if batch processing time increases, the job processing time are not increasing.
exemple : for a batch that take 7s the sum of each job processing time is 1.5s. (as shown in the image below)
Is it because the computing time driver side increases, and not the computing time executor side?
And this driver computing time is not shown in job processing ui?
If it's the case how can correct it?
I finally found the solution to my problem.
I had this code in the function that add filter and transform to my rdd.
TypeConverter.registerConverter(new SomethingToOptionConverter[EventCC])
TypeConverter.registerConverter(new OptionToSomethingConverter[EventCC])
Because it's call at each batch there is a lot of time the same object inside TypeConverter.
And I don't really know how it works Cassandra Spark converter but it's look like to make reflection internaly whit objects.
And make slow reflection x time batch make all the processing time of the batch increasing.