spark kafka 10 consumer (DirectStream) Hangs - apache-spark

it seems that we are having same issue as described here:
https://issues.apache.org/jira/browse/SPARK-20780
I am already aware that its a Kafka issue rather then spark, but still would like to get some advice on how to act until that issue would be resolved by the Kafka community.
Increasing the request.timeout is not much helping, because then we could get large queue. For example if the micro-batch is 10 seconds and the Kafka request.timeout.ms is 20 seconds, each time that the issue occurs, it would create a delay of around 20 seconds which would lead to 2 micro-batches queuing up. Obviously the more it happens, the more delays it causes which eventually leads to quite a large queue.
Any best practices / workarounds / tips for how to overcome that issue until it would be resolved ?

Related

PySpark Structured Streaming with Kafka - Scaling Consumers for multiple topics with different loads

We subscribed to 7 topics with spark.readStream in 1 single running spark app.
After transforming the event payloads, we save them with spark.writeStream to our database.
For one of the topics, the data is inserted only batch-wise (once a day) with a very high load. This delays our reading from all other topics, too. For example (grafana), the delay between a produced and consumed record over all topics stays below 1m the whole day. When the bulk-topic receives its events, our delay increases up to 2 hours on all (!) topics.
How can we solve this? we already tried 2 successive readStreams (the bulk-topic separately), but it didn't help.
Further info: We use 6 executors, 2 executor-cores. The topics have a different number of partitions (3 to 30). Structured Streaming Kafka Integration v0.10.0.
General question: How can we scale the consumers in spark structured streaming? Is 1 readStream equal to 1 consumer? or 1 executor? or what else?
Partitions are main source of parallelism in Kafka so I suggest you increase number of partitions (at least for topic which has performance issues). Also you may tweak some of consumer caching options mentioned in doc. Try to keep number of partitions 2^n. At the end you may increase size of driver machine if possible.
I'm not completely sure, but I think Spark will try to keep same number of consumer as number of partitions per topic. Also I think that actually stream is fetched from Spark driver always (not from workers).
We found a solution for our problem:
Our grafana after the change shows, that the batch-data topic still peaks but without blocking the consumption on other topics.
What we did:
We still have 1 spark app. We used 2 successive spark.readStreams but also added a sink for each.
In code:
priority_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', ','.join([T1, T2, T3])).load()
bulk_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', BULK_TOPIC).load()
priority_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
bulk_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
spark.streams.awaitAnyTermination()
To minimize the peak on the bulk-stream we will try out increasing its partitions like adviced from #partlov. But that would have only speeded up the consumption on the bulk-stream but not resolved the issue from blocking our reads from the priority-topics.

What problems will occur when event are sudden increased to too high like 10 times in spark streaming?

Actually Someone asked this question in an interview. And I want to know what answer should be there or what type of explanation should be there which can satisfy the interviewer. The question was that How will you handle spark streaming application which is processing 10 Million records if sudden records will be increased three times. What changes need to do in your spark application?

Spark Structured Streaming resource contention / memory issue

We have a Spark Structured streaming stream which is using mapGroupWithState. After some time of processing in a stable manner suddenly each mini batch starts taking 40 seconds. Suspiciously it looks like exactly 40 seconds each time. Before this the batches were taking less than a second.
Looking at the details for a particular task most partitions are processed really quickly but a few take exactly 40 seconds:
The GC was looking ok as the data was being processed quickly but suddenly the full GCs etc stop (at the same time as the 40 second issue):
I have taken a thread dump from one of the executors as this issue is happening but I cannot see any resource they are blocked on:
Are we hitting a GC problem and why is it manifesting in this way? Is there another resource that is blocking and what is it?
Try give more HEAP space to see if GC was still so overwhelming, if so you are very likely to have mem leak issue
what spark version were you using? If its spark 2.3.1 there were known FD leakage issue if you were reading data from Kafka (which is extremely common), to figure out if your job is leaking FD, take a look at FD usage in container process in slave, usually it should be very consistently around 100 to 200, and simply upgrade to spark 2.3.2 will fix this issue, I`m so surprised that this issue was so fundamental but never get enough visibilities

Spark tasks blockes randomly on standalone cluster

We are having a quite complex application that runs on Spark Standalone.
In some cases the tasks from one of the workers blocks randomly for an infinite amount of time in the RUNNING state.
Extra info:
there aren't any errors in the logs
ran with logger in debug and i didn't saw any relevant messages (i see when the tasks starts but then there is not activity for it)
the jobs are working ok if i have just only 1 worker
the same job may execute the second time without any issues, in a proper amount of time
i don't have any really big partitions that could cause delays for some of the tasks.
in spark 2.0 i've moved from RDD to Datasets and i have the same issue
in spark 1.4 i was able to overcome the issue by turning on speculation, but in spark 2.0 the blocking tasks are from different workers (while in 1.4 i have blocking tasks on only 1 worker) so speculation isn't fixing my issue.
i have the issue on more environments so i don't think it's hardware related.
Did anyone experienced something similar? Any suggestions on how could i identify the issue?
Thanks a lot!
Later Edit: I think i'm facing the same issue described here: Spark Indefinite Waiting with "Asked to send map output locations for shuffle" and here: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-td6067.html but both are without a working solution.
The last thing in the log repeated infinitely is: [dispatcher-event-loop-18] DEBUG org.apache.spark.scheduler.TaskSchedulerImpl - parentName: , name: TaskSet_2, runningTasks: 6
The issue was fixed for me by allocating just one core per executor. If I have executors with more then 1 core the issue appears again. I didn't yet understood why is this happening but for the ones having similar issue they can try this.

Spark Streaming processing time "sawtooth"

When I run a Spark Streaming application, the processing time shows strange behavior, even when there are no incoming data. Processing times are not near zero, and steadily increase until they reach the batch interval value of 10 seconds. They then suddenly drop to a minimum.
Is there an explanation for this strange behavior? I am aware of this question, but I am not using Mesos, but YARN. I have seen similar behavior multiple times with multiple applications.

Resources