Heartbeat, poll interval and session timeout for Spark Streaming with Kafka - apache-spark

Using Spark Streaming with Kafka - Direct Approach - Doc
Spark version - 2.3.2
Spark Streaming version - spark-streaming-kafka-0-10_2.11
Problem : Need to run the steaming application with a batch interval of 10 minutes, but the default timeouts are very less than 10 mins so how to configure following parameters:
heartbeat.interval.ms
session.timeout.ms
group.max.session.timeout.ms
group.min.session.timeout.ms
max.poll.interval.ms
Given that the batch interval is 10 minutes.
Also, does setting these to certain values affect all the consumer groups (existing and the ones that may be added in future) ?
If yes, how to configure these params only for a specific consumer group ?

Related

Parameter to control the handshake interval between Kafka and spark

While the kafka brokers are up and running, spark process running in cluster mode is able to read the messages from the kafka topic. But when the brokers were shutdown intentionally, the spark consumer is still in RUNNING status.
Is there any parameter to control the handshake interval between spark consumer and the zookeeper process, so that the spark process can fail if the brokers are not reachable. Or is there any alternate way to fail the consumer. Please suggest.
No there is not.
KAFKA & Spark Structured Steaming (SSS) are loosely coupled, and given high availability scenarios, failure etc. SSS just waits and will process rebalanced topics when KAFKA rebalances the load.
The whole premise is that KAFKA will do something to alleviate the situation - if a Broker goes down. Even if there are zero Brokers after a while, SSS will wait as you have noted already. It is of course knows nothing, but just waits.
As long as the topics still exist and the "fail on data loss" is not set to true if a topic is deleted, the SSS Apps will go on.

Kafka consumer request timeout

I have a Spark streaming (Scala) application running in CDH 5.13 consuming messages from Kafka using client 0.10.0. My Kafka cluster contains 3 brokers. Kafka topic is divided into 12 partitions evenly distributed between these 3 brokers. My Spark streaming consumer has 12 executors with 1 core each.
Spark streaming starts reading millions of messages from Kafka in each batch, but reduces the number to thousands due to the fact that Spark is not capable to cope with the load and queue of unprocessed batches is created. That is fine and my expectation though is that Spark processes the small batches very quickly and returns to normal, however I see that from time to time one of the executors that processes only few hundreds of messages gets 'request timeout' error just after reading the last offset from Kafka:
DEBUG org.apache.clients.NetworkClient Disconnecting from node 12345 due to request timeout
After this error, executor sends several RPC requests driver that take ~40 seconds and after this time executor reconnects to the same broker from which it disconnected.
My question is how can I prevent this request timeout and what is the best way to find the root cause for it?
Thank you
The root cause for disconnection was the fact that response for data request arrived from Kafka too late. i.e. after request.timeout.ms parameter which was set to default 40000 ms. The disconnection problem was fixed when I increased this value.

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

Achieving concurrency through FAIR scheduling in Spark

My Environment:
I'm trying to connect Cassandra through Spark Thrift server. Then I create a Meta-Table in Hive Metastore which holds the Cassandra table data. In a web application I connect to Meta-table through JDBC driver. I have enabled fair scheduling for Spark Thrift Server.
Issue:
When I perform a load test for concurrency through JMeter for 100 users for 300 seconds duration, I get sub seconds response time for initial requests(Say like first 30 seconds). Then the response time gradually increases (like 2 to 3 seconds). When I check the Spark UI, all the jobs are executed less than 100 milliseconds. I also notice that jobs and tasks are in pending stage when request are received. So I assume that even though the tasks take sub seconds to execute they are submitted with a latency by the scheduler. How to fix this latency in job submission?
Following are my configuration details,
Number of Workers - 2
Number of Executors per Worker - 1
Number of cores per Executor - 14
Total core of workers - 30
Memory per Executor - 20Gb
Total Memory of worker - 106Gb
Configuration in Fair Schedule XML
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>2</weight>
<minShare>15</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>2</weight>
<minShare>3</minShare>
</pool>
I'm executing in Spark Standalone mode.
Is it not the case queries pending in the queue while others are running. Try reducing spark.locality.wait to say 1s

Spark Streaming from Apache Kafka

I came across the following
For possible kafkaParams, see Kafka consumer config docs. If your
Spark batch duration is larger than the default Kafka heartbeat
session timeout (30 seconds), increase heartbeat.interval.ms and
session.timeout.ms appropriately. For batches larger than 5 minutes,
this will require changing group.max.session.timeout.ms on the broker
on this link
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Does this apply if I have the below property set on my spark conf
conf.set("spark.streaming.kafka.consumer.poll.ms", "5000")
Also what is the rationale behind setting heartbeat.interval.ms and session.timeout.ms larger than kafka stream batch duration? Won't heartbeats to kafka piggy back on consumer poll request?
Also I was running spark stream application and kafka on my local machine. My batch size was 1 minute, and my kafka configuration was as follows
heartbeat.interval.ms = 3000
session.timeout.ms = 30000
However I did not really see any problems when running with batch duration of 1 minute and above values for heartbeat interval and session timeout. Am I missing something here?

Resources