Limit Kafka batches size when using Spark Streaming - apache-spark

Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?
I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.

I think your problem can be solved by Spark Streaming Backpressure.
Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate.
By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.
From Apache Spark Kafka configuration
spark.streaming.backpressure.enabled:
This enables the Spark Streaming to control the receiving rate based
on the current batch scheduling delays and processing times so that
the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).
And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate
spark.streaming.backpressure.initialRate:
This is the initial maximum receiving rate at which each receiver will
receive data for the first batch when the backpressure mechanism is
enabled.
This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.
Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.

Apart from above answers. Batch size is product of 3 parameters
batchDuration: The time interval at which streaming data will be divided into batches (in Seconds).
spark.streaming.kafka.maxRatePerPartition: set the maximum number of messages per partition per second. This when combined with batchDuration will control the batch size. You want the maxRatePerPartition to be set, and large (otherwise you are effectively throttling your job) and batchDuration to be very small.
No of partitions in kafka topic
For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)

Limiting the Max batch size will greatly help to control the processing time, however, it increase the processing latency of message.
By settings below properties, we could control the batch size
spark.streaming.receiver.maxRate=
spark.streaming.kafka.maxRatePerPartition=
You could even dynamically set the batch size based on processing time, by enabling the back pressure
spark.streaming.backpressure.enabled:true
spark.streaming.backpressure.initialRate:

Related

PySpark Structured Streaming with Kafka - Scaling Consumers for multiple topics with different loads

We subscribed to 7 topics with spark.readStream in 1 single running spark app.
After transforming the event payloads, we save them with spark.writeStream to our database.
For one of the topics, the data is inserted only batch-wise (once a day) with a very high load. This delays our reading from all other topics, too. For example (grafana), the delay between a produced and consumed record over all topics stays below 1m the whole day. When the bulk-topic receives its events, our delay increases up to 2 hours on all (!) topics.
How can we solve this? we already tried 2 successive readStreams (the bulk-topic separately), but it didn't help.
Further info: We use 6 executors, 2 executor-cores. The topics have a different number of partitions (3 to 30). Structured Streaming Kafka Integration v0.10.0.
General question: How can we scale the consumers in spark structured streaming? Is 1 readStream equal to 1 consumer? or 1 executor? or what else?
Partitions are main source of parallelism in Kafka so I suggest you increase number of partitions (at least for topic which has performance issues). Also you may tweak some of consumer caching options mentioned in doc. Try to keep number of partitions 2^n. At the end you may increase size of driver machine if possible.
I'm not completely sure, but I think Spark will try to keep same number of consumer as number of partitions per topic. Also I think that actually stream is fetched from Spark driver always (not from workers).
We found a solution for our problem:
Our grafana after the change shows, that the batch-data topic still peaks but without blocking the consumption on other topics.
What we did:
We still have 1 spark app. We used 2 successive spark.readStreams but also added a sink for each.
In code:
priority_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', ','.join([T1, T2, T3])).load()
bulk_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', BULK_TOPIC).load()
priority_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
bulk_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
spark.streams.awaitAnyTermination()
To minimize the peak on the bulk-stream we will try out increasing its partitions like adviced from #partlov. But that would have only speeded up the consumption on the bulk-stream but not resolved the issue from blocking our reads from the priority-topics.

Increase the output size of Spark Structured Streaming job

Context : I have a Spark Structured Streaming job with Kafka as source and S3 as sink. The outputs in S3 are again picked up as input in other MapReduce jobs.
I, therefore, want to increase the output size of the files on S3 so that the MapReduce job works efficiently.
Currently, because of small input size, the MapReduce jobs are taking way too long to complete.
Is there a way to configure the streaming job to wait for at least 'X' number of records to process?
Probably you want to wait micro batch trigger till sufficient data are available at source . You can use minOffsetsPerTrigger option to wait for sufficient data available in kafka .
Make sure to set sufficient maxTriggerDelay time as per your application need .
No there is not in reality.
No for Spark prior to 3.x.
Yes and No for Spark 3.x which equates to No effectively.
minOffsetsPerTrigger was introduced but has a catch as per below. That means the overall answer still remains No.
From the manuals:
Minimum number of offsets to be processed per trigger interval. The
specified total number of offsets will be proportionally split across
topicPartitions of different volume. Note, if the maxTriggerDelay is
exceeded, a trigger will be fired even if the number of available
offsets doesn't reach minOffsetsPerTrigger.

Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Their are 32 Kafka partitions and 32 consumers as per Direct approach.
But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka.
I Want to increase the number of partitions for Dstream received by each consumer.
I will like solution to be something around to increase partitions on consumers rather then increasing partitions in Kafka.
In the direct stream approach, at max you can have #consumers = #partitions. Kafka does not allow more than one consumer per partition per group.id. BTW you are asking more partition per consumer? it will not help since your consumers are already running at full capacity and still are insufficient.
Few technical changes you can try to reduce the data backlog on kafka:
Increase number of partitions - although you do not want to do this, still this is the easiest approach. Sometimes platform just needs more hardware.
Optimize processing at consumer side - check possibility of record de-duplication before processing, reduce disk I/O, loop unrolling techniques etc to reduce time taken by consumers.
(higher difficulty) Controlled data distribution - Often it is found that some partitions are able to process better than others. It may be worth looking if this is the case in your platform. Kafka's data distribution policy has some preferences (as well as message-key) which often cause uneven load inside cluster: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html
Assuming you have enough hardware resources allocated to consumer, you can check below parameter
spark.streaming.kafka.maxRatePerPartition
You can set number of records you consume from single kafka partition per second.

How to rate limit a Spark map operation?

I have an S3 json dataset that is a dump of a KMS client-side encrypted DynamoDB (i.e each record is KMS client-side encrypted independently).
I would like to use Spark to load that dataset to perform some analysis which means I have to call KMS to decrypt each record. Having a udf that simply decrypts each line works but hits the KMS API limit of 100 calls/sec
I am wondering if there is someway to rate limit these Spark map operations?
I think this can be handled by Spark streaming application.
check spark.streaming.backpressure.enabled and spark.streaming.receiver.maxRate
Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).
when you want to set the maximum streaming 100 calls/sec
Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details
deploying-applications

Dynamically update batch size of input for spark kafka consumer

I use createDirectStream in my spark streaming application. I set the batch interval to 7 seconds and most of the time the batch job can finish within about 5 seconds. However, in very rare cases, the batch job need cost 60 seconds and this will delay some batch of jobs.
To cut down the total delay time, I hope I can process more streaming data which spread over the delayed jobs at one time. This will help the streaming return to normal as soon as possible.
So, I want to know there is some method to dynamically update/merge batch size of input for spark and kafka when delay appears.
You can set the "spark.streaming.backpressure.enabled" option to true.
If the batch delay occurs when the backpressure option is true, it initially starts with a small batch size and then dynamically changes to a large batch size.
See the spark configuration document.
You can see the description below.
Enables or disables Spark Streaming's internal backpressure mechanism
(since 1.5). This enables the Spark Streaming to control the receiving
rate based on the current batch scheduling delays and processing times
so that the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).

Resources