set spark.streaming.kafka.maxRatePerPartition for createDirectStream - apache-spark

I need to increase the input rate per partition for my application and I have use .set("spark.streaming.kafka.maxRatePerPartition",100) for the config. The stream duration is 10s so I expect process 5*100*10=5000 messages for this batch. However, the input rate I received is just about 500. Can You suggest any modifications to increase this rate?

The stream duration is 10s so I expect process 5*100*10=5000 messages
for this batch.
That's not what the setting means. It means "how many elements each partition can have per batch", not per second. I'm going to assume you have 5 partitions, so you're getting 5 * 100 = 500. If you want 5000, set maxRatePerPartition to 1000.
From "Exactly-once Spark Streaming From Apache Kafka" (written by the Cody, the author of the Direct Stream approach, emphasis mine):
For rate limiting, you can use the Spark configuration variable
spark.streaming.kafka.maxRatePerPartition to set the maximum number of
messages per partition per batch.
Edit:
After #avrs comment, I looked inside the code which defines the max rate. As it turns out, the heuristic is a bit more complex than stated in both the blog post and the docs.
There are two branches. If backpressure is enabled alongside maxRate, then the maxRate is the minimum between the current backpressure rate calculated by the RateEstimator object and maxRate set by the user. If it isn't enabled, it takes the maxRate defined as is.
Now, after selecting the rate it always multiplies by the total batch seconds, effectively making this a rate per second:
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}

Property fetches N messages from a partition per seconds. If I have M partition and batch interval is B, then total messages I can see in batch is N * M * B.
There are few things you should verify
Is your input rate is >500 for 10s.
Is kafka topic is properly partitioned.

Related

Does Spark configs spark.streaming.receiver.maxRate has any effect in a Kafka Beam pipeline

I was wondering if somebody has any experience with rate limiting in Beam KafkaIO component when the runner is a SparkRunner. The versions I am using are:Beam 2.29, Spark 3.2.0 and Kafka client 2.5.0?
I have the Beam parameter maxRecordsPerBatch set to a large number, 100000000. But even when the pipeline stops for 45 minutes, this value is never hit. But when there is a high burst of data above the normal, the Kafka lag increases till it eventually catches up. In the SparkUI I see that parameter batchIntervalMillis=300000 (5 min) is not reached, batches take a maximum of 3 min. It looks like the KafkaIO stops reading at some point, even when the lag is very large. My Kafka parameters --fetchMaxWaitMs=1000
--maxPollRecords=5000 should be able to bring plenty of data. Specially because KafkaIO creates one consumer per partition. In my system there are multiple topics with a total of 992 partitions and my spark.default.parallelism=600. Some partitions have very little data, while others have a large number. Topics are per region and when a region goes down the data is sent through another region/topic. That is when the lag happens.
Does the configuration values for spark.streaming.receiver.maxRate and spark.streaming.receiver.maxRatePerPartition plus spark.streaming.backpressure.enabled play any role at all?
For what I have seen, it looks like Beam controls the whole reading from Kafka with the operator KafkaIO. This component creates its own consumers, therefore the rate of the consumer can only be set by using consumer configs which include fetchMaxWaitMs and maxPollRecords.
The only way those Spark parameters could have any effect if in the rest of the pipeline after the IO source. But I am not sure.
So I finally figure out how it all works. First I want to state that the Spark configuration values: spark.streaming.receiver.maxRate, spark.streaming.receiver.maxRatePerPartition, spark.streaming.backpressure.enabled do not play a factor in Beam because they only work if you are using the source operators from Spark to read from Kafka. Since Beam has its own operator KafkaIO they do not play a role.
So Beam has a set of parameters defined in the class SparkPipelineOptions that are used in the SparkRunner to setup reading from Kafka. Those parameters are:
#Description("Minimum time to spend on read, for each micro-batch.")
#Default.Long(200)
Long getMinReadTimeMillis();
#Description(
"A value between 0-1 to describe the percentage of a micro-batch dedicated to reading from UnboundedSource.")
#Default.Double(0.1)
Double getReadTimePercentage();
Beam create a SourceDStream object that it will pass to spark to use as a source to read from Kafka. In this class the method boundReadDuration returns the result of calculating the larger of two reading values:
proportionalDuration and lowerBoundDuration. The first one is calculated by multiplying BatchIntervalMillis from readTimePercentage. The second is just the value in mills from minReadTimeMillis. Below is the code from SourceDStream. The time duration returned from this function will be used to read from Kafka alone the rest of the time will be allocated to the other tasks in the pipeline.
Last but no least the following parameter also control how many records are process during a batch maxRecordsPerBatch. The pipeline would not process more than those records in a single batch.
private Duration boundReadDuration(double readTimePercentage, long minReadTimeMillis) {
long batchDurationMillis = ssc().graph().batchDuration().milliseconds();
Duration proportionalDuration = new Duration(Math.round(batchDurationMillis * readTimePercentage));
Duration lowerBoundDuration = new Duration(minReadTimeMillis);
Duration readDuration = proportionalDuration.isLongerThan(lowerBoundDuration) ? proportionalDuration: lowerBoundDuration;
LOG.info("Read duration set to: " + readDuration);
return readDuration;
}

Spark Structured Streaming metrics: Why process rate can be greater than input rate?

How come the process rate can be greater than the input rate?
From my understanding, process rate is the rate by which spark can process arriving data, ie, the process capacity. If so, the process rate must be on average lower or equal to the input rate. If it is lower, we know we need more processing power, or rethink about trigger time.
I am basing my understanding on this blog post and common sense, but I might be wrong. I looking for the formal formula in the source code while writing this question, as well.
This is an example where the process rate is constantly greater than the input rate:
You can see that on averege we have 200-300 records being processed per sec, whereas we have 80-120 records arriving per sec.
Setup background: Spark 3.x reading from Kafka and writing to Delta.
Thank you all.
Process rate more than input rate could mean its processing much faster than input rate. i.e it could process 300-400 per sec although event rate is 100 per sec. For ex: lets say ~100 per sec is the input rate and spark is able to process 100 records within half a sec than it means it can process 100 more in the next half of the sec and on an average this would lead to ~200 process rate.
In screenshot attached it could be interpreted as
It can process ~3000 records within each batch(~200*~15s) with 15s processing time for each batch (based on ~15000 ms seen in latency chart) but its processing around ~1000 records within each batch with 15s processing time.

Kafka Direct API Batch Input Size

As per Kafka Direct API number of input records is calculated as
maxInputSize = maxRatePerPartition * #numOfPartitions# * #BATCH_DURATION_SECONDS#
I am really failed to understand why input size is determined like this. Suppose my job processes 100 files in 5 minutes.
if I set maxRatePerPartition = 1, numOfPartitions in my topic are 6, what should be batch duration because if I set batch duration seconds to 300, I will be fetching 1800 files as input and there will be a long queue of batches waiting to be processed and 1800 files will take about half hour to process let aside memory issues and other constraints.
How can I cater this issue. I should be able to control records in my input. I can process 10 records in 5 minutes, I should be able to load these many records only.

What is spark.streaming.receiver.maxRate? How does it work with batch interval

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?
Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Apache Spark Streaming: Why is the amount of Events per Second increasing?

I found that the number of Events/sec in my Streaming Application suddenly increases (see screenshot). I made sure that the sender doesn't send more data at these moments (it always sends 800 - 900 messages per TCP). I guess it has to do with the other times below (processing / delay), but why is this shown in the Events-count graph? And can someone explain what happens there more exactly? Thanks!
The number of events is directly related to the producer. This means your producer is increasing the send rate over time. At the moment of the screenshot, your mean received messages throughput was 953 msgs/s which is higher than the expected 850, taking the 800-900 range mentioned in the question as a normally distributed.
Check the producer for the particular reasons of this increase.

Resources