Kafka Direct API Batch Input Size - apache-spark

As per Kafka Direct API number of input records is calculated as
maxInputSize = maxRatePerPartition * #numOfPartitions# * #BATCH_DURATION_SECONDS#
I am really failed to understand why input size is determined like this. Suppose my job processes 100 files in 5 minutes.
if I set maxRatePerPartition = 1, numOfPartitions in my topic are 6, what should be batch duration because if I set batch duration seconds to 300, I will be fetching 1800 files as input and there will be a long queue of batches waiting to be processed and 1800 files will take about half hour to process let aside memory issues and other constraints.
How can I cater this issue. I should be able to control records in my input. I can process 10 records in 5 minutes, I should be able to load these many records only.

Related

Does IoTHub delay messages by a batching interval in a custom endpoint to Azure Storage?

I am sending some messages in a pipeline using Azure IoT Edge. There is a custom endpoint (say, GenericEndpoint) that I have set up, which will send/put the messages to Azure Blob storage. I am using a route to push the device messages to the specific endpoint GenericEndpoint.
The batch frequency of GenericEndpoint is set at 60 seconds. So 1 batch creates 1 single file with some messages, in the container specified.
Lets say, there are N messages in a single blob batch file (say, blobX) in the specific container. If I take the average of the difference between the IoTHub.EnqueuedTime(i) of each message i, in blobX and the 'Creation Time' of blobX, and call it AVG, I get:
I think, this essentially gives me the average time that those N messages spent in iothub before being written in the blob storage. Now what I observe here is that, if p and q are respectively the first and last message written in blobX, then
But since the batching interval was set to 60 seconds, I would expect this average or AVG to be approximately near 30 seconds. Because, if the messages are written as soon as they arrive, then the average for each batch file would be near 30 seconds.
But in my case, AVG ≈ 90 seconds, which suggests the messages wait for atleast approximately one batching interval (60 seconds in this case) before being considered for a particular batch.
Assumption: When a batch of messages are written in a blob file, they are written all at once.
My question:
Is this delay of one batch interval or 60 seconds intentional? If yes, then I assume it will change on changing the batching interval to say 100 seconds.
If, no, then, does it usually take 60 seconds to process a message in iothub and then send it through a route to a custom endpoint? Or am I looking at this from a completely wrong angle?
I apologize beforehand if my question seems confusing.

set spark.streaming.kafka.maxRatePerPartition for createDirectStream

I need to increase the input rate per partition for my application and I have use .set("spark.streaming.kafka.maxRatePerPartition",100) for the config. The stream duration is 10s so I expect process 5*100*10=5000 messages for this batch. However, the input rate I received is just about 500. Can You suggest any modifications to increase this rate?
The stream duration is 10s so I expect process 5*100*10=5000 messages
for this batch.
That's not what the setting means. It means "how many elements each partition can have per batch", not per second. I'm going to assume you have 5 partitions, so you're getting 5 * 100 = 500. If you want 5000, set maxRatePerPartition to 1000.
From "Exactly-once Spark Streaming From Apache Kafka" (written by the Cody, the author of the Direct Stream approach, emphasis mine):
For rate limiting, you can use the Spark configuration variable
spark.streaming.kafka.maxRatePerPartition to set the maximum number of
messages per partition per batch.
Edit:
After #avrs comment, I looked inside the code which defines the max rate. As it turns out, the heuristic is a bit more complex than stated in both the blog post and the docs.
There are two branches. If backpressure is enabled alongside maxRate, then the maxRate is the minimum between the current backpressure rate calculated by the RateEstimator object and maxRate set by the user. If it isn't enabled, it takes the maxRate defined as is.
Now, after selecting the rate it always multiplies by the total batch seconds, effectively making this a rate per second:
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}
Property fetches N messages from a partition per seconds. If I have M partition and batch interval is B, then total messages I can see in batch is N * M * B.
There are few things you should verify
Is your input rate is >500 for 10s.
Is kafka topic is properly partitioned.

Spark Streaming Processing Time vs Total Delay vs Processing Delay

I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?
I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.
I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.
Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:
inputDStream.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.
A live example from a working Streaming application:
We see that:
The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.
We're experiencing stable processing time, however increasing scheduling delay.
Based on the answer, the scheduling delay should be influenced only by processing time of previous runs.
Spark is running only streaming, nothing else.
Time window is 1 minute, processing 120K records.
If your window is 1 minute, and the average processing time is 1 minute 7 seconds, you have a problem : each batch will delay the next one by 7 seconds.
Your processing time graph shows a stable processing time, but always higher than batch time.
I think after a given amount of time, your driver will crash on GC overhead limit exceeded, as it will be full of pending batch waiting to be excecuted.
You can change this by reducing the processing time so that it goes under the expected microbatch max duration (requires code and/or resources allocation changes), or increase the microbatch size, or go to continuous streaming.
Rgds

Spark: grouping of data stream based on cycle time

I need your inputs regarding grouping of data stream within spark streaming on the basis of cycle time.
We are receiving input data in this formats {Object_id:"vm123", time:"1469077478" , metric :"cpu.usage" , value :"50.8"}.
Data frames are getting ingested very fast at the average rate of 10 seconds.We have a use case to create bins of data based on cycle time .
Suppose Spark bin/batch time is 1 minute for processing of data. Cycle time should be based on the message time-stamp. for example if we receive first packet at 11:30am then we would have to aggregate all messages of that metrics which are received in between 11:30am to 11:31am(1 minute) and send it for processing with cycle time 11.31am.
As per Spark documentation , we only have a support to bin data based on a fix batch duration , for example if we define the batch duration for 1 minute , it will hold the data for 1 minute and send that as a batch , where we have an option to aggregate the data received during this one minuted duration . But this approach does not follows the notion of aggregating the data based on the Cycle Time as defined above.
Please let us know if we have a way to achieve the above use case through Spark or some other tool.
Added Details::
In Our Use case data frames are getting ingested in each 10 seconds for different entities and each object has few metrics.We need to create bins of data before processing based on cycle time interval (like 5 mins) and the start time of that interval should start with message time-stamp.
for example:
We have messages of an object 'vm123' in kafka queue like following:
message1= {Object_id:"vm123", time:"t1" , metric :"m1" , value :"50.8"}.
message2 = {Object_id:"vm123", time:"t1" , metric :"m2" , value :"55.8"}..............................................................
Cycle time interval = 5 minutes.
so a first bin for entity 'VM123' should have all messages of range of t1 to (t1+5*60) times and final group of messages with 5 minute cycle time for ob1 should be like following:
{Object_id:"ob1" , time:"t5" , metrics : [{"name": "m1", value:"average of (v1,v2,v3,v4,v5)},{"name": "m2", value:"average of (v1,v2,v3,v4,v5)}"] }
Thanks

What is spark.streaming.receiver.maxRate? How does it work with batch interval

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?
Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Resources