Spark: grouping of data stream based on cycle time

Spark: grouping of data stream based on cycle time - apache-spark

I need your inputs regarding grouping of data stream within spark streaming on the basis of cycle time.
We are receiving input data in this formats {Object_id:"vm123", time:"1469077478" , metric :"cpu.usage" , value :"50.8"}.
Data frames are getting ingested very fast at the average rate of 10 seconds.We have a use case to create bins of data based on cycle time .
Suppose Spark bin/batch time is 1 minute for processing of data. Cycle time should be based on the message time-stamp. for example if we receive first packet at 11:30am then we would have to aggregate all messages of that metrics which are received in between 11:30am to 11:31am(1 minute) and send it for processing with cycle time 11.31am.
As per Spark documentation , we only have a support to bin data based on a fix batch duration , for example if we define the batch duration for 1 minute , it will hold the data for 1 minute and send that as a batch , where we have an option to aggregate the data received during this one minuted duration . But this approach does not follows the notion of aggregating the data based on the Cycle Time as defined above.
Please let us know if we have a way to achieve the above use case through Spark or some other tool.
Added Details::
In Our Use case data frames are getting ingested in each 10 seconds for different entities and each object has few metrics.We need to create bins of data before processing based on cycle time interval (like 5 mins) and the start time of that interval should start with message time-stamp.
for example:
We have messages of an object 'vm123' in kafka queue like following:
message1= {Object_id:"vm123", time:"t1" , metric :"m1" , value :"50.8"}.
message2 = {Object_id:"vm123", time:"t1" , metric :"m2" , value :"55.8"}..............................................................
Cycle time interval = 5 minutes.
so a first bin for entity 'VM123' should have all messages of range of t1 to (t1+5*60) times and final group of messages with 5 minute cycle time for ob1 should be like following:
{Object_id:"ob1" , time:"t5" , metrics : [{"name": "m1", value:"average of (v1,v2,v3,v4,v5)},{"name": "m2", value:"average of (v1,v2,v3,v4,v5)}"] }
Thanks

Related

Streaming data processing with Spark Structured Streaming

I have events being pushed to Kafka from the App.
Business design is so that for one interaction in the App maximum 3 events can be generated.
The events for one interaction have common ID.
My goal is to combine those three Events in one row in Delta table.
I'm doing 15 minutes timestamp based window aggregation.
And the issue is the following.
Let's say it's the period of 12:00 - 12:15.
Event A - timestamp = 12:14:30
Event B - timestamp = 12:14:50
Event C - timestamp = 12:15:50
So Event C is in different time frame and hence it is not combined to A and B.
I can check for each batch whether the ID exists already in delta table or not but I wonder if there is more elegant way to solve these cases.

Spark Structured Streaming - Force microbatch execution even without input rows

We have a Spark Structured Streaming query that counts the number of input rows received on the last hour, updating every minute, performing the aggrupation with a temporal window (windowDuration="1 hour", slideDuration="1 minute"). The query is configured to use a processingTime trigger, with a duration of 30 secods trigger(processingTime="30 seconds"). The outputMode of the query is append.
This query produces results as long as new rows are received, which is consistent with the behaviour that the documentation indicates for fixed interval micro-batches:
If no new data is available, then no micro-batch will be kicked off.
However, we would like the query to produce results even when there are NO input rows: our use case is related to monitorization, and we would like to trigger alerts when there are no input messages in the monitorized system for a period of time.
For example, for the following input:
event_time
event_id
00:02
1
00:05
2
01:00
3
03:00
4
At processingTime=01:01, we could suppose that the following output row would be produced:
window.start
window.end
count
00:00
01:00
3
However, from this point, there are no input rows until 03:00, and therefore, no microbatch will be executed until this time, missing the opportunity to produce output rows such as:
window.start
window.end
count
01:01
02:01
0
Which would otherwise produce a monitoring alert in our system.
Is there any workaround for this behaviour, allowing executions of empty microbatches when there are no input rows?

You cannot ask for things not provided in the software as such, and there are no work arounds. There was even a time, that may still exist, in which the last set of microbatch data is not processed.

How does Spark Structured Streaming determine an event has arrived late?

I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time?
Taking the above picture as an example Is the bold right arrow line "Time" represent processing time? If so
1) where does this processing time come from? since its streaming Is it assuming someone is likely using an upstream source that has processing timestamp in it or spark adds a processing timestamp field? For example, when reading messages from Kafka we do something like
Dataset<Row> kafkadf = spark.readStream().forma("kafka").load()
This dataframe has timestamp column by default which I am assuming is the processing time. correct? If so, Does Kafka or Spark add this timestamp?
2) I can see there is a time comparison between bold right arrow line and time in the message. And is that how spark determines an event is late?

Processing time of individual job (one RDD of the DStream) is what dictate the processing time in general. It is not when the actual processing of that RDD happen but when the RDD job was allocated to be processed.
In order to clearly understand what the above statement means, create a spark streaming application where batch time = 60 seconds and make sure the batch takes 2 minute. Eventually you will see that a job is allocated to be processed at a time but has not been picked up because the previous job has not finished.
Next:
Out of order data can be dealt two different ways.
Create a High water mark.
It is explained in the same spark user guide page you got your picture.
It is easy to understand where we have a key, value pair where key is the timestamp. Setting .withWatermark("timestamp", "10 minutes") we are essentially saying that if I have received a message for 10 AM then I will allow messages little older than that(Upto 9.50AM). Any message older than that gets dropped.
The other way out of order data can be handled is with in the mapGroupsWithState or mapWithState function.
Here you can decide what to do when you get bunch of values for a key. Drop anything before time X or go even fancier than that. (Like if it is data from A then allow 20 minutes delay, for the rest allow 30 minutes delay etc...)

This document from databricks explains it pretty clearly:
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
Basically it comes down to the watermark (late data threshold) and the order in which data records arrive. Processing time does not play into it at all. You set a watermark on the column representing your event time. If a record R1 with event time T1 arrives after a record R2 with event time T2 has already been seen, and T2 > T1 + Threshold, then R1 will be discarded.
For example, suppose T1 = 09:00, T2 = 10:00, Threshold = 61 min. If a record R1 with event time T1 arrives before a record R2 with event time T2 then R1 is included in calculations. If R1 arrives after R2 then R1 is still included in calculations because T2 < T1 + Threshold.
Now suppose Threshold = 59 min. If R1 arrives before R2 then R1 is included in calculations. If R1 arrives after R2, then R1 is discarded because we've already seen a record with event time T2 and T2 > T1 + Threshold.

Kafka Direct API Batch Input Size

As per Kafka Direct API number of input records is calculated as
maxInputSize = maxRatePerPartition * #numOfPartitions# * #BATCH_DURATION_SECONDS#
I am really failed to understand why input size is determined like this. Suppose my job processes 100 files in 5 minutes.
if I set maxRatePerPartition = 1, numOfPartitions in my topic are 6, what should be batch duration because if I set batch duration seconds to 300, I will be fetching 1800 files as input and there will be a long queue of batches waiting to be processed and 1800 files will take about half hour to process let aside memory issues and other constraints.
How can I cater this issue. I should be able to control records in my input. I can process 10 records in 5 minutes, I should be able to load these many records only.

What is spark.streaming.receiver.maxRate? How does it work with batch interval

I am working with spark 1.5.2. I understand what a batch interval is, essentially the interval after which the processing part should start on the data received from the receiver.
But I do not understand what is spark.streaming.receiver.maxRate. From some research it is apparently an important parameter.
Lets consider a scenario. my batch interval is set to 60s. And spark.streaming.receiver.maxRate is set to 60*1000. What if I get 60*2000 records in 60s due to some temporary load. What would happen? Will the additional 60*1000 records be dropped? Or would the processing happen twice during that batch interval?

Property spark.streaming.receiver.maxRate applies to number of records per second.
The receiver max rate is applied when receiving data from the stream - that means even before batch interval applies. In other words you will never get more records per second than set in spark.streaming.receiver.maxRate. The additional records will just "stay" in the stream (e.g. Kafka, network buffer, ...) and get processed in the next batch.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string