How does Spark Structured Streaming determine an event has arrived late? - apache-spark

I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time?
Taking the above picture as an example Is the bold right arrow line "Time" represent processing time? If so
1) where does this processing time come from? since its streaming Is it assuming someone is likely using an upstream source that has processing timestamp in it or spark adds a processing timestamp field? For example, when reading messages from Kafka we do something like
Dataset<Row> kafkadf = spark.readStream().forma("kafka").load()
This dataframe has timestamp column by default which I am assuming is the processing time. correct? If so, Does Kafka or Spark add this timestamp?
2) I can see there is a time comparison between bold right arrow line and time in the message. And is that how spark determines an event is late?

Processing time of individual job (one RDD of the DStream) is what dictate the processing time in general. It is not when the actual processing of that RDD happen but when the RDD job was allocated to be processed.
In order to clearly understand what the above statement means, create a spark streaming application where batch time = 60 seconds and make sure the batch takes 2 minute. Eventually you will see that a job is allocated to be processed at a time but has not been picked up because the previous job has not finished.
Next:
Out of order data can be dealt two different ways.
Create a High water mark.
It is explained in the same spark user guide page you got your picture.
It is easy to understand where we have a key, value pair where key is the timestamp. Setting .withWatermark("timestamp", "10 minutes") we are essentially saying that if I have received a message for 10 AM then I will allow messages little older than that(Upto 9.50AM). Any message older than that gets dropped.
The other way out of order data can be handled is with in the mapGroupsWithState or mapWithState function.
Here you can decide what to do when you get bunch of values for a key. Drop anything before time X or go even fancier than that. (Like if it is data from A then allow 20 minutes delay, for the rest allow 30 minutes delay etc...)

This document from databricks explains it pretty clearly:
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
Basically it comes down to the watermark (late data threshold) and the order in which data records arrive. Processing time does not play into it at all. You set a watermark on the column representing your event time. If a record R1 with event time T1 arrives after a record R2 with event time T2 has already been seen, and T2 > T1 + Threshold, then R1 will be discarded.
For example, suppose T1 = 09:00, T2 = 10:00, Threshold = 61 min. If a record R1 with event time T1 arrives before a record R2 with event time T2 then R1 is included in calculations. If R1 arrives after R2 then R1 is still included in calculations because T2 < T1 + Threshold.
Now suppose Threshold = 59 min. If R1 arrives before R2 then R1 is included in calculations. If R1 arrives after R2, then R1 is discarded because we've already seen a record with event time T2 and T2 > T1 + Threshold.

Related

Does Spark configs spark.streaming.receiver.maxRate has any effect in a Kafka Beam pipeline

I was wondering if somebody has any experience with rate limiting in Beam KafkaIO component when the runner is a SparkRunner. The versions I am using are:Beam 2.29, Spark 3.2.0 and Kafka client 2.5.0?
I have the Beam parameter maxRecordsPerBatch set to a large number, 100000000. But even when the pipeline stops for 45 minutes, this value is never hit. But when there is a high burst of data above the normal, the Kafka lag increases till it eventually catches up. In the SparkUI I see that parameter batchIntervalMillis=300000 (5 min) is not reached, batches take a maximum of 3 min. It looks like the KafkaIO stops reading at some point, even when the lag is very large. My Kafka parameters --fetchMaxWaitMs=1000
--maxPollRecords=5000 should be able to bring plenty of data. Specially because KafkaIO creates one consumer per partition. In my system there are multiple topics with a total of 992 partitions and my spark.default.parallelism=600. Some partitions have very little data, while others have a large number. Topics are per region and when a region goes down the data is sent through another region/topic. That is when the lag happens.
Does the configuration values for spark.streaming.receiver.maxRate and spark.streaming.receiver.maxRatePerPartition plus spark.streaming.backpressure.enabled play any role at all?
For what I have seen, it looks like Beam controls the whole reading from Kafka with the operator KafkaIO. This component creates its own consumers, therefore the rate of the consumer can only be set by using consumer configs which include fetchMaxWaitMs and maxPollRecords.
The only way those Spark parameters could have any effect if in the rest of the pipeline after the IO source. But I am not sure.
So I finally figure out how it all works. First I want to state that the Spark configuration values: spark.streaming.receiver.maxRate, spark.streaming.receiver.maxRatePerPartition, spark.streaming.backpressure.enabled do not play a factor in Beam because they only work if you are using the source operators from Spark to read from Kafka. Since Beam has its own operator KafkaIO they do not play a role.
So Beam has a set of parameters defined in the class SparkPipelineOptions that are used in the SparkRunner to setup reading from Kafka. Those parameters are:
#Description("Minimum time to spend on read, for each micro-batch.")
#Default.Long(200)
Long getMinReadTimeMillis();
#Description(
"A value between 0-1 to describe the percentage of a micro-batch dedicated to reading from UnboundedSource.")
#Default.Double(0.1)
Double getReadTimePercentage();
Beam create a SourceDStream object that it will pass to spark to use as a source to read from Kafka. In this class the method boundReadDuration returns the result of calculating the larger of two reading values:
proportionalDuration and lowerBoundDuration. The first one is calculated by multiplying BatchIntervalMillis from readTimePercentage. The second is just the value in mills from minReadTimeMillis. Below is the code from SourceDStream. The time duration returned from this function will be used to read from Kafka alone the rest of the time will be allocated to the other tasks in the pipeline.
Last but no least the following parameter also control how many records are process during a batch maxRecordsPerBatch. The pipeline would not process more than those records in a single batch.
private Duration boundReadDuration(double readTimePercentage, long minReadTimeMillis) {
long batchDurationMillis = ssc().graph().batchDuration().milliseconds();
Duration proportionalDuration = new Duration(Math.round(batchDurationMillis * readTimePercentage));
Duration lowerBoundDuration = new Duration(minReadTimeMillis);
Duration readDuration = proportionalDuration.isLongerThan(lowerBoundDuration) ? proportionalDuration: lowerBoundDuration;
LOG.info("Read duration set to: " + readDuration);
return readDuration;
}

Streaming data processing with Spark Structured Streaming

I have events being pushed to Kafka from the App.
Business design is so that for one interaction in the App maximum 3 events can be generated.
The events for one interaction have common ID.
My goal is to combine those three Events in one row in Delta table.
I'm doing 15 minutes timestamp based window aggregation.
And the issue is the following.
Let's say it's the period of 12:00 - 12:15.
Event A - timestamp = 12:14:30
Event B - timestamp = 12:14:50
Event C - timestamp = 12:15:50
So Event C is in different time frame and hence it is not combined to A and B.
I can check for each batch whether the ID exists already in delta table or not but I wonder if there is more elegant way to solve these cases.

Hazelcast Jet stream processing end window emission

I've stomped across an interesting observation trying to cross check results of aggregation for my stream processing. I've created a test case when pre-defined data set was fed into a journaled map and aggregation was supposed to populate 1 result as it was inline with window size/sliding and amount of data with pre-determined timestamps. However result was never published. Window was not emitted however few accumulate/combine operations where executed. It works differently with real data, but result of aggregation is always 'behind' the amount of data drawn from the source. I guess this has something to do with Watermarks? How can I make sure in my test case that it doesn't wait for more data to come. Will allowed lateness help?
First, I'll refer you to the two sections in the manual which describe how watermarks work and also talk about the concept of stream skew:
http://docs.hazelcast.org/docs/jet/0.6.1/manual/#unbounded-stream-processing
http://docs.hazelcast.org/docs/jet/0.6.1/manual/#stream-skew
The concept of "current time" in Jet only advances as long as there's events with advancing timestamps. There's typically several factors at play here:
Allowed lateness: This defines your lag per partition, assuming you are using a partitioned source like Kafka. This describes the tolerable degree of out of orderness in terms of timestamps in a single partition. If allowed lateness is 2 sec, the window will only close when you have received an event at N + 2 seconds across all input partitions.
Stream skew: This can happen when for example you have 10 Kafka partitions but only 3 are producing any events. As Jet coalesces watermarks from all partitions, this will cause the stream to wait until the other 7 partitions have some data. There's a timeout after which these partitions are considered idle, but this is by default 60 sec and currently not configurable in the pipeline API. So in this case you won't have any output until these partitions are marked as idle.
When using test data, it's quite common to have very low volume of events and many partitions, which can make it a challenge to advance the time correctly.
Points in Can Gencer's answer are valid. But for test, you can also use a batch source, such as Sources.list. By adding timestamps to a BatchStage you convert it to a StreamStage, on which you can do window aggregation. The aggregate transform will emit pending windows at the end of the batch.
JetInstance inst = Jet.newJetInstance();
IListJet<TimestampedEntry<String, Integer>> list = inst.getList("data");
list.add(new TimestampedEntry(1, "a", 1));
list.add(new TimestampedEntry(1, "b", 2));
list.add(new TimestampedEntry(1, "a", 3));
list.add(new TimestampedEntry(1, "b", 4));
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<TimestampedEntry<String, Integer>>list("data"))
.addTimestamps(TimestampedEntry::getTimestamp, 0)
.groupingKey(TimestampedEntry::getKey)
.window(tumbling(1))
.aggregate(AggregateOperations.summingLong(TimestampedEntry::getValue))
.drainTo(Sinks.logger());
inst.newJob(p).join();
inst.shutdown();
The above code prints:
TimestampedEntry{ts=01:00:00.002, key='a', value='4'}
TimestampedEntry{ts=01:00:00.002, key='b', value='6'}
Remember to keep your data in the list ordered by time as we use allowedLag=0.
Answer is valid for Jet 0.6.1.

Spark: grouping of data stream based on cycle time

I need your inputs regarding grouping of data stream within spark streaming on the basis of cycle time.
We are receiving input data in this formats {Object_id:"vm123", time:"1469077478" , metric :"cpu.usage" , value :"50.8"}.
Data frames are getting ingested very fast at the average rate of 10 seconds.We have a use case to create bins of data based on cycle time .
Suppose Spark bin/batch time is 1 minute for processing of data. Cycle time should be based on the message time-stamp. for example if we receive first packet at 11:30am then we would have to aggregate all messages of that metrics which are received in between 11:30am to 11:31am(1 minute) and send it for processing with cycle time 11.31am.
As per Spark documentation , we only have a support to bin data based on a fix batch duration , for example if we define the batch duration for 1 minute , it will hold the data for 1 minute and send that as a batch , where we have an option to aggregate the data received during this one minuted duration . But this approach does not follows the notion of aggregating the data based on the Cycle Time as defined above.
Please let us know if we have a way to achieve the above use case through Spark or some other tool.
Added Details::
In Our Use case data frames are getting ingested in each 10 seconds for different entities and each object has few metrics.We need to create bins of data before processing based on cycle time interval (like 5 mins) and the start time of that interval should start with message time-stamp.
for example:
We have messages of an object 'vm123' in kafka queue like following:
message1= {Object_id:"vm123", time:"t1" , metric :"m1" , value :"50.8"}.
message2 = {Object_id:"vm123", time:"t1" , metric :"m2" , value :"55.8"}..............................................................
Cycle time interval = 5 minutes.
so a first bin for entity 'VM123' should have all messages of range of t1 to (t1+5*60) times and final group of messages with 5 minute cycle time for ob1 should be like following:
{Object_id:"ob1" , time:"t5" , metrics : [{"name": "m1", value:"average of (v1,v2,v3,v4,v5)},{"name": "m2", value:"average of (v1,v2,v3,v4,v5)}"] }
Thanks

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Resources