How spark structured streaming calculate watermark - apache-spark

However, to run this query for days, it’s necessary for the system to
bound the amount of intermediate in-memory state it accumulates. This
means the system needs to know when an old aggregate can be dropped
from the in-memory state because the application is not going to
receive late data for that aggregate any more. To enable this, in
Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly. You can define the watermark of a
query by specifying the event time column and the threshold on how
late the data is expected to be in terms of event time. For a specific
window starting at time T, the engine will maintain state and allow
late data to update the state until (max event time seen by the engine
- late threshold > T). In other words, late data within the threshold will be aggregated, but data later than the threshold will start
getting dropped (see later in the section for the exact guarantees).
Let’s understand this with an example. We can easily define
watermarking on the previous example using withWatermark() as shown
below.
Above is copied from http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.
From the above documenation description(For a specific window starting at time T),it is the starting time of a given window.
I think the document is wrong, it should be the ending time of a given window.

I confirm by investigating the spark code, the document is wrong, T is the ending time of the window

Related

Does Watermark in Update output mode clean the stored state in Spark Structured Streaming?

I am working on a spark streaming application and while understanding about the sinks and watermarking logic, I couldn't find a clear answer as to if I use a watermark with say 10 min threshold while outputting the aggregations with update output mode, will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired?
Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.
As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.
Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.
You need to run through some simple examples as I note it is easy to forget the subtlety of output.
See
Why does streaming query with update output mode print out all rows?
which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9

Watermarking in Windowed Grouped Aggregation

I was reading theSpark Streaming Programming Guide Documentation and have a query on processing of data based on event time. I have attached a screenshot from the documentation link which shows the data is being processed even before the event has occurred. 12:21 event is processed in the window of 12:10 - 12:20. Is the image right or am I wrong?
Theres a late event at 12:13 which is also owl. I think that is being shown in the 12:10-12:20 time range on the cut taken at 12:20.
The 12:21 owl event I would expect should show up in the 12:20 - 12:30 or 12:15 - 12:25 time range. However, these are not shown in the graph
When you work with real time data there can be scenarios of late arrival data and computation of this data has to be performed on the earlier window data. In this scenario, the result of earlier window data is stored in memory and then aggregated with the late arrival data. But it can cause higher memory consumption as the historical data is stored in the memory till the missed data is arrived which might lead to memory accumulation. In these scenarios, Spark streaming has feature of watermarking which discards the late arrival data when it crosses threshold value.
In some cases, business results might have mismatch because of discarding these values. To avoid these type of issues, instead of applying watermarking feature, custom functionality has to be implemented to check the timestamp of data and then store it in HDFS or any cloud native object storage system to perform batch computations on the data. This implementation leads to complexity.

Spark 2.2 Structured Streaming - duplicate check without watermark - controlling the duplicates state size

We are implementing a real-time aggregation process with Spark 2.2 structured streaming.
The windows calculated can be based on event time that comes as an attribute on the object being processed in the stream and also can be calculated according to the processing time of the system like was possible in previous versions.
In addition we need to implement duplicate check to prevent the processing of duplicate events.
In the documentation it is stated that
Without watermark - Since there are no bounds on when a duplicate record may arrive, the query stores the data from all the past records as state. (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication)
The problem is that looks like this state can grow forever and cause performance and memory size issues. There seems to be no way in the API to control it
The question is is there a way to implement duplicate check on events that do not have a notion of event time and still maintain control on the duplicate check sate like clearing it etc.
Thanks!

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

cassandra kafka connect source and eventual consistency

I am thinking about using Kafka connect to stream updates from Cassandra to a Kafka topic. The existing connector from StreamReactor seems to use a timestamp or uuidtimestamp to extract new changes since the last poll. The value of the timestamp is inserted using now() in the insert statement. The connector then saves the maximum time is received last time.
Since Cassandra is eventually consistent I am wondering what actually happens when doing repeated queries using a time range to get new changes. Is there not risk to miss rows inserted into Cassandra because it "arrived late" to the node queried when using WHERE create >= maxTimeFoundSoFar?
Yes it might happen that you have newer data in front of your "cursor" when you already went on with processing if you are using consistency level one for reading and writing, but even if you use higher consistency you might run into "problems" depending on the setup that you have. Basically there are a lot of things that can go wrong.
You can increase the chances of not doing this by using an old cassandra formula NUM_NODES_RESPONDING_TO_READ + NUM_NODES_RESPONDING_TO_WRITE > REPLICATION_FACTOR but since you are using now() from cassandra the node clocks might have millisecond offsets between them so you might even miss data if you have high frequency data. I know of some systems where people are actually using raspberry pi's with gps modules to keep the clock skew really tight :)
You would have to provide more about your use case but in reality yes you can totally skip some inserts if you are not "careful" but even then there is no 100% guarantee other then you process the data with some offset that would be enough for the new data to come in and settle.
Basically you would have to keep some moving time window in the past and then move it along plus making sure that you don't take into account anything newer than the let's say last minute. That way you are making sure the data is "settling".
I had some use cases where we processed sensory data that would came in with multiple days of delay. On some projects we simply ignored it on some the data was for reporting on the month level so we always processed the old data and added it to reporting database. i.e. we kept a time window 3 days back in history.
It just depends on your use case.

Resources