Does Watermark in Update output mode clean the stored state in Spark Structured Streaming? - apache-spark

I am working on a spark streaming application and while understanding about the sinks and watermarking logic, I couldn't find a clear answer as to if I use a watermark with say 10 min threshold while outputting the aggregations with update output mode, will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired?

Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.
As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.
Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.
You need to run through some simple examples as I note it is easy to forget the subtlety of output.
See
Why does streaming query with update output mode print out all rows?
which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9

Related

How spark structured streaming calculate watermark

However, to run this query for days, it’s necessary for the system to
bound the amount of intermediate in-memory state it accumulates. This
means the system needs to know when an old aggregate can be dropped
from the in-memory state because the application is not going to
receive late data for that aggregate any more. To enable this, in
Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly. You can define the watermark of a
query by specifying the event time column and the threshold on how
late the data is expected to be in terms of event time. For a specific
window starting at time T, the engine will maintain state and allow
late data to update the state until (max event time seen by the engine
- late threshold > T). In other words, late data within the threshold will be aggregated, but data later than the threshold will start
getting dropped (see later in the section for the exact guarantees).
Let’s understand this with an example. We can easily define
watermarking on the previous example using withWatermark() as shown
below.
Above is copied from http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.
From the above documenation description(For a specific window starting at time T),it is the starting time of a given window.
I think the document is wrong, it should be the ending time of a given window.
I confirm by investigating the spark code, the document is wrong, T is the ending time of the window

Spark 2.2 Structured Streaming - duplicate check without watermark - controlling the duplicates state size

We are implementing a real-time aggregation process with Spark 2.2 structured streaming.
The windows calculated can be based on event time that comes as an attribute on the object being processed in the stream and also can be calculated according to the processing time of the system like was possible in previous versions.
In addition we need to implement duplicate check to prevent the processing of duplicate events.
In the documentation it is stated that
Without watermark - Since there are no bounds on when a duplicate record may arrive, the query stores the data from all the past records as state. (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication)
The problem is that looks like this state can grow forever and cause performance and memory size issues. There seems to be no way in the API to control it
The question is is there a way to implement duplicate check on events that do not have a notion of event time and still maintain control on the duplicate check sate like clearing it etc.
Thanks!

cassandra kafka connect source and eventual consistency

I am thinking about using Kafka connect to stream updates from Cassandra to a Kafka topic. The existing connector from StreamReactor seems to use a timestamp or uuidtimestamp to extract new changes since the last poll. The value of the timestamp is inserted using now() in the insert statement. The connector then saves the maximum time is received last time.
Since Cassandra is eventually consistent I am wondering what actually happens when doing repeated queries using a time range to get new changes. Is there not risk to miss rows inserted into Cassandra because it "arrived late" to the node queried when using WHERE create >= maxTimeFoundSoFar?
Yes it might happen that you have newer data in front of your "cursor" when you already went on with processing if you are using consistency level one for reading and writing, but even if you use higher consistency you might run into "problems" depending on the setup that you have. Basically there are a lot of things that can go wrong.
You can increase the chances of not doing this by using an old cassandra formula NUM_NODES_RESPONDING_TO_READ + NUM_NODES_RESPONDING_TO_WRITE > REPLICATION_FACTOR but since you are using now() from cassandra the node clocks might have millisecond offsets between them so you might even miss data if you have high frequency data. I know of some systems where people are actually using raspberry pi's with gps modules to keep the clock skew really tight :)
You would have to provide more about your use case but in reality yes you can totally skip some inserts if you are not "careful" but even then there is no 100% guarantee other then you process the data with some offset that would be enough for the new data to come in and settle.
Basically you would have to keep some moving time window in the past and then move it along plus making sure that you don't take into account anything newer than the let's say last minute. That way you are making sure the data is "settling".
I had some use cases where we processed sensory data that would came in with multiple days of delay. On some projects we simply ignored it on some the data was for reporting on the month level so we always processed the old data and added it to reporting database. i.e. we kept a time window 3 days back in history.
It just depends on your use case.

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here:
Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance
TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
If a batch fails and is retried, how will that affect your counters?
If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
Will the partitioning of events cause complexity in the order that they are processed?
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.
It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()

Spark streaming with Checkpoint

I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is
What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs.
Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place.
Appreciate your help.
Thank you
In streaming scenarios holding 24 hours of data is usually too much. To solve that you use a probabilistic methods instead of exact measures for streaming and perform a later batch computation to get the exact numbers (if needed).
In your case to get a distinct count you can use an algorithm called HyperLogLog. You can see an example of using Twitter's implementation of HyperLogLog (part of a library called AlgeBird) from spark streaming here

Resources