I was reading theSpark Streaming Programming Guide Documentation and have a query on processing of data based on event time. I have attached a screenshot from the documentation link which shows the data is being processed even before the event has occurred. 12:21 event is processed in the window of 12:10 - 12:20. Is the image right or am I wrong?
Theres a late event at 12:13 which is also owl. I think that is being shown in the 12:10-12:20 time range on the cut taken at 12:20.
The 12:21 owl event I would expect should show up in the 12:20 - 12:30 or 12:15 - 12:25 time range. However, these are not shown in the graph
When you work with real time data there can be scenarios of late arrival data and computation of this data has to be performed on the earlier window data. In this scenario, the result of earlier window data is stored in memory and then aggregated with the late arrival data. But it can cause higher memory consumption as the historical data is stored in the memory till the missed data is arrived which might lead to memory accumulation. In these scenarios, Spark streaming has feature of watermarking which discards the late arrival data when it crosses threshold value.
In some cases, business results might have mismatch because of discarding these values. To avoid these type of issues, instead of applying watermarking feature, custom functionality has to be implemented to check the timestamp of data and then store it in HDFS or any cloud native object storage system to perform batch computations on the data. This implementation leads to complexity.
Related
I am working on a spark streaming application and while understanding about the sinks and watermarking logic, I couldn't find a clear answer as to if I use a watermark with say 10 min threshold while outputting the aggregations with update output mode, will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired?
Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks back to a point in time (threshold) before which it is assumed no more late events are supposed to arrive, but if they do, they are discarded.
As a consequence one needs to maintain the state of window / aggregate already computed to handle these potential late updates based on event time. However, this costs resources, and if done infinitely, this would blow up a Structured Streaming App.
Will the intermittent state maintained by spark be cleared off after the 10 min threshold has expired? Yes, it will. There is by design as there is no point holding any longer a state that can no longer be updated due to the threshold having been expired.
You need to run through some simple examples as I note it is easy to forget the subtlety of output.
See
Why does streaming query with update output mode print out all rows?
which gives an excellent example of update mode output as well. Also this gives an even better update example: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Even better - this blog with some good graphics: https://towardsdatascience.com/watermarking-in-spark-structured-streaming-9e164f373e9
In my Structured Streaming, I set the watermark to 1 hour.
I am doing window operation for each 10 min.
I received a later event 20 min late.
Will the corresponding window will be calculated or not?
Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks to a point in time before which it is assumed no more late events are supposed to arrive, but if they do, they are none-the-less discarded. There are various modes of operation.
Excellent examples on https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time with nice diagrams to complement.
Your question: Yes, that example you quote will include the late data, as it is in this case as you describe it within the window of 1 hr.
However, to run this query for days, it’s necessary for the system to
bound the amount of intermediate in-memory state it accumulates. This
means the system needs to know when an old aggregate can be dropped
from the in-memory state because the application is not going to
receive late data for that aggregate any more. To enable this, in
Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly. You can define the watermark of a
query by specifying the event time column and the threshold on how
late the data is expected to be in terms of event time. For a specific
window starting at time T, the engine will maintain state and allow
late data to update the state until (max event time seen by the engine
- late threshold > T). In other words, late data within the threshold will be aggregated, but data later than the threshold will start
getting dropped (see later in the section for the exact guarantees).
Let’s understand this with an example. We can easily define
watermarking on the previous example using withWatermark() as shown
below.
Above is copied from http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.
From the above documenation description(For a specific window starting at time T),it is the starting time of a given window.
I think the document is wrong, it should be the ending time of a given window.
I confirm by investigating the spark code, the document is wrong, T is the ending time of the window
I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.
I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is
What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs.
Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place.
Appreciate your help.
Thank you
In streaming scenarios holding 24 hours of data is usually too much. To solve that you use a probabilistic methods instead of exact measures for streaming and perform a later batch computation to get the exact numbers (if needed).
In your case to get a distinct count you can use an algorithm called HyperLogLog. You can see an example of using Twitter's implementation of HyperLogLog (part of a library called AlgeBird) from spark streaming here