I know that spark supports windowing operation based on window length and sliding intervals. But does it also support to aggregate data based on event count (rather than time) as well? For example, find minimum value of 1000 events of each kind?
Thanks
Related
I have a streaming dataframe and I want to calculate some daily counters.
So far, I have been using tumbling windows with watermark as follows:
.withWatermark("timestamp", "10 minutes") \
.groupBy(window("timestamp","1 day")) \
.count()
My question is whether this is the best way (resource wise) to do this daily aggregation, or whether I should instead perform a series of aggregations on smaller windows (say hourly or even less) and then aggregate these hourly counters to achieve the daily count.
Moreover, if I try the second approach, meaning the smaller windows, how can I do this?
I can not perform both aggregations (the hourly and daily) within the same spark streaming application, I keep getting the following:
Multiple streaming aggregations are not supported with streaming
DataFrames/Datasets.
Should I therefore use a spark application to post the hourly aggregations to a Kafka topic, read this stream from another spark application and perform the daily sum up?
If yes, then how should I go about the "update" outputmode in the producer? The second application will be getting the updated values from the first application and therefore this "sum up" will be wrong.
Moreover, adding any trigger will also not work with the watermark, since any late events arriving will cause a previous counter update and I would be running into the same problem again.
I think you should perform aggregation on the most shortest time span required and then perform secondary aggregation on those primary aggs. Performing a agg for 1 day would OOM your job if not now then definitely in future.
Perform primary aggreagtions count hourly or 5 minute count and record them in a Time serries DB like Prometheus / Graphite.
Use grafana to plot those metrics, perform secondary aggregations like daily count on top of primary aggregation.
This would increase some DevOps efforts but it is but you could visually monitor your application in real-time.
Say I have some a streaming data of the schema as follows:
uid: string
ts: timestamp
Now assuming the data has been partitioned by uid (in each partition, the data is minimal, e.g. less than 1 row/sec).
I would like to put the data (in each partition) into windows based on event time ts, then sort all the elements within each window (based on ts as well), at last apply a custom transformation on each of the element in the window in order.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Before I answer the questions let me point out that Spark Structured Streaming offers KeyValueGroupedDataset.flatMapGroupsWithState (after Dataset.groupByKey) for arbitrary stateful streaming aggregation (with explicit state logic) that gives you the most for a manual streaming state management.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
That sounds like a streaming join where you have the input stream on your left and the aggregated stream (streaming aggregation) on your right. That should be doable (but leaving it with no example code as I'm still unsure if I understood your question right).
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Use window standard function to define the window and a watermark to "close" windows at proper times. That is also doable (but no example again as I'm not sure of the merit of the question).
Is watermark in Structured Streaming always set using processing time or event time or both?
In Structured Streaming 2.2 streaming watermark is tracked based on event time as defined by eventTime column in Dataset.withWatermark operator.
withWatermark Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.
That gives you event time watermark by default.
But your initial Dataset can have no event time column initially and thus you can auto-generate one using current_date or current_timestamp functions or some other way at processing time. That would give you processing time watermark (based on the custom-generated column).
In the most generic solution using KeyValueGroupedDataset.flatMapGroupsWithState, you can pre-define the strategies or write a custom one. That's why they call it a solution for Arbitrary Stateful Aggregations in Structured Streaming.
flatMapGroupsWithState Applies the given function to each group of data, while maintaining a user-defined per-group state.
I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.
I have a use case where we need to find patterns in data within a window. We are experimenting with Structured Streaming. We have a continues stream of events and are looking for patterns like event A (device disconnect) is followed by event B (device reconnect) within 10 seconds. or event A (disconnect) is not followed by event B (reconnect) within 10 seconds.
I was thinking of using a window function grouping dataset into 10 seconds window buckets and checking for the pattern every time the window values are updated. It looks like the window function is really used as a groupBy in structured streaming which forces me to use aggregate functions to get high level agg on column values.
I am wondering if there is a way to loop through all values of the column when using window function in structured streaming.
You might want to try using mapGroupsWithState (structured streaming) or mapWithState (DStreams), it sounds like it could work well for your case.
You can keep arbitrary state for any key and update the state everytime an update comes. You can also set a timeout for each key after which its state will get removed. For your use case, you could store the initial state for event A as the timestamp of when A arrived, and when event B comes you can check if the timestamp is within 10s of A. If it is, generate an event.
You might also be able to use timeouts for this, e.g. set the initial state when A comes, set the timeout to 10s, and if A is still around when B comes then generate an event.
Good blog post on the differences b/w mapGroupsWithState and mapWithState