Apache Spark Structured Streaming for Window Aggregation and Custom Triggering

Apache Spark Structured Streaming for Window Aggregation and Custom Triggering - apache-spark

Say I have some a streaming data of the schema as follows:
uid: string
ts: timestamp
Now assuming the data has been partitioned by uid (in each partition, the data is minimal, e.g. less than 1 row/sec).
I would like to put the data (in each partition) into windows based on event time ts, then sort all the elements within each window (based on ts as well), at last apply a custom transformation on each of the element in the window in order.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?

Before I answer the questions let me point out that Spark Structured Streaming offers KeyValueGroupedDataset.flatMapGroupsWithState (after Dataset.groupByKey) for arbitrary stateful streaming aggregation (with explicit state logic) that gives you the most for a manual streaming state management.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
That sounds like a streaming join where you have the input stream on your left and the aggregated stream (streaming aggregation) on your right. That should be doable (but leaving it with no example code as I'm still unsure if I understood your question right).
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Use window standard function to define the window and a watermark to "close" windows at proper times. That is also doable (but no example again as I'm not sure of the merit of the question).

Related

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.

Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

How to compute difference between timestamps with PySpark Structured Streaming

I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much

Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)

Is watermark based on processing time or event time or both?

Is watermark in Structured Streaming always set using processing time or event time or both?

In Structured Streaming 2.2 streaming watermark is tracked based on event time as defined by eventTime column in Dataset.withWatermark operator.
withWatermark Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.
That gives you event time watermark by default.
But your initial Dataset can have no event time column initially and thus you can auto-generate one using current_date or current_timestamp functions or some other way at processing time. That would give you processing time watermark (based on the custom-generated column).
In the most generic solution using KeyValueGroupedDataset.flatMapGroupsWithState, you can pre-define the strategies or write a custom one. That's why they call it a solution for Arbitrary Stateful Aggregations in Structured Streaming.
flatMapGroupsWithState Applies the given function to each group of data, while maintaining a user-defined per-group state.

Get all rows of a window in Spark structured streaming

I have a use case where we need to find patterns in data within a window. We are experimenting with Structured Streaming. We have a continues stream of events and are looking for patterns like event A (device disconnect) is followed by event B (device reconnect) within 10 seconds. or event A (disconnect) is not followed by event B (reconnect) within 10 seconds.
I was thinking of using a window function grouping dataset into 10 seconds window buckets and checking for the pattern every time the window values are updated. It looks like the window function is really used as a groupBy in structured streaming which forces me to use aggregate functions to get high level agg on column values.
I am wondering if there is a way to loop through all values of the column when using window function in structured streaming.

You might want to try using mapGroupsWithState (structured streaming) or mapWithState (DStreams), it sounds like it could work well for your case.
You can keep arbitrary state for any key and update the state everytime an update comes. You can also set a timeout for each key after which its state will get removed. For your use case, you could store the initial state for event A as the timestamp of when A arrived, and when event B comes you can check if the timestamp is within 10s of A. If it is, generate an event.
You might also be able to use timeouts for this, e.g. set the initial state when A comes, set the timeout to 10s, and if A is still around when B comes then generate an event.
Good blog post on the differences b/w mapGroupsWithState and mapWithState

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here:
Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance

TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
If a batch fails and is retried, how will that affect your counters?
If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
Will the partitioning of events cause complexity in the order that they are processed?
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.

It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string