Get all rows of a window in Spark structured streaming - apache-spark

I have a use case where we need to find patterns in data within a window. We are experimenting with Structured Streaming. We have a continues stream of events and are looking for patterns like event A (device disconnect) is followed by event B (device reconnect) within 10 seconds. or event A (disconnect) is not followed by event B (reconnect) within 10 seconds.
I was thinking of using a window function grouping dataset into 10 seconds window buckets and checking for the pattern every time the window values are updated. It looks like the window function is really used as a groupBy in structured streaming which forces me to use aggregate functions to get high level agg on column values.
I am wondering if there is a way to loop through all values of the column when using window function in structured streaming.

You might want to try using mapGroupsWithState (structured streaming) or mapWithState (DStreams), it sounds like it could work well for your case.
You can keep arbitrary state for any key and update the state everytime an update comes. You can also set a timeout for each key after which its state will get removed. For your use case, you could store the initial state for event A as the timestamp of when A arrived, and when event B comes you can check if the timestamp is within 10s of A. If it is, generate an event.
You might also be able to use timeouts for this, e.g. set the initial state when A comes, set the timeout to 10s, and if A is still around when B comes then generate an event.
Good blog post on the differences b/w mapGroupsWithState and mapWithState

Related

How to prevent Spark from keeping old data leading to out of memory in Spark Structured Streaming

I'm using structured streaming in spark but I'm struggeling to understand the data kept in memory. Currently I'm running Spark 2.4.7 which says (Structured Streaming Programming Guide)
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.
Which I understand as that Spark appends all incoming data to an unbounded table, which never gets truncated, i.e. it will keep growing indefinetly.
I understand the concept and why it is good, for example when I want to aggregaet based on event-time I can use withWatermarkto tell spark which column that is the event-time and then specify how late I want to receive data, and let spark know to throw everything older than that.
However lets say I want to aggregate on something that is not event-time. I have a usecase where each message in kafka contains an array of datapoints. So, I use explode_outer to create multiple rows for each message, and for these rows (within the same message) I would like to aggregate based on message-id (getting max, min, avg e.t.c.). So my question is, will Spark keep all "old" data since that how Structured Streaming work which will lead to OOM-issues? And is the only way to prevent this to add a "fictional" withWatermark on for example the time i received the message and include this in my groupByas well?
And the other usecase, where I do not even want to do a groupBy, I simply want to do some transformation on each message and then pass it along, I only care about the current "batch". Will spark in that case also keep all old messages forcing me to to a"fictional" withWatermark along with a groupBy (including message-id in the groupBy and taking for example max of all columns)?
I know I can move to the good old DStreams to eliminate my issue and simply handle each message seperatly, but then I loose all the good things about Strucutred Streaming.
Yes watermarking is necessary to bound the result table and to add event time in groupby.
https://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
Any reason why you want to avoid that ?
And watermarking is "strictly" required only if you have aggregation or join to avoid late events being missed in the aggregation/join(and affect the output) but not for events which just needed to transform and flow since output will not have any effect by late events but if you want very late events to be dropped you might want to add watermarking. Some links to refer.
https://medium.com/#ivan9miller/spark-streaming-joins-and-watermarks-2cf4f60e276b
https://blog.clairvoyantsoft.com/watermarking-in-spark-structured-streaming-a1cf94a517ba

How to specify retention time for stream-stream joins?

I'd like to understand which is the retention time for structured streaming in spark.
I've different spark structured streaming streams:
Stream A: it arrives every 10 seconds, starting from time t0;
Stream B: it arrives every 10 seconds, starting from time t0;
Stream C: it arrives every 10 seconds, starting from time t1;
I need to apply a machine learning model using a pandas udf on these data. Stream A and stream B go indipendentely.
Data from stream C need to be joined with Stream A and B, before being processed.
My question is: how I ensure that data that are processed in Stream A and Stream B are not thrown away? Just using watermark is sufficient to achieve this?
how I ensure that data that are processed in Stream A and Stream B are not thrown away? Just using watermark is sufficient to achieve this?
That's right. The state of a stream-stream join is kept forever so the first question of yours is handled out of the box while the second requires a watermark and "additional join conditions".
Quoting Inner Joins with optional Watermarking:
Inner joins on any kind of columns along with any kind of join conditions are supported. However, as the stream runs, the size of streaming state will keep growing indefinitely as all past input must be saved as any new input can match with any input from the past. To avoid unbounded state, you have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input.

How to compute difference between timestamps with PySpark Structured Streaming

I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much
Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)

Structured streaming multiple watermarks

I use Spark 2.3.0 if that matters.
According to the Structured Streaming documentation, it handles late data using watermarks. It also mentions that streaming deduplication is also achieved by using watermarking to keep a limit on how much of enter link description here an intermediate state is stored.
So, my question is if these watermarks can have different values or is the watermark specified only once? I ask this because I will be deduplicating values after aggregation so the tolerance for handling late data is different.
From the Policy for handling multiple watermarks:
A streaming query can have multiple input streams that are unioned or joined together. Each of the input streams can have a different threshold of late data that needs to be tolerated for stateful operations. You specify these thresholds using withWatermarks("eventTime", delay) on each of the input streams.
While executing the query, Structured Streaming individually tracks the maximum event time seen in each input stream, calculates watermarks based on the corresponding delay, and chooses a single global watermark with them to be used for stateful operations. By default, the minimum is chosen as the global watermark because it ensures that no data is accidentally dropped as too late if one of the streams falls behind the others (for example, one of the streams stop receiving data due to upstream failures). In other words, the global watermark will safely move at the pace of the slowest stream and the query output will be delayed accordingly.
Since Spark 2.4, you can set the multiple watermark policy to choose the maximum value as the global watermark by setting the SQL configuration spark.sql.streaming.multipleWatermarkPolicy to max (default is min).
In fact, this also applies to any watermark-sensitive operator.

Apache Spark Structured Streaming for Window Aggregation and Custom Triggering

Say I have some a streaming data of the schema as follows:
uid: string
ts: timestamp
Now assuming the data has been partitioned by uid (in each partition, the data is minimal, e.g. less than 1 row/sec).
I would like to put the data (in each partition) into windows based on event time ts, then sort all the elements within each window (based on ts as well), at last apply a custom transformation on each of the element in the window in order.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Before I answer the questions let me point out that Spark Structured Streaming offers KeyValueGroupedDataset.flatMapGroupsWithState (after Dataset.groupByKey) for arbitrary stateful streaming aggregation (with explicit state logic) that gives you the most for a manual streaming state management.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
That sounds like a streaming join where you have the input stream on your left and the aggregated stream (streaming aggregation) on your right. That should be doable (but leaving it with no example code as I'm still unsure if I understood your question right).
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Use window standard function to define the window and a watermark to "close" windows at proper times. That is also doable (but no example again as I'm not sure of the merit of the question).

Resources