I'd like to understand which is the retention time for structured streaming in spark.
I've different spark structured streaming streams:
Stream A: it arrives every 10 seconds, starting from time t0;
Stream B: it arrives every 10 seconds, starting from time t0;
Stream C: it arrives every 10 seconds, starting from time t1;
I need to apply a machine learning model using a pandas udf on these data. Stream A and stream B go indipendentely.
Data from stream C need to be joined with Stream A and B, before being processed.
My question is: how I ensure that data that are processed in Stream A and Stream B are not thrown away? Just using watermark is sufficient to achieve this?
how I ensure that data that are processed in Stream A and Stream B are not thrown away? Just using watermark is sufficient to achieve this?
That's right. The state of a stream-stream join is kept forever so the first question of yours is handled out of the box while the second requires a watermark and "additional join conditions".
Quoting Inner Joins with optional Watermarking:
Inner joins on any kind of columns along with any kind of join conditions are supported. However, as the stream runs, the size of streaming state will keep growing indefinitely as all past input must be saved as any new input can match with any input from the past. To avoid unbounded state, you have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input.
Related
I have two streaming datasets, let's call them fastStream and slowStream.
The fastStream is a streaming dataset that I am consuming from Kafka via the structured streaming API. I am expecting to receive potentially thousands of messages a second.
The slowStream is actually a reference (or lookup) table that is being 'upserted' by another stream and contains data that I want to join on to each message in the fastStream before I save the records to a table. The slowStream is only updated when someone changes the metadata, which can happen at any time but we would expect to change maybe once every few days.
Each record in the fastStream will have exactly one corresponding message in the slowStream and I essentially want to make that join happen immediately with whatever data is in the slowStream table. I don't want to wait to see if a potential match could occur if new data arrives in the slowStream.
The problem that I have is that according to the Spark docs:
Hence, for both the input streams, we buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results.
I have tried adding a watermark to the fastStream but I think it has no effect since the docs indicate that the watermarked columns need to be referenced in the join
Ideally I would write something like:
# Apply a watermark to the fast stream
fastStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/fastStream") \
.withWatermark("timestamp", "1 hour") \
.alias("fastStream")
# The slowStream cannot be watermarked since it is only slowly changing
slowStream = spark.readStream \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
# Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches.
fastStream.join(
slowStrean,
expr("""
fastStream.slow_id = slowStream.id
AND fastStream.timestamp > watermark
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
But I don't think you can reference the watermark in the SQL expression.
Essentially, while I'm happy to have the slowStream buffered (so the whole table is in memory) I can't have the fastStream buffered as this table will quickly consume all memory. Instead, I would simply like to drop messages from the fastStream that aren't matched instead of retaining them to see if they might match in future.
Any help very gratefully appreciated.
For inner Stream-Stream joins watermarking and event-time constraints (join condition) are optional.
If an unbounded state is not an issue for you in terms of volume you can choose not to specify them. In that case, all data will be buffered and your data from the fastStream will immediately be joined with all the data from the slowStream.
Only when both parameters are specified your state will be cleaned up. Note the purpose of those two parameters:
Event-time constraint (time range join condition): What ist the maximum time range between the generation of the two events at their respective sources?
Watermark: What is the maximum duration an event can be delayed in transit between the source and the processing engine?
To define the two parameters you need to first answer the above mentioned questions (which are quoted from the book "Learning Apache Spark, 2nd edition" published by O`Reilly).
Regarding your code comment:
"Prevent the join from buffering the fast stream by 'telling' spark that there will never be new matches."
Remember that buffering in stream-stream join is necessary. Otherwise you would just be able to join the data that is available within the current micro-batch. As the slowStream does not have regular updates but the fastStream is updating its data quite fast you would probably never get any join matches at all without buffering the data.
Overall, for the use case you are describing ("Join fast changing data with slow changing metadata") it is usually better to use a stream-static join approach where the slow changing data becomes the static part.
In a stream-static join every row in the stream data will be joined with the full static data whereas the static table is loaded in every single micro-batch. If loading the static table reduces your performance you may think about caching it and have it updated regularly as described in Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically.
Answering my own question with what I ended up going with. It's certainly not ideal but for all my searching, there doesn't seem to be the control within Spark structured streaming to address this use case.
So my solution was to read the dataset and conduct the join inside a foreachBatch. This way I prevent Spark from storing a ton of unnecessary state and get the joins conducted immediately. On the downside, there seems to be no way to incrementally read a stream table so instead, I am re-reading the entire table every time...
def join_slow_stream(df, batchID):
# Read as a table rather than a stream
slowdf = spark.read \
.format("delta") \
.load("dbfs:/mnt/some_file/slowStream") \
.alias("slowStream")
out_df = df.join(
slowdf,
expr("""
fastStream.slow_id = slowStream.id
"""
),
"inner"
).select("fastStream.*", "slowStream.metadata")
# write data to database
db_con.write(out_df)
fastStream.writeStream.foreachBatch(join_slow_stream)
If you are interested in referencing the "time that was watermarked" i.e. the 1 hour, you may replace watermark in the expression with current_timestamp - interval '1' hour.
Since you are attempting to join two streams, spark will insist that both use watermarks
Reference
Spark Stream to Stream Joins
I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much
Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)
I use Spark 2.3.0 if that matters.
According to the Structured Streaming documentation, it handles late data using watermarks. It also mentions that streaming deduplication is also achieved by using watermarking to keep a limit on how much of enter link description here an intermediate state is stored.
So, my question is if these watermarks can have different values or is the watermark specified only once? I ask this because I will be deduplicating values after aggregation so the tolerance for handling late data is different.
From the Policy for handling multiple watermarks:
A streaming query can have multiple input streams that are unioned or joined together. Each of the input streams can have a different threshold of late data that needs to be tolerated for stateful operations. You specify these thresholds using withWatermarks("eventTime", delay) on each of the input streams.
While executing the query, Structured Streaming individually tracks the maximum event time seen in each input stream, calculates watermarks based on the corresponding delay, and chooses a single global watermark with them to be used for stateful operations. By default, the minimum is chosen as the global watermark because it ensures that no data is accidentally dropped as too late if one of the streams falls behind the others (for example, one of the streams stop receiving data due to upstream failures). In other words, the global watermark will safely move at the pace of the slowest stream and the query output will be delayed accordingly.
Since Spark 2.4, you can set the multiple watermark policy to choose the maximum value as the global watermark by setting the SQL configuration spark.sql.streaming.multipleWatermarkPolicy to max (default is min).
In fact, this also applies to any watermark-sensitive operator.
I have the following scenario:
Mobile app produces events that are sent to Event Hub which is input stream source to a Stream Analytics query. From there they are passed through a sequential flow of queries that splits the stream into 2 streams based on criteria, evaluates other conditions and decides whether or not to let the event keep flowing through the pipeline (if it doesn't it is simply discarded). You could classify what we are doing is noise reduction/event filtering. Basically if A just happened don't let A happen again unless B & C happened or X time passes. At the end of the query gauntlet the streams are merged again and the "selected" events are propagated as "chosen" outputs.
My problem is that I need the ability to compare the current event to the previous "chosen" event (not just the previous input event) so in essence I need to join my input stream to my output stream. I have tried various ways to do this and so far none have worked, I know that other CEP engines support this concept. My queries are mostly all defined as temporary results sets inside of a WITH statement (that's where my initial input stream is pulled into the first query and each following query depends on the one above it) but I see no way to either join my input to my output or to join my input to another temporary result set that is further down in the chain. It appears that join only supports inputs?
For the moment I am attempting to work around this limitation with something I really don't want to do in production, but I actually have an output defined going to an Azure Queue then an Azure Function triggered by events on that queue that wakes up and posts it to a different Event hub that is mapped as a recirc feed input back into my queries which I can join to. Still wiring all of that up so not 100% sure it will work but thinking there has to be a better option for this relatively common pattern?
The WITH statement is indeed the right way to get a previous input joined with some other data.
You may need to combine it with the LAG operator, that gets the previous event in a data stream.
Let us know if it works for you.
Thanks,
JS - Azure Stream Analytics
AFAIK, the stream analytics job supports two distinct data input types: data stream inputs and reference data inputs. Per my understanding, you could leverage Reference data to perform a lookup or to correlate with your data stream. For more details, you could refer to the following tutorials:
Data input types: Data stream and reference data
Configuring reference data
Tips on refreshing your reference data
Reference Data JOIN (Azure Stream Analytics)
I have a use case where we need to find patterns in data within a window. We are experimenting with Structured Streaming. We have a continues stream of events and are looking for patterns like event A (device disconnect) is followed by event B (device reconnect) within 10 seconds. or event A (disconnect) is not followed by event B (reconnect) within 10 seconds.
I was thinking of using a window function grouping dataset into 10 seconds window buckets and checking for the pattern every time the window values are updated. It looks like the window function is really used as a groupBy in structured streaming which forces me to use aggregate functions to get high level agg on column values.
I am wondering if there is a way to loop through all values of the column when using window function in structured streaming.
You might want to try using mapGroupsWithState (structured streaming) or mapWithState (DStreams), it sounds like it could work well for your case.
You can keep arbitrary state for any key and update the state everytime an update comes. You can also set a timeout for each key after which its state will get removed. For your use case, you could store the initial state for event A as the timestamp of when A arrived, and when event B comes you can check if the timestamp is within 10s of A. If it is, generate an event.
You might also be able to use timeouts for this, e.g. set the initial state when A comes, set the timeout to 10s, and if A is still around when B comes then generate an event.
Good blog post on the differences b/w mapGroupsWithState and mapWithState