Azure Stream Analytics Get Previous Output Row for Join to Input - azure

I have the following scenario:
Mobile app produces events that are sent to Event Hub which is input stream source to a Stream Analytics query. From there they are passed through a sequential flow of queries that splits the stream into 2 streams based on criteria, evaluates other conditions and decides whether or not to let the event keep flowing through the pipeline (if it doesn't it is simply discarded). You could classify what we are doing is noise reduction/event filtering. Basically if A just happened don't let A happen again unless B & C happened or X time passes. At the end of the query gauntlet the streams are merged again and the "selected" events are propagated as "chosen" outputs.
My problem is that I need the ability to compare the current event to the previous "chosen" event (not just the previous input event) so in essence I need to join my input stream to my output stream. I have tried various ways to do this and so far none have worked, I know that other CEP engines support this concept. My queries are mostly all defined as temporary results sets inside of a WITH statement (that's where my initial input stream is pulled into the first query and each following query depends on the one above it) but I see no way to either join my input to my output or to join my input to another temporary result set that is further down in the chain. It appears that join only supports inputs?
For the moment I am attempting to work around this limitation with something I really don't want to do in production, but I actually have an output defined going to an Azure Queue then an Azure Function triggered by events on that queue that wakes up and posts it to a different Event hub that is mapped as a recirc feed input back into my queries which I can join to. Still wiring all of that up so not 100% sure it will work but thinking there has to be a better option for this relatively common pattern?

The WITH statement is indeed the right way to get a previous input joined with some other data.
You may need to combine it with the LAG operator, that gets the previous event in a data stream.
Let us know if it works for you.
Thanks,
JS - Azure Stream Analytics

AFAIK, the stream analytics job supports two distinct data input types: data stream inputs and reference data inputs. Per my understanding, you could leverage Reference data to perform a lookup or to correlate with your data stream. For more details, you could refer to the following tutorials:
Data input types: Data stream and reference data
Configuring reference data
Tips on refreshing your reference data
Reference Data JOIN (Azure Stream Analytics)

Related

Is a Azure Stream Analytics job with sparse/ none output possible?

Not the best question, I'm aware but I couldn't find much information on this and currently lack the time to test it.
Is it in principle possible with Azure Stream Analytics to only forward an output if some condition is met? In the docs it states: "You can use a single output per job, or multiple outputs per streaming job (if you need them) by adding multiple INTO clauses to the query."
So for example would it possible to do something like insert INTO destination only if some condition is met and not do produce an output otherwise (or would this raise an error)?
Well, whenever some input event does not match the WHERE condition that you specify in your query, the event will just be discarded.
https://learn.microsoft.com/en-us/stream-analytics-query/where-azure-stream-analytics

Azure Stream Analytics - no input events when using refrence data

My Azure Stream Analytics Job does not detect any input events if I use reference data in the query. When I'm using only streaming data it works well.
Here is my query:
SELECT v.localization as Station, v.lonn as Station_Longitude, v.latt as Station_Latitude, d.lat as My_Latitude, d.lon as My_Longitude
INTO [closest-station]
FROM eventhub d
CROSS JOIN [stations] v
WHERE ST_DISTANCE(CreatePoint(d.lat, d.lon), CreatePoint(v.latt, v.lonn) ) < 300
I used eventhub and blob as the input and the result was the same - works only without reference data
Inb4
When I'm testing the query with sample reference data (I'm uploading the exact same file as stored in the reference data location) it returns expected values
I've tested both inputs and tests were conducted successfully
The data comes from the logic app which copies it from dropbox to the eventhub or storage account (I've tested both scenarios) that are used in Azure Stream Analytics as inputs. Even if see this ran successfully, still no input events in ASA appear.
The idea is to get coordinates of the stations closer than 300 m to my localization.
Solved - you have to specify explicitly the reference file in the reference data input path pattern. Specifying container only doesn't work even if there is only one file inside.
Stream Analytics job will wait indefinitely for the blob to become available
As described here: Use referenece data for lookups in Stream Analytics

How to specify retention time for stream-stream joins?

I'd like to understand which is the retention time for structured streaming in spark.
I've different spark structured streaming streams:
Stream A: it arrives every 10 seconds, starting from time t0;
Stream B: it arrives every 10 seconds, starting from time t0;
Stream C: it arrives every 10 seconds, starting from time t1;
I need to apply a machine learning model using a pandas udf on these data. Stream A and stream B go indipendentely.
Data from stream C need to be joined with Stream A and B, before being processed.
My question is: how I ensure that data that are processed in Stream A and Stream B are not thrown away? Just using watermark is sufficient to achieve this?
how I ensure that data that are processed in Stream A and Stream B are not thrown away? Just using watermark is sufficient to achieve this?
That's right. The state of a stream-stream join is kept forever so the first question of yours is handled out of the box while the second requires a watermark and "additional join conditions".
Quoting Inner Joins with optional Watermarking:
Inner joins on any kind of columns along with any kind of join conditions are supported. However, as the stream runs, the size of streaming state will keep growing indefinitely as all past input must be saved as any new input can match with any input from the past. To avoid unbounded state, you have to define additional join conditions such that indefinitely old inputs cannot match with future inputs and therefore can be cleared from the state.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input.

Apache Spark Structured Streaming for Window Aggregation and Custom Triggering

Say I have some a streaming data of the schema as follows:
uid: string
ts: timestamp
Now assuming the data has been partitioned by uid (in each partition, the data is minimal, e.g. less than 1 row/sec).
I would like to put the data (in each partition) into windows based on event time ts, then sort all the elements within each window (based on ts as well), at last apply a custom transformation on each of the element in the window in order.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Before I answer the questions let me point out that Spark Structured Streaming offers KeyValueGroupedDataset.flatMapGroupsWithState (after Dataset.groupByKey) for arbitrary stateful streaming aggregation (with explicit state logic) that gives you the most for a manual streaming state management.
Q1: Is there any way to get an aggregated view of the window, but keep each element, e.g. materialize the all the elements in a window into a list?
That sounds like a streaming join where you have the input stream on your left and the aggregated stream (streaming aggregation) on your right. That should be doable (but leaving it with no example code as I'm still unsure if I understood your question right).
Q2: If Q1 is possible, I would like to set a watermark and trigger combination, which triggers once at the end of the window, then either trigger periodically or trigger every time late data arrives. Is it possible?
Use window standard function to define the window and a watermark to "close" windows at proper times. That is also doable (but no example again as I'm not sure of the merit of the question).

Get all rows of a window in Spark structured streaming

I have a use case where we need to find patterns in data within a window. We are experimenting with Structured Streaming. We have a continues stream of events and are looking for patterns like event A (device disconnect) is followed by event B (device reconnect) within 10 seconds. or event A (disconnect) is not followed by event B (reconnect) within 10 seconds.
I was thinking of using a window function grouping dataset into 10 seconds window buckets and checking for the pattern every time the window values are updated. It looks like the window function is really used as a groupBy in structured streaming which forces me to use aggregate functions to get high level agg on column values.
I am wondering if there is a way to loop through all values of the column when using window function in structured streaming.
You might want to try using mapGroupsWithState (structured streaming) or mapWithState (DStreams), it sounds like it could work well for your case.
You can keep arbitrary state for any key and update the state everytime an update comes. You can also set a timeout for each key after which its state will get removed. For your use case, you could store the initial state for event A as the timestamp of when A arrived, and when event B comes you can check if the timestamp is within 10s of A. If it is, generate an event.
You might also be able to use timeouts for this, e.g. set the initial state when A comes, set the timeout to 10s, and if A is still around when B comes then generate an event.
Good blog post on the differences b/w mapGroupsWithState and mapWithState

Resources