Not the best question, I'm aware but I couldn't find much information on this and currently lack the time to test it.
Is it in principle possible with Azure Stream Analytics to only forward an output if some condition is met? In the docs it states: "You can use a single output per job, or multiple outputs per streaming job (if you need them) by adding multiple INTO clauses to the query."
So for example would it possible to do something like insert INTO destination only if some condition is met and not do produce an output otherwise (or would this raise an error)?
Well, whenever some input event does not match the WHERE condition that you specify in your query, the event will just be discarded.
https://learn.microsoft.com/en-us/stream-analytics-query/where-azure-stream-analytics
Related
I'm using Azure Data Factory to build some file to db imports and one of the requirements I have is if a file isn't valid. e.g. either a column is missing or contains incorrect data (wrong data type, lookup doesn't exist in a db) then an alert is sent detailing the errors. Errors should be regular human readable so rather than a SQL error saying insert would violate a forign key, it should say incorrect value entered for x.
This doc (https://learn.microsoft.com/en-us/azure/data-factory/how-to-data-flow-error-rows) describes a way of using conditional splits to add custom validation that would certainly work to allow me to import the good data and write the bad data to another file with custom error messages. But how can I then trigger an alert with this? As far as I can tell, this will result in the data flow reporting success and to do something like calling a logic app to send an email needs to be done in the pipeline rather than data flow.
That’s a good point, but couldn’t you write the bad records to an error table/file, then give aggregated summary of how many records erred, counts of specific errors, that would be passed to logic app/SendGrid API to alert interested parties of the status. It would be post-data flow completion activity that checks to see if there is an error file or error records in the table, if so, aggregate and classify, then alert.
I have a similar notification process that gives me successful/erred pipeline notifications, as well as 30 day pipeline statistics... % pipeline successful, average duration, etc.
I’m not at my computer right now, otherwise I’d give more detail with examples.
To catch the scenario when the rows copied and rows written are not equal , may be you can use output of the copy active and if the difference is not 0 , send an alert .
I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much
Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)
Not sure the title is well suited to what I'm trying to achieve, so bear with me.
I'll start with defining my use case:
Many(say millions) IoT devices are sending data to my Spark stream. These devices are sending the current temperature level every 10 seconds.
The owner of all of these IoT devices has the ability to define a preset rules, for example: if temperature > 50 then do something.
I'm trying to figure out if I can output how many of these devices have met this if > 50 criteria in some time period. The catch is that the rules are defined in real time and should be applied to the Spark job at real time.
How would I do that. Is Spark the right tool for the job?
Many thanks
Is Spark the right tool for the job?
I think so.
the rules are defined in real time and should be applied to the Spark job at real time.
Let's assume the rules are in a database so every batch interval Spark would fetch them and apply one by one. They could also be in a file or any other storage. That's just orthogonal to the main requirement.
How would I do that?
The batch interval would be "some time period". I assume that the payload would have deviceId and temperature. With that you can just use regular filter over temperature and get deviceId back. You don't need stateful pipeline for this unless you want to accumulate data over time that is longer than your batch interval.
I have the following scenario:
Mobile app produces events that are sent to Event Hub which is input stream source to a Stream Analytics query. From there they are passed through a sequential flow of queries that splits the stream into 2 streams based on criteria, evaluates other conditions and decides whether or not to let the event keep flowing through the pipeline (if it doesn't it is simply discarded). You could classify what we are doing is noise reduction/event filtering. Basically if A just happened don't let A happen again unless B & C happened or X time passes. At the end of the query gauntlet the streams are merged again and the "selected" events are propagated as "chosen" outputs.
My problem is that I need the ability to compare the current event to the previous "chosen" event (not just the previous input event) so in essence I need to join my input stream to my output stream. I have tried various ways to do this and so far none have worked, I know that other CEP engines support this concept. My queries are mostly all defined as temporary results sets inside of a WITH statement (that's where my initial input stream is pulled into the first query and each following query depends on the one above it) but I see no way to either join my input to my output or to join my input to another temporary result set that is further down in the chain. It appears that join only supports inputs?
For the moment I am attempting to work around this limitation with something I really don't want to do in production, but I actually have an output defined going to an Azure Queue then an Azure Function triggered by events on that queue that wakes up and posts it to a different Event hub that is mapped as a recirc feed input back into my queries which I can join to. Still wiring all of that up so not 100% sure it will work but thinking there has to be a better option for this relatively common pattern?
The WITH statement is indeed the right way to get a previous input joined with some other data.
You may need to combine it with the LAG operator, that gets the previous event in a data stream.
Let us know if it works for you.
Thanks,
JS - Azure Stream Analytics
AFAIK, the stream analytics job supports two distinct data input types: data stream inputs and reference data inputs. Per my understanding, you could leverage Reference data to perform a lookup or to correlate with your data stream. For more details, you could refer to the following tutorials:
Data input types: Data stream and reference data
Configuring reference data
Tips on refreshing your reference data
Reference Data JOIN (Azure Stream Analytics)
Is is possible to get multiple DStream out of a single DStream in spark.
My use case is follows: I am getting Stream of log data from HDFS file.
The log line contains an id (id=xyz).
I need to process log line differently based on the id.
So I was trying to different Dstream for each id from input Dstream.
I couldnt find anything related in documentation.
Does anyone know how this can be achieved in Spark or point to any link for this.
Thanks
You cannot Split multiple DStreams from Single DStreams.
The best you can do is: -
Modify your source system to have different streams for different ID's and then you can have different jobs to process different Streams
In case your source cannot change and provide you stream which is mix of ID, then you need to write custom logic to identify the ID and then perform the appropriate operation.
I would always prefer #1 as that is cleaner solution but there are exceptions for which #2 needs to be implemented.