Concatenating IoT messages' data by Azure Stream Analytics - azure

An IoT sensor takes 1000 measurements # 1kHZ in every 10min and sends the values in ten separate messages into Azure IoT Hub. I am supposed to concatenate the ten separate messages back to one for further processing e.g. calculating RMS and FFT.
The messages have the following structure:
{
"SampleID" : 12344,
"PartionIdx" : 2,
"NbrPartitions": 10,
"Values" : [12,13,14,13,12,11,10,9]
}
So, the values of all messages having same SampleID should be concatenated together by PartitionIdx order, after all ten have been received. I tried to use Stream Analytics but failed.
Is this too complex task for Stream Analytics? If yes, is there any other options than just coding a Web Job doing the concatenation.

There are two aspects to the question
Knowing when all the values have arrived.
Concatenating them together.
For 1,
If the events for a particular device Id arrives in order to eventhub and to the same eventhub partition always or if you have an idea about how out of order the data can be, you can use TIMESTAMP BY OVER() clause to create a different timeline for every device. This will block output for a deviceId until enough data from that partition has been received.
For 2, you can either use Collect() like #js-azure mentioned. If you want the data to be formatted in a specific instead of array of records, you can use javascript aggregates.

Related

Correlating Events in Stream Analytics

I have a number of events that are based on values from devices. They are read in intervals, e.g. every hour. The events are delivered to an Event Hub, which is used as an input to a Stream Analytics (SA) job.
I want to aggregate and calculate an average value in SA. Currently, I aggregate and group the events in SA using an origin id and other properties to create the correct groups and averages. The problem is that the averages are not correct. I think the events a either not complete and/or not correlated correct.
Using a TumblingWindow will produce a number of static windows based on time, but the events I need to aggregate might come across two or more windows.
Using a SlidingWindow, as I understand, will trigger output upon a specific condition and the "look back" for a specified interval. Is this correct? If it is correct, I could attach the same id, like a JobId, to each event that I need aggregated and a value indicating whether it is the last event. When the last event enters SA, the SlidingWindow is triggered and we can "look back" for all the events with the same id. Is this possible? 
Are there other options in this case? Basically I need to correlate a number of events based on other characteristics than time.
I hope you can help me.

Create multiple Event hub receiver to process huge volume of data concurrently

Hi I have a event hub with two consumer group.
Many device are sending data to my event hub any I want to save all message to my data base.
Now data are getting send by multiple device so data ingress is to high so in order two process those message i have written one EventHub Trigger webjob to process the message and save to database.
But since saving these message to my data base is time consuming task or I can say that receiver speed is slow then sender speed.
So is there any way two process these message faster by creating multiple receiver kind of thing.
I have create two event receiver with different consumer group but I found that same message is getting processed by both trigger function so now duplicate data are getting save in my data base.
So please help me to know how I can create multiple receiver which will process unique message parallel.
Please guys help me...
Creating multiple consumer groups won't help you out as you found out yourself. Different consumer groups all read the same data, but they can have their own speed.
In order to increase the speed of processing there are just 2 options:
Make the process/code itself faster, so try to optimize the code that is saving the data to the database
Increase the amount of partitions so more consumers can read the data from a given partition in parallel. This means however that you will have to recreate the Event Hub as you cannot increase/decrease the partition count after the Event Hub is created. See the docs for guidance.
about 2.: The number of concurrent data consumers is equal to the number of partitions created. For example, if you have 4 partitions you can have up to 4 concurrent data readers processing the data.
I do not know your situation but if you have certain peaks in which the processing is too slow but it catches up during more quiet hours you might be able to live with the current situation. If not, you have to do something like I outlined.

Azure Stream Analytics: "Output contains multiple rows …" warning

we're using a Stream Analytics component in Azure to send data (log messages from different web apps) to a table storage account. The messages are retrieved from an Event Hub, but I think this doesn't matter here.
Within the Stream Analytics component we defined an output for the table storage account including partition and row key settings. As of now the partition key will be the name of the app that sent the log message in the first place. This might not be ideal, but I'm lacking experience in choosing the right values here. However, I think this is a whole different topic. The row key will be a unique id of the specific log message.
Now when I watch the Stream Analytics Output within the Azure portal the following warning message pops up very frequently (and sometimes disappears for a couple of seconds):
Warning: Output contains multiple rows and just one row per partition key. If the output latency is higher than expected, consider choosing a partition key that splits output into multiple partitions while maintaining about 100 records per partition.
Regarding this message I have two questions:
What does this exactly mean or why does it happen? I can see that a single new log message will always qualify as "just one row per partition key", simply because it's just one row. But looking at maybe hundreds of rows sent within a short period of time they all share just three partition keys (three apps logging to the Event Hub), pretty much equally divided. That's why I don't get the whole "Output contains multiple rows and just one row per partition key" thing.
Does this in any way affect the performance or overall functionality of the Stream Analytics component or the table storage?
I also played with the "Batch size" setting of the table storage output, but this didn't change anything.
Thanks in advance for reading and trying to help.
What does this exactly mean or why does it happen?
It is a warning not a error. It means that each row in your output has the unique partition key.
I can see that a single new log message will always qualify as "just one row per partition key", simply because it's just one row.
The warning is not suitable for a single message. I suggest you post a feedback on Azure feedback site which is used for accepting user voice and bugs.
https://feedback.azure.com/forums/34192--general-feedback
Does this in any way affect the performance or overall functionality of the Stream Analytics component or the table storage?
No, you could just ignore the warning.

Azure Stream Analytics Get Previous Output Row for Join to Input

I have the following scenario:
Mobile app produces events that are sent to Event Hub which is input stream source to a Stream Analytics query. From there they are passed through a sequential flow of queries that splits the stream into 2 streams based on criteria, evaluates other conditions and decides whether or not to let the event keep flowing through the pipeline (if it doesn't it is simply discarded). You could classify what we are doing is noise reduction/event filtering. Basically if A just happened don't let A happen again unless B & C happened or X time passes. At the end of the query gauntlet the streams are merged again and the "selected" events are propagated as "chosen" outputs.
My problem is that I need the ability to compare the current event to the previous "chosen" event (not just the previous input event) so in essence I need to join my input stream to my output stream. I have tried various ways to do this and so far none have worked, I know that other CEP engines support this concept. My queries are mostly all defined as temporary results sets inside of a WITH statement (that's where my initial input stream is pulled into the first query and each following query depends on the one above it) but I see no way to either join my input to my output or to join my input to another temporary result set that is further down in the chain. It appears that join only supports inputs?
For the moment I am attempting to work around this limitation with something I really don't want to do in production, but I actually have an output defined going to an Azure Queue then an Azure Function triggered by events on that queue that wakes up and posts it to a different Event hub that is mapped as a recirc feed input back into my queries which I can join to. Still wiring all of that up so not 100% sure it will work but thinking there has to be a better option for this relatively common pattern?
The WITH statement is indeed the right way to get a previous input joined with some other data.
You may need to combine it with the LAG operator, that gets the previous event in a data stream.
Let us know if it works for you.
Thanks,
JS - Azure Stream Analytics
AFAIK, the stream analytics job supports two distinct data input types: data stream inputs and reference data inputs. Per my understanding, you could leverage Reference data to perform a lookup or to correlate with your data stream. For more details, you could refer to the following tutorials:
Data input types: Data stream and reference data
Configuring reference data
Tips on refreshing your reference data
Reference Data JOIN (Azure Stream Analytics)

Azure Eventhub: de-mux and aggregate events from all partitions of eventhub using IEventProcessor

I have 1 eventhub with 2 partitions, I want to aggregate my data for a minute and save that data to database, I am using IEventProcessor to read events from the eventhub.
I am able to save data to database as it is, but when I aggregate data, I get 2 entries per minute instead of 1. I think the reason is the IEventProcessor runs twice, i.e each time for a partition in eventhub.
Are there any ways I can achieve aggregation of streaming data for a minute while reading from eventhub and then save to the database? (I can't use stream analytics, since I have data in protobuf format.)
You can use Azure IoTHub React Java and Scala API, it provides a merged reactive stream with events from all EventHub partitions.
From your perspective you'll see only one stream of data, regardless of the number of partitions in EventHub, and you can select a subset of partitions too if you need.
These samples show how the API works, it should make your task very simple. You need to define your "Sink" which is going to be a method writing events to a database, and link the provided "Source", something like:
val eventHubRecords = IoTHub().source(java.time.Instant.now())
val myDatabase = Sink.foreach[MessageFromDevice] {
m ⇒ MyDB.writeRecord(m)
}
eventHubRecords.to(myDatabase).run()
Here are the configuration settings, checkpointing supports Cassandra and AzureBlob.
Note: the project is named after Azure IoT, however you can use it for EventHub, let me know if you have any question.
You can use Stream Analytics and it's Group By clause. As long as all the rows are unique it won't summarize them. You can then push that output onto another Event Hub for your IEventProcessor to handle, or write it directly to storage.

Resources