From IOT hub to multiple tables in an Azure SQL database - azure

I have an IOT hub with devices that push their sensor data to it, to be stored in a SQL database. This seems to be quite easy to do by means of a Stream Analytics job.
However, the tricky part is as follows. The data I'm pushing is not normalized, and since I'm using a SQL database I would like to structure it among multiple tables. This does not seem to be an easy task with Stream Analytics.
This is an example of the payload I'm pushing to the IOT hub:
{
"timestamp" : "2019-01-10 12:00",
"section" : 1,
"measurements" :
{
"temperature" : 28.7,
"height" : 280,
"ec" : 6.8
},
"pictures" : {
"101_a.jpg",
"102_b.jpg",
"103_c.jpg"
}
}
My database has a table Measurement, MeasurementItem and Picture. I would like to store the timestamp and section in a Measurement record, the temperature, height and ec in a MeasurementItem record and the pictures in the Picture table.
Filling one table is easy, but to fill the second table I need the generated auto-increment ID of the previous record to keep the relation intact.
Is that actually possible with Stream Analytics, and if no, how should I do that?

You should'nt try it with Stream Analytics (SA) for several reasons. It's not designed for workloads like this, because otherwise SA would not be able to perform it's work this performant. It's just sending data to one or more sinks depending on input data.
I would suggest passing the data to a component that is able to perform logic on the output-side. There are a some options for this. 2 examples might be:
Azure Function (via service-bus-trigger pointing to the IoT hub built-in endpoint as described here)
Event-Grid-based trigger on a storage you write the IoT data to (so again you could use a Azure Function but let it be triggered by an event from a storage account)
This solutions also come with the price that each incoming data package will call a logic unit for which you have to pay additionally. Be aware that there are billing options on Azure Functions that will not depend on the amount of calls but provide you the logic in a more app-service-like model.
If you have huge amounts of data to process you might consider an architecture using Data Lake Storage Account in combination with Data Lake Analytics instead. The latter can collect, aggregate and distribute your incoming data into different data stores too.

I ended up with an Azure function with an IoT hub trigger. The function uses EF Core to store the JSON messages in the SQL database, spread over multiple tables. I was a bit reluctant for this approach, as it introduces extra logic and I expected to pay extra for that.
The opposite appeared to be true. For Azure Functions, the first 400,000 GB/s of execution and 1,000,000 executions are free. Moreover, this solution gives extra flexibility and control because the single table limitation does no longer apply.

Related

Cosmos Write Returning 429 Error With Bulk Execution

We have a solution utilizing a micro-service approach. One of our micro-service is responsible for pushing data to Cosmos. Our Cosmos database is using serverless provision having a 5,000 RU/s limit.
The data we are inserting into Cosmos looks like the below. There are 10 columns and we are pushing a batch containing 5,807 rows of this data.
Id
CompKey
Primary Id
Secondary Id
Type
DateTime
Item
Volume
Price
Fee
1
Veg_Buy
csd2354csd
dfg564dsfg55
Buy
30/08/21
Leek
10
0.75
5.00
2
Veg_Buy
sdf15s1dfd
sdf31sdf654v
Buy
30/08/21
Corn
5
0.48
3.00
We are retrieving data from multiple sources, normalizing it, and sending out the data as one bulk execution to Cosmos. The retrieval process happens every hour. We understand that we are spiking the Cosmos database once per hour with the data that has been retrieved and then stop sending data until the next retrieval cycle. So if this high peak is the problem, what remedies exist for such a scenario?
Can anyone shed some light on what we should/need to do to overcome this issue? Perhaps we are missing a setting when creating the Cosmos database or possibly this has something to do with partitioning?
You can mostly determine these things by looking at the metrics published in the Azure Portal. This doc is a good place to start, Monitor and debug with insights in Azure Cosmos DB.
In particular I would look at the section titled, Determine the throughput consumption by a partition key range
If you are not dealing with a hot partition key you may want to look at options to throttle your writes. This may include modifying your batch size and putting the write operations on a while..loop with a one second timer until RU/s consumed equals 5000 RU/s. You could also possibly look at doing queue-based load leveling and put writes on a queue in front of Cosmos and stream them in.

Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData)?

Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData) from Azure EventHub client SDK ?
We intentionally do not provide a partition key so EventHub service can do its internal load balancing but want to find out which partition the data is written to eventually, for diagnosing issues with the end to end data flow.
Ivan's answer is correct in the context of the legacy SDK (Microsoft.Azure.EventHubs), but the current generation (Azure.Messaging.EventHubs) is slightly different. You don't mention a specific language, but conceptually the answer is the same across them. I'll use .NET to illustrate.
If you're not using a call that requires specifying a partition directly when reading events, then you'll always have access to an object that represents the partition that an event was read from. For example, if you're using the EventHubConsumerClient method ReadEventsAsync to explore, you'll be see the PartitionEvent where the Partition property tells you the partition that the Data was read from.
When using the EventProcessorClient, your ProcessEventAsync handler will be invoked with a set of ProcessEventArgs where the Partition property tells you the partition that the Data was read from.
There is no direct way to this. But there're 2 workarounds.
1.Use the Event Hubs Capture to store the incoming events, then check the events in the specified blob storage. When the events are stored in blob storage, the path contains the partition id, so you can know it.
2.Use code. Create a new Consumer Group, and follow this article to read events. And in this section, there is a method public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages). You can take use of the parameter PartitionContext to get the event's partition id(by using context.PartitionId).

How to ensure idempotency on event hub on consumers that only stores aggregated information?

I'm working on an event-driven micro-services architecture and I'm using eventhubs to send a lot of data (around 20-30k events per minute) to multiple consumer groups and I'm using Azure Functions EventHubTrigger to process these events.
The data I'm passing around has a unique identifier and my other consumers can guarantee idempotency since I'm storing them on their data stores as well upon processing - so if the unique event identifier already exists, I can skip processing for that specific event.
I do however have one service that only does data-aggregation for reporting to a relational database - doing counts, sums, and what-not. Pretty much upserts so that I can do some queries against it to produce reports - and I did see quite a bit of events that have been processed multiple times.
So an idea that I had was to just have some sort of event store. Redis with TTL, or Azure Table Storage, or even a table on my relational database that only contains a single field with a unique constraint so I can do a transaction on the whole event processing.
Is there a better way to do this?

Can a date and time be specified when sending data to Azure event hub?

Here's the scenario. I'm not working with real-time data. Instead, I get data from my electric company for the past day's electric usage. Specifically, each day I can get # of kwhs for each hour on the clock on the past day.
So, I'd like to load this past information into event hub each following day. Is this doable? Does event hub support loading past information, or is it only and forever about realtime streaming data, with no ability to load past data in?
I'm afraid this is the case, as I've not seen any date specification in what limited api documentation I could find for it. I'd like to confirm, though...
Thanks,
John
An Azure Event Hub is really meant for short-term storage. By default you may only retain data up to 7 days. After which the data will be deleted based upon an append timestamp that was created when the message first entered the Event Hub. Therefore it is not practical to use an Azure Event Hub for data that's older than 7 days.
An Azure Event Hub is meant for message/event management, not long term storage. A possible solution would be to write the Event Hub data to an Azure SQL server or blob storage for long term storage. Then use Azure Stream Analytics (an event processor) to join the active stream with the legacy data that has accumulated on the SQL server. Also note, you can call this appended attribute. It's called "EventEnqueuedUtcTime". Keep in mind that it will be on the server time, whose clock may be different from the date/time of actual measurement.
As for appending a date time. If you are sending it in as a JSON, just simply append it as a key and message value. Example Message with Time: { "Time": "My UTC Time here" }
A streaming system of this type doesn't care about times a particular application may wish to apply to the items. There simply isn't any processing that happens based on a time field unless your code does it.
Each message sent is an EventData which contains a message with an arbitrary set of bytes. You can easily include a date/time in that serialized data structure, but EventHubs won't care about it. There is no sorting performed or fixed ordering other than insertion order within a partition which is defined by the sequence number. While the enqueued time is available it's mostly useful for monitoring how far behind in processing you are.
As to the rest of your problem, I'd agree with the comment that EventHubs may not really be the best choice. You can certainly load data into it once per day, but if it's really only 24 data points/day, it's not really the appropriate technology choice unless it's a prototype/tech demo for a system that's eventually supposed to have a whole load of smart meters reporting to it with fair frequency. (Note also that EventHubs cost $11/month minimum, Service Bus Queue $10/Month min, and AWS SQS $0 min)

Azure event hubs and multiple consumer groups

Need help on using Azure event hubs in the following scenario. I think consumer groups might be the right option for this scenario, but I was not able to find a concrete example online.
Here is the rough description of the problem and the proposed solution using the event hubs (I am not sure if this is the optimal solution. Will appreciate your feedback)
I have multiple event-sources that generate a lot of event data (telemetry data from sensors) which needs to be saved to our database and some analysis like running average, min-max should be performed in parallel.
The sender can only send data to a single endpoint, but the event-hub should make this data available to both the data handlers.
I am thinking about using two consumer groups, first one will be a cluster of worker role instances that take care of saving the data to our key-value store and the second consumer group will be an analysis engine (likely to go with Azure Stream Analysis).
Firstly, how do I setup the consumer groups and is there something that I need to do on the sender/receiver side such that copies of events appear on all consumer groups?
I did read many examples online, but they either use client.GetDefaultConsumerGroup(); and/or have all partitions processed by multiple instances of a same worker role.
For my scenario, when a event is triggered, it needs to be processed by two different worker roles in parallel (one that saves the data and second one that does some analysis)
Thank You!
TLDR: Looks reasonable, just make two Consumer Groups by using different names with CreateConsumerGroupIfNotExists.
Consumer Groups are primarily a concept so exactly how they work depends on how your subscribers are implemented. As you know, conceptually they are a group of subscribers working together so that each group receives all the messages and under ideal (won't happen) circumstances probably consumes each message once. This means that each Consumer Group will "have all partitions processed by multiple instances of the same worker role." You want this.
This can be implemented in different ways. Microsoft has provided two ways to consume messages from Event Hubs directly plus the option to use things like Streaming Analytics which are probably built on top of the two direct ways. The first way is the Event Hub Receiver, the second which is higher level is the Event Processor Host.
I have not used Event Hub Receiver directly so this particular comment is based on the theory of how these sorts of systems work and speculation from the documentation: While they are created from EventHubConsumerGroups this serves little purpose as these receivers do not coordinate with one another. If you use these you will need to (and can!) do all the coordination and committing of offsets yourself which has advantages in some scenarios such as writing the offset to a transactional DB in the same transaction as computed aggregates. Using these low level receivers, having different logical consumer groups using the same Azure consumer group probably shouldn't (normative not practical advice) be particularly problematic, but you should use different names in case it either does matter or you change to EventProcessorHosts.
Now onto more useful information, EventProcessorHosts are probably built on top of EventHubReceivers. They are a higher level thing and there is support to enable multiple machines to work together as a logical consumer group. Below I've included a lightly edited snippet from my code that makes an EventProcessorHost with a bunch of comments left in explaining some choices.
//We need an identifier for the lease. It must be unique across concurrently
//running instances of the program. There are three main options for this. The
//first is a static value from a config file. The second is the machine's NETBIOS
//name ie System.Environment.MachineName. The third is a random value unique per run which
//we have chosen here, if our VMs have very weak randomness bad things may happen.
string hostName = Guid.NewGuid().ToString();
//It's not clear if we want this here long term or if we prefer that the Consumer
//Groups be created out of band. Nor are there necessarily good tools to discover
//existing consumer groups.
NamespaceManager namespaceManager =
NamespaceManager.CreateFromConnectionString(eventHubConnectionString);
EventHubDescription ehd = namespaceManager.GetEventHub(eventHubPath);
namespaceManager.CreateConsumerGroupIfNotExists(ehd.Path, consumerGroupName);
host = new EventProcessorHost(hostName, eventHubPath, consumerGroupName,
eventHubConnectionString, storageConnectionString, leaseContainerName);
//Call something like this when you want it to start
host.RegisterEventProcessorFactoryAsync(factory)
You'll notice that I told Azure to make a new Consumer Group if it doesn't exist, you'll get a lovely error message if it doesn't. I honestly don't know what the purpose of this is because it doesn't include the Storage connection string which needs to be the same across instances in order for the EventProcessorHost's coordination (and presumably commits) to work properly.
Here I've provided a picture from Azure Storage Explorer of leases the leases and presumably offsets from a Consumer Group I was experimenting with in November. Note that while I have a testhub and a testhub-testcg container, this is due to manually naming them. If they were in the same container it would be things like "$Default/0" vs "testcg/0".
As you can see there is one blob per partition. My assumption is that these blobs are used for two things. The first of these is the Blob leases for distributing partitions amongst instances see here, the second is storing the offsets within the partition that have been committed.
Rather than the data getting pushed to the Consumer Groups the consuming instances are asking the storage system for data at some offset in one partition. EventProcessorHosts are a nice high level way of having a logical consumer group where each partition is only getting read by one consumer at a time, and where the progress the logical consumer group has made in each partition is not forgotten.
Remember that the throughput per partition is measured so that if you're maxing out ingress you can only have two logical consumers that are all up to speed. As such you'll want to make sure you have enough partitions, and throughput units, that you can:
Read all the data you send.
Catch up within the 24 hour retention period if you fall behind for a few hours due to issues.
In conclusion: consumer groups are what you need. The examples you read that use a specific consumer group are good, within each logical consumer group use the same name for the Azure Consumer Group and have different logical consumer groups use different ones.
I haven't yet used Azure Stream Analytics, but at least during the preview release you are limited to the default consumer group. So don't use the default consumer group for something else, and if you need two separate lots of Azure Stream Analytics you may need to do something nasty. But it's easy to configure!

Resources