Azure Eventhub Consumer - azure

Why do we need a blob container on Azure storage account for an Eventhub consumer client(I'm using python). Why can't we consume the messages from the Eventhub(topics in Kafka terminology) directly like we do in Kafka or can it be done in any other way?
I'm following the official Azure documentation linked below:
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-python-get-started-send

You are consuming the messages directly from the event hub. The storage account is not in any way used as an intermediate step or something like that. Instead, the storage account is used for checkpointing:
Checkpointing is a process by which readers mark or commit their position within a partition event sequence. Checkpointing is the responsibility of the consumer and occurs on a per-partition basis within a consumer group. This responsibility means that for each consumer group, each partition reader must keep track of its current position in the event stream, and can inform the service when it considers the data stream complete.
If a reader disconnects from a partition, when it reconnects it begins reading at the checkpoint that was previously submitted by the last reader of that partition in that consumer group. When the reader connects, it passes the offset to the event hub to specify the location at which to start reading. In this way, you can use checkpointing to both mark events as "complete" by downstream applications, and to provide resiliency if a failover between readers running on different machines occurs. It's possible to return to older data by specifying a lower offset from this checkpointing process. Through this mechanism, checkpointing enables both failover resiliency and event stream replay.
So summarized: the storage account is used to store information about the readers and their position within a partition.
You can write your own custom checkpoint storage implementation, see this question: Is there a way to store the azure Eventhub checkpoint to a remote bucket such as Google cloud bucket?

Related

Event hub handling faults

For event hub if we face a fault and the consumer crashes, then next time when it comes up how does it get to query what checkpoint it was on for the partition it gets hold of from the storage so that it can compare the reference sequence id of that message and incoming messages and process only the ones that come after that sequence id?
To save the checkpoint there is an API, but how to retrieve it?
As you know that Event Hub Check pointing is purely client side,i.e., you can store the current offset in the storage account linked with your event hub using the method
await context.CheckpointAsync();
in your client code. This will be converted to a storage account call. This is not related to any EventHub Service call.
Whenever there is a failure in your Event hub, you can read the latest(updated) offset from the storage account to avoid duplication of events.This must be handled by you on your client side code and it will not be handled by the event hub on its own.
If a reader disconnects from a partition, when it reconnects it begins reading at the checkpoint that was previously submitted by the last reader of that partition in that consumer group. When the reader connects, it passes the offset to the event hub to specify the location at which to start reading. In this way, you can use checkpointing to both mark events as "complete" by downstream applications, and to provide resiliency if a failover between readers running on different machines occurs. It is possible to return to older data by specifying a lower offset from this checkpointing process. Through this mechanism, checkpointing enables both failover resiliency and event stream replay.
Moreover, failure in an event hub is rare and duplicate events are less frequent. For more details on building a work flow with no duplicate events refer this stack overflow answer
The details of the checkpoint will be saved in the storage account linked to event hub in the format give below. This can be read using WindowsAzure.Storage client to do custom validation of sequence number of the last event received.

Azure Event Hubs - custom consumer with SQL checkpoints

We are currently looking at Azure Event Hubs as a mechanism to dispatch messages to background processors. At the moment the queue-based systems are being used.
Most processors are writing data to SQL Server databases, and the writes are wrapped in transactions.
Event Hubs are positioned as at-least-once communication channel, so duplicate messages should be expected. EventProcessorHost is the recommended API on the read side, which automates lease management and checkpointing using Azure Blob Storage.
But we have an idea, for some most critical processors, to implement checkpointing ourselves using a SQL Server table inside the same database, and write the checkpoint inside the same transaction of the processor. This should give us the strong guarantee of exactly once delivery when needed.
Ignoring lease management for now (just run 1 processor per partition), is SQL-based checkpointing a good idea? Are there other drawbacks, except the need to work on lower level of API and handle checkpoints ourselves?
As per Fred's advice, we implemented our own Checkpoint Manager based on a table in SQL Server. You can find the code sample here.
This implementation plugs nicely into EventProcessorHost. We also had to implement ILeaseManager, because they are highly coupled in the default implementation.
In my blog post I've described my motivation for such SQL-based implementation, and the high-level view of the overall solution.
Azure Storage is the built-in solution but we are not limited with that. If most of your processors are writing data to SQL Server databases and you do not want to have EventProcessorHost store checkpoints in Azure Storage (that requires a storage account), in my view, storing checkpoints in your SQL database which provide a easy way to make process event and manage checkpoint transactionally, it would be a good solution.
You could write your own checkpoint manager using ICheckpointManager interface to storing checkpoints in your SQL database.

How can we get messages from a particular event hub partition into azure function and how to automatically scale up number of azure function?

I can get the messages from all the partitions of event hub in azure function but I want to get messages from a particular event hub partition in azure function. Is there a way to do that ? And one other thing I want to do is to increase (scale out) the number of azure functions to process messages if there are large number of backlogs messages to process. How can I do that ? Is there any formulae to solve my second problem ?
In the Azure Functions Consumption plan, scale out is handled automatically for you. If we see that your function is not keeping up with the event stream, we'll add new instances. Those instances will cooperate to process the event stream in parallel.
For reading of the event stream, we rely on the Event Hubs EventProcessorHost as described in their documentation here. This host manages coordination of partition leases with other instances when the Function App starts - this isn't something you can (or should want to) control.

Does Microsoft Azure IoT Hub stores data?

I have just started learning Azure IoT and it's quite interesting. I am confuse about does IoT Hub stores data somewhere?
i.e. Suppose i am passing room Temperature to IoT hub and want to store it in database for further use. How it's possible?
I am clear on how device-to-cloud and cloud-to-device works with IoT hub.
IoT Hub exposes device to cloud messages through an event hubs endpoint. Event Hubs has a retention time expressed in days. It's a stream of data that the reading client could re-read more time because the cursor is on client side (not on server side like queues and topics). With IoT Hub the related retention time is 1 day by default but you can change it.
If you want to store received messages from device you need to have a client reading on the Event Hubs exposed endpoint (for example with an Event Processor Host) that has the business logic to process the messages and store them into a database for example.
Of course you could use another decoupling layer so that the client reads from event hubs and store messages into queues. Then you have another client that at its own pace reads from queues and store into database. In this way you have a fast path reading event hubs.
This is pretty much the use case for all IoT scenarios.
Step 1: High scale data ingestion via Event Hub.
Step 2: Create and use a stream processing engine (Stream Analytics or HDInsight /Storm). You can run conditions (SQL like queries) to filter and store appropriate data in either cold or hot store for further analytics.
Step 3: Storage for cold-path analytics can be Azure BLOB. Stream Analytics can directly be configured to write the Data into it. Cold can contain all other data that doesn't require querying and will be cheap.
Step 4: Processing for hot-path analytics. This is data that is more regularly queries for. Or data where real time analytics needs to be carried on. Like in your case checking for Temperature values going beyond a threshold! needs an urgent trigger!
Let me know if you face any challenges while configuring the Stream analytics job! :)
If you take a look at the IoT Suite remote monitoring preconfigured solution (https://azure.microsoft.com/documentation/articles/iot-suite-remote-monitoring-sample-walkthrough/) you'll see that it persists telemetry in blob storage and maintains device status information in DocumentDb. This preconfigured solution gives you a working illustration of the points made in the previous answers.

What is Partition Id,Offset,Host Name in Azure Event Hub Receiver?

I am working in azure event hub. I have some doubts.
What is Partition Id in Azure event hub receiver? Is this Id is same as partition Key in Azure event hub Publisher?
What is Offset? ,What the use of it in azure event hub consumer?
Can I consume the message with out using consumer group?
Can I consume the message with single receiver?
What is the use of blob in event hub consumer? I want only view the message what ever I sent.
This article Event Hubs Overview should answer your questions in detail, but to summarize:
When you create a new Event Hub in the portal, you specify how many partitions you need. The Publisher hashes the partition key of an event to determine which partition to send the event to. An event hub receiver receives events from those partitions.
An event hub consumer tracks which events it has received by by using an offset into each partition. By changing the offset you can, for example, re-read events from a partition.
You must have at least one consumer group (there is a default one). Each consumer group has it's own view of the partitions (different offset values) that let it read the events from the partitions independently of the other consumer groups.
Typically, you have one receiver per partition to enable scale out. An event hub has between 8 and 16 partitions.
Offset values are managed by the client. You can checkpoint your latest position in each partition to enable you to restart at the latest event if the client restarts. The checkpoint mechanism writes the latest offset values to blob storage.

Resources