Event hub handling faults - azure

For event hub if we face a fault and the consumer crashes, then next time when it comes up how does it get to query what checkpoint it was on for the partition it gets hold of from the storage so that it can compare the reference sequence id of that message and incoming messages and process only the ones that come after that sequence id?
To save the checkpoint there is an API, but how to retrieve it?

As you know that Event Hub Check pointing is purely client side,i.e., you can store the current offset in the storage account linked with your event hub using the method
await context.CheckpointAsync();
in your client code. This will be converted to a storage account call. This is not related to any EventHub Service call.
Whenever there is a failure in your Event hub, you can read the latest(updated) offset from the storage account to avoid duplication of events.This must be handled by you on your client side code and it will not be handled by the event hub on its own.
If a reader disconnects from a partition, when it reconnects it begins reading at the checkpoint that was previously submitted by the last reader of that partition in that consumer group. When the reader connects, it passes the offset to the event hub to specify the location at which to start reading. In this way, you can use checkpointing to both mark events as "complete" by downstream applications, and to provide resiliency if a failover between readers running on different machines occurs. It is possible to return to older data by specifying a lower offset from this checkpointing process. Through this mechanism, checkpointing enables both failover resiliency and event stream replay.
Moreover, failure in an event hub is rare and duplicate events are less frequent. For more details on building a work flow with no duplicate events refer this stack overflow answer
The details of the checkpoint will be saved in the storage account linked to event hub in the format give below. This can be read using WindowsAzure.Storage client to do custom validation of sequence number of the last event received.

Related

Azure Eventhub Consumer

Why do we need a blob container on Azure storage account for an Eventhub consumer client(I'm using python). Why can't we consume the messages from the Eventhub(topics in Kafka terminology) directly like we do in Kafka or can it be done in any other way?
I'm following the official Azure documentation linked below:
https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-python-get-started-send
You are consuming the messages directly from the event hub. The storage account is not in any way used as an intermediate step or something like that. Instead, the storage account is used for checkpointing:
Checkpointing is a process by which readers mark or commit their position within a partition event sequence. Checkpointing is the responsibility of the consumer and occurs on a per-partition basis within a consumer group. This responsibility means that for each consumer group, each partition reader must keep track of its current position in the event stream, and can inform the service when it considers the data stream complete.
If a reader disconnects from a partition, when it reconnects it begins reading at the checkpoint that was previously submitted by the last reader of that partition in that consumer group. When the reader connects, it passes the offset to the event hub to specify the location at which to start reading. In this way, you can use checkpointing to both mark events as "complete" by downstream applications, and to provide resiliency if a failover between readers running on different machines occurs. It's possible to return to older data by specifying a lower offset from this checkpointing process. Through this mechanism, checkpointing enables both failover resiliency and event stream replay.
So summarized: the storage account is used to store information about the readers and their position within a partition.
You can write your own custom checkpoint storage implementation, see this question: Is there a way to store the azure Eventhub checkpoint to a remote bucket such as Google cloud bucket?

Is it possible to configure Azure Event Hub to retain message if Azure Function fails processing it?

I have an Azure Function that listens for messages in an Event Hub. The function takes messages from the Event Hub, processes them, and passes them to another Hub. At this point the messages are removed from the Event Hub.
If the Function fails processing the message for whatever reason, is it possible to tell the Event Hub to not remove the message, and to try to deliver it to the Function again at some point in the future?
I understand that the Event Hubs have a maximum retention period of 7 days. I would like for the Event Hub & Function to continue trying during that period.
Readers never "remove" messages from an Event Hub. They are different from Service Bus Topics and Queues in this.
Event Hubs rely on clients to maintain their own bookmarks for each partition. The high-level API EventProcessorHost does this for you:
The EventProcessorHost class also implements an Azure storage-based
checkpointing mechanism. This mechanism stores the offset on a per
partition basis, so that each consumer can determine what the last
checkpoint from the previous consumer was.
But the lower-level EventHubReceiver exposes the StartingSequenceNumber property for you to control this explicitly.
However a desire for guaranteed delivery strongly suggests that you may want to copy the messages requiring guaranteed delivery from an Event Hub to a Service Bus Topic or Queue or perhaps an Azure SQL Database table for processing.

Diagnosing failures in azure event grid?

I did not find much in the way of troubleshooting events lost scenario in the azure event grid.
Hence I am asking question in relation to following scenario:
Our code publishes the events to the domain.
The events are delivered to the configured web hook in the subscription.
This works for a while.
The consumer (who owns the web hook endpoint) complains that he is not receiving some events but most are coming through.
We look in the configured dead-letter queue and find that there are no events. It has been more than a day and hence all retries are already exhausted.
Hence we assume that all events are being delivered because there are no failed delivery events in the metrics.
We also make sure that we indeed submitted these mysterious events to the grid.
But consumer insists about the problem and proves that there is nothing wrong with his side.
Now we need to figure out if some of these events are being swallowed by the event grid.
How do I go about troubleshooting this scenario?
The current version of the AEG is not integrated for Diagnostic settings feature which can be help very well for streaming the metrics and logs.
For your scenario which is based on the Event Domains (still in the public preview, see limits) can help an Azure Monitoring REST API, to see all metrics in the specific your Event Domain.
The valid metrics are:
PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
The following example is a REST GET request to obtain all metrics values within your event domain for specific timespan and interval:
https://management.azure.com/subscriptions/{mySubId}/resourceGroups/{myRG}/providers/Microsoft.EventGrid/domains/{myDomain}/providers/Microsoft.Insights/metrics?api-version=2018-01-01&interval=PT1H&aggregation=count,total&timespan=2019-02-06T07:58:12Z/2019-02-07T08:58:12Z&metricnames=PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
Based on the response values, you can see metrics of the AEG behavior from the publisher side and the event delivery to the subscriber. For your production version, I do recommend to use a polling technique to obtain all metrics from AEG and pushing them to the Event Hub for a streaming analyzing, alerting, etc. Based on the query parameters (such as timespan, interval, etc.), it can be close to the real-time. When the Diagnostic settings will be supported by AEG, than this polling and publishing all metrics is obsoleted and small modification at the analyzing stream job can be continued.
The other point is to extend your eventing model for auditing part. I do recommend the following:
Add a domain scope subscription to capture all events in the event domain and push them to the Event Hub for streaming purposes. Note, that any published event within that event domain should be in this published stream pipeline.
Add a storage subscription for dead-letter messages and push them to the same Event Hub for streaming purposes.
(optional) Add the Diagnostic settings (some metrics) of the dead-letter storage to the same Event Hub for streaming purposes. Note, that the dead-letter message is dropped after 4 hours trying to store it in the blob container. There is no any log message for that failed process, just only metric counter.
For the customer side, I do recommend that each subscriber will create a log message (aeg headers + event message) for auditing and troubleshooting purposes. It should be stored in the blob container or locally and then uploaded, etc. The point is, that this reference can be very useful for analyzing stream job to quickly figure out where is the problem.
In addition to your eventing model, your publisher should periodically (for instance once per hour) probes the event domain endpoint and also should send a probe event message to the probe topic for test purposes. The event subscription for that probe topic will configure a deadlettering option. The subscriber webhook handler should be always failed with a error code = HttpStatusCode.BadRequest such as no retrying action. Note, that there is a 300 seconds delay time, when the deadletter message will be stored in the storage. In other words, after probe event + 5 minutes, the deadlettering message should be in the stream pipeline. This probe scenario in your eventing model will probe a functionality of the AEG from the publisher and delivery point of the view.
The above described solution is shown in the following screen snippet:

When to use EventGrid and when to use ServiceBus / Storage Queue?

In Azure, we have two separate messaging technologies and it's not very well documented when to use what? While EventGrid is really cool, I did not come across when to use EventGrid(scenarios) vs the Storage/ServiceBus queue? Can someone help?
E.g. if I have the following scenario :
A status of a flag changes and based on that, I want to trigger an algorithm that would do recalculations, few inserts/updates etc. in the database.
For implementing this - I can either use EventGrid or Storage Queue. How do we figure what to use in such scenario? I was looking for some kind of guidance.
Basically, Azure Event Grid handles events and Azure ServiceBus handles messages.A message is raw data produced by a service to be consumed or stored. Events are also messages (lightweigth), but they don’t generally convey a publisher intent, other than to inform.
1) If the purpose is to just to store the information ServiceBus can be used.
2) If the information received is used to trigger another service Azure Event Grid can be used.
Find more info here
https://learn.microsoft.com/en-us/azure/event-grid/compare-messaging-services
https://azure.microsoft.com/en-us/blog/events-data-points-and-messages-choosing-the-right-azure-messaging-service-for-your-data/
Events are like notifications from a service to inform the world that something happened in the domain of the publisher (similar to an email notification). There is no expectations from the publisher to have any actions taken. A message is a command you send to a specific receiver with the expectation of the message to be processed (like an asynchronous post request).
Events will work in pub/sub pattern and multiple subscribers could be configured to the events. The service that needs to react to an event will get notified by the event grid when an event occurs (http call from event grid to the receiver). The event will remain in the event grid until deletion (cleanup) and there is no garantie of keeping the original order (no FIFO).
In the other hand, messages will be added to a queue and will be deleted once the “message processor” is done with it. The messages in the queue will keep the original order (FIFO). The message processor has to pull messages from the queue.
In your scenario, you could use a combination of both. Service A sends an event “StatusChanged”, then you can configure a subscription to that event and send a message to a queue, then have your logic to process that message. This will end up with a fully async communication pattern. This is ideal to support scenarios where you processor is down or too busy. The incoming messages will simply get accumulated in the queue and eventually being processed once the service is back up and running. And without affecting the original service that sent the “StatusChanged” event..

How to ensure that message transmission is reliable - ASB

I am using Azure Service Bus to implement message communication between separate bounded contexts. I am curious about what techniques people use to ensure that domain events raised in one bc are guaranteed to be received by another consuming bc.
For example, say the "orders" bc raises an "orderPlaced" event, how can I ensure that this event is received by a "shipping" bc. I understand that 2 phase commit is not advisable in cloud, so what is the alternative? How do I mitigate against the order being placed, but the message failing to be sent to the service bus in the event of a network failure?
Thoughts would be welcomed. Thanks.
If you send a BrokeredMessage to a Service Bus Queue and receive an acknowledgement, the message has been successfully stored in the queue. You don't have to worry about the message dying in transit due to a network error after you've been told it is persisted.
What you can worry about is a Service Bus Queue falling offline for a period of time and being unavailable. During an outage, your orderPlaced message wouldn't be able to get into the queue in the first place, and your shipping logic wouldn't be able to receive orders that are already persisted in your queue.
Note that Service Bus Queue outages are transient and the Queue recovers and returns to normal service. At that time, your shipping app could drain the queue of existing messages, and your ordering app could once again insert orderPlaced messages. I don't actually recall the last time I've seen one of my Service Bus Queues go down - it's a rare event.
If you are super-concerned about never ever ever EVER dropping a message, look at paired namespaces. Basically, this allows for failover to standby queues so that you can insert messages while your primary is down. Automatic detection checks to see when your primary queue comes back online. And a siphon process sucks messages that were inserted into the failover queue during the outage back into the primary once the primary comes back online.
Edit: When sending, there is still the chance that even though you had a valid Service Bus Queue connection in your QueueClient or MessagingFactory, the underlying Service Bus Queue just went down like a glass-jawed prizefighter. The vast majority of the time, these errors are transient. To handle them, set the RetryPolicy property of your MessagingFactory or QueueClient. Off the top of my head, I think that the only policy currently available is the RetryExponential policy. This will perform a back-off that will retry sending the message until the specified number of attempts are exhausted. This is the easy-peasy way to handle transient errors that pop up in your Service Bus Queue connection.

Resources