How do you maintain idempotency with Azure EventGrid webhooks? - azure

I have configured an EventGrid subscription to initiate a web hook call for events in a resource group when a resource is created.
The web hook call is successfully handled, and I return a 200 OK. To maintain idempotency, I store all events that have occurred in a webhook_events table with the id of the event. Any new events are checked to see if they exist in that table by their id.
Azure EventGrid attempts to remove the event from the retry queue after returning a 200 OK. No matter how quickly I respond with a 200 OK, EventGrid reliably retries sending.
I am receiving the same event multiple times (as I said, EventGrid always retries, as it cannot remove the event from the retry queue fast enough). This however is not the focus of my question; rather, the issue exists in the fact that each of these retries presents me with a different id for the event. This means that I cannot logically determine the uniqueness of an event, and my application code is not being executed in an idempotent fashion.
How can I maintain idempotency between my application and Azure despite there being no unique identifier between event retries?

It's the way EventGrid is implemented if you look at the documentation
If the endpoint responds within 3 minutes, Event Grid will attempt to
remove the event from the retry queue on a best effort basis but
duplicates may still be received.
you can use back-end code to clean up logs and stored data, using event and message IDs to identify duplicates.

The id field is in fact unique per event and kept identical between retries & therefore can be used for dedupe.
What you're running into is a specific issue with some events generated by Azure Resource Manager (ARM). Specifically, the two events you are seeing are in fact distinct events, not duplicates, generated by ARM at different stages of the creative flow for some resource types.
ARM is acting as the API front door to the various Azure services and emits a set of events for that are generalized and often to get the details of what has occurred, you need to look in the data payload. For example, ARM will emit a success event for each 2xx status code it receives from an Azure service, so a 202 accepted and a 201 created can result in two events being emitted and the only way to see the difference would be in the data payload.
This is a known pain point, and we are working to emit more high-fidelity events that will be clearer and easier to react to in these scenarios. The ideal state will be a change-feed of sorts for the Azure control plane.

Related

Azure Service Bus Topic Subscriber receiving messages not in order

I am using azure service bus topic and I have enable session for it's subscription.
With in the my logic app i am inserting data using sql transaction which are coming from topic
I am using topic Subscription(peek-lock) and
in subscriber level concurrency is set to default as follows
According to my Understanding, my logic app(subscriber) should read ALL the messages and have to processed in FIFO
my logic app is like
which means it should insert data to the table in ordered manner
however when i checked in the trigger log it shows correct order but in database level you can see the order is not happening
Message ordering is a delicate business. You can have either message ordering or concurrent processing, but not both. The moment you make message ordering a must, you lose the ability to have concurrent processing. This is correct for both, Azure Service Bus Sessions and Logic Apps concurrency control. You could process multiple sessions, but each session would be still restricted to a single processor. Here's a post about it.

Diagnosing failures in azure event grid?

I did not find much in the way of troubleshooting events lost scenario in the azure event grid.
Hence I am asking question in relation to following scenario:
Our code publishes the events to the domain.
The events are delivered to the configured web hook in the subscription.
This works for a while.
The consumer (who owns the web hook endpoint) complains that he is not receiving some events but most are coming through.
We look in the configured dead-letter queue and find that there are no events. It has been more than a day and hence all retries are already exhausted.
Hence we assume that all events are being delivered because there are no failed delivery events in the metrics.
We also make sure that we indeed submitted these mysterious events to the grid.
But consumer insists about the problem and proves that there is nothing wrong with his side.
Now we need to figure out if some of these events are being swallowed by the event grid.
How do I go about troubleshooting this scenario?
The current version of the AEG is not integrated for Diagnostic settings feature which can be help very well for streaming the metrics and logs.
For your scenario which is based on the Event Domains (still in the public preview, see limits) can help an Azure Monitoring REST API, to see all metrics in the specific your Event Domain.
The valid metrics are:
PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
The following example is a REST GET request to obtain all metrics values within your event domain for specific timespan and interval:
https://management.azure.com/subscriptions/{mySubId}/resourceGroups/{myRG}/providers/Microsoft.EventGrid/domains/{myDomain}/providers/Microsoft.Insights/metrics?api-version=2018-01-01&interval=PT1H&aggregation=count,total&timespan=2019-02-06T07:58:12Z/2019-02-07T08:58:12Z&metricnames=PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
Based on the response values, you can see metrics of the AEG behavior from the publisher side and the event delivery to the subscriber. For your production version, I do recommend to use a polling technique to obtain all metrics from AEG and pushing them to the Event Hub for a streaming analyzing, alerting, etc. Based on the query parameters (such as timespan, interval, etc.), it can be close to the real-time. When the Diagnostic settings will be supported by AEG, than this polling and publishing all metrics is obsoleted and small modification at the analyzing stream job can be continued.
The other point is to extend your eventing model for auditing part. I do recommend the following:
Add a domain scope subscription to capture all events in the event domain and push them to the Event Hub for streaming purposes. Note, that any published event within that event domain should be in this published stream pipeline.
Add a storage subscription for dead-letter messages and push them to the same Event Hub for streaming purposes.
(optional) Add the Diagnostic settings (some metrics) of the dead-letter storage to the same Event Hub for streaming purposes. Note, that the dead-letter message is dropped after 4 hours trying to store it in the blob container. There is no any log message for that failed process, just only metric counter.
For the customer side, I do recommend that each subscriber will create a log message (aeg headers + event message) for auditing and troubleshooting purposes. It should be stored in the blob container or locally and then uploaded, etc. The point is, that this reference can be very useful for analyzing stream job to quickly figure out where is the problem.
In addition to your eventing model, your publisher should periodically (for instance once per hour) probes the event domain endpoint and also should send a probe event message to the probe topic for test purposes. The event subscription for that probe topic will configure a deadlettering option. The subscriber webhook handler should be always failed with a error code = HttpStatusCode.BadRequest such as no retrying action. Note, that there is a 300 seconds delay time, when the deadletter message will be stored in the storage. In other words, after probe event + 5 minutes, the deadlettering message should be in the stream pipeline. This probe scenario in your eventing model will probe a functionality of the AEG from the publisher and delivery point of the view.
The above described solution is shown in the following screen snippet:

When to use EventGrid and when to use ServiceBus / Storage Queue?

In Azure, we have two separate messaging technologies and it's not very well documented when to use what? While EventGrid is really cool, I did not come across when to use EventGrid(scenarios) vs the Storage/ServiceBus queue? Can someone help?
E.g. if I have the following scenario :
A status of a flag changes and based on that, I want to trigger an algorithm that would do recalculations, few inserts/updates etc. in the database.
For implementing this - I can either use EventGrid or Storage Queue. How do we figure what to use in such scenario? I was looking for some kind of guidance.
Basically, Azure Event Grid handles events and Azure ServiceBus handles messages.A message is raw data produced by a service to be consumed or stored. Events are also messages (lightweigth), but they don’t generally convey a publisher intent, other than to inform.
1) If the purpose is to just to store the information ServiceBus can be used.
2) If the information received is used to trigger another service Azure Event Grid can be used.
Find more info here
https://learn.microsoft.com/en-us/azure/event-grid/compare-messaging-services
https://azure.microsoft.com/en-us/blog/events-data-points-and-messages-choosing-the-right-azure-messaging-service-for-your-data/
Events are like notifications from a service to inform the world that something happened in the domain of the publisher (similar to an email notification). There is no expectations from the publisher to have any actions taken. A message is a command you send to a specific receiver with the expectation of the message to be processed (like an asynchronous post request).
Events will work in pub/sub pattern and multiple subscribers could be configured to the events. The service that needs to react to an event will get notified by the event grid when an event occurs (http call from event grid to the receiver). The event will remain in the event grid until deletion (cleanup) and there is no garantie of keeping the original order (no FIFO).
In the other hand, messages will be added to a queue and will be deleted once the “message processor” is done with it. The messages in the queue will keep the original order (FIFO). The message processor has to pull messages from the queue.
In your scenario, you could use a combination of both. Service A sends an event “StatusChanged”, then you can configure a subscription to that event and send a message to a queue, then have your logic to process that message. This will end up with a fully async communication pattern. This is ideal to support scenarios where you processor is down or too busy. The incoming messages will simply get accumulated in the queue and eventually being processed once the service is back up and running. And without affecting the original service that sent the “StatusChanged” event..

Does Azure client .OnMessage generate billable request for empty queues?

You can subscribe to asynchronous updates from Azure topics and queues by using SubscriptionClient/QueueClient's .OnMessage call which will presumably create a separate thread polling the topic/queue with default settings and calling a defined callback if it receives anything.
Azure website says that receiving a message is a billable action, which is understandable. However, it isn't clear enough if each those poll requests are considered billable even when they do not return anything, i.e. the queue in question has no pending messages.
Based on the Azure Service Bus Pricing FAQ - the answer to your question is yes
In general, management operations and “control messages,” such as
completes and deferrals, are not counted as billable messages. There
are two exceptions:
Null messages delivered by the Service Bus in
response to requests against an empty queue, subscription, or message
buffer, are also billable. Thus, applications that poll against
Service Bus entities will effectively be charged one message per poll.
Setting and getting state on a MessageSession will also result in
billable messages, using the same message size-based calculation
described above.
Given the price is $0.01 per 10,000 messages, I don't think you should worry too much about that.

Azure Loosely Coupled / Scalable

I have been struggling with this concept for a while. I am attempting to come up with a loosely coupled Azure component design that is completely scalable using Queues and worker roles, which dequeue and process the items. I can scale the worker roles at will, and publishing to the queue is never an issue. So far so good, but, it seems that the only real world model this could work in is fire and forget. It would work fantastic for logging and other one way operations, but let's say I want to up load a file using queues/worker roles, save it to blob, then get a response back once it is complete. Or should this type of model not be used for online apps? What is the best way to send a notification back once an operation is completed? Do I create a response Q, then (somehow) retrieve the associated response? Any help is greatly appreciated!!!!!
I usually do a polling model.
Client (usually a browser) sends a request to do some work.
Front-end (web role) enqueues the work and replies with an ID.
Back-end (worker role) processes the queue and stores the result in a blob or table entity named .
Client polls ("Is done yet?") at some interval.
Front-end checks to see if the blob or table entity is there and replies accordingly.
See http://blog.smarx.com/posts/web-page-image-capture-in-windows-azure for one example of this pattern.
you could also look into the servicebus appfabric instead of using queues. with the servicebus you can send messages, use queues etc all from the servicebus appfabric. you could go to publish and subscribe instead of polling then!

Resources