dead lettering in Azure event grid does not work - azure

As per the Microsoft documentation, Event Grid does not implement retry on specific errors like 400, 413, 401. In such cases Event Grid will either perform dead-lettering on the event or drop the event if dead-letter isn't configured.
So I enabled dead letter on my webhook (receiver endpoint) and produced a 400 error. But it is still not captured in dead letter logs container.
Is there something I am missing?

Please have a look at the Event Grid message delivery and retry - Dead-letter events documentation and see if your configuration is correct.
By default, Event Grid doesn't turn on dead-lettering. To enable it, you must specify a storage account to hold undelivered events when creating the event subscription. You pull events from this storage account to resolve deliveries.
[...]
Before setting the dead-letter location, you must have a storage account with a container. You provide the endpoint for this container when creating the event subscription. The endpoint is in the format of: /subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.Storage/storageAccounts/<storage-name>/blobServices/default/containers/<container-name>
Also, take into account that
There's a five-minute delay between the last attempt to deliver an event and when it's delivered to the dead-letter location. This delay is intended to reduce the number of Blob storage operations. If the dead-letter location is unavailable for four hours, the event is dropped.

Related

Getting event notifications for Azure Blob objects created in the past

I plan on using Azure Event Grid to get notifications when Blob objects are created. If my service goes down and comes back up, will I get notifications of Blob objects that were created while my service was down? I would prefer not to get notifications of Blob objects that were created while my service was down.
The document for Retry schedule and duration described the types of destination endpoints and errors for which retry doesn't happen.
As you can see in the table of the Failure codes the code 503 Service Unavailable is retrying a message delivery after 30 seconds or more.
It will be nice to have a capability for each subscription to control a retry policy based on the status code, but for now I can see the following workaround:
Use the Azure API Management for handling a properly return status code back to the AEG, in other words, the APIM will mapped the stratus code 503 from your service to the status code 403 back to the AEG. Note, this APIM mediator will also handle a webhook validation for AEG subscription.
I do recommend to enable a dead-lettering feature for auditing non delivered messages during when your service has been down.

Azure storage account queue get new messages arrived (Ideally Logic App)

I am trying to design a logic app where when a message arrives or is received by poison queue(located in azure storage account) it posts to slack channel.
I know how to do the slack part and I can get the trigger to fire on queues located in my service bus resource because azure logic apps includes triggers for that.
However, I don't see azure logic app triggers for queues in the storage account for when a message arrives or is received in a queue.
My question is: Is there a trigger or other process that I can tap into when messages arrives in those storage account queues? If not, what is the best way to achieve being able to get this data (Message and Message Content) when it arrives in my poison queue in the storage account then take that info and throw that info into the slack channel?
Is there a trigger (for logic app) on the “storage account” queues that can be fired when a message is received?(I see length, and scan all messages in queue)
I have found triggers for this on the “service bus” queues.
These are the trigger options I have found for "service bus queues": https://i.stack.imgur.com/nS0nu.png
These are the trigger options I have found for "storage account queues": https://i.stack.imgur.com/AqxB4.png
Ideally, I just want to configure a simple logic app for when queue messages arrive in poison queue in storage account then take info to slack, i know how to put info in slack via logic app in service bus queue but haven't been able to figure out how to set up the action to pass the message to slack because I can't figure out how to do it when a message is recieved, i can set it as a job and grab length of queue and all items in queue, but not fire on arrival because I can't find a trigger for that.
If I can't figure this out or if at can't be done, I am probably just going to initiate the logic app with and Azure Queue storage trigger on an Azure function, this seems to be how MS wants us to do it anyway.
https://learn.microsoft.com/en-us/azure/app-service/webjobs-sdk-how-to
ctrl+f "Binding reference information" then see sub heading "Usage: The types you can bind to and information about how the binding works. For example: polling algorithm, poison queue processing."
thanks again for the input folks, appreciate ya!
According to some test, it works fine in my side with "When a specified number of messages are in a given queue" trigger. When I add a message to my queue which named myqueue-items-poison, the logic app was triggered success.
So please check if you have configured the trigger correctly.
Please check if you set the interval of the trigger too long, I set it with 1 minute.

Diagnosing failures in azure event grid?

I did not find much in the way of troubleshooting events lost scenario in the azure event grid.
Hence I am asking question in relation to following scenario:
Our code publishes the events to the domain.
The events are delivered to the configured web hook in the subscription.
This works for a while.
The consumer (who owns the web hook endpoint) complains that he is not receiving some events but most are coming through.
We look in the configured dead-letter queue and find that there are no events. It has been more than a day and hence all retries are already exhausted.
Hence we assume that all events are being delivered because there are no failed delivery events in the metrics.
We also make sure that we indeed submitted these mysterious events to the grid.
But consumer insists about the problem and proves that there is nothing wrong with his side.
Now we need to figure out if some of these events are being swallowed by the event grid.
How do I go about troubleshooting this scenario?
The current version of the AEG is not integrated for Diagnostic settings feature which can be help very well for streaming the metrics and logs.
For your scenario which is based on the Event Domains (still in the public preview, see limits) can help an Azure Monitoring REST API, to see all metrics in the specific your Event Domain.
The valid metrics are:
PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
The following example is a REST GET request to obtain all metrics values within your event domain for specific timespan and interval:
https://management.azure.com/subscriptions/{mySubId}/resourceGroups/{myRG}/providers/Microsoft.EventGrid/domains/{myDomain}/providers/Microsoft.Insights/metrics?api-version=2018-01-01&interval=PT1H&aggregation=count,total&timespan=2019-02-06T07:58:12Z/2019-02-07T08:58:12Z&metricnames=PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
Based on the response values, you can see metrics of the AEG behavior from the publisher side and the event delivery to the subscriber. For your production version, I do recommend to use a polling technique to obtain all metrics from AEG and pushing them to the Event Hub for a streaming analyzing, alerting, etc. Based on the query parameters (such as timespan, interval, etc.), it can be close to the real-time. When the Diagnostic settings will be supported by AEG, than this polling and publishing all metrics is obsoleted and small modification at the analyzing stream job can be continued.
The other point is to extend your eventing model for auditing part. I do recommend the following:
Add a domain scope subscription to capture all events in the event domain and push them to the Event Hub for streaming purposes. Note, that any published event within that event domain should be in this published stream pipeline.
Add a storage subscription for dead-letter messages and push them to the same Event Hub for streaming purposes.
(optional) Add the Diagnostic settings (some metrics) of the dead-letter storage to the same Event Hub for streaming purposes. Note, that the dead-letter message is dropped after 4 hours trying to store it in the blob container. There is no any log message for that failed process, just only metric counter.
For the customer side, I do recommend that each subscriber will create a log message (aeg headers + event message) for auditing and troubleshooting purposes. It should be stored in the blob container or locally and then uploaded, etc. The point is, that this reference can be very useful for analyzing stream job to quickly figure out where is the problem.
In addition to your eventing model, your publisher should periodically (for instance once per hour) probes the event domain endpoint and also should send a probe event message to the probe topic for test purposes. The event subscription for that probe topic will configure a deadlettering option. The subscriber webhook handler should be always failed with a error code = HttpStatusCode.BadRequest such as no retrying action. Note, that there is a 300 seconds delay time, when the deadletter message will be stored in the storage. In other words, after probe event + 5 minutes, the deadlettering message should be in the stream pipeline. This probe scenario in your eventing model will probe a functionality of the AEG from the publisher and delivery point of the view.
The above described solution is shown in the following screen snippet:

When to use EventGrid and when to use ServiceBus / Storage Queue?

In Azure, we have two separate messaging technologies and it's not very well documented when to use what? While EventGrid is really cool, I did not come across when to use EventGrid(scenarios) vs the Storage/ServiceBus queue? Can someone help?
E.g. if I have the following scenario :
A status of a flag changes and based on that, I want to trigger an algorithm that would do recalculations, few inserts/updates etc. in the database.
For implementing this - I can either use EventGrid or Storage Queue. How do we figure what to use in such scenario? I was looking for some kind of guidance.
Basically, Azure Event Grid handles events and Azure ServiceBus handles messages.A message is raw data produced by a service to be consumed or stored. Events are also messages (lightweigth), but they don’t generally convey a publisher intent, other than to inform.
1) If the purpose is to just to store the information ServiceBus can be used.
2) If the information received is used to trigger another service Azure Event Grid can be used.
Find more info here
https://learn.microsoft.com/en-us/azure/event-grid/compare-messaging-services
https://azure.microsoft.com/en-us/blog/events-data-points-and-messages-choosing-the-right-azure-messaging-service-for-your-data/
Events are like notifications from a service to inform the world that something happened in the domain of the publisher (similar to an email notification). There is no expectations from the publisher to have any actions taken. A message is a command you send to a specific receiver with the expectation of the message to be processed (like an asynchronous post request).
Events will work in pub/sub pattern and multiple subscribers could be configured to the events. The service that needs to react to an event will get notified by the event grid when an event occurs (http call from event grid to the receiver). The event will remain in the event grid until deletion (cleanup) and there is no garantie of keeping the original order (no FIFO).
In the other hand, messages will be added to a queue and will be deleted once the “message processor” is done with it. The messages in the queue will keep the original order (FIFO). The message processor has to pull messages from the queue.
In your scenario, you could use a combination of both. Service A sends an event “StatusChanged”, then you can configure a subscription to that event and send a message to a queue, then have your logic to process that message. This will end up with a fully async communication pattern. This is ideal to support scenarios where you processor is down or too busy. The incoming messages will simply get accumulated in the queue and eventually being processed once the service is back up and running. And without affecting the original service that sent the “StatusChanged” event..

Azure Queue - how to implement

First service adds messages to queue if user does not exist in DB, second service gets message from queue and create user. Possible situation, when first service adds 2 messages for create users before second gets it. How to resolve it? As I understand, no way to review queue...
I use Azure Storage queues
Azure Queue message doesn't support peek-lock to be processed. Once it is read, it becomes invisible. You need to look into Azure Service Bus as it allows you to control message one by one and in order if required.

Resources