Is it somehow possible to access the messages that Azure EventGrid has failed to deliver and are waiting to be retried? Alternatively, is it possible to get the count of messages currently waiting or other metrics about them?
With default retry policy EventGrid tries to deliver messages for 24 hours. It would be good to get some information that there are messages waiting for delivery. Eventually they will be available in the dead-letter storage, but 24 hours is a long time waiting.
I tried looking at portal, but could not find a clue where those messages might be kept. Nothing relevant in the metrics either. I also tried to search for an appropriate command in Azure CLI as well as in Powershell Az module. Couldn't find anything there either. Any hints on where to search next?
After having a better look at the metrics, I found out that it actually should be possible to get the number of events currently waiting for delivery. EventGrid Topic metrics include Matched Events (Sum), Delivered Events (Sum), Dead Letter Events (Sum) and Delivery Failed Events (Sum). The last one shows straight away that there are events that have not been delivered successfully. Then, a subtraction of the sum of Delivered Events (Sum) and Dead Letter Events (Sum) from Matched Events (Sum) should give the number of events currently waiting for re-trial.
There is also an option to add Diagnostics settings to an EventGrid Topic. That will initiate a collection of delivery failure data and start storing it to e.g. Log Analytics workspace, where it will be available in a table named AegDeliveryFailureLogs. Those logs will show each delivery failure identifying the topic, event type and subject. I have not yet figured out how to get the unique id of the event message.
Related
I have a request to implement a dashboard with the information about which message in Azure Service Bus queue was completed when (with some info about message parameters). Unfortunately we do not have an access to the reciever's code and cannot change the code to log the time of the message delivery. So, we need to subscribe somehow to a moment when reciever takes away the message.
I have already investigated Azure portal API in order to find something, but there is no such a possibility, I have tried to find something on stackoverflow and in Google, but no results.
There is 1 idea: use 2 queues and azure function between them. Put all messages to the first queue, azure function recieves a message, logs the info about the message and puts it to the second queue and waits until other services takes the message away from the second queue. Second queue will always have only 1 message and this way we will be able to understand what message was for sure delivered and when.
However what I do not like is the second message queue executes not the role of the real queue (it means something is wrong here and I need to use something else), performance of such a system can be not high enough...
Any help is appreciated (articles, videos, ideas). Thank you.
Pardon if my terminology is a little off; I'm new to this.
I have created an Azure Event Grid subscription which triggers an event whenever I upload a file to blob storage. I have an Azure Function which responds to this event. I've got this all working finally, but I have a slough of left-over messages from previous (bad) uploads that are failing periodically (as viewed from the Logs window in the Azure portal for the associated Azure Function). It's as if they're stored in a queue somewhere and retried periodically, though I'm not sure if that's how it works.
In any case, what I want to be able to do is purge any in-transit or queued events, but I don't know where to find them to do this. As far as I know they're just floating about in the ether.
How can I purge these events so they don't keep triggering my Azure Function at random times?
Event Grid will automatically retry delivery of the message if anything other then a 200 or 202 (OK/Accepted) is returned when a delivery attempt is made. By default it will try again for 24 hours and it uses an exponential backup that adds additional time in between each request until it gives up. What you're seeing is that default process running. (You can also configure a dead letter handling with a storage account so the undelivered messages get stored somewhere if it eventually fails).
What you are likely looking for is the Retry Policy you can create when creating a subscription. Pretty sure you can set the number of maximum delivery attempts to 1 so it won't retry (and without dead letter support turned on the message would essentially be dropped). More details on this can be found at https://learn.microsoft.com/en-us/azure/event-grid/manage-event-delivery#set-retry-policy
I'm not aware of any way to "dequeue" already submitted messages without that retry policy already in place - you may have to delete and recreate the subscription to that event grid topic.
In addition to #JoshCarlisle's answer and more clear to the Event Grid message delivery and retry document:
The dead-lettering enables a special case in the retry policy logic.
In the case of the dead-lettering is turn-on and subscriber failed with a HttpStatusCode.BadRequest, the Event Grid will stop a retrying process and the event is sent to the dead-letter endpoint. This error code indicates, that the delivery will never succeed.
the following code snippet shows some properties in the dead-letter message:
"deadLetterReason": "UndeliverableDueToHttpBadRequest",
"deliveryAttempts": 1,
"lastDeliveryOutcome": "BadRequest",
"lastHttpStatusCode": 400,
the following list shows some of the status codes where the Event Grid will continue in the retrying process:
HttpStatusCode.ServiceUnavailable
HttpStatusCode.InternalServerError
HttpStatusCode.RequestTimeout
HttpStatusCode.NotFound
HttpStatusCode.Conflict
HttpStatusCode.Forbidden
HttpStatusCode.Unauthorized
HttpStatusCode.NotImplemented
HttpStatusCode.Gone
Example of the some dead-letter properties, when the HttpStatusCode.RequestTimeout:
"deadLetterReason":"MaxDeliveryAttemptsExceeded",
"deliveryAttempts":3,
"lastDeliveryOutcome":"TimedOut",
"lastHttpStatusCode":408,
Now, you can see the above two difference cases described in the deadLetterReason property such as "UndeliverableDueToHttpBadRequest" vs "MaxDeliveryAttemptsExceeded"
One more thing:
When the dead-lettering is turn-on, the Event Grid will NOT deliver a dead-letter message to the dead-letter endpoint immediately, but after ~300 seconds. I hope this is a bug and it will be fix soon.
Practically, if the subscriber failed for instance HttpStatusCode.BadRequest, we can not wait for 5 minutes the event from the container storage, it must be an event driven close to the real-time.
My service consumes messages from an Azure Service Bus subscription. A dependency of my service was down for a while, which caused a lot of messages to end up in the deadletter queue (DLQ). Now that the service is back up, I want to reprocess all messages from the DLQ. How can I move/resubmit all messages from the DLQ back in to the main queue.
Restrictions:
It's thousands of messages, so manually handling them isn't feasible.
The topic has about ten subscriptions. I don't want to resubmit the messages to the topic, because then all subscriptions would receive the messages, leading to double-processing.
I don't want to run the service against the DLQ directly, because some messages are broken and cause permanent errors, i.e. they would end up in the DLQ again, which would lead to an infinite loop. Moreover, the broken messages are put back at the front of the queue, effectively starving healthy messages that come after the broken ones.
I realize this is a while after the original post but if anyone else stumbles on this problem, there is a fairly handy solution baked into the Service Bus Explorer (which I have found to be incredibly handy with ASB development).
After connecting to your Service Bus and finding the needed namespace, find the desired topic and subscription with the deadletters in it. From there Right Click and Receive Deadletter Queue Messages and hit OK.
From there, highlight which you would like to send back to the main queue and hit Resubmit Selected Messages in Batch Mode.
Thomas, you probably already found your answer since this is quite awhile ago. think of DLQ (or any existing queue that you have) as just another collection variable like in a PC app, but residing on the cloud. just like a PC-app or in-memory collection variable from your tool-kit, you have many ways of utilising it. off course there are limitations and differences between these 2 types of collection variables, but that's how you design your solution as though the DLQ is just another collection variable by knowing those limitations and differences.
For some queuing implementations, one of the solutions would be to have another instance of the same app pointing to the DLQ, but with a fairly long visibility timeout (e.g. 6 or 12 or even 24 hours depending on your SLA), since you don't want to repeat them too often. However, this is not applicable to Azure service bus, as it limits the visibility timeout to at most 5 minutes.
if the DLQ contains broken un-recoverable jobs, you should fix the app to delete them based on the error messages when the unknown exception occurred. once the fix is deployed, such broken un-recoverable jobs would have been removed by your app and never get sent to the DLQ in the first place. and those already in the DLQ will be removed by the fixed app.
The only option to replay DLQ messages is to receive them from DLQ, create new message with same content and send it again to the topic. They will end up at the end of subscription queue.
You can't send messages directly to the subscription. There is a trick to add a metadata property to the message, and then adjust all except one subscription to filter out such messages. It's up to you to decide if it's going to help in your scenario.
As for tooling, we always did that with custom code, because we always needed some extra work to be done, like logging each replayed message for further analysis.
The quick answer is that you cannot directly move messages back to the main queue of a subscription. This is by design with how Microsoft implemented their topics and subscriptions.
Option #1
There is the option to use Azure Service Bus topic filters https://learn.microsoft.com/en-us/azure/service-bus-messaging/topic-filters and define/tag your messages in a manner that would only allow them to be received on the targeted subscription.
Option #2
The other option would be to change your current implementation. You would set up "delivery queues" (regular service bus queues) and configure each corresponding subscription to auto forward its messages to these delivery queues. Your message processing logic would then listen on these "delivery queues" vs the subscription. Any failures would then result in DLQ messages on these associated "delivery queues" which could then be handled outside of the topic/subscriptions.
I was hoping if someone can clarify a few things regarding Azure Storage Queues and their interaction with WebJobs:
To perform recurring background tasks (i.e. add to queue once, then repeat at set intervals), is there a way to update the same message delivered in the QueueTrigger function so that its lease (visibility) can be extended as a way to requeue and avoid expiry?
With the above-mentioned pattern for recurring background jobs, I'm also trying to figure out a way to delete/expire a job 'on demand'. Since this doesn't seem possible outside the context of WebJobs, I was thinking of maybe storing the messageId and popReceipt for the message(s) to be deleted in Table storage as persistent cache, and then upon delivery of message in the QueueTrigger function do a Table lookup to perform a DeleteMessage, so that the message is not repeated any more.
Any suggestions or tips are appreciated. Cheers :)
Azure Storage Queues are used to store messages that may be consumed by your Azure Webjob, WorkerRole, etc. The Azure Webjobs SDK provides an easy way to interact with Azure Storage (that includes Queues, Table Storage, Blobs, and Service Bus). That being said, you can also have an Azure Webjob that does not use the Webjobs SDK and does not interact with Azure Storage. In fact, I do run a Webjob that interacts with a SQL Azure database.
I'll briefly explain how the Webjobs SDK interact with Azure Queues. Once a message arrives to a queue (or is made 'visible', more on this later) the function in the Webjob is triggered (assuming you're running in continuous mode). If that function returns with no error, the message is deleted. If something goes wrong, the message goes back to the queue to be processed again. You can handle the failed message accordingly. Here is an example on how to do this.
The SDK will call a function up to 5 times to process a queue message. If the fifth try fails, the message is moved to a poison queue. The maximum number of retries is configurable.
Regarding visibility, when you add a message to the queue, there is a visibility timeout property. By default is zero. Therefore, if you want to process a message in the future you can do it (up to 7 days in the future) by setting this property to a desired value.
Optional. If specified, the request must be made using an x-ms-version of 2011-08-18 or newer. If not specified, the default value is 0. Specifies the new visibility timeout value, in seconds, relative to server time. The new value must be larger than or equal to 0, and cannot be larger than 7 days. The visibility timeout of a message cannot be set to a value later than the expiry time. visibilitytimeout should be set to a value smaller than the time-to-live value.
Now the suggestions for your app.
I would just add a message to the queue for every task that you want to accomplish. The message will obviously have the pertinent information for processing. If you need to schedule several tasks, you can run a Scheduled Webjob (on a schedule of your choice) that adds messages to the queue. Then your continuous Webjob will pick up that message and process it.
Add a GUID to each message that goes to the queue. Store that GUID in some other domain of your application (a database). So when you dequeue the message for processing, the first thing you do is check against your database if the message needs to be processed. If you need to cancel the execution of a message, instead of deleting it from the queue, just update the GUID in your database.
There's more info here.
Hope this helps,
As for the first part of the question, you can use the Update Message operation to extend the visibility timeout of a message.
The Update Message operation can be used to continually extend the
invisibility of a queue message. This functionality can be useful if
you want a worker role to “lease” a queue message. For example, if a
worker role calls Get Messages and recognizes that it needs more time
to process a message, it can continually extend the message’s
invisibility until it is processed. If the worker role were to fail
during processing, eventually the message would become visible again
and another worker role could process it.
You can check the REST API documentation here: https://msdn.microsoft.com/en-us/library/azure/hh452234.aspx
For the second part of your question, there are really multiple ways and your method of storing the id/popReceipt as a lookup is a possible option, you can actually have a Web Job dedicated to receive messages on a different queue (e.g plz-delete-msg) and you send a message containing the "messageId" and this Web Job can use Get Message operation then Delete it. (you can make the job generic by passing the queue name!)
https://msdn.microsoft.com/en-us/library/azure/dd179474.aspx
https://msdn.microsoft.com/en-us/library/azure/dd179347.aspx
I have an Azure queue (ServiceBus Topic Subscription DeadLetter queue) with ~900 messages. I would like to list them for debug purposes. I call PeekBatch method with parameter messageCount set to 100. I expect to get 100 messages but I only receive 69. Why is that???
I ran into a similar issue recently where the reported Active Message Count number was higher than the actual number of messages that could be read. After some investigation, I contacted Microsoft's Azure support team. Here's what they told me:
After researching I see that there is bug about this where the count is not showing correct value. The bug is active so it’s not resolved. Product group is tracking it but I don’t have any ETA on it.
Although in my case it's the Active Message Count that was off, this could very well be the same issue.
I found a way (that only works some of the times - I don't know why) to correct the message count. Use Service Bus Explorer, select the topic that is displaying the issue, hit Update (without having changed any of the settings) and then hit Refresh.
I also had the same issue, when I tried to peek messages from queue which is enabled partitioning. if I use PeekBatch method with passing message count as parameter, I can read only some of the messages even the actual message count was higher than that.
Partition queue - Partitioned queue consists of multiple fragments handled by different message brokers. when a message is sent to the partitioned queue, Service Bus assigns the message to one of the fragments. When a client wants to receive a message from a queue, Service Bus checks all fragments for messages. if it finds any, it picks one and passes to the receiver.
So, When we try to get messages using PeekBatch() method, it checks the fragments for messages and picks the messages from the fragment which it found at first, even if it is not matched with the given message count.