Azure Event Grid Function Trigger - Probation - azure

We have a Azure setup with a Azure Event Grid Topic and to that we have a Azure Function Service with about 15 functions that are subscribing to the topic via different prefix filters. The Azure Function Service is set up as a consumption based resource and should be able to scale as it prefers.
Each subscription is set up to try deliveries for 10 times during maximum 4 hours befor dropping the event. So far so good and the setup is working as expected – most of the time.
In certain, for us unknown situations, it seems like the Event Grid Topic cannot deliver events to the different functions. What we can see is that our dead letter storage fill up with events that have not been delivered.
Now to my question
From the logs we can see the reason for various events not being delivered. The reason is most often Outcome: Probation. We can not find any information from Microsoft on what this actually means.
In addition, the Grid fails and adds the event to the dead letter log before both the timeout policy (4 hours) and the delivery attempts policy (10 retries) has exceeded. Some times the Function Service is idling and do not receive any events from the Grid.
Do any of you good people have ideas of how we can proceed with the troubleshooting for this? What has happened between the Grid and Funciton App when the error message Probation occurs? One thing that we have noticed is that the number of connections from the Grid to our function app is quite high in comparison to the number of events delivered.
There are not other incoming connections to the Function App besides the Event Grid.
Example of a dead letter message
[{
"id":"a40a1f02-5ec8-46c3-a349-aea6aaff646f",
"eventTime":"2020-06-02T17:45:09.9710145Z",
"eventType":"mitbalAdded",
"dataVersion":"1",
"metadataVersion":"1",
"topic":"/subscriptions/XXXXXXX/resourceGroups/XXXX_STAGING/providers/Microsoft.EventGrid/topics/XXXXXstaging",
"subject":"odl/type/mitbal/v1",
"deadLetterReason":"TimeToLiveExceeded",
"deliveryAttempts":6,
"lastDeliveryOutcome":"Probation",
"publishTime":"2020-06-02T17:45:10.1869491Z",
"lastDeliveryAttemptTime":"2020-06-02T19:30:10.5756332Z",
"data":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Stock><Action>ADD</Action><Id>123456</Id><Store>123</Store><Shelf>1</Shelf></Stock>"
}]
Function Service Metrics
Blue = Connections (count)
Red = Function Executions (count)
White = Requests (count)

I'm not sure if you have figured the issue here, but here are some insights for others in a comparable situation.
Firstly, probation is the outcome when the destination is not healthy, for which Event Grid would still attempt deliveries.
Based on the graph, it looks like functions hit the 100 executions mark and then took a while to scale out for the next 100. You could get better results by tweaking the host.json settings depending on what each function execution does.
Including scale controller logs could shed more light into what is happening internally when scaling out.
Also, another option would be to send events into service bus or event hubs first and then have a function run from there.

Related

Where does Azure EventGrid keep messages waiting to be delivered?

Is it somehow possible to access the messages that Azure EventGrid has failed to deliver and are waiting to be retried? Alternatively, is it possible to get the count of messages currently waiting or other metrics about them?
With default retry policy EventGrid tries to deliver messages for 24 hours. It would be good to get some information that there are messages waiting for delivery. Eventually they will be available in the dead-letter storage, but 24 hours is a long time waiting.
I tried looking at portal, but could not find a clue where those messages might be kept. Nothing relevant in the metrics either. I also tried to search for an appropriate command in Azure CLI as well as in Powershell Az module. Couldn't find anything there either. Any hints on where to search next?
After having a better look at the metrics, I found out that it actually should be possible to get the number of events currently waiting for delivery. EventGrid Topic metrics include Matched Events (Sum), Delivered Events (Sum), Dead Letter Events (Sum) and Delivery Failed Events (Sum). The last one shows straight away that there are events that have not been delivered successfully. Then, a subtraction of the sum of Delivered Events (Sum) and Dead Letter Events (Sum) from Matched Events (Sum) should give the number of events currently waiting for re-trial.
There is also an option to add Diagnostics settings to an EventGrid Topic. That will initiate a collection of delivery failure data and start storing it to e.g. Log Analytics workspace, where it will be available in a table named AegDeliveryFailureLogs. Those logs will show each delivery failure identifying the topic, event type and subject. I have not yet figured out how to get the unique id of the event message.

Routing messages: Topic to serverless to multi-queue (maybe) to serverless to multi rest

Sorry if this is long and/or open ended... there is just so many options in azure I'm struggling to choose which pieces to use where, so Im hoping someone can help point the way.
I am trying to build something as described in the image below. I am kind of "routing" messages from a multi-tenant service bus topic (top of diagram) to tenant specific REST endpoints (bottom of diagram).
The hardest part that I am facing for far is to have kind of a concept of a function with dynamic queue/event triggers that change over time, but I'm looking for advice on any parts of the solution.
I have 4 constraints to deal with:
Those messages HAVE to eventually reach the corresponding rest endpoint
Those REST endpoint nodes get added and removed dynamically, but I do know (via event grid message) when that happens.
Those REST endpoint nodes can also be taken offline anytime, or for long periods of time (days), and all messages should eventually be delivered
I have a mostly working solution for F1, but looking for better ideas, but really its F2 that I am struggling with.
F1
Currently what I have is:
1 queue per tenant server, created/deleted based on event grid message
An Azure Function that can return a queue name based on service bus message contents
A logic app (WIP) that will take messages off the topic subscription, use the function to determine the destination queue name, add the webhook URI to the message properties, and forward the message to that queue.
I think that F1 will eventually work correctly :D... but f2..
F2
This is the tricky part, now I have N queues, that come and go over time. I can't figure out how to listen to all the queues via 1 function, so im thinking of trying to maybe rollout 1 function again per queue. It would be responsible for pulling messages off the queue and sending to the REST webhook URI.
But then when the rest endpoint is down it should pause, and not drain the queue, also not sure how to do this efficiently, maybe another logic app with polling?
I'm thinking of maybe using event grid instead of queues, because I think they have more serverless support in general, but I'm not sure this will solve the problem either.
Appreciate any thoughts

Where are "queued" Azure Event Grid Blob trigger event messages stored and how can I clear them?

Pardon if my terminology is a little off; I'm new to this.
I have created an Azure Event Grid subscription which triggers an event whenever I upload a file to blob storage. I have an Azure Function which responds to this event. I've got this all working finally, but I have a slough of left-over messages from previous (bad) uploads that are failing periodically (as viewed from the Logs window in the Azure portal for the associated Azure Function). It's as if they're stored in a queue somewhere and retried periodically, though I'm not sure if that's how it works.
In any case, what I want to be able to do is purge any in-transit or queued events, but I don't know where to find them to do this. As far as I know they're just floating about in the ether.
How can I purge these events so they don't keep triggering my Azure Function at random times?
Event Grid will automatically retry delivery of the message if anything other then a 200 or 202 (OK/Accepted) is returned when a delivery attempt is made. By default it will try again for 24 hours and it uses an exponential backup that adds additional time in between each request until it gives up. What you're seeing is that default process running. (You can also configure a dead letter handling with a storage account so the undelivered messages get stored somewhere if it eventually fails).
What you are likely looking for is the Retry Policy you can create when creating a subscription. Pretty sure you can set the number of maximum delivery attempts to 1 so it won't retry (and without dead letter support turned on the message would essentially be dropped). More details on this can be found at https://learn.microsoft.com/en-us/azure/event-grid/manage-event-delivery#set-retry-policy
I'm not aware of any way to "dequeue" already submitted messages without that retry policy already in place - you may have to delete and recreate the subscription to that event grid topic.
In addition to #JoshCarlisle's answer and more clear to the Event Grid message delivery and retry document:
The dead-lettering enables a special case in the retry policy logic.
In the case of the dead-lettering is turn-on and subscriber failed with a HttpStatusCode.BadRequest, the Event Grid will stop a retrying process and the event is sent to the dead-letter endpoint. This error code indicates, that the delivery will never succeed.
the following code snippet shows some properties in the dead-letter message:
"deadLetterReason": "UndeliverableDueToHttpBadRequest",
"deliveryAttempts": 1,
"lastDeliveryOutcome": "BadRequest",
"lastHttpStatusCode": 400,
the following list shows some of the status codes where the Event Grid will continue in the retrying process:
HttpStatusCode.ServiceUnavailable
HttpStatusCode.InternalServerError
HttpStatusCode.RequestTimeout
HttpStatusCode.NotFound
HttpStatusCode.Conflict
HttpStatusCode.Forbidden
HttpStatusCode.Unauthorized
HttpStatusCode.NotImplemented
HttpStatusCode.Gone
Example of the some dead-letter properties, when the HttpStatusCode.RequestTimeout:
"deadLetterReason":"MaxDeliveryAttemptsExceeded",
"deliveryAttempts":3,
"lastDeliveryOutcome":"TimedOut",
"lastHttpStatusCode":408,
Now, you can see the above two difference cases described in the deadLetterReason property such as "UndeliverableDueToHttpBadRequest" vs "MaxDeliveryAttemptsExceeded"
One more thing:
When the dead-lettering is turn-on, the Event Grid will NOT deliver a dead-letter message to the dead-letter endpoint immediately, but after ~300 seconds. I hope this is a bug and it will be fix soon.
Practically, if the subscriber failed for instance HttpStatusCode.BadRequest, we can not wait for 5 minutes the event from the container storage, it must be an event driven close to the real-time.

Delay in Azure function triggering off IOThub

I have data going from my system to an azure iot. I timestamp the data packet when I send it.Then I have an azure function that is triggered by the iothub. In the azure function I get the message and get the timestamp and record how long it took the data to get to the function. I also have another program running on my system that listens for data on the iothub and records that time too.
So most of the time, the time in the azure function is in millisecs, but sometimes, I see a large time for the azure function to be triggered(I conclude it is this because the program that reads from the iot hub shows that the data reached the iot hub quickly and there was no delay).
Would anybody know the reasons for why azure function might be triggering late
Is this the same question that was asked here? https://github.com/Azure/Azure-Functions/issues/711
I'll copy/paste my answer for others to see:
Based on what I see in the logs and your description, I think the latency can be explained as being caused by a cold-start of your function app process. If a function app goes idle for approximately 20 minutes, then it is unloaded from memory and any subsequent trigger will initiate a cold start.
Basically, the following sequence of events takes place:
The function app goes idle and is unloaded (this happened about 5 minutes before the trigger you mentioned).
You send the new event.
The event eventually gets noticed by our scale controller, which polls for events on a 10 second interval.
Our scale controller initiates a cold-start of your function app. This can add a few more seconds depending on the content of your function app (it was about 6 seconds in this case).
So unfortunately this is a known behavior with the consumption plan. You can read up on this issue here: https://blogs.msdn.microsoft.com/appserviceteam/2018/02/07/understanding-serverless-cold-start/. The blog post also discusses some ways you can work around this if it's problematic for your scenario.

Controlling queue polling times

I have a piece of code that pushes a message to a service bus queue every time a new article is added on my web app. This then gets picked up by a ServiceBusTrigger with SendGrid output in my functions app that sends me an email that a new article has been added by someone.
This doesn't happen often at all and the only reason i decided to make it behave this way is to get my feet wet with some of the awesome Azure services.
My question is - since i don't really care about reeciving these notification email in real time... how can I reduce the frequency with which the trigger is checking the queue?
In my functions app's host.json I've already minimized the maxConcurrentCalls to 1 (default is 16).
"serviceBus": {
"maxConcurrentCalls": 1,
"prefetchCount": 100,
"autoRenewTimeout": "00:05:00"
}
Is there a way to also set it so that my trigger only checks the queue every 30 minutes or something like that?
No. Message retrieval is managed by Scaling Controller, which you don't have much influence on, apart from host.json parameters that you have already seen.
To implement your scenario, you would need to switch to Timer trigger running every 30 minutes and retrieve messages manually from Service Bus, arguably losing many benefits of Azure Functions.
Update: You can now integrate your Service Bus to Azure Event Grid and then use Event Grid triggered Function. Unfortunately, as of today it only works for Premium Service Bus namespace, so you'd most probably have to wait until they expand the feature to lower tiers.

Resources