Azure Function Event Hub Trigger reliability - azure

I'm a bit confused regarding the EventHubTrigger for Azure functions.
I've got an IoT Hub, and am using its eventhub-compatible endpoint to trigger an Azure function that is going to process and store the received data.
However, if my function fails (= throws an exception), that message (or messages) being processed during that function call will get lost. I actually would expect the Azure function runtime to process the messages at a later time again. Specifically, I would expect this behavior because the EventHubTrigger is keeping checkpoints in the Function Apps storage account in order to keep track of where in the event stream it has to continue.
The documention of the EventHubTrigger even states that
If all function executions succeed without errors, checkpoints are added to the associated storage account
But still, even when I deliberately throw exceptions in my function, the checkpoints will get updated and the messages will not get received again.
Is my understanding of the EventHubTriggers documentation wrong, or is the EventHubTriggers implementation (or its documentation) wrong?

This piece of documentation seems confusing indeed. I guess they mean the errors of Function App host itself, not of your code. An exception inside function execution doesn't stop the processing and checkpointing progress.
The fact is that Event Hubs are not designed for individual message retries. The processor works in batches, and it can either mark the whole batch as processed (i.e. create a checkpoint after it), or retry the whole batch (e.g. if the process crashed).
See this forum question and answer.
If you still need to re-process failed events from Event Hub (and errors don't happen too often), you could implement such mechanism yourself. E.g.
Add an output Queue binding to your Azure Function.
Add try-catch around processing code.
If exception is thrown, add the problematic event to the Queue.
Have another Function with Queue trigger to process those events.
Note that the downside of this is that you will loose ordering guarantee provided by Event Hubs (since Queue message will be processed later than its neighbors).

Quick fix. As retry policy would not work if down system is down for few hours. You can call Process.GetCurrentProcess().Kill(); in exception handling. This would stop the checkpoint moving forward. I have tested this with consumption based function app. You will not see anything in logs but i added email to notify that something went wrong and to avoid data loss i have killed the function instance.
Hope this helps.
Would put an blog over it and other part of workflow where I stop function in case of continuous failure on down system using logic app.

Related

What happens to the messages being processed on functions running when we disable the function?

We are working with Azure functions, which are triggered on every message in the service bus queue. We are trying to solve a problem whereby we need to disable a function on the function app processing messages, dynamically, so that it does not process messages any further and we do not lose any message in the process as well.
We can disable the functions via multiple ways, referring to link but the problem remains the same. Unable to figure out what happens to the functions already spawned when trying to disable the same.
Since the function is service bus triggered there is always a possibility that the function is processing a message and we disable the same, does it get processed, any sorts of cancellation is raised, it just dies out with an exception?
It would be great someone could direct me to some documentation or something. Thanks.
Azure Service Bus triggered function will already have a lock on the message that's being processed. If Function is terminated and the message was not completed or disposition, the lock will expire and the message will reappear on the queue. That's because of the Functions runtime receives a message in PeekLock mode.
One factor to consider is the queue's MaxDeliveryCount. If a function is terminated upon the last processing attempt, the message will be dead-lettered as all processing attempts have been exhausted. That's a standard Azure Service Bus behaviour.

Time triggered Azure Function and queue processing

I have a Function called once a day processing all messages in a queue. But I would like to also have the retries and poison messages logic as with the Queue Trigger. Is this somehow possible?
So at that point you function is purely a timer triggered function and from there you are no different than a console app in terms of how you would have to process messages from a queue with your own client connection, message loop, retrying and dead lettering (poison) logic. It's honestly just not the right tool for that job.
One approach I suppose you could consider if you wanted to be creative so that you could benefit from using an Azure Function's built in queue trigger behavior while still controlling what time the queue is processed is actually starting and stopping the function instance itself via something like Azure Scheduler. Scheduling the starting of the function is pretty straightforward and, once started, it will immediately begin draining the queue. The challenge is knowing how to stop it. The Azure Function runtime with the queue binding won't ever stop on its own as it's reading off the queue using a pull model so it's just gonna sit there waiting for new messages to arrive, right? So stopping it is really a question of understanding the need of the business. Do you stop once there's no more messages left for that day? Do you stop at a specific time of day? Etc. There's no correct answer here, it's totally domain specific, but whatever that answer is will dictate the exact approach taken.
Honestly, as I said earlier on, I'm not sure this is the right tool for the job. Yeah, it's nice that you get the retry and poison handling, but you're really going against the grain of the runtime. I would personally suggest you look into scheduling this simple console executable, executed in a task-like fashion using Azure Container Instances (some docs on that approach here). If you were counting on the auto-scale of Azure Functions, that's totally something you can get from Azure Container Instances as well.

Can I configure azure function to peek and read message in service bus queue but not delete it?

Per Azure Functions Service Bus bindings:
Trigger behavior
...
PeekLock behavior - The Functions runtime receives a message in PeekLock mode and calls Complete on the message if the function finishes successfully, or calls Abandon if the function fails. If the function runs longer than the PeekLock timeout, the lock is automatically renewed.
I am assuming that when azure function calls Complete on the message, it will be removed from the queue.
What should I do in my function if I want my function to spy on the message but never delete it?
Unsuccessful processing of a message resulting in function throwing an exception or an explicit abandon operation on the message will not complete the message.
Saying that, I see a problem with this approach. You're not truly "spying" on the messages, but actively processing those. Which means a given message will be re-delivered and eventually end up in the dead letter queue. If you want to spy, you should peek at the messages, but Azure Service Bus trigger doesn't do that.
If you need a wiretap implementation, it's probably not a bad idea to use a topic and have a 2 subscriptions, one to consume the messages and another to duplicate all the messages for your wiretap function (that perhaps does some sort of analysis or logging). Without understanding the full scope of what you're doing, hard to provide an answer.

Azure Function and storage queue, what to do if function fails

I'm working out a scenario where a post a message to an Azure Storage Queue. For testing purposes I've developed a console app, where I get the message and I'm able to update it with a try count, and when the logic is done, I delete the message.
Now I'm trying to port my code to an Azure Function. One thing that seems to be very different is, when the Azure Function is called, the message is deleted from the queue.
I find it hard to find any documentation on this specific subject and I feel I'm missing something with regard to the concept of combining these two.
My questions:
Am I right, that when you trigger a function on a new queue item, the function takes the message and deletes it from the queue, even if the function fails?
If 1 is correct, how do you make sure that the message is retried and posted to a dead queue for later processing?
The runtime only deletes the queue message when your Function successfully processes it (i.e. no error has occurred). When the message is dequeued and passed to your function, it becomes invisible for a period of time (10 minutes). While your function is running this invisibility is maintained. If your function fails, the message is not deleted - it remains in the queue in an invisible state. After the visibilty timeout expires, the message will become visible in the queue again for reprocessing.
The details of how core WebJobs SDK queue processing works can be found here. On that page, see the section "How to handle poison messages" which addresses your question. Basically you'll get all the right behaviors for free - retry handling, poison message handling, etc. :)

NServiceBus and Azure long running handler pattern

We are using Azure service bus via NServiceBus and I am facing a problem with deciding the correct architecture for dealing with long running tasks as a result of messages.
As is good practice, we don't want to block the message handler from returning by making it wait for long running processes (downloading a large file from a remote server), and actually doing so will cause the lock on the message to be lost with Azure SB. The plan is to respond by spawning a separate task and allow the message handler to return immediately.
However this means that the handler is now immediately available for the next message which will cause another task to be spawned and so on until the message queue is empty. What I'd like is some way to stop taking messages while we are processing (a limited number of) earlier messages. Is there an accepted pattern for this with NServiceBus and Azure Service Bus?
The following is what I'd kind of do if I was programming directly against the Azure SB
{
while(true)
{
var message = bus.Next();
message.Complete();
// Do long running stuff here
}
}
The verbs Next and Complete are probably wrong but what happens under Azure is that Next gets a temporary lock on the message so that other consumers can no longer see the message. Then you can decide if you really want to process the message and if so call Complete. That then removes the message from the queue entirely, failing to do so will cause the message to appear back on the queue after a period of time as Azure assumes you crashed. As dirty as this code looks it would achieve my goals (so why not do it?) as my consumer is only going to consume the next time I'm available (after the long running task). Other consumers (other instances) can jump in if necessary.
The problem is that NServiceBus adds a level of abstraction so that now handling a message is via a method on a handler class.
void Handle(NewFileMessage message)
{
// Do work here
}
The problem is that Azure does not get the call to message.Complete() until after your work and after the Handle method exits. This is why you need to keep the work short. However if you exit you also signal that you are ready to handle another message. This is my Catch 22
Downloading on a background thread is a good idea. You don't want to to increase lock duration, because that's a symptom, not the problem. Your download can easily get longer than maximum lock duration (5mins) and then you're back to square one.
What you can do is have an orchestrating saga for download. Saga can monitor the download process and when download is completed, b/g process would signal to the saga about completion. If download is never finished, you can have a timeout (or multiple timeouts) to indicate that and have a compensating action or retry, whatever works for your business case.
Documentation on Sagas should get you going: http://docs.particular.net/nservicebus/sagas/
In Azure Service Bus you can increase the lock duration of a message (default set to 30 seconds) in case the handling will take a long time.
But, besides you are able to increase your lock duration, it's generally an indication that your handler takes care of to much work which can be divided over different handlers.
If it is critical that the file is downloaded, I would keep the download operation in the handler. That way if the download fails the message can be handled again and the download can be retried. If however you want to free up the handler instantly to handle more messages, I would suggest that you scale out the workers that perform the download task so that the system can cope with the demand.

Resources