Cancelling an Azure Function with a queue trigger without the poison queue - azure

We're using the following setup:
We're using an Azure Function with a queue trigger to process a queue of JSON messages.
These messages are each just forwarded to an API endpoint via HTTP POST, for further processing.
The API can return 3 possible HTTP status codes; 200 (OK), 400 (Bad Request), 500 (Internal Server Error).
If the API returns 200, the message was processed properly and everything is fine. The queue trigger function appears to automatically delete the queue message and that's fine by us.
If the API returns 400, the API has logic which takes the message and adds it to a table with a status indicating that it was malformed or otherwise couldn't be processed. We are therefore fine with the message being automatically being deleted from the queue and the Azure Function can end normally.
If the API returns 500, we make sure the function retries posting the message to the API, until the status code is 200 or 400 (because there's likely a problem with the API and we don't want lost messages). We're using Polly to achieve this. We have it set up so it's essentially going to keep retrying forever on an exponential backoff.
We recently encountered this problem however:
There are certain situations where the API will return 500 for certain messages. This error is completely transient and will come and go unpredictably. Retrying forever using Polly would be fine except not all messages cause this error and essentially the "bad" messages are blocking "good" messages from being processed.
Let's say for example I have 50 messages in the queue. The first 32 messages at the front of the queue are "bad" and will sometimes return 500 from the API. These messages are picked up by the Azure Function and worked on concurrently. The other 18 messages are "good" and will return 200. These "good" messages will not be processed until the "bad" ones have been successfully processed. Essentially the bad ones cause a traffic jam for the good ones.
My solution was to try to cancel execution of the Azure Function if the current message has been retried a certain number of times. I thought maybe the message would then be made visible after some time, but in that time it gives the good messages time to be processed. However, I have no idea how to cancel execution of the function without either causing the queue message to be completely deleted or pushed onto a poison queue.
Am I able to achieve this using a queue trigger function? Is this something I can maybe do using a timer trigger instead?
Thanks very much!

As you've mentioned, you can't effectively cancel execution, so I'd suggest finishing the function and moving the message to a queue where it will be processed later.
A few suggestions:
Throw an error to use the poison queue, and handle the logic there.
Push these messages to another 'long running' queue of your choice, with an output binding.
Use an output binding of type CloudQueue that connects to your input queue. When you encounter a problematic message, add it to the output queue using the initialVisibilityDelay parameter to push it to the back of the queue: https://learn.microsoft.com/en-us/dotnet/api/microsoft.windowsazure.storage.queue.cloudqueue.addmessage?view=azure-dotnet
Edit: Here's a cheatsheet of parameter binding types

Related

What happens to the messages being processed on functions running when we disable the function?

We are working with Azure functions, which are triggered on every message in the service bus queue. We are trying to solve a problem whereby we need to disable a function on the function app processing messages, dynamically, so that it does not process messages any further and we do not lose any message in the process as well.
We can disable the functions via multiple ways, referring to link but the problem remains the same. Unable to figure out what happens to the functions already spawned when trying to disable the same.
Since the function is service bus triggered there is always a possibility that the function is processing a message and we disable the same, does it get processed, any sorts of cancellation is raised, it just dies out with an exception?
It would be great someone could direct me to some documentation or something. Thanks.
Azure Service Bus triggered function will already have a lock on the message that's being processed. If Function is terminated and the message was not completed or disposition, the lock will expire and the message will reappear on the queue. That's because of the Functions runtime receives a message in PeekLock mode.
One factor to consider is the queue's MaxDeliveryCount. If a function is terminated upon the last processing attempt, the message will be dead-lettered as all processing attempts have been exhausted. That's a standard Azure Service Bus behaviour.

SQS Lambda - retry logic?

When the message has been added to an SQS queue and it is configured to trigger a lambda function (nodejs).
When a lambda function is triggered - I may want to retry same message again after 5 minute without deleting the message from the Queue. The reason I want to do this if Lambda could not connect external host (eg: API) - i like to try again after 5 minutes for 3 attempts only.
How can that be written in node js?
For example in Laravel, we can Specifying Max Job Attempts functionality. The number of times the job may be attempted using public $tries = 5;
Source: https://laravel.com/docs/5.7/queues#max-job-attempts-and-timeout
How can we do similar fashion in node.js?
I am thinking adding a message to another queue (for retry). A lambda function read all the messages from that queue after 5 minutes and send that message back to main Queue and it will be trigger a lambda function.
Re-tries and re-tries "timeout" can all be configured directly in the SQS queue.
When you create a queue, set up the following attributes:
The Default Visibility Timeout will be the time that the message will be hidden once it has been received by your application. If the message fails during the lambda run and an exception is thrown, lambda will not delete any of the messages in the batch and all of them will eventually re-appear in the queue.
If you only want to try 3 times, you must set the SQS re-drive policy (AKA Dead Letter Queue)
The re-drive policy will enable your queue to redirect messages to a Dead Letter Queue (DLQ) after the message has re-appeared in the queue N number of times, where N is a number between 1 and 1000.
It is essential to understand that lambda will continue to process a failed message (a message that generates an exception in the code) until:
It is processed without any errors (lambda deletes the message)
The Message Retention Period expires (SQS deletes the message)
It is sent to the DLQ set in the SQS queue re-drive policy (SQS "moves" the message to the DLQ)
You delete the message from the queue directly in your code (User deletes the message)
Lambda will not dispose of this bad message otherwise.
Important observations
Lambda will not deal with failed messages
Based on several experiments I ran to understand the behavior of the SQS integration (the documentation on re-tries can be ambiguous).
Lambda will not delete failed messages and will continue to re-try them. Even if you have a Lambda DLQ setup, failed messages will not be sent to the lambda DLQ. Lambda fully relies on the configuration of the SQS queue for this purpose as stated in the lambda DLQ documentation.
Recommendation:
Always use a re-drive policy in your SQS queue.
Exceptions will fail a whole batch of messages
As I stated earlier if there is an exception in your code while processing a message, the whole batch of messages is re-tried, it doesn't matter if some of the messages were processed correctly. If for some reason a downstream service is failing you may end up with messages that were processed in the DLQ.
Recommendation:
Manually delete messages that have been processed correctly
Ensure that your lambda function can process the same message more than once
Lambda concurrency limits and SQS side effects
The blog post "Lambda Concurrency Limits and SQS Triggers Don’t Mix Well (Sometimes)" describes how, if your concurrency limit is set too low, lambda may cause batches of messages to be throttled and the received attempt to be incremented without ever being processed.
Recommendation:
The post and Amazon's recommendations are:
Set the queue’s visibility timeout to at least 6 times the timeout that you configure on your function.
The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.
Set the maxReceiveCount on the queue’s re-drive policy to at least 5. This will help avoid sending messages to the dead-letter queue due to throttling.
Configure the dead-letter to retain failed messages long enough so that you can move them back later to be reprocessed
Here is how I did it.
Create Normal Queues (Immediate Delivery), Q1
Create Delay Queues (5 mins delay), Q2
Create DLQ (After retries), DLQ1
(Q1/Q2) SQS Trigger --> Lambda L1 (if failed, delete on (Q1/Q2), drop
it on Q2) --> On Failure DLQ
When messages arrive on Q1 it triggers Lambda L1 if success goes from there. If fails, drop it to Q2 (which is a delayed queue). Every message that arrives on Q2 will have a delay of 5 minutes.
If your initial message can have a delay of 5 mins, then you might not need two queues. One queue should be good. If the initial delay is not acceptable then you need two queues. One another reason to have two queues, you will always have a way for new messages that comes in the path.
If you have a code failure in handling Q1/Q2 aws infrastructure will retry immediately for 3 times before it sends it to DLQ1. If you handle the error in the code, then you can get the pipeline to work with the timings you mentioned.
SQS Delay Queues:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
SQS Lambda Architecture:
https://nordcloud.com/amazon-sqs-as-a-lambda-event-source/
Hope it helps.
Fairly simple (if you execute the Lambda in a Async way) and without the need to do any coding. First of all: if you code will throw an error, AWS Lambda will retry 3 more times to execute you code. In this case if the external API was not accessible, there is a big change that by the third time AWS retries – the API will work. Plus the delay between the re-tries is random-ish meaning, there a is a delay between the re-tries.
If the worst happens, and the external API is not yet up, you can take advantage of the dead-letter queue (DLQ) feature that each lambda have. Which will push to SQS a message saying what went wrong, so you can take additional actions. In this case, keep re-trying until you make it.
You can read more here: https://docs.aws.amazon.com/lambda/latest/dg/dlq.html
According this blog:
https://www.lucidchart.com/blog/cloud/5-reasons-why-sqs-lambda-triggers-are-a-big-deal
Leverage existing retry logic and dead letter queues. If the Lambda
function does not return success, the message will not be deleted from
the queue and will reappear after the visibility timeout has expired.

Azure Service Bus queue message handling

So I have an azure function acting as a queue trigger that calls an internally hosted API.
There doesn't seem to be a definitive answer online on how to handle a message that could not be processed due to issues other than being poisonous.
An example:
My message is received and the function attempts to call the API. The message payload is correct and could be handled however the API/service is down for whatever reason (this time will likely be upwards of 10 minutes). Currently what happens is the message delivery count is reaching its max(10) and then getting pushed to the dead letter queue, which in turn happens for each message after.
I need a way to either not increment the delivery count or reset it upon reaching max. Alternatively I could abandon the peek lock on the message without increment the delivery count as I want to stop processing any message on the queue until the API/service is back up and running.
This way I would ensure that all messages that can be processed will be and will not fall on the dead letter because of connection issues between services.
Any ideas on how to achieve this?
Currently what happens is the message delivery count is reaching its max(10) and then getting pushed to the dead letter queue, which in turn happens for each message after.
As this document states about Exceeding MaxDeliveryCount:
Queues and subscriptions have a QueueDescription.MaxDeliveryCount/SubscriptionDescription.MaxDeliveryCount setting; the default value is 10. Whenever a message has been delivered under a lock (ReceiveMode.PeekLock), but has been either explicitly abandoned or the lock has expired, the message's BrokeredMessage.DeliveryCount is incremented. When the DeliveryCount exceeds the MaxDeliveryCount, the message gets moved to the DLQ specifying the ``MaxDeliveryCountExceeded``` reason code.
This behavior cannot be turned off, but the MaxDeliveryCount can set to a very large number.
According to your requirement, I assumed that you could follow the approaches below to achieve your purpose:
For receiving messages under ReceiveMode.PeekLock
You could specify the Maximum Delivery Count between 1 and 2147483647 under the "SETTINGS > Properties" of your service bus queue on Azure Portal.
For receiving messages under ReceiveMode.ReceiveAndDelete
You could try-catch the exception when your API/service is down, then you could re-send the message to your queue.

Azure Function and storage queue, what to do if function fails

I'm working out a scenario where a post a message to an Azure Storage Queue. For testing purposes I've developed a console app, where I get the message and I'm able to update it with a try count, and when the logic is done, I delete the message.
Now I'm trying to port my code to an Azure Function. One thing that seems to be very different is, when the Azure Function is called, the message is deleted from the queue.
I find it hard to find any documentation on this specific subject and I feel I'm missing something with regard to the concept of combining these two.
My questions:
Am I right, that when you trigger a function on a new queue item, the function takes the message and deletes it from the queue, even if the function fails?
If 1 is correct, how do you make sure that the message is retried and posted to a dead queue for later processing?
The runtime only deletes the queue message when your Function successfully processes it (i.e. no error has occurred). When the message is dequeued and passed to your function, it becomes invisible for a period of time (10 minutes). While your function is running this invisibility is maintained. If your function fails, the message is not deleted - it remains in the queue in an invisible state. After the visibilty timeout expires, the message will become visible in the queue again for reprocessing.
The details of how core WebJobs SDK queue processing works can be found here. On that page, see the section "How to handle poison messages" which addresses your question. Basically you'll get all the right behaviors for free - retry handling, poison message handling, etc. :)

Azure Request-Response Session Timeout handling

We are using the azure service bus to facilitate the parallel processing of messages through workers listening to a queue.
First an aggregated message is received and then this message is split in thousands of individual messages which are posted through a request-response pattern since we need to know when all messages have been completed to run a separate process.
Our issue is that the request-response method has a timeout which is causing the following issue:
Lets say we post 1000 messages to be processed and there is only one worker listening. Messages left in the queue after the timeout expiration are discarded which is something that we do not want. If we set the expiry time to a large value that will guarantee that all messages will be processed then we run the risk of a message failing and having to wait the timeout to understand that something has gone wrong.
Is there a way to dynamically change the expiration of a single message in a request-response scenario or any other pattern that we should consider?
Thanks!
You got the things wrong, The Time to live of an azure service bus message https://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.brokeredmessage.timetolive.aspx It is the time on which the message will be on the queue if it is consumed or not.
This it is not the timeout, if you post a message with this larger time to live the message will stay on the queue for a long time but if you fail to consume you should warn the other end that you failed to consume this message.
You can do this using another queue and putting another message on this other queue with the message id that failed and the error.
This is an asynchronous process so you should not be holding requests based on that but work with the asynchronous nature of the problem.

Resources