SQS Lambda - retry logic? - node.js

When the message has been added to an SQS queue and it is configured to trigger a lambda function (nodejs).
When a lambda function is triggered - I may want to retry same message again after 5 minute without deleting the message from the Queue. The reason I want to do this if Lambda could not connect external host (eg: API) - i like to try again after 5 minutes for 3 attempts only.
How can that be written in node js?
For example in Laravel, we can Specifying Max Job Attempts functionality. The number of times the job may be attempted using public $tries = 5;
Source: https://laravel.com/docs/5.7/queues#max-job-attempts-and-timeout
How can we do similar fashion in node.js?
I am thinking adding a message to another queue (for retry). A lambda function read all the messages from that queue after 5 minutes and send that message back to main Queue and it will be trigger a lambda function.

Re-tries and re-tries "timeout" can all be configured directly in the SQS queue.
When you create a queue, set up the following attributes:
The Default Visibility Timeout will be the time that the message will be hidden once it has been received by your application. If the message fails during the lambda run and an exception is thrown, lambda will not delete any of the messages in the batch and all of them will eventually re-appear in the queue.
If you only want to try 3 times, you must set the SQS re-drive policy (AKA Dead Letter Queue)
The re-drive policy will enable your queue to redirect messages to a Dead Letter Queue (DLQ) after the message has re-appeared in the queue N number of times, where N is a number between 1 and 1000.
It is essential to understand that lambda will continue to process a failed message (a message that generates an exception in the code) until:
It is processed without any errors (lambda deletes the message)
The Message Retention Period expires (SQS deletes the message)
It is sent to the DLQ set in the SQS queue re-drive policy (SQS "moves" the message to the DLQ)
You delete the message from the queue directly in your code (User deletes the message)
Lambda will not dispose of this bad message otherwise.
Important observations
Lambda will not deal with failed messages
Based on several experiments I ran to understand the behavior of the SQS integration (the documentation on re-tries can be ambiguous).
Lambda will not delete failed messages and will continue to re-try them. Even if you have a Lambda DLQ setup, failed messages will not be sent to the lambda DLQ. Lambda fully relies on the configuration of the SQS queue for this purpose as stated in the lambda DLQ documentation.
Recommendation:
Always use a re-drive policy in your SQS queue.
Exceptions will fail a whole batch of messages
As I stated earlier if there is an exception in your code while processing a message, the whole batch of messages is re-tried, it doesn't matter if some of the messages were processed correctly. If for some reason a downstream service is failing you may end up with messages that were processed in the DLQ.
Recommendation:
Manually delete messages that have been processed correctly
Ensure that your lambda function can process the same message more than once
Lambda concurrency limits and SQS side effects
The blog post "Lambda Concurrency Limits and SQS Triggers Don’t Mix Well (Sometimes)" describes how, if your concurrency limit is set too low, lambda may cause batches of messages to be throttled and the received attempt to be incremented without ever being processed.
Recommendation:
The post and Amazon's recommendations are:
Set the queue’s visibility timeout to at least 6 times the timeout that you configure on your function.
The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.
Set the maxReceiveCount on the queue’s re-drive policy to at least 5. This will help avoid sending messages to the dead-letter queue due to throttling.
Configure the dead-letter to retain failed messages long enough so that you can move them back later to be reprocessed

Here is how I did it.
Create Normal Queues (Immediate Delivery), Q1
Create Delay Queues (5 mins delay), Q2
Create DLQ (After retries), DLQ1
(Q1/Q2) SQS Trigger --> Lambda L1 (if failed, delete on (Q1/Q2), drop
it on Q2) --> On Failure DLQ
When messages arrive on Q1 it triggers Lambda L1 if success goes from there. If fails, drop it to Q2 (which is a delayed queue). Every message that arrives on Q2 will have a delay of 5 minutes.
If your initial message can have a delay of 5 mins, then you might not need two queues. One queue should be good. If the initial delay is not acceptable then you need two queues. One another reason to have two queues, you will always have a way for new messages that comes in the path.
If you have a code failure in handling Q1/Q2 aws infrastructure will retry immediately for 3 times before it sends it to DLQ1. If you handle the error in the code, then you can get the pipeline to work with the timings you mentioned.
SQS Delay Queues:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
SQS Lambda Architecture:
https://nordcloud.com/amazon-sqs-as-a-lambda-event-source/
Hope it helps.

Fairly simple (if you execute the Lambda in a Async way) and without the need to do any coding. First of all: if you code will throw an error, AWS Lambda will retry 3 more times to execute you code. In this case if the external API was not accessible, there is a big change that by the third time AWS retries – the API will work. Plus the delay between the re-tries is random-ish meaning, there a is a delay between the re-tries.
If the worst happens, and the external API is not yet up, you can take advantage of the dead-letter queue (DLQ) feature that each lambda have. Which will push to SQS a message saying what went wrong, so you can take additional actions. In this case, keep re-trying until you make it.
You can read more here: https://docs.aws.amazon.com/lambda/latest/dg/dlq.html

According this blog:
https://www.lucidchart.com/blog/cloud/5-reasons-why-sqs-lambda-triggers-are-a-big-deal
Leverage existing retry logic and dead letter queues. If the Lambda
function does not return success, the message will not be deleted from
the queue and will reappear after the visibility timeout has expired.

Related

AWS: inconsistency between SQS and lambda

I want to trigger lambda with a websocket. I have deployed a EC2 instance of websocket producer which is throwing all its data through SQS FIFO and SQS triggering lambda with same messageGroupId. But sometimes lambda is executing concurrently, I am expecting lambda to be executed sequentially. Because data is coming in a Queue. Since it is a cryptocurrency exchange websocket, data frequency is really high. And I checked one message from the websocket takes 3ms in lambda to get processed.
I was expecting lambda to run only 1 process not concurrently (which is causing wrong data calculation). Can anyone tell me what config in Queue should I configure or is there any other method to achieve this goal.
Thanks
Edit: Attaching config for fifo
There are two types of Amazon SQS queues: first-in, first-out (FIFO) and standard queues.
In FIFO queues, message strings remain in the same order in which the original messages were sent and received. FIFO queues support up to 300 send, receive or delete messages per second.
Standard queues attempt to keep message strings in the same order in which the messages were originally sent, but processing requirements may change the original order or sequence of messages. For example, standard queues can be used to batch messages for future processing or allocate tasks to multiple worker nodes.
The frequency of message delivery differs between standard and FIFO queues, as FIFO messages are delivered exactly once, while in standard queues, messages are delivered at least once.
Suggestion : check your que type and change it to FIFO.
You need to set the maximum lambda concurrency to 1.
https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
To process each message sequentially:
When sending a message to the Amazon SQS queue, specify the same MessageGroupId for each message
In the Lambda function, configure the SQS Trigger to have a Batch size of 1
When using a FIFO queue, if a message is 'in-flight' then SQS will not permit another message with the same MessageGroupId to be processed. It will, however, allow multiple messages with the same MessageGroupId to be sent to a Lambda function, which is why you should set the Batch Size to 1.
See:
Using the Amazon SQS message group ID - Amazon Simple Queue Service
New for AWS Lambda – SQS FIFO as an event source | AWS Compute Blog

Azure - Storage queue message renewal

We are using azure web job for processing the messages in azure storage queue. After the 5 unsuccessful attempts messages are moved to the Poisson queue. Instead of that i want to process the message further more until the message have been processed successfully.
kindly assist me on the same.
You can configure the maximum number of retries (default is 5) before a message is sent to the poison queue. You can add an 'int dequeCount' parameter to your method to check the number of times it has been called and base your decisions on that.
Having said that you should definitely have some error handling strategy in place. Just trying indefinitely till you succeed is a recipe for failure.

Cancelling an Azure Function with a queue trigger without the poison queue

We're using the following setup:
We're using an Azure Function with a queue trigger to process a queue of JSON messages.
These messages are each just forwarded to an API endpoint via HTTP POST, for further processing.
The API can return 3 possible HTTP status codes; 200 (OK), 400 (Bad Request), 500 (Internal Server Error).
If the API returns 200, the message was processed properly and everything is fine. The queue trigger function appears to automatically delete the queue message and that's fine by us.
If the API returns 400, the API has logic which takes the message and adds it to a table with a status indicating that it was malformed or otherwise couldn't be processed. We are therefore fine with the message being automatically being deleted from the queue and the Azure Function can end normally.
If the API returns 500, we make sure the function retries posting the message to the API, until the status code is 200 or 400 (because there's likely a problem with the API and we don't want lost messages). We're using Polly to achieve this. We have it set up so it's essentially going to keep retrying forever on an exponential backoff.
We recently encountered this problem however:
There are certain situations where the API will return 500 for certain messages. This error is completely transient and will come and go unpredictably. Retrying forever using Polly would be fine except not all messages cause this error and essentially the "bad" messages are blocking "good" messages from being processed.
Let's say for example I have 50 messages in the queue. The first 32 messages at the front of the queue are "bad" and will sometimes return 500 from the API. These messages are picked up by the Azure Function and worked on concurrently. The other 18 messages are "good" and will return 200. These "good" messages will not be processed until the "bad" ones have been successfully processed. Essentially the bad ones cause a traffic jam for the good ones.
My solution was to try to cancel execution of the Azure Function if the current message has been retried a certain number of times. I thought maybe the message would then be made visible after some time, but in that time it gives the good messages time to be processed. However, I have no idea how to cancel execution of the function without either causing the queue message to be completely deleted or pushed onto a poison queue.
Am I able to achieve this using a queue trigger function? Is this something I can maybe do using a timer trigger instead?
Thanks very much!
As you've mentioned, you can't effectively cancel execution, so I'd suggest finishing the function and moving the message to a queue where it will be processed later.
A few suggestions:
Throw an error to use the poison queue, and handle the logic there.
Push these messages to another 'long running' queue of your choice, with an output binding.
Use an output binding of type CloudQueue that connects to your input queue. When you encounter a problematic message, add it to the output queue using the initialVisibilityDelay parameter to push it to the back of the queue: https://learn.microsoft.com/en-us/dotnet/api/microsoft.windowsazure.storage.queue.cloudqueue.addmessage?view=azure-dotnet
Edit: Here's a cheatsheet of parameter binding types

Azure Function and storage queue, what to do if function fails

I'm working out a scenario where a post a message to an Azure Storage Queue. For testing purposes I've developed a console app, where I get the message and I'm able to update it with a try count, and when the logic is done, I delete the message.
Now I'm trying to port my code to an Azure Function. One thing that seems to be very different is, when the Azure Function is called, the message is deleted from the queue.
I find it hard to find any documentation on this specific subject and I feel I'm missing something with regard to the concept of combining these two.
My questions:
Am I right, that when you trigger a function on a new queue item, the function takes the message and deletes it from the queue, even if the function fails?
If 1 is correct, how do you make sure that the message is retried and posted to a dead queue for later processing?
The runtime only deletes the queue message when your Function successfully processes it (i.e. no error has occurred). When the message is dequeued and passed to your function, it becomes invisible for a period of time (10 minutes). While your function is running this invisibility is maintained. If your function fails, the message is not deleted - it remains in the queue in an invisible state. After the visibilty timeout expires, the message will become visible in the queue again for reprocessing.
The details of how core WebJobs SDK queue processing works can be found here. On that page, see the section "How to handle poison messages" which addresses your question. Basically you'll get all the right behaviors for free - retry handling, poison message handling, etc. :)

Azure Request-Response Session Timeout handling

We are using the azure service bus to facilitate the parallel processing of messages through workers listening to a queue.
First an aggregated message is received and then this message is split in thousands of individual messages which are posted through a request-response pattern since we need to know when all messages have been completed to run a separate process.
Our issue is that the request-response method has a timeout which is causing the following issue:
Lets say we post 1000 messages to be processed and there is only one worker listening. Messages left in the queue after the timeout expiration are discarded which is something that we do not want. If we set the expiry time to a large value that will guarantee that all messages will be processed then we run the risk of a message failing and having to wait the timeout to understand that something has gone wrong.
Is there a way to dynamically change the expiration of a single message in a request-response scenario or any other pattern that we should consider?
Thanks!
You got the things wrong, The Time to live of an azure service bus message https://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.brokeredmessage.timetolive.aspx It is the time on which the message will be on the queue if it is consumed or not.
This it is not the timeout, if you post a message with this larger time to live the message will stay on the queue for a long time but if you fail to consume you should warn the other end that you failed to consume this message.
You can do this using another queue and putting another message on this other queue with the message id that failed and the error.
This is an asynchronous process so you should not be holding requests based on that but work with the asynchronous nature of the problem.

Resources