SQS fifo with dead letter queue and lambda failing to process messages within the same group after getting runtime exception - dead-letter

I’m using SQS fifo with a dead letter queue with Lambda. My maxReceivecount is 1 and visibility timeout is 15 minutes.
When one message fails with a runtime exception, it is not immediately removed to the dead letter queue, it waits 15 minutes. Also I am noticing other messages with the same group id are not being processed until visibility timeout passes which is an even bigger issue. Is there a way to make a failing message go immediately to the dead letter queue upon failure and immediately process valid messages from the same group? I cannot afford to wait for the visibility timeout(15 minutes) to pass to process other messages from the same group.

Related

SQS Lambda - retry logic?

When the message has been added to an SQS queue and it is configured to trigger a lambda function (nodejs).
When a lambda function is triggered - I may want to retry same message again after 5 minute without deleting the message from the Queue. The reason I want to do this if Lambda could not connect external host (eg: API) - i like to try again after 5 minutes for 3 attempts only.
How can that be written in node js?
For example in Laravel, we can Specifying Max Job Attempts functionality. The number of times the job may be attempted using public $tries = 5;
Source: https://laravel.com/docs/5.7/queues#max-job-attempts-and-timeout
How can we do similar fashion in node.js?
I am thinking adding a message to another queue (for retry). A lambda function read all the messages from that queue after 5 minutes and send that message back to main Queue and it will be trigger a lambda function.
Re-tries and re-tries "timeout" can all be configured directly in the SQS queue.
When you create a queue, set up the following attributes:
The Default Visibility Timeout will be the time that the message will be hidden once it has been received by your application. If the message fails during the lambda run and an exception is thrown, lambda will not delete any of the messages in the batch and all of them will eventually re-appear in the queue.
If you only want to try 3 times, you must set the SQS re-drive policy (AKA Dead Letter Queue)
The re-drive policy will enable your queue to redirect messages to a Dead Letter Queue (DLQ) after the message has re-appeared in the queue N number of times, where N is a number between 1 and 1000.
It is essential to understand that lambda will continue to process a failed message (a message that generates an exception in the code) until:
It is processed without any errors (lambda deletes the message)
The Message Retention Period expires (SQS deletes the message)
It is sent to the DLQ set in the SQS queue re-drive policy (SQS "moves" the message to the DLQ)
You delete the message from the queue directly in your code (User deletes the message)
Lambda will not dispose of this bad message otherwise.
Important observations
Lambda will not deal with failed messages
Based on several experiments I ran to understand the behavior of the SQS integration (the documentation on re-tries can be ambiguous).
Lambda will not delete failed messages and will continue to re-try them. Even if you have a Lambda DLQ setup, failed messages will not be sent to the lambda DLQ. Lambda fully relies on the configuration of the SQS queue for this purpose as stated in the lambda DLQ documentation.
Recommendation:
Always use a re-drive policy in your SQS queue.
Exceptions will fail a whole batch of messages
As I stated earlier if there is an exception in your code while processing a message, the whole batch of messages is re-tried, it doesn't matter if some of the messages were processed correctly. If for some reason a downstream service is failing you may end up with messages that were processed in the DLQ.
Recommendation:
Manually delete messages that have been processed correctly
Ensure that your lambda function can process the same message more than once
Lambda concurrency limits and SQS side effects
The blog post "Lambda Concurrency Limits and SQS Triggers Don’t Mix Well (Sometimes)" describes how, if your concurrency limit is set too low, lambda may cause batches of messages to be throttled and the received attempt to be incremented without ever being processed.
Recommendation:
The post and Amazon's recommendations are:
Set the queue’s visibility timeout to at least 6 times the timeout that you configure on your function.
The extra time allows for Lambda to retry if your function execution is throttled while your function is processing a previous batch.
Set the maxReceiveCount on the queue’s re-drive policy to at least 5. This will help avoid sending messages to the dead-letter queue due to throttling.
Configure the dead-letter to retain failed messages long enough so that you can move them back later to be reprocessed
Here is how I did it.
Create Normal Queues (Immediate Delivery), Q1
Create Delay Queues (5 mins delay), Q2
Create DLQ (After retries), DLQ1
(Q1/Q2) SQS Trigger --> Lambda L1 (if failed, delete on (Q1/Q2), drop
it on Q2) --> On Failure DLQ
When messages arrive on Q1 it triggers Lambda L1 if success goes from there. If fails, drop it to Q2 (which is a delayed queue). Every message that arrives on Q2 will have a delay of 5 minutes.
If your initial message can have a delay of 5 mins, then you might not need two queues. One queue should be good. If the initial delay is not acceptable then you need two queues. One another reason to have two queues, you will always have a way for new messages that comes in the path.
If you have a code failure in handling Q1/Q2 aws infrastructure will retry immediately for 3 times before it sends it to DLQ1. If you handle the error in the code, then you can get the pipeline to work with the timings you mentioned.
SQS Delay Queues:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
SQS Lambda Architecture:
https://nordcloud.com/amazon-sqs-as-a-lambda-event-source/
Hope it helps.
Fairly simple (if you execute the Lambda in a Async way) and without the need to do any coding. First of all: if you code will throw an error, AWS Lambda will retry 3 more times to execute you code. In this case if the external API was not accessible, there is a big change that by the third time AWS retries – the API will work. Plus the delay between the re-tries is random-ish meaning, there a is a delay between the re-tries.
If the worst happens, and the external API is not yet up, you can take advantage of the dead-letter queue (DLQ) feature that each lambda have. Which will push to SQS a message saying what went wrong, so you can take additional actions. In this case, keep re-trying until you make it.
You can read more here: https://docs.aws.amazon.com/lambda/latest/dg/dlq.html
According this blog:
https://www.lucidchart.com/blog/cloud/5-reasons-why-sqs-lambda-triggers-are-a-big-deal
Leverage existing retry logic and dead letter queues. If the Lambda
function does not return success, the message will not be deleted from
the queue and will reappear after the visibility timeout has expired.

Azure Service Bus queue message handling

So I have an azure function acting as a queue trigger that calls an internally hosted API.
There doesn't seem to be a definitive answer online on how to handle a message that could not be processed due to issues other than being poisonous.
An example:
My message is received and the function attempts to call the API. The message payload is correct and could be handled however the API/service is down for whatever reason (this time will likely be upwards of 10 minutes). Currently what happens is the message delivery count is reaching its max(10) and then getting pushed to the dead letter queue, which in turn happens for each message after.
I need a way to either not increment the delivery count or reset it upon reaching max. Alternatively I could abandon the peek lock on the message without increment the delivery count as I want to stop processing any message on the queue until the API/service is back up and running.
This way I would ensure that all messages that can be processed will be and will not fall on the dead letter because of connection issues between services.
Any ideas on how to achieve this?
Currently what happens is the message delivery count is reaching its max(10) and then getting pushed to the dead letter queue, which in turn happens for each message after.
As this document states about Exceeding MaxDeliveryCount:
Queues and subscriptions have a QueueDescription.MaxDeliveryCount/SubscriptionDescription.MaxDeliveryCount setting; the default value is 10. Whenever a message has been delivered under a lock (ReceiveMode.PeekLock), but has been either explicitly abandoned or the lock has expired, the message's BrokeredMessage.DeliveryCount is incremented. When the DeliveryCount exceeds the MaxDeliveryCount, the message gets moved to the DLQ specifying the ``MaxDeliveryCountExceeded``` reason code.
This behavior cannot be turned off, but the MaxDeliveryCount can set to a very large number.
According to your requirement, I assumed that you could follow the approaches below to achieve your purpose:
For receiving messages under ReceiveMode.PeekLock
You could specify the Maximum Delivery Count between 1 and 2147483647 under the "SETTINGS > Properties" of your service bus queue on Azure Portal.
For receiving messages under ReceiveMode.ReceiveAndDelete
You could try-catch the exception when your API/service is down, then you could re-send the message to your queue.

Hidden messages in Azure storage queue

Sometimes there are some messages in Azure Queues that are not taken in charge by Azure Functions and also are not visible from StorageExplorer.
These messages are created without any visibility delay.
Is there any way to know what do those messages contain, and why are they not processed by our Azure Functions?
In the image you can see that we have a message in queue but it is not visible in the list and it is there from hours.
The Azure Queue API currently has no way to check invisible messages.
There are several situations in which a message will become invisible:
The message was added with an VisibilityTimeout in the Put Message request. The message will be invisible until this initial timeout expires.
The message has been retrieved (dequeued). Whenever a message is retrieved it will be invisible for the duration of the VisibilityTimeout specified by the Get Messages request, or 30 seconds by default.
The message has expired. Messages expire after 7 days by default, or after the MessageTTL specified in the Put Message request. Note: after a while these messages are automatically deleted, but until that point they are there as invisible message.
Use cases
Initial VisibilityTimeout
Messages are created with an initial VisibilityTimeout so that the message can be created now, but processed later (after the timeout expires), for whatever reason the creator has for wanting to delay this processing.
VisibilityTimeout on retrieving
The intended process for processing queue messages is:
The application dequeues one or more messages, optionally specifying the next VisibilityTimeout. This timeout should be bigger than the time it takes to process the message(s).
The application processes the message(s).
The application deletes the messages. When the processing fails the message(s) are not deleted.
Message(s) for which the process failed will become visible again as soon as their VisibilityTimeout expires, so that they can be re-tried. To prevent endless retries step 2. should start by checking the DequeueCount of the message: if it is bigger than the desired retry-count it should be deleted, instead of processed. It is good practice to copy such messages to a deadletter / poison queue (for example a queue with the original queue name plus a -poison suffix).
MessageTTL
By default messages have a time-to-live of 7 days. If the application processing cannot handle the amount of messages being added, a backlog could build up. Adjusting the TTL will determine what happens to such backlog.
Alternatively the application could crash, so that the backlog would build up until the application would be started again.
It seems that the message is expired. The following steps could reproduce the issue, you could test it.
Add message with a short TTL
After the message has been expired

Azure ServiceBus Retry Delay

I am using the Microsoft Azure ServiceBus for Queue messages using WCF for the subscriptions. I am trying to implement retry logic. I use Peak/Lock to view the message and then have to do some local processing on the message. If that processing fails, I unlock the message so I can try processing it again. The problem is I need to build a have a delay in-between processing tries. Currently it is popped back into the queue and then is processed almost immediately. There needs to be about 2 minutes between attempts.
If you always have to wait 2 minutes before re-processing the message of that particular queue, you could try to configure the lock-timeout on the queue to be 2 minutes (plus the time you expect it will take you to process the message) and then just let the lock expire, instead of unlocking it. This has the downside that you would need to keep an eye on your processing time, and extend the lock's timeout if needed.
Another option could be to receive and complete the message, set a scheduled delivery of 2 minutes into the future, and re send the message. This has the downside that you need to consume it and ack it, which involves certain risks (e.g. your process dies before you get a chance to re-send the message).
"If the message is Peeked in Peek Lock mode from a Queue then you don't have the receive context in the message. You can receive the message in Peek Lock mode, which will lock the message for the interval specified for the 'lock duration' property of the queue. Locked messages cannot be received until its lock expires. Thus, by setting the lock duration to 2 minutes and Receiving messages in Peek Lock mode will solve this issue.
You can either write custom code to update the Lock Duration property. Tools like Service Bus Explorer, Serverless360 etc provides options to update property using graphical user interface."

Locking messages in queue with Windows Azure Queues

I am working with Windows Azure Message queues. I want know if is there a method to lock messages in the queue when i get them ?
When you retrieve a message from the queue, it's marked as invisible until you delete it (or the timeout period is reached). When it's marked as invisible, nobody else sees the message. I guess that's as closed to "locked" as you're going to get.
If, while processing, you feel you need more time, you can modify the message and extend the invisibility timeout.
You do need to focus on idempotent operations with Windows Azure queues: Assume that any given message may be processed more than once:
Processing goes beyond invisibility timeout, so some other worker gets the message
VM instance crashes while processing message, causing it to re-appear in the queue and get processed again

Resources