AWS SQS FIFO message doesn't seem to be retrying - node.js

I've set up a function to hit an API endpoint for a newly created entity that isn't immediately available. If the endpoint returns a status of "pending", the function throws an error. If the endpoint returns a status of "active", the function then deletes the SQS message and triggers several other microservices to do their things using SNS. The SQS queue that triggers the function has a visibility timeout of 2 minutes, and the function itself has a 1 minute timeout.
What I'm expecting to happen is that if the endpoint returns a "pending" status, and the function throws an error, then after the 2 minute visibility timeout, the message would trigger the function again. This should happen every 2 minutes until the api call returns an "active" status and the message is deleted, or until the message retention period is surpassed (currently 1 hour). This seemed like a nice serverless way to poll my newly created entity to check if it was ready for other post-processing.
What's actually happening after adding a message to the SQS queue is that the CloudWatch logs are showing that the function is throwing an error like I'd expect, but the function is only being triggered one time. I can't tell if the message is just not not visible for some reason, or if it somehow was deleted. I don't know. I'm am new to using SQS for a Lambda trigger, am I thinking about this wrong?

A few possible causes here:
your Lambda function handler did not actually throw an exception to the Lambda runtime environment, so Lambda thought the function had successfully processed the message and the Lambda service then deleted the message from the queue (so that it would not get processed again)
your SQS queue has a configured DLQ with maximum receives set to 1, so the message is delivered once, the Lambda function fails, and the message is subsequently moved to the DLQ
the SQS message was re-delivered to the Lambda function and was logged but the logs were made to an earlier log stream (because this invocation was warm) and so it wasn't obvious that the Lambda function had actually been invoked multiple times with the same failed message
To verify this all works normally, I set up a simple test with both FIFO and non-FIFO queues and configured the queues to trigger a Lambda function that simply logged the SQS message and then threw an exception. As expected, I saw the same SQS message delivered to the Lambda function every 2 minutes (which is the queue's message visibility timeout). That continued until it hit the max receive count on the SQS redrive policy (defaults to 10 attempts) at which point the failed message was correctly moved to the associated DLQ.

Related

AWS Lambda returns 200 to SQS when function times out

Resolved
Human error. Even though status 200 was reported, the messages was not deleted when the function timed out. I had a side effect that deleted single messages from the batch.
I have a Lambda function that is invoked by a SQS Message. Sometimes the function takes a long time and then it times out.
Cloudwatch reports: Task timed out after 30.54 seconds
That is fine, but the SQS messages should then be retried because of this timeout/error, but in X-Ray I see that Lambda has error=true, but response status=200. That means the SQS messages are deleted.
I could do timing in the Lambda function and return an error if the code takes longer then the timeout, but is there a way to make Lambda to return an error (and not status 200) when it times out?
The functions are setup with the Serverless framework:
# the lambda function in serverless.yml
receiver:
handler: handler.receiver
events:
- sqs:
batchSize: 10
arn:
Fn::GetAtt:
- ReceiverQueue
- Arn
memorySize: 2048
timeout: 30
We were struggling with the same problem. How we solve it:
The lambda timeout is handled with middleware (How to log timed out Lambda invocations, middy)
We process the batch of messages from SQS and here pop-ups the following cases:
a. all messages were processed successfully - return {statusCode: 200}, and the messages will be deleted from SQS
b. some messages failed:
we delete the successful messages from SQS (deleteMesages)
throw the error in the lambda function, thus the failed messages will be not deleted from SQS and be retried automatically, depends on the config of SQS.

webjob QueueTrigger does not delete message from the queue

.NET Core 2.2, WebJobs SDK 3.0
I have a webjob that takes the messages from a queue. A standard QueueTrigger like this
public void ProcessQueueMessage(
[QueueTrigger("%WebJobs:WorkerQueueName%")] CloudQueueMessage queueMessage,
ILogger log)
At the end of the process I write the message to another queue (archive).
The function finishes successfully but the message is kept in the source queue
In Storage Explorer I see this (in this example I had 3 messages pending)
and the message is dequeued once again after 10 minutes.
How can I make it so the message is dequeued when my function is successful?
Btw my queue config is
BatchSize 1
MaxDequeueCount 2
MaxPollingInterval 00:00:04
VisibilityTimeout 02:00:00
The SDK should already handle that.
The message will be leased (or become invisible) for 30 seconds by default. If the job takes longer than that, then the lease will be renewed. The message will not become available to another instance of the function unless the host crashes or the function throws an exception. When the function completes successfully, the message is deleted by the SDK.
When a message is retrieved from the queue, the response includes the message and a pop receipt value, which is required to delete the message. The message is not automatically deleted from the queue, but after it has been retrieved, it is not visible to other clients for the time interval specified by the visibilitytimeout parameter. The client that retrieves the message is expected to delete the message after it has been processed, and before the time specified by the TimeNextVisible element of the response, which is calculated based on the value of the visibilitytimeout parameter. The value of visibilitytimeout is added to the time at which the message is retrieved to determine the value of TimeNextVisible.
So you shouldn't need to write any special code for deleting message from queue.
For more details you could refer to this article and this one.
It turns out that I was using the queueMessage object I got as a parameter to directly put it in another queue which probably confused the SDK.
public void ProcessQueueMessage(
[QueueTrigger("%WebJobs:WorkerQueueName%")] CloudQueueMessage queueMessage,
ILogger log)
So I changed that and I create a new CloudQueueMessage object before I put it in another queue.
var newMessage = new CloudQueueMessage(queueMessage.AsString);
Now the message is properly deleted from the queue when my function returns.

Will Lambda receive SQS messages that have been consumed?

I have a system design like this
SQS -> trigger -> Lambda -> if fails -> DLQ
pre condition
Lambda function using a try catch block , it won't throw any errors .
Lambda function never run out of memory , or timeout . (from Lambda monitoring)
Error count is 0 in Lambda monitoring
Never use SQS console to view messages
Lambda SQS batchSize set to 1
DLQ Maximum Receives set to 1
Lambda invocation about 60k
After running for a while
we found a few message in DLQ
message in DLQ has attributes ApproximateReceiveCount is 2 or bigger.
Is this as expected ?
In my opinion if no error throws in Lambda , DLQ message should always be zero .
SQS is designed to provide at least once delivery. There is a chance that your message could be received by more than one lambda invocation.
If you need exactly once delivery, consider using a fifo queue.
Regarding the second part of your question - why are messages being written to the DLQ. The easiest way to troubleshoot that is to look for the lambda invocation logs which match some uniquely identifying aspect of the SQS message in the DLQ.

Azure Function calls itself after 3 minutes

i have the following code in my azure function with 5 minutes manual timeout.
when i run the above function in azure, i see the function creates a new instance after 3 minutes.(check the below image)
both the instances completes successfully ,but returns Status: 504 Gateway Timeout which in turn fails my function execution.
i have hosted the function in App Service Plan, and also increased the timeout in host.json file to 10 minutes
"functionTimeout": "00:10:00"
Several questions in here:
Timeouts - The function timeout in host.json applies to the underlying function runtime; not the http pipeline. You should not have an http function running longer than a minute. The http calls will timeout independently of the runtime (as you see with the 504). However, you could use that timeout for a long-running (ie, 60 minute on appservice plan) queue trigger. If you need a long-running function, the http call could queue a message for a queue trigger, or you could use the Durable Function support.
Why is it invoking again? The simplest interpretation here is that your function is just receiving a second http request message. Do you have evidence that's not the case? You could bind to the HttpRequestMessage and log additional http request properties to track this down.

How to stop an Azure WebJobs queue message from being deleted from an Azure Queue?

I'm using Azure WebJobs to poll a queue and then process the message.
Part of the message processing includes a hit to 3rd party HTTP endpoint. (e.g. a Weather api or some Stock market api).
Now, if the hit to the api fails (network error, 500 error, whatever) I try/catch this in my code, log whatever and then ... what??
If I continue .. then I assume the message will be deleted by the WebJobs SDK.
How can I:
1) Say to the SDK - please don't delete this message (so it will be retried automatically at the next queue poll and when the message is visible again).
2) Set the invisibility time value, when the SDK pops a message off the queue for processing.
Thanks!
Now, if the hit to the api fails (network error, 500 error, whatever) I try/catch this in my code, log whatever and then ... what??
The Webjobs SDK behaves like this: If your method throws an uncaught exception, the message is returned to the Queue with its dequeueCount property +1. Else, if all is well, the message is considered successfully processed and is deleted from the Queue - i.e. queue.DeleteMessage(retrievedMessage);
So don't gracefully catch the HTTP 500, throw an exception so the SDK gets the hint.
If I continue .. then I assume the message will be deleted by the WebJobs SDK.
From https://github.com/Azure/azure-content/blob/master/articles/app-service-web/websites-dotnet-webjobs-sdk-get-started.md#contosoadswebjob---functionscs---generatethumbnail-method:
If the method fails before completing, the queue message is not deleted; after a 10-minute lease expires, the message is released to be picked up again and processed. This sequence won't be repeated indefinitely if a message always causes an exception. After 5 unsuccessful attempts to process a message, the message is moved to a queue named {queuename}-poison. The maximum number of attempts is configurable.
If you really dislike the hardcoded 10-minute visibility timeout (the time the message stays hidden from consumers), you can change it. See this answer by #mathewc:
From https://stackoverflow.com/a/34093943/4148708:
In the latest v1.1.0 release, you can now control the visibility timeout by registering your own custom QueueProcessor instances via JobHostConfiguration.Queues.QueueProcessorFactory. This allows you to control advanced message processing behavior globally or per queue/function.
https://github.com/Azure/azure-webjobs-sdk-samples/blob/master/BasicSamples/MiscOperations/CustomQueueProcessorFactory.cs#L63
protected override async Task ReleaseMessageAsync(CloudQueueMessage message, FunctionResult result, TimeSpan visibilityTimeout, CancellationToken cancellationToken)
{
// demonstrates how visibility timeout for failed messages can be customized
// the logic here could implement exponential backoff, etc.
visibilityTimeout = TimeSpan.FromSeconds(message.DequeueCount);
await base.ReleaseMessageAsync(message, result, visibilityTimeout, cancellationToken);
}

Resources