I want simply to disable lambda retries when it's launched by a kinesis trigger. If the lambda fails or exit, I don't want it to retry.
From AWS Lambda Retry Behavior - AWS Lambda:
Poll-based (or pull model) event sources that are stream-based: These consist of Kinesis Data Streams or DynamoDB. When a Lambda function invocation fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days.
The exception is treated as blocking, and AWS Lambda will not read any new records from the shard until the failed batch of records either expires or is processed successfully. This ensures that AWS Lambda processes the stream events in order.
There does not appear to be any configuration options to change this behaviour.
How about handling your error properly so that the invocation will still succeed and Lambda will not retry it anymore?
In NodeJS, it would be something like this...
export const handler = (event, context) => {
return doWhateverAsync()
.then(() => someSuccessfulValue)
.catch((err) => {
// Log the error at least.
console.log(error)
// But still return something so Lambda won't retry.
return someSuccessfulValue
})
}
If you are using a Lambda Event Source Mapping to trigger your Lambda with a batch of records from kinesis stream shard then you can configure the maximum number of retries that will be made by the event source mapping.
another option is to configure the maximum age of the record which is sent to the function.
Retry attempts – The maximum number of times that Lambda retries when the function returns an error. This doesn't apply to service errors or throttles where the batch didn't reach the function.
Maximum age of record – The maximum age of a record that Lambda sends to your function.
A good practice is to configure failure destination. this is usually an SQS queue or SNS topic. details of the batch that caused the invocation to fail are stored here.
https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html#services-kinesis-errors for more info.
Related
I've set up a function to hit an API endpoint for a newly created entity that isn't immediately available. If the endpoint returns a status of "pending", the function throws an error. If the endpoint returns a status of "active", the function then deletes the SQS message and triggers several other microservices to do their things using SNS. The SQS queue that triggers the function has a visibility timeout of 2 minutes, and the function itself has a 1 minute timeout.
What I'm expecting to happen is that if the endpoint returns a "pending" status, and the function throws an error, then after the 2 minute visibility timeout, the message would trigger the function again. This should happen every 2 minutes until the api call returns an "active" status and the message is deleted, or until the message retention period is surpassed (currently 1 hour). This seemed like a nice serverless way to poll my newly created entity to check if it was ready for other post-processing.
What's actually happening after adding a message to the SQS queue is that the CloudWatch logs are showing that the function is throwing an error like I'd expect, but the function is only being triggered one time. I can't tell if the message is just not not visible for some reason, or if it somehow was deleted. I don't know. I'm am new to using SQS for a Lambda trigger, am I thinking about this wrong?
A few possible causes here:
your Lambda function handler did not actually throw an exception to the Lambda runtime environment, so Lambda thought the function had successfully processed the message and the Lambda service then deleted the message from the queue (so that it would not get processed again)
your SQS queue has a configured DLQ with maximum receives set to 1, so the message is delivered once, the Lambda function fails, and the message is subsequently moved to the DLQ
the SQS message was re-delivered to the Lambda function and was logged but the logs were made to an earlier log stream (because this invocation was warm) and so it wasn't obvious that the Lambda function had actually been invoked multiple times with the same failed message
To verify this all works normally, I set up a simple test with both FIFO and non-FIFO queues and configured the queues to trigger a Lambda function that simply logged the SQS message and then threw an exception. As expected, I saw the same SQS message delivered to the Lambda function every 2 minutes (which is the queue's message visibility timeout). That continued until it hit the max receive count on the SQS redrive policy (defaults to 10 attempts) at which point the failed message was correctly moved to the associated DLQ.
Let's suppose I have a database on DynamoDB, and I am currently using streams and lambda functions to send that data to Elasticsearch.
Here's the thing, supposing the data is saved successfully on DynamoDB, is there a way for me to be 100% sure that the data has been saved on Elasticsearch as well?
Considering I have a function to save that data on DDB is there a way for me communicate with the lambda function triggered by DDB before returning a status code answer, so I can receive confirmation before returning?
I want to do that in order to return ok both from my function and the lambda function at the same time.
This doesn't look like the correct approach for this problem. We generally use DynamoDB Streams + Lambda for operations that are async in nature and when we don't have to communicate the status of this Lambda execution to the client.
So I suggest the following two approaches that are the closest to what you are trying to achieve -
Make the operation completely synchronous. i.e., do the DynamoDB insert and ElasticSearch insert in the same call (without any Ddb Stream and Lambda triggers). This will ensure that you return the correct status of both writes to the client. Also, in case the ES insert fails, you have an option to revert the Ddb write and then return the complete status as failed.
The first approach obviously adds to the Latency of the function. So you can continue with your original approach, but let the client know about it. It will work as follows -
Client calls your API.
API inserts record into Ddb and returns to the client.
The client receives the status and displays a message to the user that their request is being processed.
The client then starts polling for the status of the ES insert via another API.
Meanwhile, the Ddb stream triggers the ES insert Lambda fn and completes the ES write.
The poller on the client comes to know about the successful insert into ES and displays a final success message to the user.
Resolved
Human error. Even though status 200 was reported, the messages was not deleted when the function timed out. I had a side effect that deleted single messages from the batch.
I have a Lambda function that is invoked by a SQS Message. Sometimes the function takes a long time and then it times out.
Cloudwatch reports: Task timed out after 30.54 seconds
That is fine, but the SQS messages should then be retried because of this timeout/error, but in X-Ray I see that Lambda has error=true, but response status=200. That means the SQS messages are deleted.
I could do timing in the Lambda function and return an error if the code takes longer then the timeout, but is there a way to make Lambda to return an error (and not status 200) when it times out?
The functions are setup with the Serverless framework:
# the lambda function in serverless.yml
receiver:
handler: handler.receiver
events:
- sqs:
batchSize: 10
arn:
Fn::GetAtt:
- ReceiverQueue
- Arn
memorySize: 2048
timeout: 30
We were struggling with the same problem. How we solve it:
The lambda timeout is handled with middleware (How to log timed out Lambda invocations, middy)
We process the batch of messages from SQS and here pop-ups the following cases:
a. all messages were processed successfully - return {statusCode: 200}, and the messages will be deleted from SQS
b. some messages failed:
we delete the successful messages from SQS (deleteMesages)
throw the error in the lambda function, thus the failed messages will be not deleted from SQS and be retried automatically, depends on the config of SQS.
I have a system design like this
SQS -> trigger -> Lambda -> if fails -> DLQ
pre condition
Lambda function using a try catch block , it won't throw any errors .
Lambda function never run out of memory , or timeout . (from Lambda monitoring)
Error count is 0 in Lambda monitoring
Never use SQS console to view messages
Lambda SQS batchSize set to 1
DLQ Maximum Receives set to 1
Lambda invocation about 60k
After running for a while
we found a few message in DLQ
message in DLQ has attributes ApproximateReceiveCount is 2 or bigger.
Is this as expected ?
In my opinion if no error throws in Lambda , DLQ message should always be zero .
SQS is designed to provide at least once delivery. There is a chance that your message could be received by more than one lambda invocation.
If you need exactly once delivery, consider using a fifo queue.
Regarding the second part of your question - why are messages being written to the DLQ. The easiest way to troubleshoot that is to look for the lambda invocation logs which match some uniquely identifying aspect of the SQS message in the DLQ.
I have a NodeJS endpoint that receives requests to gather data from a reporting engine.
To keep the request endpoint light and because some of the reports generated have a few steps (Gather data -> assemble report -> convert to PDF -> Email to relevant person) I want to separate the inbound request from the job itself.
Using AWS.SQS I can accept the request, put the variables into SQS and the respond with a 200 / 201.
What are some of the better practices around picking this job up on the other end?
If I were to trigger a lambda function would I have to wait for that function to complete before 200 / 201 can be sent? or can I:
Accept Request ->
Job to SQS ->
Initiate Lamba function ->
200 Response.
Alternatively what other options would be available to decouple the inbound request from the processing itself?
Here are a few options:
Insert the request in your SQS queue and return a 200 response immediately. Have a process on an EC2 server polling the SQS queue and performing the query when it gets a message out of SQS.
Invoke a Lambda function asynchronously, passing it the properties needed to perform the query, and return a 200 response immediately. Since you invoked the Lambda function asynchronously your NodeJS code that invoked the Lambda function doesn't wait for the function to complete.
An alternative to #2 is to send the request to an SNS topic, and have the SNS topic configured to invoke the Lambda function. This is probably the best method if you are using Lambda, because SNS will retry if the Lambda function fails for some reason.
I don't recommend combining SQS with Lambda because those two services don't integrate very well. SNS on the other hand does integrate very well with Lambda.
Also, you need to make sure your Lambda function invocations can be completed in under 5 minutes since that's currently the maximum time a Lambda function can execute. If you need individual steps to run for longer than 5 minutes you will need to use EC2 or ECS.
I think AWS Step Functions may be a good fit for your use case.