How do I fail a specific SQS message in a batch from a Lambda? - node.js

I have a Lambda with an SQS trigger. When it gets hit, a batch of records from SQS comes in (usually about 10 at a time, I think). If I return a failed status code from the handler, all 10 messages will be retried. If I return a success code, they'll all be removed from the queue. What if 1 out of those 10 messages failed and I want to retry just that one?
exports.handler = async (event) => {
for(const e of event.Records){
try {
let body = JSON.parse(e.body);
// do things
}
catch(e){
// one message failed, i want it to be retried
}
}
// returning this causes ALL messages in
// this batch to be removed from the queue
return {
statusCode: 200,
body: 'Finished.'
};
};
Do I have to manually re-add that ones message back to the queue? Or can I return a status from my handler that indicates that one message failed and should be retried?

As per AWS documentation, SQS event source mapping now supports handling of partial failures out of the box. Gist of the linked article is as follows:
Include ReportBatchItemFailures in your EventSourceMapping configuration
The response syntax in case of failures has to be modified to have:
{
"batchItemFailures": [
{ "itemIdentifier": "id2" },
{ "itemIdentifier": "id4" }
]
}
Where id2 and id4 the failed messageIds in a batch.
Quoting the documentation as is:
Lambda treats a batch as a complete success if your function returns
any of the following
An empty batchItemFailure list
A null batchItemFailure list
An empty EventResponse
A null EventResponse
Lambda treats a batch as a complete failure if your function returns
any of the following:
An invalid JSON response
An empty string itemIdentifier
A null itemIdentifier
An itemIdentifier with a bad key name
An itemIdentifier value with a message ID that doesn't exist
SAM support is not yet available for the feature as per the documentation. But one of the AWS labs example points to its usage in SAM and it worked for me when tested

Yes you have to manually re-add the failed messages back to the queue.
What I suggest doing is setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 then you can individually send back the failed messages to the queue.

You've to programmatically delete each message from after processing it successfully.
So you can have a flag set to true if anyone of the messages failed and depending upon it you can raise error after processing all the messages in a batch so successful messages will be deleted and other messages will be reprocessed based on retry policies.
So as per the below logic only failed and unprocessed messages will get retried.
import boto3
sqs = boto3.client("sqs")
def handler(event, context):
for message in event['records']:
queue_url = "form queue url recommended to set it as env variable"
message_body = message["body"]
print("do some processing :)")
message_receipt_handle = message["receiptHandle"]
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message_receipt_handle
)
there is also another way to save successfully processed message id into a variable and perform batch delete operation based on message id
response = client.delete_message_batch(
QueueUrl='string',
Entries=[
{
'Id': 'string',
'ReceiptHandle': 'string'
},
]
)

You need to design your app iin diffrent way here is few ideas not best but will solve your problem.
Solution 1:
Create sqs delivery queues - sq1
Create delay queues as per delay requirment sq2
Create dead letter queue sdl
Now inside lambda function if message failed in sq1 then delete it on sq1 and drop it on sq2 for retry Any Lambda function invoked asynchronously is retried twice before the event is discarded. If the retries fail.
If again failed after give retry move into dead letter queue sdl .
AWS Lambda - processing messages in Batches
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
Note :When an SQS event source mapping is initially created and enabled, or first appear after a period with no traffic, then the Lambda service will begin polling the SQS queue using five parallel long-polling connections, as per AWS documentation, the default duration for a long poll from AWS Lambda to SQS is 20 seconds.
https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html#supported-event-source-sqs
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
https://nordcloud.com/amazon-sqs-as-a-lambda-event-source/
Solution 2:
Use AWS StepFunction
https://aws.amazon.com/step-functions/
StepFunction will call lambda and handle the retry logic on failure with configurable exponential back-off if needed.
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
https://cloudacademy.com/blog/aws-step-functions-a-serverless-orchestrator/
**Solution 3: **
CloudWatch scheduled event to trigger a Lambda function that polls for FAILED.
Error handling for a given event source depends on how Lambda is invoked. Amazon CloudWatch Events invokes your Lambda function asynchronously.
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
https://engineering.opsgenie.com/aws-lambda-performance-series-part-2-an-analysis-on-async-lambda-fail-retry-behaviour-and-dead-b84620af406
https://dzone.com/articles/asynchronous-retries-with-aws-sqs
https://medium.com/#ron_73212/how-to-handle-aws-lambda-errors-like-a-pro-e5455b013d10

AWS supports partial batch response. Here is example for Typescript code
type Result = {
itemIdentifier: string
status: 'failed' | 'success'
}
const isFulfilled = <T>(
result: PromiseFulfilledResult<T> | PromiseRejectedResult
): result is PromiseFulfilledResult<T> => result.status === 'fulfilled'
const isFailed = (
result: PromiseFulfilledResult<Result>
): result is PromiseFulfilledResult<
Omit<Result, 'status'> & { status: 'failed' }
> => result.value.status === 'failed'
const results = await Promise.allSettled(
sqsEvent.Records.map(async (record) => {
try {
return { status: 'success', itemIdentifier: record.messageId }
} catch(e) {
console.error(e);
return { status: 'failed', itemIdentifier: record.messageId }
}
})
)
return results
.filter(isFulfilled)
.filter(isFailed)
.map((result) => ({
itemIdentifier: result.value.itemIdentifier,
}))

Related

AWS SQS messages does not become available again after visibility timeout

This is most likely something really simple but for some reason my SQS messages does not become available again after visibility timeout. At least this is what I figured since consuming lambda has no log entries indicating that any retries have been triggered.
My use case is that another lambda is feeding SQS queue with JSON entities that then needs to be sent forward. Sometimes there's so much data to be sent that receiving end is responding with HTTP 429.
My sending lambda (JSON body over HTTPS) is deleting the messages from queue only when service is responding with HTTP 200, otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue.
Anyway, when request is rejected by the service, the message does not become available anymore and so it's never tried to send again and is forever lost.
Incoming SQS has been setup as follows:
Visibility timeout: 3min
Delivery delay: 0s
Receive message wait time: 1s
Message retention period: 1d
Maximum message size: 256Kb
Associated DLQ has Maximum receives of 100
The consuming lambda is configured as
Memory: 128Mb
Timeout: 10s
Triggers: The source SQS queue, Batch size: 10, Batch window: None
And the actual logic I have in my lambda is quite simple, really. Event it receives is the Records in the queue. Lambda might get more than one record at a time but all records are handled separately.
console.log('Response', response);
if (response.status === 'CREATED') {
/* some code here */
const deleteParams = {
QueueUrl: queueUrl, /* required */
ReceiptHandle: receiptHandle /* required */
};
console.log('Removing record from ', queueUrl, 'with params', deleteParams);
await sqs.deleteMessage(deleteParams).promise();
} else {
/* any record that ends up here, are never seen again :( */
console.log('Keeping record in', eventSourceARN);
}
What do :( ?!?!11
otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue
That's not now it works:
Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages. Lambda reads messages in batches and invokes your function once for each batch. When your function successfully processes a batch, Lambda deletes its messages from the queue.
When an AWS Lambda function is triggered from an Amazon SQS queue, all activities related to SQS are handled by the Lambda service. Your code should not call any Amazon SQS functions.
The messages will be provided to the AWS Lambda function via the event parameter. When the function successfully exits, the Lambda service will delete the messages from the queue.
Your code should not be calling DeleteMessage().
If you wish to signal that some of the messages were not successfully processed, you can use a partial batch response to indicate which messages were successfully processed. The AWS Lambda service will then make the unsuccessful messages available on the queue again.
Thanks to everyone who answered. So I got this "problem" solved just by going through the documents.
I'll provide more detailed answer to my own question here in case someone besides me didn't get it on the first go :)
So function should return batchItemFailures object containing message ids of failures.
So, for example, one can have Lambda handler as
/**
* Handler
*
* #param {*} event SQS event
* #returns {Object} batch item failures
*/
exports.handler = async (event) => {
console.log('Starting ' + process.env.AWS_LAMBDA_FUNCTION_NAME);
console.log('Received event', event);
event = typeof event === 'object'
? event
: JSON.parse(event);
const batchItemFailures = await execute(event.Records);
if (batchItemFailures.length > 0) {
console.log('Failures', batchItemFailures);
} else {
console.log('No failures');
}
return {
batchItemFailures: batchItemFailures
}
}
and execute function, that handles the messages
/**
* Execute
*
* #param {Array} records SQS records
* #returns {Promise<*[]>} batch item failures
*/
async function execute (records) {
let batchItemFailures = [];
for (let index = 0; index < records.length; index++) {
const record = records[index];
// ...some async stuff here
if (someSuccessCondition) {
console.log('Life is good');
} else {
batchItemFailures.push({
itemIdentifier: record.messageId
});
}
}
return batchItemFailures;
}

Google App Engine Node Task Queue Retries before Finished

I am running a slow operation via a cloud task queue to delete objects from Google Cloud Storage. I have noticed that the task queue retries the task after two minutes have passed, even though the running task is not yet finished nor errored.
What is the best strategy to trigger valid retries, but not retry while the task is still running?
Here's my task creator:
router.get('/start-delete-old', async (req, res) => {
const task = {
appEngineHttpRequest: {
httpMethod: 'POST',
relativeUri: `/videos/delete-old`,
},
};
const request = {
parent: taskClient.parent,
task: task,
};
const [response] = await taskClient.queue.createTask(request);
res.send(response);
});
Here's my task handler:
router.post('/delete-old', async (req, res) => {
let cameras = await knex('cameras')
let date = moment().subtract(365, 'days').format('YYYY-MM-DD');
for (let i = 0; i < cameras.length; i++) {
let camera = cameras[i]
let prefix = `${camera.id}/${date}/`
try {
await bucket.deleteFiles({ prefix: prefix, force: true })
await knex.raw(`delete from videos where camera_id = ${camera.id} and cast(start_time as date) = '${date}'`)
}
catch (e) {
console.log('error deleting ' + e)
}
}
res.send({});
});
As per the documentation, a timeout in a task can vary depending on the environment you are using:
Standard environment
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
Flexible environment
For worker services running in the flex environment: all types have a 60 minute timeout.
So, if your handler misses the deadline, the queue assumes the task failed and retries it.
Also, the Task Queue is expecting to receive an status code between 200 and 299, if not, it will assume that the running task will fail. Quoting the documentation:
Upon successful completion of processing, the handler must send an HTTP status code between 200 and 299 back to the queue. Any other value indicates the task has failed and the queue retries the task.
I believe that both the bucket delete files and the knex raw query, are taking a lot of time to be processed, and this is causing the handler to return an status other than 200-299.
One good way to troubleshoot is using Stackdriver logs, you will be able to gather more information about the ongoing processes and see if any of them is returning any error.

ScriptHost error occured despite catching an exception

I have a simple function that takes a message from a queue and saves it to a storage table. I expect that in some cases a table entity with the same data can already exist. Because of that, I added an exception handling to skip this type of situation and mark the queue message as processed. Despite the fact that exception is handled now, the scripthost informs me about an error and the message is still in the queue.
I suppose it is caused by the fact that I'm using table binding that is on edge between host and my code. Am I right? Should I use a table client within my code instead of binding? Is there a different approach?
Sample code to generate this situation:
[FunctionName("MyFunction")]
public static async Task Run([QueueTrigger("myqueue", Connection = "Conn")]string msg, [Table("mytable", Connection = "Conn")] IAsyncCollector<DataEntity> dataEntity, TraceWriter log)
{
try
{
await dataEntity.AddAsync(new DataEntity()
{
PartitionKey = "1",
RowKey = "1",
Data = msg
});
await dataEntity.FlushAsync();
}
catch (StorageException e)
{
// when it is an exception that informs "entity already exists" skip it
}
}
When a queue trigger function fails, Azure Functions retries the function up to five times for a given queue message, including the first try.
If all five attempts fail, the functions runtime adds a message to a queue named <originalqueuename>-poison.
You can write a function to process messages from the poison queue by logging them or sending a notification that manual attention is needed.
The host.json file contains settings that control queue trigger behavior:
{
"queues": {
"maxPollingInterval": 2000,
"visibilityTimeout" : "00:00:30",
"batchSize": 16,
"maxDequeueCount": 1,
"newBatchThreshold": 8
}
}
Note: maxDequeueCount default is 5. The number of times to try processing a message before moving it to the poison queue. For your need, you could set the "maxDequeueCount":1.
Also these settings are host wide and apply to all functions. You can't control these per function currently.

Can I filter an Azure ServiceBusService using node.js SDK?

I have millions of messages in a queue and the first ten million or so are irrelevant. Each message has a sequential ActionId so ideally anything < 10000000 I can just ignore or better yet delete from the queue. What I have so far:
let azure = require("azure");
function processMessage(sb, message) {
// Deserialize the JSON body into an object representing the ActionRecorded event
var actionRecorded = JSON.parse(message.body);
console.log(`processing id: ${actionRecorded.ActionId} from ${actionRecorded.ActionTaken.ActionTakenDate}`);
if (actionRecorded.ActionId < 10000000) {
// When done, delete the message from the queue
console.log(`Deleting message: ${message.brokerProperties.MessageId} with ActionId: ${actionRecorded.ActionId}`);
sb.deleteMessage(message, function(deleteError, response) {
if (deleteError) {
console.log("Error deleting message: " + message.brokerProperties.MessageId);
}
});
}
// immediately check for another message
checkForMessages(sb);
}
function checkForMessages(sb) {
// Checking for messages
sb.receiveQueueMessage("my-queue-name", { isPeekLock: true }, function(receiveError, message) {
if (receiveError && receiveError === "No messages to receive") {
console.log("No messages left in queue");
return;
} else if (receiveError) {
console.log("Receive error: " + receiveError);
} else {
processMessage(sb, message);
}
});
}
let connectionString = "Endpoint=sb://<myhub>.servicebus.windows.net/;SharedAccessKeyName=KEYNAME;SharedAccessKey=[mykey]"
let serviceBusService = azure.createServiceBusService(connectionString);
checkForMessages(serviceBusService);
I've tried looking at the docs for withFilter but it doesn't seem like that applies to queues.
I don't have access to create or modify the underlying queue aside from the operations mentioned above since the queue is provided by a client.
Can I either
Filter my results that I get from the queue
speed up the queue processing somehow?
Filter my results that I get from the queue
As you found, filters as a feature are only applicable to Topics & Subscriptions.
speed up the queue processing somehow
If you were to use the #azure/service-bus package which is the newer, faster library to work with Service Bus, you could receive the messages in ReceiveAndDelete mode until you reach the message with ActionId 9999999, close that receiver and then create a new receiver in PeekLock mode. For more on these receive modes, see https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-transfers-locks-settlement#settling-receive-operations

preventing race conditions with nodejs

I'm writing an application using nodeJS 6.3.0 and aws DynamoDB.
the dynamodb holds statistics information that are added to dynamodb that are being called from 10 different function (10 different statistic measures). the interval is set to 10 seconds, which means that every 10 seconds, 10 calls to my function are being made to add all the relevant information.
the putItem function:
function putItem(tableName,itemData,callback) {
var params = {
TableName: tableName,
Item: itemData
};
docClient.put(params, function(err, data) {
if (err) {
logger.error(params,"putItem failed in dynamodb");
callback(err,null);
} else {
callback(null,data);
}
});
now... I created a queue.
var queue = require('./dynamoDbQueue').queue;
that implements a simple queue with fixed size that I took from http://www.bennadel.com/blog/2308-creating-a-fixed-length-queue-in-javascript-using-arrays.htm.
the idea is that if there is a network problem.. lets say for a minute. i want all the events to be pushed to the queue and when the problem is resolved to send queue information to dynamodb and to free the queue.
so I modified my original function to the following code:
function putItem(tableName,itemData,callback) {
var params = {
TableName: tableName,
Item: itemData
};
if (queue.length>0) {
queue.push(params);
callback(null,null);
} else {
docClient.put(params, function (err, data) {
if (err) {
queue.push(params);
logger.error(params, "putItem failed in dynamodb");
handleErroredQueue(); // imaginary function that i need to implement
callback(err, null);
} else {
callback(null, data);
}
});
}
}
but since I have 10 insert functions that runs at the same second, there is a chance of race conditions. which means that ...
execute1 - one function validated that the queue is empty... and is about to execute docClient.put() function.
execute2 - and at the same time another function returned from docClient.put() with an error and as a result it adds to the queue it's first row.
execute1 - by the time that the first function calling docClient.put(), the problem has been resolved and it successfully inserted data to dynamodb, which leaves the queue with previous data that will be released in the next iteration.
so for example if i inserted 4 rows with ids 1,2,3,4, the order of rows that will be inserted to dynamodb is 1,2,4,3.
is there a way to resolve that ?
thanks!
I think you are on right track, but instead of checking for an error and then adding into queue what I would suggest is to add every operation to queue first and then read the data from the queue every time.
For instance, in your case you call function 1,2,3,4 and it results in 1,2,4,3 because you are using the queue at a time off error/abrupt operation.
Step1: All your function will make an entry to a Queue -> 1,2,3,4
Step2: Read your queue and make an insert, if success remove the element
else redo the operation. This way it will insert in the desired sequence
Another advantage is that because you are using queue you don't have to keep very high throughputs for the table.
Edit:
I guess you just need to ensure that on completion of your first operation you will perform your next process and not before that.
e.g: fn 1 -> read from queue (don't delete right now from queue) -> operation Completed if not perfrom again -> Delete from queue -> perform next operation.
You just have to make sure you read from queue and wait till you get response from DynamoDB.
Hope this helps.

Resources