AWS SQS messages does not become available again after visibility timeout - node.js

This is most likely something really simple but for some reason my SQS messages does not become available again after visibility timeout. At least this is what I figured since consuming lambda has no log entries indicating that any retries have been triggered.
My use case is that another lambda is feeding SQS queue with JSON entities that then needs to be sent forward. Sometimes there's so much data to be sent that receiving end is responding with HTTP 429.
My sending lambda (JSON body over HTTPS) is deleting the messages from queue only when service is responding with HTTP 200, otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue.
Anyway, when request is rejected by the service, the message does not become available anymore and so it's never tried to send again and is forever lost.
Incoming SQS has been setup as follows:
Visibility timeout: 3min
Delivery delay: 0s
Receive message wait time: 1s
Message retention period: 1d
Maximum message size: 256Kb
Associated DLQ has Maximum receives of 100
The consuming lambda is configured as
Memory: 128Mb
Timeout: 10s
Triggers: The source SQS queue, Batch size: 10, Batch window: None
And the actual logic I have in my lambda is quite simple, really. Event it receives is the Records in the queue. Lambda might get more than one record at a time but all records are handled separately.
console.log('Response', response);
if (response.status === 'CREATED') {
/* some code here */
const deleteParams = {
QueueUrl: queueUrl, /* required */
ReceiptHandle: receiptHandle /* required */
};
console.log('Removing record from ', queueUrl, 'with params', deleteParams);
await sqs.deleteMessage(deleteParams).promise();
} else {
/* any record that ends up here, are never seen again :( */
console.log('Keeping record in', eventSourceARN);
}
What do :( ?!?!11

otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue
That's not now it works:
Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages. Lambda reads messages in batches and invokes your function once for each batch. When your function successfully processes a batch, Lambda deletes its messages from the queue.

When an AWS Lambda function is triggered from an Amazon SQS queue, all activities related to SQS are handled by the Lambda service. Your code should not call any Amazon SQS functions.
The messages will be provided to the AWS Lambda function via the event parameter. When the function successfully exits, the Lambda service will delete the messages from the queue.
Your code should not be calling DeleteMessage().
If you wish to signal that some of the messages were not successfully processed, you can use a partial batch response to indicate which messages were successfully processed. The AWS Lambda service will then make the unsuccessful messages available on the queue again.

Thanks to everyone who answered. So I got this "problem" solved just by going through the documents.
I'll provide more detailed answer to my own question here in case someone besides me didn't get it on the first go :)
So function should return batchItemFailures object containing message ids of failures.
So, for example, one can have Lambda handler as
/**
* Handler
*
* #param {*} event SQS event
* #returns {Object} batch item failures
*/
exports.handler = async (event) => {
console.log('Starting ' + process.env.AWS_LAMBDA_FUNCTION_NAME);
console.log('Received event', event);
event = typeof event === 'object'
? event
: JSON.parse(event);
const batchItemFailures = await execute(event.Records);
if (batchItemFailures.length > 0) {
console.log('Failures', batchItemFailures);
} else {
console.log('No failures');
}
return {
batchItemFailures: batchItemFailures
}
}
and execute function, that handles the messages
/**
* Execute
*
* #param {Array} records SQS records
* #returns {Promise<*[]>} batch item failures
*/
async function execute (records) {
let batchItemFailures = [];
for (let index = 0; index < records.length; index++) {
const record = records[index];
// ...some async stuff here
if (someSuccessCondition) {
console.log('Life is good');
} else {
batchItemFailures.push({
itemIdentifier: record.messageId
});
}
}
return batchItemFailures;
}

Related

How do I fail a specific SQS message in a batch from a Lambda?

I have a Lambda with an SQS trigger. When it gets hit, a batch of records from SQS comes in (usually about 10 at a time, I think). If I return a failed status code from the handler, all 10 messages will be retried. If I return a success code, they'll all be removed from the queue. What if 1 out of those 10 messages failed and I want to retry just that one?
exports.handler = async (event) => {
for(const e of event.Records){
try {
let body = JSON.parse(e.body);
// do things
}
catch(e){
// one message failed, i want it to be retried
}
}
// returning this causes ALL messages in
// this batch to be removed from the queue
return {
statusCode: 200,
body: 'Finished.'
};
};
Do I have to manually re-add that ones message back to the queue? Or can I return a status from my handler that indicates that one message failed and should be retried?
As per AWS documentation, SQS event source mapping now supports handling of partial failures out of the box. Gist of the linked article is as follows:
Include ReportBatchItemFailures in your EventSourceMapping configuration
The response syntax in case of failures has to be modified to have:
{
"batchItemFailures": [
{ "itemIdentifier": "id2" },
{ "itemIdentifier": "id4" }
]
}
Where id2 and id4 the failed messageIds in a batch.
Quoting the documentation as is:
Lambda treats a batch as a complete success if your function returns
any of the following
An empty batchItemFailure list
A null batchItemFailure list
An empty EventResponse
A null EventResponse
Lambda treats a batch as a complete failure if your function returns
any of the following:
An invalid JSON response
An empty string itemIdentifier
A null itemIdentifier
An itemIdentifier with a bad key name
An itemIdentifier value with a message ID that doesn't exist
SAM support is not yet available for the feature as per the documentation. But one of the AWS labs example points to its usage in SAM and it worked for me when tested
Yes you have to manually re-add the failed messages back to the queue.
What I suggest doing is setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 then you can individually send back the failed messages to the queue.
You've to programmatically delete each message from after processing it successfully.
So you can have a flag set to true if anyone of the messages failed and depending upon it you can raise error after processing all the messages in a batch so successful messages will be deleted and other messages will be reprocessed based on retry policies.
So as per the below logic only failed and unprocessed messages will get retried.
import boto3
sqs = boto3.client("sqs")
def handler(event, context):
for message in event['records']:
queue_url = "form queue url recommended to set it as env variable"
message_body = message["body"]
print("do some processing :)")
message_receipt_handle = message["receiptHandle"]
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message_receipt_handle
)
there is also another way to save successfully processed message id into a variable and perform batch delete operation based on message id
response = client.delete_message_batch(
QueueUrl='string',
Entries=[
{
'Id': 'string',
'ReceiptHandle': 'string'
},
]
)
You need to design your app iin diffrent way here is few ideas not best but will solve your problem.
Solution 1:
Create sqs delivery queues - sq1
Create delay queues as per delay requirment sq2
Create dead letter queue sdl
Now inside lambda function if message failed in sq1 then delete it on sq1 and drop it on sq2 for retry Any Lambda function invoked asynchronously is retried twice before the event is discarded. If the retries fail.
If again failed after give retry move into dead letter queue sdl .
AWS Lambda - processing messages in Batches
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
Note :When an SQS event source mapping is initially created and enabled, or first appear after a period with no traffic, then the Lambda service will begin polling the SQS queue using five parallel long-polling connections, as per AWS documentation, the default duration for a long poll from AWS Lambda to SQS is 20 seconds.
https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html#supported-event-source-sqs
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
https://nordcloud.com/amazon-sqs-as-a-lambda-event-source/
Solution 2:
Use AWS StepFunction
https://aws.amazon.com/step-functions/
StepFunction will call lambda and handle the retry logic on failure with configurable exponential back-off if needed.
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
https://cloudacademy.com/blog/aws-step-functions-a-serverless-orchestrator/
**Solution 3: **
CloudWatch scheduled event to trigger a Lambda function that polls for FAILED.
Error handling for a given event source depends on how Lambda is invoked. Amazon CloudWatch Events invokes your Lambda function asynchronously.
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
https://engineering.opsgenie.com/aws-lambda-performance-series-part-2-an-analysis-on-async-lambda-fail-retry-behaviour-and-dead-b84620af406
https://dzone.com/articles/asynchronous-retries-with-aws-sqs
https://medium.com/#ron_73212/how-to-handle-aws-lambda-errors-like-a-pro-e5455b013d10
AWS supports partial batch response. Here is example for Typescript code
type Result = {
itemIdentifier: string
status: 'failed' | 'success'
}
const isFulfilled = <T>(
result: PromiseFulfilledResult<T> | PromiseRejectedResult
): result is PromiseFulfilledResult<T> => result.status === 'fulfilled'
const isFailed = (
result: PromiseFulfilledResult<Result>
): result is PromiseFulfilledResult<
Omit<Result, 'status'> & { status: 'failed' }
> => result.value.status === 'failed'
const results = await Promise.allSettled(
sqsEvent.Records.map(async (record) => {
try {
return { status: 'success', itemIdentifier: record.messageId }
} catch(e) {
console.error(e);
return { status: 'failed', itemIdentifier: record.messageId }
}
})
)
return results
.filter(isFulfilled)
.filter(isFailed)
.map((result) => ({
itemIdentifier: result.value.itemIdentifier,
}))

Google Cloud PubSub not ack messages

We have the system of publisher and subscriber systems based on GCP PubSub. Subscriber processing single message quite long, about 1 minute. We already set subscribers ack deadline to 600 seconds (10 minutes) (maximal one) to make sure, that pubsub will not start redelivery too earlier, as basically we have long running operation here.
I'm seeing this behavior of PubSub. While code sending ack, and monitor confirms that PubSub acknowledgement request has been accepted and acknowledgement itself completed with success status, total number of unacked messages still the same.
Metrics on the charts showing the same for sum, count and mean aggregation aligner. On the picture above aligner is mean and no reducers enabled.
I'm using #google-cloud/pubsub Node.js library. Different versions has been tried (0.18.1, 0.22.2, 0.24.1), but I guess issue not in them.
The following class can be used to check.
TypeScript 3.1.1, Node 8.x.x - 10.x.x
import { exponential, Backoff } from "backoff";
const pubsub = require("#google-cloud/pubsub");
export interface IMessageHandler {
handle (message): Promise<void>;
}
export class PubSubSyncListener {
private readonly client;
private listener: Backoff;
private runningOperations: Promise<unknown>[] = [];
constructor (
private readonly handler: IMessageHandler,
private readonly options: {
/**
* Maximal messages number to be processed simultaniosly.
* Listener will try to keep processing number as close to provided value
* as possible.
*/
maxMessages: number;
/**
* Formatted full subscrption name /projects/{projectName}/subscriptions/{subscriptionName}
*/
subscriptionName: string;
/**
* In milliseconds
*/
minimalListenTimeout?: number;
/**
* In milliseconds
*/
maximalListenTimeout?: number;
}
) {
this.client = new pubsub.v1.SubscriberClient();
this.options = Object.assign({
minimalListenTimeout: 300,
maximalListenTimeout: 30000
}, this.options);
}
public async listen () {
this.listener = exponential({
maxDelay: this.options.maximalListenTimeout,
initialDelay: this.options.minimalListenTimeout
});
this.listener.on("ready", async () => {
if (this.runningOperations.length < this.options.maxMessages) {
const [response] = await this.client.pull({
subscription: this.options.subscriptionName,
maxMessages: this.options.maxMessages - this.runningOperations.length
});
for (const m of response.receivedMessages) {
this.startMessageProcessing(m);
}
this.listener.reset();
this.listener.backoff();
} else {
this.listener.backoff();
}
});
this.listener.backoff();
}
private startMessageProcessing (message) {
const index = this.runningOperations.length;
const removeFromRunning = () => {
this.runningOperations.splice(index, 1);
};
this.runningOperations.push(
this.handler.handle(this.getHandlerMessage(message))
.then(removeFromRunning, removeFromRunning)
);
}
private getHandlerMessage (message) {
message.message.ack = async () => {
const ackRequest = {
subscription: this.options.subscriptionName,
ackIds: [message.ackId]
};
await this.client.acknowledge(ackRequest);
};
return message.message;
}
public async stop () {
this.listener.reset();
this.listener = null;
await Promise.all(
this.runningOperations
);
}
}
This is basically partial implementation of async pulling of the messages and immediate acknowledgment. Because one of the proposed solutions was in usage of the synchronous pulling.
I found similar reported issue in java repository, if I'm not mistaken in symptoms of the issue.
https://github.com/googleapis/google-cloud-java/issues/3567
The last detail here is that acknowledgment seems to work on the low number of requests. In case if I fire single message in pubsub and then immediately process it, undelivered messages number decreases (drops to 0 as only one message was there before).
The question itself - what is happening and why unacked messages number is not reducing as it should when ack has been received?
To quote from the documentation, the subscription/num_undelivered_messages metric that you're using is the "Number of unacknowledged messages (a.k.a. backlog messages) in a subscription. Sampled every 60 seconds. After sampling, data is not visible for up to 120 seconds."
You should not expect this metric to decrease immediately upon acking a message. In addition, it sounds as if you are trying to use pubsub for an exactly once delivery case, attempting to ensure the message will not be delivered again. Cloud Pub/Sub does not provide these semantics. It provides at least once semantics. In other words, even if you have received a value, acked it, received the ack response, and seen the metric drop from 1 to 0, it is still possible and correct for the same worker or another to receive an exact duplicate of that message. Although in practice this is unlikely, you should focus on building a system that is duplicate tolerant instead of trying to ensure your ack succeeded so your message won't be redelivered.

can I limit consumption of kafka-node consumer?

It seems like my kafka node consumer:
var kafka = require('kafka-node');
var consumer = new Consumer(client, [], {
...
});
is fetching way too many messages than I can handle in certain cases.
Is there a way to limit it (for example accept no more than 1000 messages per second, possibly using the pause api?)
I'm using kafka-node, which seems to have a limited api comparing to the Java version
In Kafka, poll and process should happen in a coordinated/synchronized way. Ie, after each poll, you should process all received data first, before you do the next poll. This pattern will automatically throttle the number of messages to the max throughput your client can handle.
Something like this (pseudo-code):
while(isRunning) {
messages = poll(...)
for(m : messages) {
process(m);
}
}
(That is the reason, why there is not parameter "fetch.max.messages" -- you just do not need it.)
I had a similar situation where I was consuming messages from Kafka and had to throttle the consumption because my consumer service was dependent on a third party API which had its own constraints.
I used async/queue along with a wrapper of async/cargo called asyncTimedCargo for batching purpose.
The cargo gets all the messages from the kafka-consumer and sends it to queue upon reaching a size limit batch_config.batch_size or timeout batch_config.batch_timeout.
async/queue provides saturated and unsaturated callbacks which you can use to stop the consumption if your queue task workers are busy. This would stop the cargo from filling up and your app would not run out of memory. The consumption would resume upon unsaturation.
//cargo-service.js
module.exports = function(key){
return new asyncTimedCargo(function(tasks, callback) {
var length = tasks.length;
var postBody = [];
for(var i=0;i<length;i++){
var message ={};
var task = JSON.parse(tasks[i].value);
message = task;
postBody.push(message);
}
var postJson = {
"json": {"request":postBody}
};
sms_queue.push(postJson);
callback();
}, batch_config.batch_size, batch_config.batch_timeout)
};
//kafka-consumer.js
cargo = cargo-service()
consumer.on('message', function (message) {
if(message && message.value && utils.isValidJsonString(message.value)) {
var msgObject = JSON.parse(message.value);
cargo.push(message);
}
else {
logger.error('Invalid JSON Message');
}
});
// sms-queue.js
var sms_queue = queue(
retryable({
times: queue_config.num_retries,
errorFilter: function (err) {
logger.info("inside retry");
console.log(err);
if (err) {
return true;
}
else {
return false;
}
}
}, function (task, callback) {
// your worker task for queue
callback()
}), queue_config.queue_worker_threads);
sms_queue.saturated = function() {
consumer.pause();
logger.warn('Queue saturated Consumption paused: ' + sms_queue.running());
};
sms_queue.unsaturated = function() {
consumer.resume();
logger.info('Queue unsaturated Consumption resumed: ' + sms_queue.running());
};
From FAQ in the README
Create a async.queue with message processor and concurrency of one (the message processor itself is wrapped with setImmediate function so it will not freeze up the event loop)
Set the queue.drain to resume() the consumer
The handler for consumer's message event to pause() the consumer and pushes the message to the queue.
As far as I know the API does not have any kind of throttling. But both consumers (Consumer and HighLevelConsumer) have a 'pause()' function. So you could stop consuming if you get to much messages. Maybe that already offers what you need.
Please keep in mind what's happening. You send a fetch request to the broker and get a batch of message back. You can configure the min and max size of the messages (according to the documentation not the number of messages) you want to fetch:
{
....
// This is the minimum number of bytes of messages that must be available to give a response, default 1 byte
fetchMinBytes: 1,
// The maximum bytes to include in the message set for this partition. This helps bound the size of the response.
fetchMaxBytes: 1024 * 1024,
}
I was facing the same issue, initially fetchMaxBytes value was
fetchMaxBytes: 1024 * 1024 * 10 // 10MB
I just chanbed it to
fetchMaxBytes: 1024
It worked very smoothly after the change.

Amazon SQS with aws-sdk receiveMessage Stall

I'm using the aws-sdk node module with the (as far as I can tell) approved way to poll for messages.
Which basically sums up to:
sqs.receiveMessage({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20
}, function(err, data) {
if (err) {
logger.fatal('Error on Message Recieve');
logger.fatal(err);
} else {
// all good
if (undefined === data.Messages) {
logger.info('No Messages Object');
} else if (data.Messages.length > 0) {
logger.info('Messages Count: ' + data.Messages.length);
var delete_batch = new Array();
for (var x=0;x<data.Messages.length;x++) {
// process
receiveMessage(data.Messages[x]);
// flag to delete
var pck = new Array();
pck['Id'] = data.Messages[x].MessageId;
pck['ReceiptHandle'] = data.Messages[x].ReceiptHandle;
delete_batch.push(pck);
}
if (delete_batch.length > 0) {
logger.info('Calling Delete');
sqs.deleteMessageBatch({
Entries: delete_batch,
QueueUrl: queueUrl
}, function(err, data) {
if (err) {
logger.fatal('Failed to delete messages');
logger.fatal(err);
} else {
logger.debug('Deleted recieved ok');
}
});
}
} else {
logger.info('No Messages Count');
}
}
});
receiveMessage is my "do stuff with collected messages if I have enough collected messages" function
Occasionally, my script is stalling because I don't get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a "no messages object", the callback isn't called.
(I'm writing this up to Amazon Weirdness)
What I'm asking is whats the best way to detect and deal with this, as I have some code in place to stop concurrent calls to receiveMessage.
The suggested answer here: Nodejs sqs queue processor also has code that prevents concurrent message request queries (granted it's only fetching one message a time)
I do have the whole thing wrapped in
var running = false;
runMonitorJob = setInterval(function() {
if (running) {
} else {
running = true;
// call SQS.receive
}
}, 500);
(With a running = false after the delete loop (not in it's callback))
My solution would be
watchdogTimeout = setTimeout(function() {
running = false;
}, 30000);
But surely this would leave a pile of floating sqs.receive's lurking about and thus much memory over time?
(This job runs all the time, and I left it running on Friday, it stalled Saturday morning and hung till I manually restarted the job this morning)
Edit: I have seen cases where it hangs for ~5 minutes and then suddenly gets messages BUT with a wait time of 20 seconds it should throw a "no messages" after 20 seconds. So a WatchDog of ~10 minutes might be more practical (depending on the rest of ones business logic)
Edit: Yes Long Polling is already configured Queue Side.
Edit: This is under (latest) v2.3.9 of aws-sdk and NodeJS v4.4.4
I've been chasing this (or a similar) issue for a few days now and here's what I've noticed:
The receiveMessage call does eventually return although only after 120 seconds
Concurrent calls to receiveMessage are serialised by the AWS.SDK library so making multiple calls in parallel have no effect.
The receiveMessage callback does not error - in fact after the 120 seconds have passed, it may contain messages.
What can be done about this? This sort of thing can happen for a number of reasons and some/many of these things can't necessarily be fixed. The answer is to run multiple services each calling receiveMessage and processing the messages as they come - SQS supports this. At any time, one of these services may hit this 120 second lag but the other services should be able to continue on as normal.
My particular problem is that I have some critical singleton services that can't afford 120 seconds of down time. For this I will look into either 1) use HTTP instead of SQS to push messages into my service or 2) spawn slave processes around each of the singletons to fetch the messages from SQS and push them into the service.
I also ran into this issue, but not when calling receiveMessage but sendMessage. I also saw hangups of exactly 120 seconds. I also saw it with a few other services, like Firehose.
That lead me to this line in the AWS SDK:
SQS Constructor
httpOptions:
timeout [Integer] — Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000).
to implement a fix, I override the timeout for my SQS client that performs the sendMessage to timeout after 10 seconds, and another with 25 seconds for receiving (where I long poll for 20 seconds):
var sendClient = new AWS.SQS({httpOptions:{timeout:10*1000}});
var receiveClient = new AWS.SQS({httpOptions:{timeout:25*1000}});
I've had this out in production for a week now and I've noticed that all of my SQS stalling issues have been eliminated.

Azure nodejs sdk: long polling for queue message listener

Is it possible to create a message listener to receive messages from a service bus queue (not storage queue) only when messages are available?
Actually my implementation consists in a setInterval function calling receive operation:
var service = azure.createServiceBusService( azureEnpoint );
var repeat = function() {
service.receiveQueueMessage(me.name, function (error, receivedMessage) {
if (!error) {
logger.debug(receivedMessage, "Received message from queue "+ me.name);
callback(error, receivedMessage);
}
});
}
setInterval(repeat, me.pollingInterval);
Thanks
Sort of, you use long polling whereby you check for a message and wait for a long time for a response. The good side of this is you only get charged for a single request. Replace seconds with how long you want to wait for a response. The maximum is 24 days.
service.receiveQueueMessage(me.name, { timeoutIntervalInS: seconds },
function (error, receivedMessage)
There's a full example here: https://msdn.microsoft.com/en-us/magazine/dn802604.aspx

Resources