Why jobs are not waiting in waiting queue in bull? - node.js

I am using bull as a job queue. I have set this rate limit with the queue.
const opts = {
limiter: {
max: 1,
duration: 100,
bounceBack: false
}
};
let queue = new Queue('FetchQueue', opts);
My understanding of bounceBack: false is that it put the job in the delayed queue or waiting queue if it can't be processed immediately. But I am getting this error intermittently.
StatusCodeError: 403 - {"error":{"code":403,"message":"User Rate Limit Exceeded","errors":[{"message":"User Rate Limit Exceeded","domain":"usageLimits","reason":"userRateLimitExceeded"}]}}
What is the correct setting for bounceBack to make jobs wait in the job queue?

Related

Camel error handling fixed redelivery delay for azure storage queue not working correctly

I have an azure app service that reads from a azure storage queue through camel route version 3.14.0. Below is my code:
queue code:
QueueServiceClient client = new QueueServiceClientBuilder()
.connectionString(storageAccountConnectionString)
.buildClient();
getContext().getRegistry().bind("client", client);
errorHandler(deadLetterChannel(SEND_TO_POISON_QUEUE)
.useOriginalBody()
.log("Message sent to poison queue for handling")
.retryWhile(method(new RetryRuleset(), "shouldRetry"))
.maximumRedeliveries(24)
.asyncDelayedRedelivery()
.redeliveryDelay(3600 * 1000L) // initial delay
);
// Route to retrieve a message from storage queue.
from("azure-storage-queue:" + storageAccountName + "/" + QUEUE_NAME + "?serviceClient=#client&maxMessages=1&visibilityTimeout=P2D")
.id(QUEUE_ROUTE_CONSUMER)
.log("Message received from queue with messageId: ${headers.CamelAzureStorageQueueMessageId} and ${headers.CamelAzureStorageQueueInsertionTime} in UTC")
.bean(cliFacilityService, "processMessage(${body}, ${headers.CamelAzureStorageQueueInsertionTime})")
.end();
RetryRuleset code:
public boolean shouldRetry(#Header(QueueConstants.MESSAGE_ID) String messageId,
#Header(Exchange.REDELIVERY_COUNTER) Integer counter,
#Header(QueueConstants.INSERTION_TIME) OffsetDateTime insertionTime) {
OffsetDateTime futureRetryOffsetDateTime = OffsetDateTime.now(Clock.systemUTC()).plusHours(1); //because redelivery delay is 1hr
OffsetDateTime insertionTimePlus24hrs = insertionTime.plusHours(24);
if (futureRetryOffsetDateTime.isAfter(insertionTimePlus24hrs)) {
log.info("Facility queue message: {} done retrying because next time to retry {}. Redelivery count: {}, enqueue time: {}",
messageId, futureRetryOffsetDateTime, counter, insertionTime);
return false;
}
return true;
}
the redeliveryDelay is 1hr and maximumRedeliveries is 24, because i want to try once an hour for about 24 hrs. so not necessarily needs to be 24 times, just as many as it can do with 24hrs. and if it passes 24hrs, send it to the poison queue (this code is in the retry ruleset)
The problem is the app service retrying for first lets say 2 - 5 times normally once an hour. but after that the app service retries after 2 days later. So the message is expired and not retried because of the ruleset and sent to poison queue. Sometimes the app service does the first read from queue and the next retry is after 2 days. so very unstable. so total it is retrying tops 1-10 times and the last retry is always 2 days later in the same time from the first retry.
Is there anything i am doing wrong?
Thank you for you help!

AWS SQS messages does not become available again after visibility timeout

This is most likely something really simple but for some reason my SQS messages does not become available again after visibility timeout. At least this is what I figured since consuming lambda has no log entries indicating that any retries have been triggered.
My use case is that another lambda is feeding SQS queue with JSON entities that then needs to be sent forward. Sometimes there's so much data to be sent that receiving end is responding with HTTP 429.
My sending lambda (JSON body over HTTPS) is deleting the messages from queue only when service is responding with HTTP 200, otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue.
Anyway, when request is rejected by the service, the message does not become available anymore and so it's never tried to send again and is forever lost.
Incoming SQS has been setup as follows:
Visibility timeout: 3min
Delivery delay: 0s
Receive message wait time: 1s
Message retention period: 1d
Maximum message size: 256Kb
Associated DLQ has Maximum receives of 100
The consuming lambda is configured as
Memory: 128Mb
Timeout: 10s
Triggers: The source SQS queue, Batch size: 10, Batch window: None
And the actual logic I have in my lambda is quite simple, really. Event it receives is the Records in the queue. Lambda might get more than one record at a time but all records are handled separately.
console.log('Response', response);
if (response.status === 'CREATED') {
/* some code here */
const deleteParams = {
QueueUrl: queueUrl, /* required */
ReceiptHandle: receiptHandle /* required */
};
console.log('Removing record from ', queueUrl, 'with params', deleteParams);
await sqs.deleteMessage(deleteParams).promise();
} else {
/* any record that ends up here, are never seen again :( */
console.log('Keeping record in', eventSourceARN);
}
What do :( ?!?!11
otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue
That's not now it works:
Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages. Lambda reads messages in batches and invokes your function once for each batch. When your function successfully processes a batch, Lambda deletes its messages from the queue.
When an AWS Lambda function is triggered from an Amazon SQS queue, all activities related to SQS are handled by the Lambda service. Your code should not call any Amazon SQS functions.
The messages will be provided to the AWS Lambda function via the event parameter. When the function successfully exits, the Lambda service will delete the messages from the queue.
Your code should not be calling DeleteMessage().
If you wish to signal that some of the messages were not successfully processed, you can use a partial batch response to indicate which messages were successfully processed. The AWS Lambda service will then make the unsuccessful messages available on the queue again.
Thanks to everyone who answered. So I got this "problem" solved just by going through the documents.
I'll provide more detailed answer to my own question here in case someone besides me didn't get it on the first go :)
So function should return batchItemFailures object containing message ids of failures.
So, for example, one can have Lambda handler as
/**
* Handler
*
* #param {*} event SQS event
* #returns {Object} batch item failures
*/
exports.handler = async (event) => {
console.log('Starting ' + process.env.AWS_LAMBDA_FUNCTION_NAME);
console.log('Received event', event);
event = typeof event === 'object'
? event
: JSON.parse(event);
const batchItemFailures = await execute(event.Records);
if (batchItemFailures.length > 0) {
console.log('Failures', batchItemFailures);
} else {
console.log('No failures');
}
return {
batchItemFailures: batchItemFailures
}
}
and execute function, that handles the messages
/**
* Execute
*
* #param {Array} records SQS records
* #returns {Promise<*[]>} batch item failures
*/
async function execute (records) {
let batchItemFailures = [];
for (let index = 0; index < records.length; index++) {
const record = records[index];
// ...some async stuff here
if (someSuccessCondition) {
console.log('Life is good');
} else {
batchItemFailures.push({
itemIdentifier: record.messageId
});
}
}
return batchItemFailures;
}

Google App Engine Node Task Queue Retries before Finished

I am running a slow operation via a cloud task queue to delete objects from Google Cloud Storage. I have noticed that the task queue retries the task after two minutes have passed, even though the running task is not yet finished nor errored.
What is the best strategy to trigger valid retries, but not retry while the task is still running?
Here's my task creator:
router.get('/start-delete-old', async (req, res) => {
const task = {
appEngineHttpRequest: {
httpMethod: 'POST',
relativeUri: `/videos/delete-old`,
},
};
const request = {
parent: taskClient.parent,
task: task,
};
const [response] = await taskClient.queue.createTask(request);
res.send(response);
});
Here's my task handler:
router.post('/delete-old', async (req, res) => {
let cameras = await knex('cameras')
let date = moment().subtract(365, 'days').format('YYYY-MM-DD');
for (let i = 0; i < cameras.length; i++) {
let camera = cameras[i]
let prefix = `${camera.id}/${date}/`
try {
await bucket.deleteFiles({ prefix: prefix, force: true })
await knex.raw(`delete from videos where camera_id = ${camera.id} and cast(start_time as date) = '${date}'`)
}
catch (e) {
console.log('error deleting ' + e)
}
}
res.send({});
});
As per the documentation, a timeout in a task can vary depending on the environment you are using:
Standard environment
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
Flexible environment
For worker services running in the flex environment: all types have a 60 minute timeout.
So, if your handler misses the deadline, the queue assumes the task failed and retries it.
Also, the Task Queue is expecting to receive an status code between 200 and 299, if not, it will assume that the running task will fail. Quoting the documentation:
Upon successful completion of processing, the handler must send an HTTP status code between 200 and 299 back to the queue. Any other value indicates the task has failed and the queue retries the task.
I believe that both the bucket delete files and the knex raw query, are taking a lot of time to be processed, and this is causing the handler to return an status other than 200-299.
One good way to troubleshoot is using Stackdriver logs, you will be able to gather more information about the ongoing processes and see if any of them is returning any error.

can I limit consumption of kafka-node consumer?

It seems like my kafka node consumer:
var kafka = require('kafka-node');
var consumer = new Consumer(client, [], {
...
});
is fetching way too many messages than I can handle in certain cases.
Is there a way to limit it (for example accept no more than 1000 messages per second, possibly using the pause api?)
I'm using kafka-node, which seems to have a limited api comparing to the Java version
In Kafka, poll and process should happen in a coordinated/synchronized way. Ie, after each poll, you should process all received data first, before you do the next poll. This pattern will automatically throttle the number of messages to the max throughput your client can handle.
Something like this (pseudo-code):
while(isRunning) {
messages = poll(...)
for(m : messages) {
process(m);
}
}
(That is the reason, why there is not parameter "fetch.max.messages" -- you just do not need it.)
I had a similar situation where I was consuming messages from Kafka and had to throttle the consumption because my consumer service was dependent on a third party API which had its own constraints.
I used async/queue along with a wrapper of async/cargo called asyncTimedCargo for batching purpose.
The cargo gets all the messages from the kafka-consumer and sends it to queue upon reaching a size limit batch_config.batch_size or timeout batch_config.batch_timeout.
async/queue provides saturated and unsaturated callbacks which you can use to stop the consumption if your queue task workers are busy. This would stop the cargo from filling up and your app would not run out of memory. The consumption would resume upon unsaturation.
//cargo-service.js
module.exports = function(key){
return new asyncTimedCargo(function(tasks, callback) {
var length = tasks.length;
var postBody = [];
for(var i=0;i<length;i++){
var message ={};
var task = JSON.parse(tasks[i].value);
message = task;
postBody.push(message);
}
var postJson = {
"json": {"request":postBody}
};
sms_queue.push(postJson);
callback();
}, batch_config.batch_size, batch_config.batch_timeout)
};
//kafka-consumer.js
cargo = cargo-service()
consumer.on('message', function (message) {
if(message && message.value && utils.isValidJsonString(message.value)) {
var msgObject = JSON.parse(message.value);
cargo.push(message);
}
else {
logger.error('Invalid JSON Message');
}
});
// sms-queue.js
var sms_queue = queue(
retryable({
times: queue_config.num_retries,
errorFilter: function (err) {
logger.info("inside retry");
console.log(err);
if (err) {
return true;
}
else {
return false;
}
}
}, function (task, callback) {
// your worker task for queue
callback()
}), queue_config.queue_worker_threads);
sms_queue.saturated = function() {
consumer.pause();
logger.warn('Queue saturated Consumption paused: ' + sms_queue.running());
};
sms_queue.unsaturated = function() {
consumer.resume();
logger.info('Queue unsaturated Consumption resumed: ' + sms_queue.running());
};
From FAQ in the README
Create a async.queue with message processor and concurrency of one (the message processor itself is wrapped with setImmediate function so it will not freeze up the event loop)
Set the queue.drain to resume() the consumer
The handler for consumer's message event to pause() the consumer and pushes the message to the queue.
As far as I know the API does not have any kind of throttling. But both consumers (Consumer and HighLevelConsumer) have a 'pause()' function. So you could stop consuming if you get to much messages. Maybe that already offers what you need.
Please keep in mind what's happening. You send a fetch request to the broker and get a batch of message back. You can configure the min and max size of the messages (according to the documentation not the number of messages) you want to fetch:
{
....
// This is the minimum number of bytes of messages that must be available to give a response, default 1 byte
fetchMinBytes: 1,
// The maximum bytes to include in the message set for this partition. This helps bound the size of the response.
fetchMaxBytes: 1024 * 1024,
}
I was facing the same issue, initially fetchMaxBytes value was
fetchMaxBytes: 1024 * 1024 * 10 // 10MB
I just chanbed it to
fetchMaxBytes: 1024
It worked very smoothly after the change.

Amazon SQS with aws-sdk receiveMessage Stall

I'm using the aws-sdk node module with the (as far as I can tell) approved way to poll for messages.
Which basically sums up to:
sqs.receiveMessage({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20
}, function(err, data) {
if (err) {
logger.fatal('Error on Message Recieve');
logger.fatal(err);
} else {
// all good
if (undefined === data.Messages) {
logger.info('No Messages Object');
} else if (data.Messages.length > 0) {
logger.info('Messages Count: ' + data.Messages.length);
var delete_batch = new Array();
for (var x=0;x<data.Messages.length;x++) {
// process
receiveMessage(data.Messages[x]);
// flag to delete
var pck = new Array();
pck['Id'] = data.Messages[x].MessageId;
pck['ReceiptHandle'] = data.Messages[x].ReceiptHandle;
delete_batch.push(pck);
}
if (delete_batch.length > 0) {
logger.info('Calling Delete');
sqs.deleteMessageBatch({
Entries: delete_batch,
QueueUrl: queueUrl
}, function(err, data) {
if (err) {
logger.fatal('Failed to delete messages');
logger.fatal(err);
} else {
logger.debug('Deleted recieved ok');
}
});
}
} else {
logger.info('No Messages Count');
}
}
});
receiveMessage is my "do stuff with collected messages if I have enough collected messages" function
Occasionally, my script is stalling because I don't get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a "no messages object", the callback isn't called.
(I'm writing this up to Amazon Weirdness)
What I'm asking is whats the best way to detect and deal with this, as I have some code in place to stop concurrent calls to receiveMessage.
The suggested answer here: Nodejs sqs queue processor also has code that prevents concurrent message request queries (granted it's only fetching one message a time)
I do have the whole thing wrapped in
var running = false;
runMonitorJob = setInterval(function() {
if (running) {
} else {
running = true;
// call SQS.receive
}
}, 500);
(With a running = false after the delete loop (not in it's callback))
My solution would be
watchdogTimeout = setTimeout(function() {
running = false;
}, 30000);
But surely this would leave a pile of floating sqs.receive's lurking about and thus much memory over time?
(This job runs all the time, and I left it running on Friday, it stalled Saturday morning and hung till I manually restarted the job this morning)
Edit: I have seen cases where it hangs for ~5 minutes and then suddenly gets messages BUT with a wait time of 20 seconds it should throw a "no messages" after 20 seconds. So a WatchDog of ~10 minutes might be more practical (depending on the rest of ones business logic)
Edit: Yes Long Polling is already configured Queue Side.
Edit: This is under (latest) v2.3.9 of aws-sdk and NodeJS v4.4.4
I've been chasing this (or a similar) issue for a few days now and here's what I've noticed:
The receiveMessage call does eventually return although only after 120 seconds
Concurrent calls to receiveMessage are serialised by the AWS.SDK library so making multiple calls in parallel have no effect.
The receiveMessage callback does not error - in fact after the 120 seconds have passed, it may contain messages.
What can be done about this? This sort of thing can happen for a number of reasons and some/many of these things can't necessarily be fixed. The answer is to run multiple services each calling receiveMessage and processing the messages as they come - SQS supports this. At any time, one of these services may hit this 120 second lag but the other services should be able to continue on as normal.
My particular problem is that I have some critical singleton services that can't afford 120 seconds of down time. For this I will look into either 1) use HTTP instead of SQS to push messages into my service or 2) spawn slave processes around each of the singletons to fetch the messages from SQS and push them into the service.
I also ran into this issue, but not when calling receiveMessage but sendMessage. I also saw hangups of exactly 120 seconds. I also saw it with a few other services, like Firehose.
That lead me to this line in the AWS SDK:
SQS Constructor
httpOptions:
timeout [Integer] — Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000).
to implement a fix, I override the timeout for my SQS client that performs the sendMessage to timeout after 10 seconds, and another with 25 seconds for receiving (where I long poll for 20 seconds):
var sendClient = new AWS.SQS({httpOptions:{timeout:10*1000}});
var receiveClient = new AWS.SQS({httpOptions:{timeout:25*1000}});
I've had this out in production for a week now and I've noticed that all of my SQS stalling issues have been eliminated.

Resources