Lambda long running task gets executed two times with no errors

Lambda long running task gets executed two times with no errors - node.js

I am calling a lambda via aws-sdk with explicit RequestResponse
const Lambda = require("aws-sdk/clients/lambda");
const lambda = new Lambda({region: "region"});
const params = {
FunctionName: "myFunctionArn",
InvocationType: "RequestResponse",
Payload: JSON.stringify({
//... my payload
})
};
lambda.invoke(params).promise().then(data => {
console.log(data);
});
It is a long running task >~5 minutes and my timeout is set to 10 minutes. I download an mp3, compress it, save it to S3 and then return the url to the client.
There are no errors in cloudwatch and the process goes smoothly, the mp3 is store with lower quality to S3, however the function gets executed two times.
If the mp3 file is small enough (~8 MB) there is only one execution, however if it is a big file (~100MB) it will get executed two times and of course the function will timeout. I am using the /tmp folder to store the file temporarily, I will as well remove them after the mp3 has been safely stored in S3
I have scattered my function with logging and there is absolutely no sign of errors and this is happening every single time, not sporadically.
Those are my cloudwatch logs
Thanks
EDIT_1: I have tried by adding some options to the lambda client
const Lambda = require("aws-sdk/clients/lambda");
const lambda = new Lambda({
region: "ap-southeast-1",
httpOptions: { timeout: 10 * 60 * 1000, connectTimeout: 10 * 60 * 1000 }
});
But looks like nothing has changed
EDIT_2: seems like now this is happening also for short running tasks and is completely random. I am at loss here I really don't know what to do

Setting maxRetries to 0 in the SDK config params will turn off retries being invoked by the SDK itself.
const Lambda = require("aws-sdk/clients/lambda");
const lambda = new Lambda({
maxRetries: 0,
region: "ap-southeast-1",
httpOptions: { timeout: 10 * 60 * 1000, connectTimeout: 10 * 60 * 1000 }
});

After spending countless hours through debugging, I just discovered that the cause of this weird behavior was because the function was going out of memory.
Incredibly enough nothing showed up on Cloudwatch.

Related

AWS SQS messages does not become available again after visibility timeout

This is most likely something really simple but for some reason my SQS messages does not become available again after visibility timeout. At least this is what I figured since consuming lambda has no log entries indicating that any retries have been triggered.
My use case is that another lambda is feeding SQS queue with JSON entities that then needs to be sent forward. Sometimes there's so much data to be sent that receiving end is responding with HTTP 429.
My sending lambda (JSON body over HTTPS) is deleting the messages from queue only when service is responding with HTTP 200, otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue.
Anyway, when request is rejected by the service, the message does not become available anymore and so it's never tried to send again and is forever lost.
Incoming SQS has been setup as follows:
Visibility timeout: 3min
Delivery delay: 0s
Receive message wait time: 1s
Message retention period: 1d
Maximum message size: 256Kb
Associated DLQ has Maximum receives of 100
The consuming lambda is configured as
Memory: 128Mb
Timeout: 10s
Triggers: The source SQS queue, Batch size: 10, Batch window: None
And the actual logic I have in my lambda is quite simple, really. Event it receives is the Records in the queue. Lambda might get more than one record at a time but all records are handled separately.
console.log('Response', response);
if (response.status === 'CREATED') {
/* some code here */
const deleteParams = {
QueueUrl: queueUrl, /* required */
ReceiptHandle: receiptHandle /* required */
};
console.log('Removing record from ', queueUrl, 'with params', deleteParams);
await sqs.deleteMessage(deleteParams).promise();
} else {
/* any record that ends up here, are never seen again :( */
console.log('Keeping record in', eventSourceARN);
}
What do :( ?!?!11

otherwise I do nothing with the receiptHandle which I think should then keep the message in the queue
That's not now it works:
Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages. Lambda reads messages in batches and invokes your function once for each batch. When your function successfully processes a batch, Lambda deletes its messages from the queue.

When an AWS Lambda function is triggered from an Amazon SQS queue, all activities related to SQS are handled by the Lambda service. Your code should not call any Amazon SQS functions.
The messages will be provided to the AWS Lambda function via the event parameter. When the function successfully exits, the Lambda service will delete the messages from the queue.
Your code should not be calling DeleteMessage().
If you wish to signal that some of the messages were not successfully processed, you can use a partial batch response to indicate which messages were successfully processed. The AWS Lambda service will then make the unsuccessful messages available on the queue again.

Thanks to everyone who answered. So I got this "problem" solved just by going through the documents.
I'll provide more detailed answer to my own question here in case someone besides me didn't get it on the first go :)
So function should return batchItemFailures object containing message ids of failures.
So, for example, one can have Lambda handler as
/**
* Handler
*
* #param {*} event SQS event
* #returns {Object} batch item failures
*/
exports.handler = async (event) => {
console.log('Starting ' + process.env.AWS_LAMBDA_FUNCTION_NAME);
console.log('Received event', event);
event = typeof event === 'object'
? event
: JSON.parse(event);
const batchItemFailures = await execute(event.Records);
if (batchItemFailures.length > 0) {
console.log('Failures', batchItemFailures);
} else {
console.log('No failures');
}
return {
batchItemFailures: batchItemFailures
}
}
and execute function, that handles the messages
/**
* Execute
*
* #param {Array} records SQS records
* #returns {Promise<*[]>} batch item failures
*/
async function execute (records) {
let batchItemFailures = [];
for (let index = 0; index < records.length; index++) {
const record = records[index];
// ...some async stuff here
if (someSuccessCondition) {
console.log('Life is good');
} else {
batchItemFailures.push({
itemIdentifier: record.messageId
});
}
}
return batchItemFailures;
}

Optimizing file parse and SNS publish of large record set

I have an 85mb data file with 110k text records in it. I need to parse each of these records, and publish an SNS message to a topic for each record. I am doing this successfully, but the Lambda function requires a lot of time to run, as well as a large amount of memory. Consider the following:
const parse = async (key) => {
//get the 85mb file from S3. this takes 3 seconds
//I could probably do this via a stream to cut down on memory...
let file = await getFile( key );
//parse the data by new line
const rows = file.split("\n");
//free some memory now
//this free'd up ~300mb of memory in my tests
file = null;
//
for( let i = 0; i < rows.length; i++ ) {
//... parse the row and build a small JS object from it
//publish to SNS. assume publishMsg returns a promise after a successful SNS push
requests.push( publishMsg(data) );
}
//wait for all to finish
await Promise.all(requests);
return 1;
};
The Lambda function will timeout with this code at 90 seconds (the current limit I have set). I could raise this limit, as well as the memory (currently at 1024mb) and likely solve my issue. But, none of the SNS publish calls take place when the function hits the timeout. Why?
Lets say 10k rows process before the function hits the timeout. Since I am submitting the publish async, shouldn't several of these complete regardless of the timeout? It seems they only run if the entire function completes.
I have run a test where I cut the data down to 15k rows, and it runs without any issue, in roughly 15 seconds.
So the question, why are the async calls not firing prior to the function timeout, and any input on how I can optimize this without moving away from Lambda?
Lambda Config: nodeJS 10.x, 1024 mb, 90 second timeout

How can the AWS Lambda concurrent execution limit be reached?

UPDATE
The original test code below is largely correct, but in NodeJS the various AWS services should be setup a bit differently as per the SDK link provided by #Michael-sqlbot
// manager
const AWS = require("aws-sdk")
const https = require('https');
const agent = new https.Agent({
maxSockets: 498 // workers hit this level; expect plus 1 for the manager instance
});
const lambda = new AWS.Lambda({
apiVersion: '2015-03-31',
region: 'us-east-2', // Initial concurrency burst limit = 500
httpOptions: { // <--- replace the default of 50 (https) by
agent: agent // <--- plugging the modified Agent into the service
}
})
// NOW begin the manager handler code
In planning for a new service, I am doing some preliminary stress testing. After reading about the 1,000 concurrent execution limit per account and the initial burst rate (which in us-east-2 is 500), I was expecting to achieve at least the 500 burst concurrent executions right away. The screenshot below of CloudWatch's Lambda metric shows otherwise. I cannot get past 51 concurrent executions no matter what mix of parameters I try. Here's the test code:
// worker
exports.handler = async (event) => {
// declare sleep promise
const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
// return after one second
let nStart = new Date().getTime()
await sleep(1000)
return new Date().getTime() - nStart; // report the exact ms the sleep actually took
};
// manager
exports.handler = async(event) => {
const invokeWorker = async() => {
try {
let lambda = new AWS.Lambda() // NO! DO NOT DO THIS, SEE UPDATE ABOVE
var params = {
FunctionName: "worker-function",
InvocationType: "RequestResponse",
LogType: "None"
};
return await lambda.invoke(params).promise()
}
catch (error) {
console.log(error)
}
};
try {
let nStart = new Date().getTime()
let aPromises = []
// invoke workers
for (var i = 1; i <= 3000; i++) {
aPromises.push(invokeWorker())
}
// record time to complete spawning
let nSpawnMs = new Date().getTime() - nStart
// wait for the workers to ALL return
let aResponses = await Promise.all(aPromises)
// sum all the actual sleep times
const reducer = (accumulator, response) => { return accumulator + parseInt(response.Payload) };
let nTotalWorkMs = aResponses.reduce(reducer, 0)
// show me
let nTotalET = new Date().getTime() - nStart
return {
jobsCount: aResponses.length,
spawnCompletionMs: nSpawnMs,
spawnCompletionPct: `${Math.floor(nSpawnMs / nTotalET * 10000) / 100}%`,
totalElapsedMs: nTotalET,
totalWorkMs: nTotalWorkMs,
parallelRatio: Math.floor(nTotalET / nTotalWorkMs * 1000) / 1000
}
}
catch (error) {
console.log(error)
}
};
Response:
{
"jobsCount": 3000,
"spawnCompletionMs": 1879,
"spawnCompletionPct": "2.91%",
"totalElapsedMs": 64546,
"totalWorkMs": 3004205,
"parallelRatio": 0.021
}
Request ID:
"43f31584-238e-4af9-9c5d-95ccab22ae84"
Am I hitting a different limit that I have not mentioned? Is there a flaw in my test code? I was attempting to hit the limit here with 3,000 workers, but there was NO throttling encountered, which I guess is due to the Asynchronous invocation retry behaviour.
Edit: There is no VPC involved on either Lambda; the setting in the select input is "No VPC".
Edit: Showing Cloudwatch before and after the fix

There were a number of potential suspects, particularly due to the fact that you were invoking Lambda from Lambda, but your focus on consistently seeing a concurrency of 50 — a seemingly arbitrary limit (and a suspiciously round number) — reminded me that there's an anti-footgun lurking in the JavaScript SDK:
In Node.js, you can set the maximum number of connections per origin. If maxSockets is set, the low-level HTTP client queues requests and assigns them to sockets as they become available.
Here of course, "origin" means any unique combination of scheme + hostname, which in this case is the service endpoint for Lambda in us-east-2 that the SDK is connecting to in order to call the Invoke method, https://lambda.us-east-2.amazonaws.com.
This lets you set an upper bound on the number of concurrent requests to a given origin at a time. Lowering this value can reduce the number of throttling or timeout errors received. However, it can also increase memory usage because requests are queued until a socket becomes available.
...
When using the default of https, the SDK takes the maxSockets value from the globalAgent. If the maxSockets value is not defined or is Infinity, the SDK assumes a maxSockets value of 50.
https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/node-configuring-maxsockets.html

Lambda concurrency it not the only factor that decides how scalable your functions are. If your Lambda function is runnning within a VPC, it will require an ENI (Elastic Network Interface) which allows for ethernet traffic from and to the container (Lambda function).
It's possible your throttling occurred due to too many ENI's being requested (50 at a time). You can check this by viewing the logs of the Manager lambda function and looking for an error message when it's trying to invoke one of the child containers. If the error looks something like the following, you'll know ENI's is your issue.
Lambda was not able to create an ENI in the VPC of the Lambda function because the limit for Network Interfaces has been reached.

NodeJS of Azure Function is much more slower than local nodeJS

I'm beginner at nodeJs and Azure.
I'm trying to use wav-encoder npm module in my program.
wav-encoder
so I wrote code like below,
var WavEncoder = require('wav-encoder');
const whiteNoise1sec = {
sampleRate: 40000,
channelData: [
new Float32Array(40000).map(() => Math.random() - 0.5),
new Float32Array(40000).map(() => Math.random() - 0.5)
]
};
WavEncoder.encode(whiteNoise1sec).then((buffer)=>{
console.log(whiteNoise1sec);
console.log(buffer);
});
It runs on my local machine, less than 2 secs.
but if I upload similar code to Azure Functions, it takes more than 2 mins.
below is code in my Functions. It is triggered by http REST call.
var WavEncoder = require('wav-encoder');
module.exports = function (context, req) {
context.log('JavaScript HTTP trigger function processed a request.');
const whiteNoise1sec = {
sampleRate: 40000,
channelData: [
new Float32Array(40000).map(() => Math.random() - 0.5),
new Float32Array(40000).map(() => Math.random() - 0.5)
]
};
WavEncoder.encode(whiteNoise1sec).then((buffer)=>{
context.res = {
// status: 200, /* Defaults to 200 */
body: whiteNoise1sec
};
context.done();
});
};
Do you know how can I improve performance of Azure?
Update
context.res = {
// status: 200, /* Defaults to 200 */
body: whiteNoise1sec
};
context.done();
I found that this line cause slow performance.
If I give large size array to context.res.body it takes long time when I call context.done();
Isn't large size json response proper for Azure Functions???

It's a bit hard to analyze performance issues like this, but there are few things to consider here and few things to look at.
Cold functions vs warm functions performance
if the function hasn't been invoked in a while or never (I think it's about 10 or 20 minutes) it goes idle, meaning it gets deprovisioned. next time you hit that function it needs to be loaded from storage. Due to some architecture and relying of a certain type of storage, IO hits for small files are bad currently. There is work in progress to improve that, but a large npm tree would cause > 1 minute loading time just to fetch all the small js files. if the function is warm however, it should be in the msec range (or depending on the work your function is doing, see below for more)
Workaround: use this to pack your function https://github.com/Azure/azure-functions-pack
Slower CPU for consumption sku
in consumption sku, you are scaled to many instances (in the hundreds) but each instance is affinitized to a single core. That is fine for IO bound operations, regular node functions (since they are single threaded anyway), etc. But if your function tries to utilize CPU for CPU bound workloads, it's not going to perform as you expect it.
Workaround: you can use dedicated Skus for CPU bound workloads

Amazon SQS with aws-sdk receiveMessage Stall

I'm using the aws-sdk node module with the (as far as I can tell) approved way to poll for messages.
Which basically sums up to:
sqs.receiveMessage({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20
}, function(err, data) {
if (err) {
logger.fatal('Error on Message Recieve');
logger.fatal(err);
} else {
// all good
if (undefined === data.Messages) {
logger.info('No Messages Object');
} else if (data.Messages.length > 0) {
logger.info('Messages Count: ' + data.Messages.length);
var delete_batch = new Array();
for (var x=0;x<data.Messages.length;x++) {
// process
receiveMessage(data.Messages[x]);
// flag to delete
var pck = new Array();
pck['Id'] = data.Messages[x].MessageId;
pck['ReceiptHandle'] = data.Messages[x].ReceiptHandle;
delete_batch.push(pck);
}
if (delete_batch.length > 0) {
logger.info('Calling Delete');
sqs.deleteMessageBatch({
Entries: delete_batch,
QueueUrl: queueUrl
}, function(err, data) {
if (err) {
logger.fatal('Failed to delete messages');
logger.fatal(err);
} else {
logger.debug('Deleted recieved ok');
}
});
}
} else {
logger.info('No Messages Count');
}
}
});
receiveMessage is my "do stuff with collected messages if I have enough collected messages" function
Occasionally, my script is stalling because I don't get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a "no messages object", the callback isn't called.
(I'm writing this up to Amazon Weirdness)
What I'm asking is whats the best way to detect and deal with this, as I have some code in place to stop concurrent calls to receiveMessage.
The suggested answer here: Nodejs sqs queue processor also has code that prevents concurrent message request queries (granted it's only fetching one message a time)
I do have the whole thing wrapped in
var running = false;
runMonitorJob = setInterval(function() {
if (running) {
} else {
running = true;
// call SQS.receive
}
}, 500);
(With a running = false after the delete loop (not in it's callback))
My solution would be
watchdogTimeout = setTimeout(function() {
running = false;
}, 30000);
But surely this would leave a pile of floating sqs.receive's lurking about and thus much memory over time?
(This job runs all the time, and I left it running on Friday, it stalled Saturday morning and hung till I manually restarted the job this morning)
Edit: I have seen cases where it hangs for ~5 minutes and then suddenly gets messages BUT with a wait time of 20 seconds it should throw a "no messages" after 20 seconds. So a WatchDog of ~10 minutes might be more practical (depending on the rest of ones business logic)
Edit: Yes Long Polling is already configured Queue Side.
Edit: This is under (latest) v2.3.9 of aws-sdk and NodeJS v4.4.4

I've been chasing this (or a similar) issue for a few days now and here's what I've noticed:
The receiveMessage call does eventually return although only after 120 seconds
Concurrent calls to receiveMessage are serialised by the AWS.SDK library so making multiple calls in parallel have no effect.
The receiveMessage callback does not error - in fact after the 120 seconds have passed, it may contain messages.
What can be done about this? This sort of thing can happen for a number of reasons and some/many of these things can't necessarily be fixed. The answer is to run multiple services each calling receiveMessage and processing the messages as they come - SQS supports this. At any time, one of these services may hit this 120 second lag but the other services should be able to continue on as normal.
My particular problem is that I have some critical singleton services that can't afford 120 seconds of down time. For this I will look into either 1) use HTTP instead of SQS to push messages into my service or 2) spawn slave processes around each of the singletons to fetch the messages from SQS and push them into the service.

I also ran into this issue, but not when calling receiveMessage but sendMessage. I also saw hangups of exactly 120 seconds. I also saw it with a few other services, like Firehose.
That lead me to this line in the AWS SDK:
SQS Constructor
httpOptions:
timeout [Integer] — Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000).
to implement a fix, I override the timeout for my SQS client that performs the sendMessage to timeout after 10 seconds, and another with 25 seconds for receiving (where I long poll for 20 seconds):
var sendClient = new AWS.SQS({httpOptions:{timeout:10*1000}});
var receiveClient = new AWS.SQS({httpOptions:{timeout:25*1000}});
I've had this out in production for a week now and I've noticed that all of my SQS stalling issues have been eliminated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Lambda long running task gets executed two times with no errors - node.js

Setting maxRetries to 0 in the SDK config params will turn off retries being invoked by the SDK itself. const Lambda = require("aws-sdk/clients/lambda"); const lambda = new Lambda({ maxRetries: 0, region: "ap-southeast-1", httpOptions: { timeout: 10 * 60 * 1000, connectTimeout: 10 * 60 * 1000 } });

After spending countless hours through debugging, I just discovered that the cause of this weird behavior was because the function was going out of memory. Incredibly enough nothing showed up on Cloudwatch.

Related

AWS SQS messages does not become available again after visibility timeout

Optimizing file parse and SNS publish of large record set

How can the AWS Lambda concurrent execution limit be reached?

NodeJS of Azure Function is much more slower than local nodeJS

Amazon SQS with aws-sdk receiveMessage Stall

Categories

Resources