We've got an occasional weird problem in our AWS lambda function where sendMessage() only gets completed in the /next/ function invocation. Our code looks like:
await sqsClient.sendMessage(params).promise()
As you can see, we're awaiting on the promise... but occasionally it looks like the promise doesn't get completed until the next lambda invocation comes into the same warm container.
Any theories on what might cause that?
So after hours of debugging, I eventually tracked it down to us having a line of code in our codebase that called context.done() but then letting code continue to execute.
The next time our code calls await ..., I assume that node's async handling kicks in, realises that it is marked as done, so then just abruptly ends the lambda execution... even though the promise hasn't yet completed.
Related
Let's say I've got a queue of requests for my Lambda, and inside the lambda might be an external service call that takes 500ms, which is wrapped in async await like
async callSlowService(serializedObject: string) Promise<void>{
await slowServiceClient.post(serializedObject);
}
Should I expect that my Lambda instance will pick up another request off the queue while awaiting the slow call? I know it'll also spin up new Lambda instances but that's not what I'm talking about interleaving requests on a single instance.
I'm asking because I would think that it should do this, however I'm testing with a sleep function and a load generator and it's not happening. My code actually looks like this:
async someCoreFunction() Promise<void>{
// Business logic
console.log("Before wait");
await sleep(2000);
console.log("After wait");
}
}
const sleep = (milliseconds) => {
return new Promise(resolve => setTimeout(resolve, milliseconds))
};
And while it definitely is taking 2 seconds between the "Before wait" and "After wait" statements, there's no new logs being written in that time.
No.
Lambda as a service is largely unaware of what your code is doing. It simply takes a request, invokes your code and then waits for it to return.
I would not expect AWS to implement a feature like interleaving any time soon. It would require the lambda runtime to have substantial knowledge of how your code behaves (for example, you may be awaiting two concurrent long asynchronous calls within one invocation- so simply interrupting when you hit your first await would be incorrect). It would also cause no end of issues for people using the shared scope outside of the handler for common setup/teardown.
As you pay per invocation and time, I don't really see that there is much difference between interleaving and processing the queue in parallel (which lambda natively supports); considering that time spent awaiting still requires some compute. If interleaving ever happens I'd expect it to be a way for AWS to reduce the drain on their own resources.
n.b. If you are awaiting for a long time in a lambda function then there is probably a better way of doing things. For example, Step Functions provide a great way to kick off and poll long running tasks. Similarly, the pattern of using a session variable in your payload is a good way of allowing a long service to callback into lambda without having the lambda idling.
I was following the guide here for setting up a presignup trigger.
However, when I used callback(null, event) my lambda function would never actually return and I would end up getting an error
{ code: 'UnexpectedLambdaException',
name: 'UnexpectedLambdaException',
message: 'arn:aws:lambda:us-east-2:642684845958:function:proj-dev-confirm-1OP5DB3KK5WTA failed with error Socket timeout while invoking Lambda function.' }
I found a similar link here that says to use context.done().
After switching it works perfectly fine.
What's the difference?
exports.confirm = (event, context, callback) => {
event.response.autoConfirmUser = true;
context.done(null, event);
//callback(null, event); does not work
}
Back in the original Lambda runtime environment for Node.js 0.10, Lambda provided helper functions in the context object: context.done(err, res) context.succeed(res) and context.fail(err).
This was formerly documented, but has been removed.
Using the Earlier Node.js Runtime v0.10.42 is an archived copy of a page that no longer exists in the Lambda documentation, that explains how these methods were used.
When the Node.js 4.3 runtime for Lambda was launched, these remained for backwards compatibility (and remain available but undocumented), and callback(err, res) was introduced.
Here's the nature of your problem, and why the two solutions you found actually seem to solve it.
Context.succeed, context.done, and context.fail however, are more than just bookkeeping – they cause the request to return after the current task completes and freeze the process immediately, even if other tasks remain in the Node.js event loop. Generally that’s not what you want if those tasks represent incomplete callbacks.
https://aws.amazon.com/blogs/compute/node-js-4-3-2-runtime-now-available-on-lambda/
So with callback, Lambda functions now behave in a more paradigmatically correct way, but this is a problem if you intend for certain objects to remain on the event loop during the freeze that occurs between invocations -- unlike the old (deprecated) done fail succeed methods, using the callback doesn't suspend things immediately. Instead, it waits for the event loop to be empty.
context.callbackWaitsForEmptyEventLoop -- default true -- was introduced so that you can set it to false for those cases where you want the Lambda function to return immediately after you call the callback, regardless of what's happening in the event loop. The default is true because false can mask bugs in your function and can cause very erratic/unexpected behavior if you fail to consider the implications of container reuse -- so you shouldn't set this to false unless and until you understand why it is needed.
A common reason false is needed would be a database connection made by your function. If you create a database connection object in a global variable, it will have an open socket, and potentially other things like timers, sitting on the event loop. This prevents the callback from causing Lambda to return a response, until these operations are also finished or the invocation timeout timer fires.
Identify why you need to set this to false, and if it's a valid reason, then it is correct to use it.
Otherwise, your code may have a bug that you need to understand and fix, such as leaving requests in flight or other work unfinished, when calling the callback.
So, how do we parse the Cognito error? At first, it seemed pretty unusual, but now it's clear that it is not.
When executing a function, Lambda will throw an error that the tasked timed out after the configured number of seconds. You should find this to be what happens when you test your function in the Lambda console.
Unfortunately, Cognito appears to have taken an internal design shortcut when invoking a Lambda function, and instead of waiting for Lambda to timeout the invocarion (which could tie up resources inside Cognito) or imposing its own explicit timer on the maximum duration Cognito will wait for a Lambda response, it's relying on a lower layer socket timer to constrain this wait... thus an "unexpected" error is thrown while invoking the timeout.
Further complicating interpreting the error message, there are missing quotes in the error, where the lower layer exception is interpolated.
To me, the problem would be much more clear if the error read like this:
'arn:aws:lambda:...' failed with error 'Socket timeout' while invoking Lambda function
This format would more clearly indicate that while Cognito was invoking the function, it threw an internal Socket timeout error (as opposed to Lambda encountering an unexpected internal error, which was my original -- and incorrect -- assumption).
It's quite reasonable for Cognito to impose some kind of response time limit on the Lambda function, but I don't see this documented. I suspect a short timeout on your Lambda function itself (making it fail more promptly) would cause Cognito to throw a somewhat more useful error, but in my mind, Cognito should have been designed to include logic to make this an expected, defined error, rather than categorizing it as "unexpected."
As an update the Runtime Node.js 10.x handler supports an async function that makes use of return and throw statements to return success or error responses, respectively. Additionally, if your function performs asynchronous tasks then you can return a Promise where you would then use resolve or reject to return a success or error, respectively. Either approach simplifies things by not requiring context or callback to signal completion to the invoker, so your lambda function could look something like this:
exports.handler = async (event) => {
// perform tasking...
const data = doStuffWith(event)
// later encounter an error situation
throw new Error('tell invoker you encountered an error')
// finished tasking with no errors
return { data }
}
Of course you can still use context but its not required to signal completion.
My route calls a function startJob() that immediately returns a response "Job started." It also calls a function runJob(), which continues with the job flow (async).
Obviously, I can't just do const jobResults = await startJob(), since that function just returns "Job started." The actual results of the job are not available until all the subsequent functions complete.
So I need a way to check the results of the test after all the functions complete, meaning I need to know when all the functions complete.
The last function in the flow logResults() is not going to work to spy on, because I need it to call through (including an async call). Jasmine doesn't allow me to call it through and then call another function that checks the results.
My idea was to add a function at the end of logResults() called finishJob and spy on that. finishJob() wouldn't do anything, so I wouldn't need to call it through.
Pseudo-code:
spyOn(finishJob).and.callFake( [function that checks the results] )
However, it seems like bad practice to add a function to the code purely for testing reasons.
Other possible solutions would be using some kind of node watchers, a state change machine, or just doing a timeout (also bad practice? but quick implementation...)
I found a solution I'm happy with. I decided to start the testing at a different point in the flow.
Instead calling await startJob() to start the test, I called await runJob(), which kicks off the asynchronous process and gets the results.
That way, I could know when the process finishes (using await keyword obvs) and check the results at that time.
When we are using Promise in nodejs, given a Promise p, we can't know when the Promise p is actually resolved by logging the currentTime in the "then" callback.
To prove that, I wrote the test code below (using CoffeeScript):
# the procedure we want to profile
getData = (arg) ->
new Promise (resolve) ->
setTimeout ->
resolve(arg + 1)
, 100
# the main procedure
main = () ->
beginTime = new Date()
console.log beginTime.toISOString()
getData(1).then (r) ->
resolveTime = new Date()
console.log resolveTime.toISOString()
console.log resolveTime - beginTime
cnt = 10**9
--cnt while cnt > 0
return cnt
main()
When you run the above code, you will notice that the resolveTime (the time your code run into the callback function) is much later than 100ms from the beginTime.
So If we want to know when the Promise is actually resolved, HOW?
I want to know the exact time because I'm doing some profiling via logging. And I'm not able to modify the Promise p 's implementation when I'm doing some profiling outside of the black box.
So, Is there some function like promise.onUnderlyingConditionFulfilled(callback) , or any other way to make this possible?
This is because you have a busy loop that apparently takes longer than your timer:
cnt = 10**9
--cnt while cnt > 0
Javascript in node.js is single threaded and event driven. It can only do one thing at a time and it will finish the current thing it's doing before it can service the event posted by setTimeout(). So, if your busy loop (or any other long running piece of Javascript code) takes longer than you've set your timer for, the timer will not be able to run until this other Javascript code is done. "single threaded" means Javascript in node.js only does one thing at a time and it waits until one thing returns control back to the system before it can service the next event waiting to run.
So, here's the sequence of events in your code:
It calls the setTimeout() to schedule the timer callback for 100ms from now.
Then you go into your busy loop.
While it's in the busy loop, the setTimeout() timer fires inside of the JS implementation and it inserts an event into the Javascript event queue. That event can't run at the moment because the JS interpreter is still running the busy loop.
Then eventually it finishes the busy loop and returns control back to the system.
When that is done, the JS interpreter then checks the event queue to see if any other events need servicing. It finds the timer event and so it processes that and the setTimeout() callback is called.
That callback resolves the promise which triggers the .then() handler to get called.
Note: Because of Javascript's single threaded-ness and event-driven nature, timers in Javascript are not guaranteed to be called exactly when you schedule them. They will execute as close to that as possible, but if other code is running at the time they fire or if their are lots of items in the event queue ahead of you, that code has to finish before the timer callback actually gets to execute.
So If we want to know when the Promise is actually resolved, HOW?
The promise is resolved when your busy loop is done. It's not resolved at exactly the 100ms point (because your busy loop apparently takes longer than 100ms to run). If you wanted to know exactly when the promise was resolved, you would just log inside the setTimeout() callback right where you call resolve(). That would tell you exactly when the promise was resolved though it will be pretty much the same as where you're logging now. The cause of your delay is the busy loop.
Per your comments, it appears that you want to somehow measure exactly when resolve() is actually called in the Promise, but you say that you cannot modify the code in getData(). You can't really do that directly. As you already know, you can measure when the .then() handler gets called which will probably be no more than a couple ms after resolve() gets called.
You could replace the promise infrastructure with your own implementation and then you could instrument the resolve() callback directly, but replacing or hooking the promise implementation probably influences the timing of things even more than just measuring from the .then() handler.
So, it appears to me that you've just over-constrained the problem. You want to measure when something inside of some code happens, but you don't allow any instrumentation inside that code. That just leaves you with two choices:
Replace the promise implementation so you can instrument resolve() directly.
Measure when .then() is triggered.
The first option probably has a heisenberg uncertainty issue in that you've probably influenced the timing by more than you should if you replace or hook the promise implementation.
The second option measures an event that happens just slightly after the actual .resolve(). Pick which one sounds closest to what you actually want.
I'm using two lambda functions with Javascript's 4.3 runtime. I run the first and it calls the second synchronously (sync is the intent). Problem is the second one times out (at 60sec) but it actually reaches a successful finish after only 22 seconds.
Here's the flow between the two Lambda functions:
Lamda function A I am no longer getting CloudWatch logs for but the real problem (I think) is function B which times out for no reason.
Here's some CloudWatch logs to illustrate this:
The code in Function B at the end -- which includes the "Success" log statement see in picture above -- is included below:
Originally I only had the callback(null, 'successful ...') line and not the nodejs 0.10.x way where you called succeed() off of context. In desperation I added both but the result is the same.
Anyone have an idea what's going on? Any way in which I can debug this?
In case the invocation logic between A and B makes a difference in the state that B starts in, here's the invocation:
As Michael - sqlbot said; the issue seems to be that as long as there is an open connection, because of the non empty event loop, calling the callback doesn't terminate the function. Had the same problem with a open Redis connection; solution as stated is context.callbackWaitsForEmptyEventLoop = false;
At least for redis conenctions it helps to quit the connection to redis in order to let Lambda finish the job properly.