What is causing intermittent long execution times for DynamoDB operations - node.js

We are experiencing intermittent DynamoDB execution times, taking consistently just over 10 seconds. This is happening roughly once every hour or three.
The DynamoDB operations are called from a Node.js Lambda (either getItem or updateItem)
The AWS SDK DynamoDB instance instantiation happens once outside of the Lambda handler
The Lambda and DynamoDB are in the same VPC with a DynamoDB endpoint enabled
The Lambda is triggered by a DynamoDB stream from another DynamoDB table which can happen multiple times a second
The tables are set to PAY_PER_REQUEST
The Partition Key is unique and we are getting concurrency of 4 Lambdas
We have given the Lambda 2560MB of memory
We have tried setting the BATCH_SIZE of the stream from 1 to 10 but it never prevents the intermittent long execution time
We have enabled AWS SDK logging and can see on the long spike times that the DynamoDB call has been retried once - the duration on these occasions is consistently just over 10 seconds
We have set the AWS SDK timeout from 5 secs to 2 secs and still see the 10 second delay
We have maxRetries on the table set to 10
The cause for the consistent 10 second timing of the long spikes is hard to figure out. If anyone has any other idea to try out or advice it would be much appreciated. We are also liaising with DynamoDB, Lambda and Networking AWS support departments but so far they haven't figured out what is going on.
Much thanks,
Sam

Related

Azure Timer Function connection limits

I have an Azure Trimer function that executes every 15 minutes. The function compiles data from 3 data sources, WCF service, REST endpoint and Table Storage, and insert the data into CosmosDb. Where I am running into an issues is that after 7 or 8 executions of function I get the "Host thresholds exceeded: [Connections]" error. Here is what is really strange, the function takes about 2 minutes to execute. The error doesn't show in the logs until well after the function is done executing.
I have gone through all the connection limits documentation and understand it. Where I am a bit confused is when the limits matter. A single execution of my function does not come anywhere close to hitting the 600 active connection limit. Do the connection limits apply to the individual execution of the timer function or are the limits an cumulative over multiple executions?
Here is the real kicker, this function was running fine for two weeks until 07/22/2012. Nothing in the code has changed and it has not been redeployed.
Runtime is 3.1.3
Is your function on a Consumption Plan or in an App Service Plan?
From your description it just sounds like your code may be leaking connections and building up a large number of them over time.
Maybe this blog post can help in ensuring the right usage patterns? https://4lowtherabbit.github.io/blogs/2019/10/SNAT/

DynamoDB queries take 2000ms for no obvious reason

I am using DynamoDB to query some data. At a bottom you can see number of miliseconds for certain percentage of requests.
Most of the time DynamoDB works fine with about 100ms response time. (there are only queries on Primary Key or Indexes). About 0.4% requests take more than 800ms (which is the limit the service need to provide response) even at "calm" times. Its not perfect, but good enough.
However certain load (which is not even big) triggers DynamoDB behaviour that causes about 5% of requests to have 2000ms response times. There are several queries in one request, but if one of these queries take more than 800ms, whole request is terminated by caller.
We are using Node.js and this is how we use the AWS library:
This is how we initiate the library
const dynamodb = new AWS.DynamoDB.DocumentClient({
region: config.server.region,
});
/// ...
dynamodb.query(params).promise()
Example of query
{
"TableName":"prod-UserDeviceTable",
"KeyConditionExpression":"#userId = :userId",
"ExpressionAttributeNames":{
"#userId":"userId"
},
"ExpressionAttributeValues":{
":userId":"e3hs8etjse3t13se8h7eh4"
},
"ConsistentRead":false
}
Few more notes:
CPU of service is currently autoscaled to be below 10%, higher load or cpu usage does not make the service work worse, we even tried to reach 50% with same response times
There is no "grow" of response time. Its usually less than 100ms and then out of nowhere its 2000 for 1-5%. As you can see in the graph
The queries do not use ConsistentRead
We are using Global Tables in 4 regions. This happens in all regions (the service is deployed independently in all 4 regions as well)
DynamoDB metrics for tables used do not show any spikes in response times
Service is deployed in ECS Fargate, but it was deployed before in ECS EC2 with same results
All tables in all regions are on-demand
The 2000ms delay happen anywhere in service during request lifetime when querying dynamodb (=happens for different queries, even for different tables)
When I repeat the same request that took more than 2000ms again, it takes only 100ms or less

Lambda starts timing out randomly when communicating with DynamoDB

I have a Node.js Lambda code base that talks to tiny dataset in DynamoDB (less than 400 byte each). Every now and then the function will time out over 5 minutes whilst doing a get() request to DynamoDB (via dynamoDbdAWS.DynamoDB.DocumentClient();).
The problem is it's completely random as to when this issue will occur but when it works it take ~2 second from a cold start, so taking over 5 minutes to run makes no sense and at random points.
It's a dev environment so only myself is using this, and I'm doing maybe 10 requests a day
context.callbackWaitsForEmptyEventLoop = false; has been set
Memory allocation never exceeds 45MB (128MB set)
I'm testing directly in Lambda
The code is deployed via Serverless
When testing, using Serverless, locally it works whilst the Lambda fails
I've inherited this project but have a good understanding of the architecture around it and it's fairly simple but I've not done much work with Lambda before.
Any ideas what I should look for or any known issues will be a massive help.
It sounds like one (or more) of the VPC subnets the Lambda function is configured to run in doesn't have a route to a NAT Gateway (or an AWS PrivateLink configuration). So whenever that subnet is used by the Lambda function it is unable to access the AWS API.
If the Lambda function doesn't actually need to access any resources in the VPC then it is much better to not configure it to use the VPC at all.

AWS SDK calls from a Lambda take longer than 30 seconds

I have a NodeJs Lambda function in AWS which needs to read some data. As a data source we've tried two options - S3 and DynamoDB. Both on them have the same issue - when we conduct load testing (10 req/sec during 100sec) some requests to those S3/DynamoDB fail to complete in 30 sec, which is our Lambda timeout. The requests themselves are very light - for S3 it is a 1KB file and for DynamoDB it is a table with only one record in it. On average those requests take less than 100ms, but sometimes we get these very long peaks I'm talking about.
The rate of such long requests is quite small - less than 1%, but this is still not acceptable for us. Moreover, I don't see any reasons why we have such long responses.
Another thing we've noticed is that those 30sec+ requests usually happen after long periods (4h or more) of not calling those S3/DynamoDB resources.
The only reason I can think of is that after long inactivity periods AWS infrastructure unable to create required number of ENIs fast enough. ENIs are needed because both S3 and DynamoDB are called via HTTP by aws-sdk. But this is just a guess which I don't know how to validate.
Currently, I'm thinking of warming-up ENIs by making requests to S3/DynamoDB, but I haven't tried it yet.
If anybody has had similar issues I would appreciate any suggestions on how to fix the issue.
P.S. Increasing a Lambda timeout is not an options for us. 30secs are more than enough to make such a simple calls.

How to optimize AWS Lambda?

I'm currently building web API using AWS Lambda with Serverless Framework.
In my lambda functions, each of them connects to Redis (elasticache) and RDB (Aurora, RDS) or DynamoDB to retrieve data or write new data.
And all my lambda functions are running in my VPC.
Everything works fine except that when a lambda function is first executed or executed a while after last execution, it takes quite a long time (1-3 seconds) to execute the lambda function, or sometimes it even respond with a gateway timeout error (around 30 seconds), even though my lambda functions are configured to 60 seconds timeout.
As stated in here, I assume 1-3 seconds is for initializing a new container. However, I wonder if there is a way to reduce this time, because 1-3 seconds or gateway timeout is not really an ideal for production use.
You've go two issues:
The 1-3 second delay. This is expected and well-documented when using Lambda. As #Nick mentioned in the comments, the only way to prevent your container from going to sleep is using it. You can use Lambda Scheduled Events to execute your function as often as every minute using a rate expression rate(1 minute). If you add some parameters to your function to help you distinguish between a real request and one of these ping requests you can immediately return on the ping requests and then you've worked around your problem. It will cost you more, but we're probably talking pennies per month if anything. Lambda has a generous free tier.
The 30 second delay is unusual. I would definitely check your CloudWatch logs. If you see logs from when your function is working normally but no logs from when you see the 30 second timeout then I would assume the problem is with API Gateway and not with Lambda. If you do see logs then maybe they can help you troubleshoot. Another place to check is the AWS Status Page. I've seen sometimes where Lambda functions timeout and respond intermittently and I pull my hair out only to realize that there's a problem on Amazon's end and they're working on it.
Here's a blog post with additional information on Lambda Container Reuse that, while a little old, still has some good information.

Resources