Connecting to public Redshift DB from Lambda with no VPC - node.js

On AWS, I have an API Gateway setup that calls a lambda function which in turns accesses a Redshift database. All of these services are within the same VPC and work. The only problem is that every api call takes a minimum 10 seconds just for spinning up the Lambda function inside a VPC.
From what I've read, if we were to move the Lambda function outside of the VPC it should be able to avoid that 10 second startup. However, is it still possible to connect to the redshift db at that point? The redshift db is publicly accessible but does the lambda function need a VPC in order to access the internet/public redshift db?

As others suggested in comments, I would say, look into your Lambda code and see if the dependencies are really complex that it takes so much time in initialization.
I far as I understand, its going to take same time irrespective of its inside the VPC or outside.
There is something call as "Cold start / warm call with AWS Lambda", its time when initialization is taking place. As initialization requires building downloading the code, making container up, initializing the container and eventually executing the code.
Its nicely explained here.
https://blog.octo.com/en/cold-start-warm-start-with-aws-lambda/
"The initialization time of a Lambda represents a significant part of the total time. After a cold start, the Lambda will remain instantiated for a while (5 minutes) allowing any other call not to have to wait for this initialization to be done each time."
Regarding your second question, should you put Lambda outside, so the best practice suggests that "don't put Lambda inside the VPC unless you have to".
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

So it turns out i was having a timeout issue for the lambda connecting to the redshift db because the zone in the VPC that the redshift db lives in didn't have an IGW route table associated to it. I fixed that and then all I had to do was remove the lambda from its vpc and things just worked.
Long story short: Make sure your redshift db has public internet access.

Related

Aws lambda node and concurrency

I develop for first time on aws lambda with serverless
I know that my NodeJS code is not blocking so a NodeJS server can handle several requests simultaneously
My question : does Lambda create an instance for each call ? example if there are 10 simultaneous connections, will Lambda create 10 instances of NodeJS
Currently, in my tests, I have the impression that lambda creates an instance for each call because at each call, my code creates a new connection to my database while locally my code keeps in memory the connection to my database
Yes, this is a fundamental feature of AWS Lambda (and "serverless" functions in general). A new instance is created for each request.
If you have multiple parallel executions, all will be separate instances (and this, each would use its own connection to the DB).
Now, if you are invoking multiple Lambda functions one after another, that's a bit different. It is possible that subsequent invocations of the Lambda function reuse the context. That means there is a possibility of reusing some things, like DB connection in subsequent calls to the Lambda function.
There is no exact information about how long a Lambda function keeps the previous context alive. Also, in order to reuse things like DB connection, you must define and obtain a connection outside of your Handler function. If you put it in the handler function, it will certainly not be reused.
When the context is reused, you have something called a "warm" start. Lambda function is started quicker. If some time has passed and the context cannot be reused anymore, you have a "cold" start, meaning the Lambda function will take more time to start its execution (it needs to pull all the dependencies when doing the cold start)

Lambda starts timing out randomly when communicating with DynamoDB

I have a Node.js Lambda code base that talks to tiny dataset in DynamoDB (less than 400 byte each). Every now and then the function will time out over 5 minutes whilst doing a get() request to DynamoDB (via dynamoDbdAWS.DynamoDB.DocumentClient();).
The problem is it's completely random as to when this issue will occur but when it works it take ~2 second from a cold start, so taking over 5 minutes to run makes no sense and at random points.
It's a dev environment so only myself is using this, and I'm doing maybe 10 requests a day
context.callbackWaitsForEmptyEventLoop = false; has been set
Memory allocation never exceeds 45MB (128MB set)
I'm testing directly in Lambda
The code is deployed via Serverless
When testing, using Serverless, locally it works whilst the Lambda fails
I've inherited this project but have a good understanding of the architecture around it and it's fairly simple but I've not done much work with Lambda before.
Any ideas what I should look for or any known issues will be a massive help.
It sounds like one (or more) of the VPC subnets the Lambda function is configured to run in doesn't have a route to a NAT Gateway (or an AWS PrivateLink configuration). So whenever that subnet is used by the Lambda function it is unable to access the AWS API.
If the Lambda function doesn't actually need to access any resources in the VPC then it is much better to not configure it to use the VPC at all.

Lambda lost connection to RDS at 01:00 2019-01-12 (EU/London)

I have a set of lambda functions that processes messages on an SQS stack. They take data sets, process them and store the results in an RDS MySQL database, which it connects to via VPC. Both the Lambda functions and the RDS database are in the same availability zone.
This has been working for the last couple of months without any issues, but early this morning (2019-01-12) at 01:00 I started seeing lambda timeouts and messages being moved into the dead letter queue.
I've done some troubleshooting and confirmed the reason for the timeouts is the inability for Lambda to establish a connection to the database server.
The RDS server is public, but locked down to allow access only through VPC and 2 public IPs.
I've taken the following steps so far to try and resolve the issue:
Given the lambda service role admin rights to rule out IAM issues
Unassigned VPC from the lambda functions and opened up RDC inbound access from 0.0.0.0/0 to rule out VPC issues.
Restarted the RDS hosts, the good ol' off'n'on again.
Used serverless to invoke the lambda functions locally with test data (worked). My local machine connects to the public RDS IP, not through VPC.
Changed the runtime environment from 3.6 to 3.7
It doesn't appear to be a code issue, as it's been working flawlessly for the past couple of months and I can invoke locally without issue and my Elastic Beanstalk instance, which sits on the same VPC subnet continues to connect through VPC without issue.
Here's the code I'm using to connect:
connectionString = 'mysql+pymysql://{0}:{1}#{2}/{3}'.format(os.environ['DB_USER'], os.environ['DB_PASSWORD'], os.environ['DB_HOST'], os.environ['DB_SCHEMA'])
engine = create_engine(connectionString, poolclass=NullPool)
with engine.connect() as con: <--- breaking here
meta = MetaData(engine, reflect=True) <-- never gets to here
I double checked the connection string & user accounts, both are correct/working locally.
If someone could point me in the right direction, I'd be grateful!
My first guess is that you've hit a connection limit on the RDS database. Because Lambdas can be executed concurrently (this could easily be the case if there were suddenly a lot of messages in your SQS queue), and each execution opens a new connection to your DB, the connection pool can get saturated.
If this is the case, you can set a concurrent execution limit on your Lambda function to prevent this.
A side note - it is not recommended to use a database with a persistent connection in a serverless architecture exactly for this reason. AFAIK, AWS is working on a better solution to use RDS from Lambda, but it's not available yet.
So...
I was changing security groups and it was having no effect on the RDS host, at one point I removed all access and I could still connect, which is crazy. At this point I started to think the outage on Friday night put the underlying RDS host into a weird state. I put the Security Groups back to the way they should be, stopped & started (restart had no effect) the RDS host and everything started to work again.
Very frustrating, but happy it's finally resolved.

Connecting from AWS Lambda to MongoDB

I'm working on a NodeJS project and using pretty common AWS setup it seems. My ApiGateway receives call, triggers lambda A, then this lambda A triggers other lambdas, say B or C depending on params passed from ApiGateway.
Lambda A needs to access MongoDB and to avoid hassle with running MongoDB myself I decided to use mLab. ATM Lambda A is accessing MongoDB using NodeJS driver.
Now, not to start connection with every Lambda A execution I use connection pool, again, inside of Lambda A code, outside of handler I keep connection pool that allows me to reuse connections when Lambda A is invoked multiple times.
This seems to work fine.
However, I'm not sure how to deal with connections when Lambda A is invoking Lambda B and Lambda B needs to access mLab's MongoDB database.
Is it possible to pass connection pool somehow or Lambda B would have to keep its own connection pool?
I was thinking of using mLab's Data API that exposes most of the operations of MongoDB driver and so I could use HTTP calls e.g. GET and POST to run commands against database. It seems similar to RESTHeart it seems.
I'm leaning towards option 2 but on mLab's Data API it clearly states to avoid using REST api unless cannot connect using MongoDB driver directly:
The first method—the one we strongly recommend whenever possible for
added performance and functionality—is to connect using one of the
available MongoDB drivers. You do not need to use our API if you use
the driver. The second method, documented in this article, is to
connect via mLab’s RESTful Data API. Use this method only if you
cannot connect using a MongoDB driver.
Given all this how would it be best to approach it? 1 or 2 or is there any other option I should consider?
Unfortunately you won't be able to 'share' a mongo connection across lambdas because ultimately there's a 'physical' socket to the connection which is specific to that instance.
I think both of your solutions are good depending on usage.
If you tend to have steady average concurrency on both lambda A and B across an hour period (which is a bit of a rule of thumb as to how long AWS keeps a lambda instance alive), then having them both own their own static connections is a good solution. This is because the chances are that a request will reach an already started and connected lambda. I would also guess that node drivers for 'vanilla' mongo are more mature than those for the RESTFul Data API.
However if you get spikey or uneven load, then you might use the RESTFul Data API. This is because you'll be centralising the responsibility for managing the number of open connections to your instances to a single point, which under these conditions means you're less likely to be opening unneeded connections, or using all of your current capacity and having to wait for a new connection to be established.
Ultimately it's a game of probabilistic load balancing- either you 'pool' all your connections in a central place (the Data API) and become less affected by the usage of a single function at the expense of greater latency on individual operations, or you pool at a function level but are more exposed to cold-starts opening connections under uneven concurrency.

Connecting Lambda to both RDS and S3

I have a Node.js task that converts values from my database to MP3 files, then uploads them to s3 storage. The code works beautifully when executing it on my laptop. I decided to migrate it to Lambda so I can run it automatically every couple hours. I made a few minor modifications, and again, it works great. But here's the catch: it's only working when my RDS instance is set to allow connections from ANY IP. Obviously, that's an unacceptable security risk.
I put my database and Lambda code in the same VPC and security group, but even so, my code wouldn't connect to S3. Then, I added an endpoint for S3, and it looked like everything was working per my console logs. However, the file in S3 storage is empty (0 bytes).
What do I need to change? I've heard that I might need to configure my VPC to have internet access, but I'm not sure if that's what I need to do. And honestly, those tutorials seem confusing to me.
Can someone point me in the right direction?
It is a known problem (known by users, not really acknowledged by AWS that I've seen) The lambda vps docs say:
http://docs.aws.amazon.com/lambda/latest/dg/vpc.html
"When a Lambda function is configured to run within a VPC, it incurs
an additional ENI start-up penalty. This means address resolution may
be delayed when trying to connect to network resources."
And
"If your Lambda function accesses a VPC, you must make sure that your
VPC has sufficient ENI capacity to support the scale requirements of
your Lambda function. "
Source: https://forums.aws.amazon.com/thread.jspa?messageID=767285
This means it has serious drawbacks that make it unworkable:
speed penalty
have to manually setup scaling
have to pay for NAT gateway 0.059 per hour (https://aws.amazon.com/vpc/pricing/)

Resources