Using mysql pool on amazon lambda - node.js

I am trying to use mysql pool in my NodeJS service that is running on Amazon Lambda.
This is the beginning of my module that works with database:
console.log('init database module ...');
var settings = require('./settings.json');
var mysql = require('mysql');
var pool = mysql.createPool(settings);
As following from logs in Amazon console this piece of code is executed very often:
If I just deployed the service and executed 10 requests simultaneously - all these 10 requests execute this piece of code.
If I again execute 10 requests simultaneously immediately after first series - they don't execute this code.
If some time is passed from last query - then some of the requests re-execute that code.
Even if I use global - this decreases but not eliminates duplicates:
if (!global.pool) {
console.log('init database module ...');
var settings = require('./settings.json');
var mysql = require('mysql');
global.pool = mysql.createPool(settings);
}
Moreover, if request execution has some error - this piece of code is executed after the request and global.pool is null at that moment.
So, does this mean that using pool in Amazon Lambda is not possible?
Is there any option how I can make Amazon use the same pool instance every time?

Each time a Lambda function is invoked, it runs in its own, independent container. If no idle containers are available, a new one is automatically created by the service. Hence:
If I just deployed the service and executed 10 requests simultaneously - all these 10 requests execute this piece of code.
If a container is available, it may be, and very likely will be, reused. When that happens, the process is already running, so the global section doesn't run again -- the invocation starts with the handler. Therefore:
If I again execute 10 requests simultaneously immediately after first series - they don't execute this code.
After each invocation is complete, the container that was used is frozen, and will ultimately be either thawed and reused for a subsequent invocation, or if it isn't needed after a few minutes, it is destroyed. Thus:
If some time is passed from last query - then some of the requests re-execute that code.
Makes sense, now, right?
The only "catch" is that the amount of time that must elapse before a container is destroyed is not a fixed value. Anecdotally, it appears to be about 15 minutes, but I don't believe it's documented, since most likely the timer is adaptive... the service can (at its descretion) use heuristics to predict whether recent activity was a spike or likely to be sustained, and probably considers other factors.
(Lambda#Edge, which is Lambda integrated with CloudFront for HTTP header manipulation, seems to operate with different timing. Idle containers seem to persist much longer, at least in small quantities, but this makes sense because they are always very small containers... and again this observation is anecdotal.)
The global section of your code only runs when a new container is created.
Pooling doesn't make sense because nothing is shared during an invocation -- each invocation is the only one running in its container -- one per process -- at any one time.
What you will want to do, though, is change the idle_timeout on the connections. MySQL Server doesn't have an effective way to "discover" that an idle connection has gone away entirely, so when your connection goes away when the container is destroyed, the server just sits there, and the connection remains in the Sleep state until the default idle_timeout expires. The default is 28800 seconds, or 8 hours, which is too long. You can change this on the server, or send the query SET ##IDLE_TIMEOUT = 900 (though you'll need to experiment with an appropriate value).
Or, you can establish and destroy the connection inside the handler for each invocation. This will take a little bit more time, of course, but it's a sensible approach if your function isn't going to be running very often. The MySQL client/server protocol's connection/handshake sequence is reasonably lightweight, and frequent connect/disconnect doesn't impose as much load on the server as you might expect... although you would not want to do that on an RDS server that uses IAM token authentication, which is more resource-intensive.

Related

Can PouchDB be hosted as an idle Lambda?

I am curious if PouchDB can be used within an AWS Lambda.
The key question (in my mind) is whether PouchDB ever empties its NodeJS Event Loop (allowing the Lambda function to return and be suspended). This would be critical to getting the benefit of the Lambda.
Does anyone know if PouchDB can be configured to only run when it is actually handling a request. For example if it schedules timers for occasional polling this would keep the eventloop full and make it infeasible to host on a Lambda as the execution time would be 15 minutes instead of a few hundred ms.
The purpose is to host a richly-indexed database but which is only available ad-hoc (without a cloud instance dedicated to hosting it).
In the ideal configuration, a Lambda execution context would be brought into existence for 15 minutes when requested, and only occasionally actually run code specifically to handle incoming requests, (and to carry out the replication from those requests to a persistent store), before going idle again. At some unknown point, AWS will garbage collect the instance. Any subsequent request would essentially re-lauch the PouchDB from scratch.
PouchDB in a lambda would give me the benefit of incrementally updated MapReduce views on a dataset stored fast in-memory. I expect to have live replication (write-only) to a second PouchDB whose indexes were loaded on-demand from S3 (via a LevelDown S3 adapter). Between the two, this would give me persistence of the indexes, in-memory fast access, but on-demand availability.
Theoretically, this is looking positive (you can run it in a Lambda) from my investigations so far.
I used pouchdb from within Node to replicate from a couchdb, which ran to completion synchronously.
I followed up with a dump of current IO handles through wtfnode. The only outstanding IO seemed to be the REPL itself. This suggests that there are no live connections or outstanding events after a replication is complete.
Next step is to actually run it in a lambda and replicate data to pouchdb from a couchdb within the Lambda, proving that its execution will actually complete.

What can cause "idle in transaction" for "BEGIN" statements

We have a node.js application that connects via pg-promise to a Postgres 11 server - all processes are running on a single cloud server in docker containers.
Sometimes we hit a situation where the application does not react anymore.
The last time this happened, I had a little time to check the db via pgadmin and it showed that the connections were idle in transaction with statement BEGIN and an exclusive lock of virtualxid
I think the situation is like this:
the application has started a transaction by sending the BEGIN sql command to the db
the db got this command and started a new transaction and thus acquired an exclusive lock of mode virtualxid
now the db waits for the application to send the next statement/s (until it receives COMMIT or ROLLBACK) - and then it will release the exclusive lock of mode virtualxid
but for some reason it does not get anymore statements:
I think that the node.js event-loop is blocked - because at the time, when we see these locks, the node.js application does not log anymore statements. But the webserver still gets requests and reported some upstream timed out requests.
Does this make sense (I'm really not sure about 2. and 3.)?
Why would all transactions block at the beginning? Is this just coincidence or is the displayed SQL maybe wrong?
BTW: In this answer I found, that we can set idle_in_transaction_session_timeout so that these transactions will be released after a timeout - which is great, but I try to understand what's causing this issue.
The transactions are not blocking at all. The database is waiting for the application to send the next statement.
The lock on the transaction ID is just a technique for transactions to block each other, even if they are not contending for a table lock (for example, if they are waiting for a row lock): each transaction holds an exclusive lock on its own transaction ID, and if it has to wait for a concurrent transaction to complete, it can just request a lock on that transaction's ID (and be blocked).
If all transactions look like this, then the lock must be somewhere in your application; the database is not involved.
When looking for processes blocked in the database, look for rows in pg_locks where granted is false.
Your interpretation is correct. As for why it is happening, that is hard to say. It seems like there is some kind of bug (maybe an undetected deadlock) in your application, or maybe in nodes.js or pg-promise. You will have to debug at that level.
As expected the problems were caused by our application code. Transactions were used incorrectly:
One of the REST endpoints started a new transaction right away, using Database.tx().
This transaction was passed down multiple levels, but one function in the chain had an error and passed undefined instead of the transaction to the next level
the lowest repository level function started a new transaction (because the transaction parameter was undefined), by using Database.tx() a second time
This started to fail, under heavy load:
The connection pool size was set to 10
When there were many simultaneous requests for this endpoint, we had a situation where 10 of the requests started (opened the outer transaction) and had not yet reached the repository code that will request the 2nd transaction.
When these requests reached the repository code, they request a new (2nd) connection from the connection-pool. But this call will block because there are currently all connections in use.
So we have a nasty application level deadlock
So the solution was to fix the application code (the intermediate function must pass down the transaction correctly). Then everything works.
Moreover I strongly recommend to set a sensible idle_in_transaction_session_timeout and connection-timeout. Then, even if such an application-deadlock is introduced again in future versions, the application can recover automatically after this timeout.
Notes:
pg-postgres before v 10.3.4 contained a small bug #682 related to the connection-timeout
pg-promise before version 10.3.5 could not reocver from an idle-in-transaction-timeout and left the connection in a broken state: see pg-promise #680
Basically there was another issue: there was no need to use a transaction - because all functions were just reading data: so we can just use Database.task() instead of Database.tx()

Loading up clients in Azure Functions

I'm creating an Azure Function that will run in consumption mode and will get triggered by messages in a queue.
The function will typically need to make a database call when it gets triggered. I "assume" the function gets launched and loaded to memory when it gets triggered and when it's idle, it gets terminated because it's running in consumption mode.
Based on this assumption, I don't think I can load up a singleton instance of my back-end client which includes the logic for making database calls.
Is then new'ing up my back-end client the right approach every time I need to perform some back-end operations?
This is a wrong assumption. Your function will be loaded during the first call, and will be unloaded only after an idle timeout (5 or 10 minutes).
You will not pay for idling, but you will pay for the whole time that your function was running, including the wait time during the database calls (or other IO).
Singletons and statics work just fine; and you should reuse instances like HttpClient between the calls.

How to get the status of all requests to one API in nodejs

I want to get API server status in nodejs. I'm using nodejs to open an interface: "api/request?connId=50&timeout=90". This API will keep the request running for provided time on the server side. After the successful completion of the provided time it should return status/OK. And when we have multiple connection ids & timeout, we want the API return all the running requests on the server with their time left for completion, something like below, where 4 and 8 are the connId and 25 and 15 is the time remaining for the requests to complete (in seconds):
{"4":"25","8":"15"}
please help.
Node.js server uses async model in one single thread, which means at any time, only one request (connId) is under execution by Node (except... you have multiple node.js instance, but let's keep the scenario simple and ignore this case).
When one request is processed (running its handler code), it may start an async task such as read a file, and continue execution. The request itself's handler code would be executed without waiting for async task, and when this handler code is finished running, from Node.js point of view, the request handling itself is done -- the handling of async task's result is another thing in another time, node does not care about the progress of it.
Thus, in order to return remaining time of all requests -- I guess this is the remaining time of other request's async task, because remaining time of other request's handler code execution does not make any sense, there must be some place to store the information of all requests, including:
request's connId and startTime (the time when request is received).
request's timeout value, which is passed as parameter in URL.
request's estimated remaining time, this information is mission specific and must be retrieved from other async task related services (you can pull time by time using setInterval or make other services push the latest remaining time). Node.js doesn't know the remaining time information of any async task.
In this way, you can track all running requests and their remaining time. Before one request is returned, you can check the above "some place" to calculate all requests' remaining time. This "some place" could be global variable, memory database such as Redis, or even a plain database such as MySQL.
Please note: the calculated remaining time would not be accurate, as the read&calculation itself would cost time and introduce error.

Run Node JS on a multi-core cluster cloud

Is there a service or framework or any way that would allow me to run Node JS for heavy computations letting me choose the number of cores?
I'll be more specific: let's say I want to run some expensive computation for each of my users and I have 20000 users.
So I want to run the expensive computation for each user on a separate thread/core/computer, so I can finish the computation for all users faster.
But I don't want to deal with low level server configuration, all I'm looking for is something similar to AWS Lambda but for high performance computing, i.e., letting me scale as I please (maybe I want 1000 cores).
I did simulate this with AWS Lambda by having a "master" lambda that receives the data for all 20000 users and then calls a "computation" lambda for each user. Problem is, with AWS Lambda I can't make 20000 requests and wait for their callbacks at the same time (I get a request limit exceeded error).
With some setup I could user Amazon HPC, Google Compute Engine or Azure, but they only go up to 64 cores, so if I need more than that, I'd still have to setup all the machines I need separately and orchestrate the communication between them with something like Open MPI, handling the different low level setups for master and compute instances (accessing via ssh and etc).
So is there any service I can just paste my Node JS code, maybe choose the number of cores and run (not having to care about OS, or how many computers there are in my cluster)?
I'm looking for something that can take that code:
var users = [...];
function expensiveCalculation(user) {
// ...
return ...;
}
users.forEach(function(user) {
Thread.create(function() {
save(user.id, expensiveCalculation(user));
});
});
And run each thread on a separate core so they can run simultaneously (therefore finishing faster).
I think that your problem is that you feel the need to process 20000 inputs at once on the same machine. Have you looked into SQS from Amazon? Maybe you push those 20000 inputs into SQS and then have a cluster of servers pull from that queue and process each one individually.
With this approach you could add as many servers, processes or add as many AWS Lambda invokes as you want. You could even use a combination of the 3 to see what's cheaper or faster. Adding resources will only reduce the amount of time it would take to complete the computations. Then you wouldn't have to wait for 20000 requests or anything to complete. The process could tell you when it completes the computation by sending some notification after it completes.
So basically, you could have a simple application that just grabbed 10 of these inputs at a time and ran your computation on them. After it finishes you could then have this process delete them from SQS and send a notification somewhere (Maybe SNS?) to notify the user or some other system that they are done. Then it would repeat the process.
After that you could scale the process horizontally and you wouldn't need a super computer in order to process this. So you could either get a cluster of EC2 instances that ran several of these applications a piece or have a Lambda function invoked periodically in order to pull items out of SQS and process them.
EDIT:
To get started using an EC2 instance I would look at the docs here. To start with I would pick the smallest, cheapest instance (T2.micro I think), and leave everything at it's default. There's no need to open any port other than the one for SSH.
Once it's setup and you login, the first thing you need to do is run aws configure to setup your profile that way you can access AWS resources from the instance. After that install Node and get your application on there using git or something. Once it's setup though, go to the EC2 console and in your Actions menu there will be an option to create an image from the instance.
Once you create an image, then you can go to Auto Scaling groups and create a launch configuration using that AMI. Then it'll let you specify how many instances you want to run.
I feel like this could also be done more easily using their container service, but honestly I don't know how to use it yet.

Resources