Can PouchDB be hosted as an idle Lambda? - node.js

I am curious if PouchDB can be used within an AWS Lambda.
The key question (in my mind) is whether PouchDB ever empties its NodeJS Event Loop (allowing the Lambda function to return and be suspended). This would be critical to getting the benefit of the Lambda.
Does anyone know if PouchDB can be configured to only run when it is actually handling a request. For example if it schedules timers for occasional polling this would keep the eventloop full and make it infeasible to host on a Lambda as the execution time would be 15 minutes instead of a few hundred ms.
The purpose is to host a richly-indexed database but which is only available ad-hoc (without a cloud instance dedicated to hosting it).
In the ideal configuration, a Lambda execution context would be brought into existence for 15 minutes when requested, and only occasionally actually run code specifically to handle incoming requests, (and to carry out the replication from those requests to a persistent store), before going idle again. At some unknown point, AWS will garbage collect the instance. Any subsequent request would essentially re-lauch the PouchDB from scratch.
PouchDB in a lambda would give me the benefit of incrementally updated MapReduce views on a dataset stored fast in-memory. I expect to have live replication (write-only) to a second PouchDB whose indexes were loaded on-demand from S3 (via a LevelDown S3 adapter). Between the two, this would give me persistence of the indexes, in-memory fast access, but on-demand availability.

Theoretically, this is looking positive (you can run it in a Lambda) from my investigations so far.
I used pouchdb from within Node to replicate from a couchdb, which ran to completion synchronously.
I followed up with a dump of current IO handles through wtfnode. The only outstanding IO seemed to be the REPL itself. This suggests that there are no live connections or outstanding events after a replication is complete.
Next step is to actually run it in a lambda and replicate data to pouchdb from a couchdb within the Lambda, proving that its execution will actually complete.

Related

Is there a way to read a database link in Cosmos DB Java V4 API?

For example, reading "dbs/colls/document" instead of getting a container, then calling read on the container.
I've been having an issue where the first readItem on a container (after calling database.getContainer(x)) is extremely slow (like 1 second or longer) and was thinking using a database link could be faster.
I'm guessing a read after getting the container is slow because it doesn't make a service call until I call read.
Is there a way I can have this preloaded when reading in a database?
I have an application with a read(collectionName, key) method, and my approach was to use getContainer(collectionName) and then call read on that, but this method needs to be fast.
As discussed, the best practice is to keep an instance of your container alive between requests and call readItem on each request. This should resolve the primary issue.
As for the secondary concern, the "high latency every 50 requests or so", this is a known issue however it should only occur in the first minute or so of operation. If you can tolerate the initial slow requests, the solution is to wait for performance to stabilize. How long do you have to run your app for before you no longer see these high-latency requests?
FYI, if latency is a concern, run your client application in a geographically colocated Azure VM. Also a good rule of thumb is to allocate client CPU cores such that CPU utilization is not more than 40% or 50%.

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

Connecting from AWS Lambda to MongoDB

I'm working on a NodeJS project and using pretty common AWS setup it seems. My ApiGateway receives call, triggers lambda A, then this lambda A triggers other lambdas, say B or C depending on params passed from ApiGateway.
Lambda A needs to access MongoDB and to avoid hassle with running MongoDB myself I decided to use mLab. ATM Lambda A is accessing MongoDB using NodeJS driver.
Now, not to start connection with every Lambda A execution I use connection pool, again, inside of Lambda A code, outside of handler I keep connection pool that allows me to reuse connections when Lambda A is invoked multiple times.
This seems to work fine.
However, I'm not sure how to deal with connections when Lambda A is invoking Lambda B and Lambda B needs to access mLab's MongoDB database.
Is it possible to pass connection pool somehow or Lambda B would have to keep its own connection pool?
I was thinking of using mLab's Data API that exposes most of the operations of MongoDB driver and so I could use HTTP calls e.g. GET and POST to run commands against database. It seems similar to RESTHeart it seems.
I'm leaning towards option 2 but on mLab's Data API it clearly states to avoid using REST api unless cannot connect using MongoDB driver directly:
The first method—the one we strongly recommend whenever possible for
added performance and functionality—is to connect using one of the
available MongoDB drivers. You do not need to use our API if you use
the driver. The second method, documented in this article, is to
connect via mLab’s RESTful Data API. Use this method only if you
cannot connect using a MongoDB driver.
Given all this how would it be best to approach it? 1 or 2 or is there any other option I should consider?
Unfortunately you won't be able to 'share' a mongo connection across lambdas because ultimately there's a 'physical' socket to the connection which is specific to that instance.
I think both of your solutions are good depending on usage.
If you tend to have steady average concurrency on both lambda A and B across an hour period (which is a bit of a rule of thumb as to how long AWS keeps a lambda instance alive), then having them both own their own static connections is a good solution. This is because the chances are that a request will reach an already started and connected lambda. I would also guess that node drivers for 'vanilla' mongo are more mature than those for the RESTFul Data API.
However if you get spikey or uneven load, then you might use the RESTFul Data API. This is because you'll be centralising the responsibility for managing the number of open connections to your instances to a single point, which under these conditions means you're less likely to be opening unneeded connections, or using all of your current capacity and having to wait for a new connection to be established.
Ultimately it's a game of probabilistic load balancing- either you 'pool' all your connections in a central place (the Data API) and become less affected by the usage of a single function at the expense of greater latency on individual operations, or you pool at a function level but are more exposed to cold-starts opening connections under uneven concurrency.

Server constantly running a function to update a cache, will it block all other server functions?

About once a minute, I need to cache all orderbooks from various cryptocurrency exchanges. There are hundreds of orderbooks, so this update function will likely never stop running.
My question is: If my server is constantly running this orderbook update function, will it block all other server functionality? Will users ever be able to interact with my server?
Do I need to create a separate service to perform the updating, or can Node somehow prioritize API requests and pause the caching function?
My question is: If my server is constantly running this orderbook
update function, will it block all other server functionality? Will
users ever be able to interact with my server?
If you are writing asynchronously, these actions will go into your eventloop and your node server would pick next event from eventloop while these actions are being performed. If you have too many events like this, your event queue would be long and user would face really slow response or may even get a timeout
Do I need to create a separate service to perform the updating, or can
Node somehow prioritize API requests and pause the caching function?
Node only consumes event from the event queue. There are no priorities.
From the design perspective, you should look for options which can reduce this write load like bulkCreate/edit or if you are using redis for cache, consider redis pipeline
This is a very open ended question much of which depends on your system. In general your server should be able to handle concurrent requests, but there are some things to watch out for.
Performance costs. If the operation to retrieve and store data requires too much computational power, then it will cause strain on all requests processed by the server.
Database connections. The server spends a lot of time waiting for database queries to complete. If you have one database connection for the entire application, and this connection is busy, they will have to wait until the database connection is free. You may want to look into database connection 'pooling'.

Connection pool using pg-promise

I'm using Node js and Postgresql and trying to be most efficient in the connections implementation.
I saw that pg-promise is built on top of node-postgres and node-postgres uses pg-pool to manage pooling.
I also read that "more than 100 clients at a time is a very bad thing" (node-postgres).
I'm using pg-promise and wanted to know:
what is the recommended poolSize for a very big load of data.
what happens if poolSize = 100 and the application gets 101 request simultaneously (or even more)?
Does Postgres handles the order and makes the 101 request wait until it can run it?
I'm the author of pg-promise.
I'm using Node js and Postgresql and trying to be most efficient in the connections implementation.
There are several levels of optimization for database communications. The most important of them is to minimize the number of queries per HTTP request, because IO is expensive, so is the connection pool.
If you have to execute more than one query per HTTP request, always use tasks, via method task.
If your task requires a transaction, execute it as a transaction, via method tx.
If you need to do multiple inserts or updates, always use multi-row operations. See Multi-row insert with pg-promise and PostgreSQL multi-row updates in Node.js.
I saw that pg-promise is built on top of node-postgres and node-postgres uses pg-pool to manage pooling.
node-postgres started using pg-pool from version 6.x, while pg-promise remains on version 5.x which uses the internal connection pool implementation. Here's the reason why.
I also read that "more than 100 clients at a time is a very bad thing"
My long practice in this area suggests: If you cannot fit your service into a pool of 20 connections, you will not be saved by going for more connections, you will need to fix your implementation instead. Also, by going over 20 you start putting additional strain on the CPU, and that translates into further slow-down.
what is the recommended poolSize for a very big load of data.
The size of the data got nothing to do with the size of the pool. You typically use just one connection for a single download or upload, no matter how large. Unless your implementation is wrong and you end up using more than one connection, then you need to fix it, if you want your app to be scalable.
what happens if poolSize = 100 and the application gets 101 request simultaneously
It will wait for the next available connection.
See also:
Chaining Queries
Performance Boost
what happens if poolSize = 100 and the application gets 101 request simultaneously (or even more)? Does Postgres handles the order and makes the 101 request wait until it can run it?
Right, the request will be queued. But it's not handled by Postgres itself, but by your app (pg-pool). So whenever you run out of free connections, the app will wait for a connection to release, and then the next pending request will be performed. That's what pools are for.
what is the recommended poolSize for a very big load of data.
It really depends on many factors, and no one will really tell you the exact number. Why not test your app under huge load and see in practise how it performs, and find the bottlenecks.
Also I find the node-postgres documentation quite confusing and misleading on the matter:
Once you get >100 simultaneous requests your web server will attempt to open 100 connections to the PostgreSQL backend and 💥 you'll run out of memory on the PostgreSQL server, your database will become unresponsive, your app will seem to hang, and everything will break. Boooo!
https://github.com/brianc/node-postgres
It's not quite true. If you reach the connection limit at Postgres side, you simply won't be able to establish a new connection until any previous connection is closed. Nothing will break, if you handle this situation in your node app.

Resources