In node.js, why one would want to use pools when connecting through node-postgres? - node.js

With node-postgres npm package, I'm given two connection options: with using Client or with using Pool.
What would be the benefit of using a Pool instead of a Client, what problem will it solve for me in the context of using node.js, which is a) async, and b) won't die and disconnect from Postgres after every HTTP request (as PHP would do, for example).
What would be the technicalities of using a single instance of Client vs using a Pool from within a single container running a node.js server? (e.g. Next.js, or Express, or whatever).
My understanding is that with server-side languages like PHP (classic sync php), Pool would benefit me by saving time on multiple re-connections. But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?

PostgreSQL's architecture is specifically built for pooling. Its developers decided that forking a process for each connection to the database was the safest choice and this hasn't been changed since the start.
Modern middleware that sits between the client and the database (in your case node-postgres) opens and closes virtual connections while administering the "physical" connection to the Postgres database can be held as efficient as possible.
This means connection time can be reduced a lot, as closed connections are not really closed, but returned to a pool, and opening a new connection returns the same physical connection back to the pool after use, reducing the actual forking going on the database side.
Node-postgres themselves write about the pros on their website, and they recommend you always use pooling:
Connecting a new client to the PostgreSQL server requires a handshake
which can take 20-30 milliseconds. During this time passwords are
negotiated, SSL may be established, and configuration information is
shared with the client & server. Incurring this cost every time we
want to execute a query would substantially slow down our application.
The PostgreSQL server can only handle a limited number of clients at a
time. Depending on the available memory of your PostgreSQL server you
may even crash the server if you connect an unbounded number of
clients. note: I have crashed a large production PostgreSQL server
instance in RDS by opening new clients and never disconnecting them in
a python application long ago. It was not fun.
PostgreSQL can only process one query at a time on a single connected
client in a first-in first-out manner. If your multi-tenant web
application is using only a single connected client all queries among
all simultaneous requests will be pipelined and executed serially, one
after the other. No good!
https://node-postgres.com/features/pooling
I think it was clearly expressed in this snippet.
"But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?"
Yes, but the number of simultaneous connections to the database itself is limited, and when too many browsers try to connect at the same time, the database's handling of it is not elegant. A pool can better mitigate this by virtualizing and outsourcing from the database itself the queuing and error handling that no databases are specialized in.
"What exactly is not elegant and how is it more elegant with pooling?"
A database stops responding, a connection times out, without any feedback to the end user (and even often with few clues to the server admin). The database is dependent on hardware to a higher extent than a javascript program. The risk of failure is higher. Those are my main "not elegant" arguments.
Pooling is better because:
a) As node-postgres wrote in my link above: "Incurring the cost of a db handshake every time we want to execute a query would substantially slow down our application."
b) Postgres can only process one query at a time on a single connected client (which is what Node would do without the pool) in a first-in first-out manner. All queries among all simultaneous requests will be pipelined and executed serially, one after the other. Recipe for disaster.
c) A node-based pooling component is in my opinion a better interface for enhancements, like request queuing, logging and error handling compared to a single-threaded connection.
Background:
According to Postgres themselves pooling IS needed, but deliberately not built into Postgres itself. They write:
"If you look at any graph of PostgreSQL performance with number of connections on the x axis and tps on the y access (with nothing else changing), you will see performance climb as connections rise until you hit saturation, and then you have a "knee" after which performance falls off. A lot of work has been done for version 9.2 to push that knee to the right and make the fall-off more gradual, but the issue is intrinsic -- without a built-in connection pool or at least an admission control policy, the knee and subsequent performance degradation will always be there.
The decision not to include a connection pooler inside the PostgreSQL server itself has been taken deliberately and with good reason:
In many cases you will get better performance if the connection pooler is running on a separate machine;
There is no single "right" pooling design for all needs, and having pooling outside the core server maintains flexibility;
You can get improved functionality by incorporating a connection pool into client-side software; and finally
Some client side software (like Java EE / JPA / Hibernate) always pools connections, so built-in pooling in PostgreSQL would then be wasteful duplication.
Many frameworks do the pooling in a process running on the the database server machine (to minimize latency effects from the database protocol) and accept high-level requests to run a certain function with a given set of parameters, with the entire function running as a single database transaction. This ensures that network latency or connection failures can't cause a transaction to hang while waiting for something from the network, and provides a simple way to retry any database transaction which rolls back with a serialization failure (SQLSTATE 40001 or 40P01).
Since a pooler built in to the database engine would be inferior (for the above reasons), the community has decided not to go that route."
And continue with their top reasons for performance failure with many connections to Postgres:
Disk contention. If you need to go to disk for random access (ie your data isn't cached in RAM), a large number of connections can tend to force more tables and indexes to be accessed at the same time, causing heavier seeking all over the disk. Seeking on rotating disks is massively slower than sequential access so the resulting "thrashing" can slow systems that use traditional hard drives down a lot.
RAM usage. The work_mem setting can have a big impact on performance. If it is too small, hash tables and sorts spill to disk, bitmap heap scans become "lossy", requiring more work on each page access, etc. So you want it to be big. But work_mem RAM can be allocated for each node of a query on each connection, all at the same time. So a big work_mem with a large number of connections can cause a lot of the OS cache to be periodically discarded, forcing more accesses to disk; or it could even put the system into swapping. So the more connections you have, the more you need to make a choice between slow plans and trashing cache/swapping.
Lock contention. This happens at various levels: spinlocks, LW locks, and all the locks that show up in pg_locks. As more processes compete for the spinlocks (which protect LW locks acquisition and release, which in turn protect the heavyweight and predicate lock acquisition and release) they account for a high percentage of CPU time used.
Context switches. The processor is interrupted from working on one query and has to switch to another, which involves saving state and restoring state. While the core is busy swapping states it is not doing any useful work on any query. Context switches are much cheaper than they used to be with modern CPUs and system call interfaces but are still far from free.
Cache line contention. One query is likely to be working on a particular area of RAM, and the query taking its place is likely to be working on a different area; causing data cached on the CPU chip to be discarded, only to need to be reloaded to continue the other query. Besides that the various processes will be grabbing control of cache lines from each other, causing stalls. (Humorous note, in one oprofile run of a heavily contended load, 10% of CPU time was attributed to a 1-byte noop; analysis showed that it was because it needed to wait on a cache line for the following machine code operation.)
General scaling. Some internal structures allocated based on max_connections scale at O(N^2) or O(N*log(N)). Some types of overhead which are negligible at a lower number of connections can become significant with a large number of connections.
Source

Related

Is using Pool instead of Client in node-postgres useful despite Nodejs being single threaded?

I am using Node.js express for building REST api with postgres database using node-postgres package.
My question is whether I should use Client or Pool? I found this answer:
How can I choose between Client or Pool for node-postgres
but I don't understand what would be the use of Pool connection, since Nodejs is single-threaded and there won't be an attempt to use a single connection at the same time even if there are concurrent requests occurring.
Also by using a single connection, I can benefit from the prepared statements much more efficiently. I can prepare them at the initialization phase of my app, and then execute it whenever a request arrives.
Yes since Postgresql is still multithreaded.
When making a database request your process spends 0% CPU time executing code. Yes, you've read that right, zero.
The computer does not execute code in order to wait. Instead it sets up interrupt handlers and tells the hardware (ethernet card or wifi module) to send it an interrupt when there is data. Regardless of the number of requests you make to your database you still only have ONE ethernet card in your PC (well, some servers can have multiple and have increased bandwidth by trunking but I think you can see that the number of PCI cards you have does not have any relationship with the number of threads you are running - rather it is more related with the amount of $money you are willing to spend). Your hardware still basically sends all the requests out one bit at a time.
A traditional multi-threaded server therefore spends exactly the same amount of CPU time as node.js waiting for responses from the database: zero. Which means node.js improves efficiency by not needing to malloc a lot of RAM for each thread since node only has one thread.
Even when you are running your database on the same computer as your node process, communication with the database is not overly parallel. And the TCP/IP stack itself sort of serializes the communication. And while it does not go through the networking hardware the OS still schedules the responses using OS level events (instead of hardware interrupts).
So yes, it makes sense for your node.js process to make multiple parallel connections to the database even when node is singlethreaded - it is to allow the database to process requests in multiple database threads. You are making use of your database's multithreading instead of forcing your database to also use only one thread to process node's single connection.

How to find optimal size of connection pool for single mongo nodejs driver

I am using official mongo nodejs driver with default settings, but was digging deeper into options today and apparently there is an option of maxPoolSize that is set to 100 by default.
My understanding of this is that single nodejs process can establish up to 100 connections, thus allowing mongo to handle 100 reads/writes simultaneously in paralel?
If so, it seems that setting this number higher could only benefit the performance, but I am not sure hence decided to ask here.
Assuming default setup with no indexes, is there a way to determine (based on cpu's and memory of the db) what the optimal connection number for pool should be?
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).
Good question =)
it seems that setting this number higher could only benefit the performance
It does indeed. I mean it seems, and it would be the case for an abstract nodejs process in a vacuum with unlimited resources. Connections are not free, so there are things to consider:
limited connection quota on the server. Atlas in particular, but even self-hosted cluster has only 65k sockets. Remember the driver keeps them open to reuse, and the default timeout per cursor is 30 minutes of inactivity.
single thread clientside. BSON serialisation blocks event loop and is quite expensive, e.g. see the flamechart in this answer https://stackoverflow.com/a/72264469/1110423 . Blocking the loop, you increase time cursors from the previous point remain open, and in worst case get performance degradation.
limited RAM. Each connection require ~1 MB serverside.
Assuming default setup with no indexes
You have at least _id, and you should have more if we are talking about performance
is there a way to determine what the optimal connection number for pool should be?
I'd love to know that too. There are too many factors to consider, not only CPA/RAM, but also data shape, query patterns, etc. This is what dbops are for. Mongo cluster requires some attention, monitoring and adjustments for optimal operations. In many cases it's more cost efficient to scale up the cluster than optimise the app.
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).
This is quite wild assumption. The process cannot scale horisontally. It's on the OS level. Once you have a process descriptor, it's locked to it till the death. You can use a node cluster to utilise all CPU cores, can even have multiple servers running the same nodejs and balance the load, but none of them will share connections from the pool. The pool is local to nodejs process.

Any benefits when clustering node.js application server in single core computer

This question is meant for single core only therefore multiple cores are out.
I am running Node.js application as HTTP server on single core computer using Express.js. Assuming that my server is able to handle 1000 concurrent requests, would clustering brings about any better in response speed ?
Does process context switch has much impact on performance in this case ?
I wouldn't expect improvements in speed, but you might get some other benefits.
for example if the process crashes, the other nodejs instance could still work.
would clustering brings about any better in response speed ?
You are aware of the fact that in clustered mode, your application will be running behind a load balancer, that will in turn take some CPU and memory to manage and forward the network traffic. Then, what's left of the resources, will be used to distribute the network load.
Apart from a few, rare and easily avoidable cases such as #Cristyan mentions—in which case your load balancer can be an orchestrator managing most of the stuff, like Kubernetes—running a Node.js app in cluster does not make sense to me, on a single CPU core. If the process has to wait for an item, it has to wait for it! Asynchronously you can make it work on other requests, but even in this case, other processes would want to take a share of CPU too.

Loading Streaming Data from RabbitMQ to Postgres in Parallel

I'm still somewhat new to Node.js, so I'm not as conversant in how parallelism works with concurrent I/O operations as I'd like to be.
I'm planning a Node.js application to load streaming data from RabbitMQ to Postgres. These loads will happen during system operation, so it is not a bulk load.
I expect throughput requirements to be fairly low to start (maybe 50-100 records per minute). But I'd like to plan the application so it can scale up to higher volumes as the requirements emerge.
I'm trying to think through how parallelism would work. My first impressions of flow and how parallelism would be introduced is:
Message read from the queue
Query to load data into Postgres kicked off, which pushes callback to the Node stack
Event loop free to read another message from the queue, if available, which will launch another query
Repeat
I believe the queries kicked off in this fashion will run in parallel up to the number of connections in my PG connection pool. Is this a good assumption?
With this simple flow, the limit on parallel queries would seem to be the size of the Postgres connection pool. I could make that as big as required for throughput (and that the server and backend database can handle) and that would be the limiting factor on how many messages I could process in parallel. Does that sound right?
I haven't located a great reference on how many parallel I/Os Node will instantiate. Will Node eventually block as my event loop generates too many I/O requests that aren't yet resolved (if not, I assume pg will put my query on the callback stack when I have to wait for a connection)? Are there dials I can turn to affect these limits by setting switches when I launch Node? Am I assuming correctly that libuv and the "pg" lib will in fact run these queries in parallel within one Node.js process? If those assumptions are correct, I'd think I'd hit connection pool size limits before I'd run into libuv parallelism limits (or possibly at the same time if I size my connection pool to the number of cores on the server).
Also, related to the discussion above about Node launching parallel I/O requests, how do I prevent Node from pulling messages off the queue as quick as they come in and queuing up I/O requests? I'd think at some point this could cause problems with memory consumption. This relates back to my question about startup parameters to limit the amount of parallel I/O requests created. I don't understand this too well at this point, so maybe it's not a concern (maybe by default Node won't create more parallel I/O requests than cores, providing a natural limit?).
The other thing I'm wondering is when/how running multiple copies of this program in parallel would help? Does it even matter on one host since the Postgres connection pool seems to be the driver of parallelism here? If that's the case, I'd probably only run one copy per host and only run additional copies on other hosts to spread the load.
As you can see, I'm trying to get some basic assumptions right before I start down this road. Insight and pointers to good reference doc would be appreciated.
I resolved this with a test of the prototype I wrote. A few observations:
If I don't set pre-fetch on the RabbitMQ channel, Node will pull ALL the messages off the queue in seconds. I did a test with 100K messages off the queue and Node pulled all 100K off in seconds, though it took many minutes to actually process the messages.
The behavior mentioned in #1 above is not desireable, because then Node must cache all the messages in memory. In my test, Node took up 2GB when pulling down all those message quickly, whereas if I set pre-fetch to match the number of database connections, Node took up only 80 MB and drained the queue slowly, as it finished processing the messages and sent back ACKs.
A single instance of Node running this program kept my CPUs 100% utilized.
So, the morals of the story seem to be:
Node can spawn any number of async I/O handlers (limited by available memory)
In a case like this, you want to limit how many async I/O requests Node spawns to avoid excessive memory usage.
Creating additional child processes for this workload made no difference. The unit of parallelism was the size of the database connection pool. If my workload did more in JavaScript instead of just delegating to Postgres, additional child processes would help. But in this case, it's all I/O (and thankfully I/O that doesn't need the Node threadpool), so the additional child processes do nothing.

How exactly does NodeJS handle high concurrent requests?

I was trying to understand how nodejs can achieve higher concurrency compared to thread-based approaches such as Servlet servers.
I already know that in nodejs "everything runs in parallel except your code", and also there is a backend thread pool in libuv to handle File IO or database calls which are usually the bottlenecks.
So here is my question: if nodejs uses thread pool to handle database calls, how it can service higher concurrent request than Servlet servers such as Tomcat given that Tomcat can also use NIO backed by epoll/kqueue to achieve high concurrency ?
For example, if there's a 100k concurrent request coming in and each requires database operations, if these 100k request are to be serviced concurrently, with nodejs we still end up creating 100k threads which might cause memory exhaustion as Tomcat does. Yes, the 100k threads is just an imagination because (I know) that nodejs has a fixed thread pool and different operations are queued in the event loop, but with Tomcat it handles things in the same way--we also can configure the thread pool size in Tomcat and it also queues request.
Or, am I wrong to say that "nodejs uses backend thread pool in libuv to handle File IO or database calls"? Does nodejs use epoll/kqueue to handle database io without a separate thread?
I was reading this similar question but still didn't get the answer.
if nodejs uses thread pool to handle database calls
That's a wrong assumption. nodejs will typically use networking to talk to a local database running in a different process or on a different host. Networking in node.js does not use threads of any kind - it uses event driven I/O. What the database does for threads is up to the database and independent of node.js since it would be the same no matter which server environment you were using.
node.js does use a thread pool for local disk access, but high scale applications are usually using a database for the crux of their disk access which run in a separate process and have their own I/O optimizations to handle lots of requests. How a given database does it is up to that implementation, but it will not be using a nodejs thread per request.
I was trying to understand how nodejs can achieve higher concurrency compared to thread-based approaches such as Servlet servers.
The general concept is that a properly written server app in node.js uses async I/O for all I/O (except perhaps startup code that only runs during server startup). This means that it can have a lot of requests in-flight at the same time with only a single Javascript thread while most of them are waiting on some type of I/O. If you're going to have a lot of requests in-flight at the same time, it can be a lot more efficient for the system to do it the node.js way of a single thread where all the requests are cooperatively switched vs. using OS threads where every thread has OS overhead associated with it and every pre-emptive thread switch has OS and CPU overhead associated with it.
In node-js, there is no pre-emptive switching between the active requests. Only one runs at a time and it runs until it either finishes or hits an asychronous operation and has nothing else to do until that async I/O operation completes. At that point, the JS engine goes back to the event queue and picks out an event (probably for one of the other requests). This type of cooperate switching can be significantly faster and more efficient than OS-level threads. There is sometimes a programming cost in that a node.js developer has to code with async I/O in order to take advantage of this which has a learning curve in order to get proficient at writing good, clean code with proper error handling and has a learning curve for debugging it too.
For example, if there's a 100k concurrent request coming in and each requires database operations, if these 100k request are to be serviced concurrently, with nodejs we still end up creating 100k threads which might cause memory exhaustion as Tomcat does.
No, you will not be creating 100k threads. A node.js database interface layer that interfaces between node.js and the actual database code in another process or on another host may be written entirely in node.js (using TCP networking to talk to the database) and introduce no new threads at all or it may have some native code and use a small number of threads for its own native code operations, but it will likely be a small number of threads and nothing even close to one per request.
Or, am I wrong to say that "nodejs uses backend thread pool in libuv to handle File IO or database calls"? Does nodejs use epoll/kqueue to handle database io without a separate thread?
For file I/O, yes it uses a thread pool in libuv. For database calls, no - While the details depend entirely upon the database implementation, usually there is not a thread per database call. The database is typically in another process and the nodejs interface library for the DB either directly uses nodejs TCP to talk to the database (which uses no threads) or it has its own native code add-on that talks to the database which probably uses a small number of threads for its work, but typically not a thread per request.

Resources