In the process of performance tuning a nodejs application, I've determined that the application is blocking because of an insufficient number of database connections in the mongodb connection pool.
I want to know the high-water mark for in-flight queries in my application. How can I find this out? Preferably, using nodejs.
Because I'm using a shared database, I can't use the database's 'serverStatus' for this purpose.
Related
I'm somewhat new to the Node ecosystem and am trying to instrument some of our Node services to provide better metrics about their internal state and one of the glaring blindspots we currently have is around our database connection pools.
Coming from a Java background, I've primarily relied on libraries like Hikari which exposes key metrics like total connections, active connections, idle connections, and threads queued and waiting for a connection from the pool (https://github.com/brettwooldridge/HikariCP/wiki/MBean-(JMX)-Monitoring-and-Management). These are all critical metrics to understand to ensure your connection pool is properly sized and your application is functioning as expected.
In our current Node services we are using TypeORM and connecting to a Postgres DB. I'd like to find a way to access and expose these same core metrics but I'm finding little to no information about the best way to do it so I have two questions:
With TypeORM and Postgres, is there a way to get a handle on the connection pool internals? It looks like I may be able to get part way there with something like getConnectionManager().connections but I'm not seeing any way to get more detailed information like distinguishing between active and idle connections.
Is there a standard mechanism to expose internal application metrics for a Node service that's somewhat comparable to JMX on the JVM?
With node-postgres npm package, I'm given two connection options: with using Client or with using Pool.
What would be the benefit of using a Pool instead of a Client, what problem will it solve for me in the context of using node.js, which is a) async, and b) won't die and disconnect from Postgres after every HTTP request (as PHP would do, for example).
What would be the technicalities of using a single instance of Client vs using a Pool from within a single container running a node.js server? (e.g. Next.js, or Express, or whatever).
My understanding is that with server-side languages like PHP (classic sync php), Pool would benefit me by saving time on multiple re-connections. But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?
PostgreSQL's architecture is specifically built for pooling. Its developers decided that forking a process for each connection to the database was the safest choice and this hasn't been changed since the start.
Modern middleware that sits between the client and the database (in your case node-postgres) opens and closes virtual connections while administering the "physical" connection to the Postgres database can be held as efficient as possible.
This means connection time can be reduced a lot, as closed connections are not really closed, but returned to a pool, and opening a new connection returns the same physical connection back to the pool after use, reducing the actual forking going on the database side.
Node-postgres themselves write about the pros on their website, and they recommend you always use pooling:
Connecting a new client to the PostgreSQL server requires a handshake
which can take 20-30 milliseconds. During this time passwords are
negotiated, SSL may be established, and configuration information is
shared with the client & server. Incurring this cost every time we
want to execute a query would substantially slow down our application.
The PostgreSQL server can only handle a limited number of clients at a
time. Depending on the available memory of your PostgreSQL server you
may even crash the server if you connect an unbounded number of
clients. note: I have crashed a large production PostgreSQL server
instance in RDS by opening new clients and never disconnecting them in
a python application long ago. It was not fun.
PostgreSQL can only process one query at a time on a single connected
client in a first-in first-out manner. If your multi-tenant web
application is using only a single connected client all queries among
all simultaneous requests will be pipelined and executed serially, one
after the other. No good!
https://node-postgres.com/features/pooling
I think it was clearly expressed in this snippet.
"But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?"
Yes, but the number of simultaneous connections to the database itself is limited, and when too many browsers try to connect at the same time, the database's handling of it is not elegant. A pool can better mitigate this by virtualizing and outsourcing from the database itself the queuing and error handling that no databases are specialized in.
"What exactly is not elegant and how is it more elegant with pooling?"
A database stops responding, a connection times out, without any feedback to the end user (and even often with few clues to the server admin). The database is dependent on hardware to a higher extent than a javascript program. The risk of failure is higher. Those are my main "not elegant" arguments.
Pooling is better because:
a) As node-postgres wrote in my link above: "Incurring the cost of a db handshake every time we want to execute a query would substantially slow down our application."
b) Postgres can only process one query at a time on a single connected client (which is what Node would do without the pool) in a first-in first-out manner. All queries among all simultaneous requests will be pipelined and executed serially, one after the other. Recipe for disaster.
c) A node-based pooling component is in my opinion a better interface for enhancements, like request queuing, logging and error handling compared to a single-threaded connection.
Background:
According to Postgres themselves pooling IS needed, but deliberately not built into Postgres itself. They write:
"If you look at any graph of PostgreSQL performance with number of connections on the x axis and tps on the y access (with nothing else changing), you will see performance climb as connections rise until you hit saturation, and then you have a "knee" after which performance falls off. A lot of work has been done for version 9.2 to push that knee to the right and make the fall-off more gradual, but the issue is intrinsic -- without a built-in connection pool or at least an admission control policy, the knee and subsequent performance degradation will always be there.
The decision not to include a connection pooler inside the PostgreSQL server itself has been taken deliberately and with good reason:
In many cases you will get better performance if the connection pooler is running on a separate machine;
There is no single "right" pooling design for all needs, and having pooling outside the core server maintains flexibility;
You can get improved functionality by incorporating a connection pool into client-side software; and finally
Some client side software (like Java EE / JPA / Hibernate) always pools connections, so built-in pooling in PostgreSQL would then be wasteful duplication.
Many frameworks do the pooling in a process running on the the database server machine (to minimize latency effects from the database protocol) and accept high-level requests to run a certain function with a given set of parameters, with the entire function running as a single database transaction. This ensures that network latency or connection failures can't cause a transaction to hang while waiting for something from the network, and provides a simple way to retry any database transaction which rolls back with a serialization failure (SQLSTATE 40001 or 40P01).
Since a pooler built in to the database engine would be inferior (for the above reasons), the community has decided not to go that route."
And continue with their top reasons for performance failure with many connections to Postgres:
Disk contention. If you need to go to disk for random access (ie your data isn't cached in RAM), a large number of connections can tend to force more tables and indexes to be accessed at the same time, causing heavier seeking all over the disk. Seeking on rotating disks is massively slower than sequential access so the resulting "thrashing" can slow systems that use traditional hard drives down a lot.
RAM usage. The work_mem setting can have a big impact on performance. If it is too small, hash tables and sorts spill to disk, bitmap heap scans become "lossy", requiring more work on each page access, etc. So you want it to be big. But work_mem RAM can be allocated for each node of a query on each connection, all at the same time. So a big work_mem with a large number of connections can cause a lot of the OS cache to be periodically discarded, forcing more accesses to disk; or it could even put the system into swapping. So the more connections you have, the more you need to make a choice between slow plans and trashing cache/swapping.
Lock contention. This happens at various levels: spinlocks, LW locks, and all the locks that show up in pg_locks. As more processes compete for the spinlocks (which protect LW locks acquisition and release, which in turn protect the heavyweight and predicate lock acquisition and release) they account for a high percentage of CPU time used.
Context switches. The processor is interrupted from working on one query and has to switch to another, which involves saving state and restoring state. While the core is busy swapping states it is not doing any useful work on any query. Context switches are much cheaper than they used to be with modern CPUs and system call interfaces but are still far from free.
Cache line contention. One query is likely to be working on a particular area of RAM, and the query taking its place is likely to be working on a different area; causing data cached on the CPU chip to be discarded, only to need to be reloaded to continue the other query. Besides that the various processes will be grabbing control of cache lines from each other, causing stalls. (Humorous note, in one oprofile run of a heavily contended load, 10% of CPU time was attributed to a 1-byte noop; analysis showed that it was because it needed to wait on a cache line for the following machine code operation.)
General scaling. Some internal structures allocated based on max_connections scale at O(N^2) or O(N*log(N)). Some types of overhead which are negligible at a lower number of connections can become significant with a large number of connections.
Source
I need a local DB on a pi zero, with multiple processes running that need to write and read data. That kind of rules SQLite out (I think). From my experience SQLite only allows one connection at a time and is tricky with multiple processes trying to do database work. All of my data transmission would be JSON driven so NOSQL makes sense but I need something light weight to store a few configs and to store data that will synced up to the server. But what NOSQL options would be best to run on a pi with great NODE support?
SQLite is generally fine when using it with multiple concurrent processes. From the SQLite FAQ:
We are aware of no other embedded SQL database engine that supports as much concurrency as SQLite. SQLite allows multiple processes to have the database file open at once, and for multiple processes to read the database at once. When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business. Other embedded SQL database engines typically only allow a single process to connect to the database at once.
For the majority of applications, that should be fine. If only one of your processes is doing writes, and the other only reads, it should have no impact at all.
If you're looking for something that's NoSQL-specific, you can also consider LevelDB, which is used in Google Chrome. With Node, the best way to access it is through the levelup library.
I am currently developing a real-time app with rethinkdb and node, and there are many different rethinkdb queries to run in different classes. So, my question is, does it make more sense to have a single rethinkdb connection which every query must open and close, or a single connection where every query is run, statically available?
From this issue I deduce that parallelization is already an option, so this is a matter of what is more efficient.
It's best to have a pool of open connections to your RethinkDB server. For example rethinkdbdash (which I recommend you use) opens a pool of 50 connections that are available for your queries.
we're planing to deploy a web-application with Amazon OpsWork and I just wanted to check with you, if our architecture might have any design flaws.
We've 4 components:
A load balanacer (Amazon preferably)
Express based on Node.js
MongoDB
ElasticSearch
Here's a communication diagram of our components:
At the front is a load balancer which distributes http requests to multiple web servers.
The web server is stateless and therefore can be cloned each time the load requires it. All web server instances are equal. Session information is saved in the MongoDB.
In the "backend" we're planing to use the build-in cluster functionalities from MongoDB and ElasticSearch. Therefore each web server instance only connects to a single MongoDB and ElasticSearch master instance. MongoDB and ElasticSearch are then scaling accordingly. Furthermore the the ElasticSearch master speaks to the MongoDB master to retrieve data for building the index.
How we see it, the most challenging task to setup such a system, is to configure OpsWorks with the MongoDB and ElasticSearch cluster.
Many thanks in advance!
if our architecture might have any design flaws.
Well, keep in mind that we can't tell much from a generic diagram. But here are some notes:
1) MongoDB isn't as easy to scale as other databases such as DynamoDB, Riak or Cassandra. For example, if you ever exceed the capacity of a single master (no matter how many slaves you have, all writes go to the single master), you'll have to shard. But switching to sharding is very disruptive and very tedious to set up.
If you don't expect to exceed the write capacity of one node, then you'll be fine on MongoDB.
2) What will you do for async tasks such as sending emails, creating long reports, etc?
It's possible to do these things in the request loop, and that's probably a fine way to get started. But as you have more boxes, the chances of failure go up. When a box dies, all the async tasks go away and nobody will know what they were. You also can have problems where one box gets heavily loaded with async tasks (using too much CPU or memory), and the problem will get worse and worse as it gets more tasks and completes them more slowly.
Also, a front-end like ELB will have a 60-second limit, which can cause problems if some of your requests could take longer. (Spin them off into async jobs with polling or something.)
3) ELB doesn't support web sockets. Consider that if you think you might want websockets down the road.
There's no such thing as a master in elastic search. You have master copies of shards and replicas of shards but they are basically moved around through your cluster by elastic search. Nodes might be master for one shard and a replica for another. So, you could simply put a load balancer in front of it.
However, you can specialize nodes to be data nodes or routing nodes as explained here: http://www.elasticsearch.org/guide/reference/modules/node/
The routing nodes effectively become load balancers. You could have a few of those (redundancy) and distribute load between those. Alternatively, you could run a dedicated router node on each web server. Basically routing nodes are pretty light and you save a bit of bandwidth/latency since your web server talks to localhost and from there it is all elastic search internal cluster traffic.
I'd recommend to replace MongoDB with Amazon Dynamo DB (it has node.js SDK).