Loading Streaming Data from RabbitMQ to Postgres in Parallel - node.js

I'm still somewhat new to Node.js, so I'm not as conversant in how parallelism works with concurrent I/O operations as I'd like to be.
I'm planning a Node.js application to load streaming data from RabbitMQ to Postgres. These loads will happen during system operation, so it is not a bulk load.
I expect throughput requirements to be fairly low to start (maybe 50-100 records per minute). But I'd like to plan the application so it can scale up to higher volumes as the requirements emerge.
I'm trying to think through how parallelism would work. My first impressions of flow and how parallelism would be introduced is:
Message read from the queue
Query to load data into Postgres kicked off, which pushes callback to the Node stack
Event loop free to read another message from the queue, if available, which will launch another query
Repeat
I believe the queries kicked off in this fashion will run in parallel up to the number of connections in my PG connection pool. Is this a good assumption?
With this simple flow, the limit on parallel queries would seem to be the size of the Postgres connection pool. I could make that as big as required for throughput (and that the server and backend database can handle) and that would be the limiting factor on how many messages I could process in parallel. Does that sound right?
I haven't located a great reference on how many parallel I/Os Node will instantiate. Will Node eventually block as my event loop generates too many I/O requests that aren't yet resolved (if not, I assume pg will put my query on the callback stack when I have to wait for a connection)? Are there dials I can turn to affect these limits by setting switches when I launch Node? Am I assuming correctly that libuv and the "pg" lib will in fact run these queries in parallel within one Node.js process? If those assumptions are correct, I'd think I'd hit connection pool size limits before I'd run into libuv parallelism limits (or possibly at the same time if I size my connection pool to the number of cores on the server).
Also, related to the discussion above about Node launching parallel I/O requests, how do I prevent Node from pulling messages off the queue as quick as they come in and queuing up I/O requests? I'd think at some point this could cause problems with memory consumption. This relates back to my question about startup parameters to limit the amount of parallel I/O requests created. I don't understand this too well at this point, so maybe it's not a concern (maybe by default Node won't create more parallel I/O requests than cores, providing a natural limit?).
The other thing I'm wondering is when/how running multiple copies of this program in parallel would help? Does it even matter on one host since the Postgres connection pool seems to be the driver of parallelism here? If that's the case, I'd probably only run one copy per host and only run additional copies on other hosts to spread the load.
As you can see, I'm trying to get some basic assumptions right before I start down this road. Insight and pointers to good reference doc would be appreciated.

I resolved this with a test of the prototype I wrote. A few observations:
If I don't set pre-fetch on the RabbitMQ channel, Node will pull ALL the messages off the queue in seconds. I did a test with 100K messages off the queue and Node pulled all 100K off in seconds, though it took many minutes to actually process the messages.
The behavior mentioned in #1 above is not desireable, because then Node must cache all the messages in memory. In my test, Node took up 2GB when pulling down all those message quickly, whereas if I set pre-fetch to match the number of database connections, Node took up only 80 MB and drained the queue slowly, as it finished processing the messages and sent back ACKs.
A single instance of Node running this program kept my CPUs 100% utilized.
So, the morals of the story seem to be:
Node can spawn any number of async I/O handlers (limited by available memory)
In a case like this, you want to limit how many async I/O requests Node spawns to avoid excessive memory usage.
Creating additional child processes for this workload made no difference. The unit of parallelism was the size of the database connection pool. If my workload did more in JavaScript instead of just delegating to Postgres, additional child processes would help. But in this case, it's all I/O (and thankfully I/O that doesn't need the Node threadpool), so the additional child processes do nothing.

Related

Is using Pool instead of Client in node-postgres useful despite Nodejs being single threaded?

I am using Node.js express for building REST api with postgres database using node-postgres package.
My question is whether I should use Client or Pool? I found this answer:
How can I choose between Client or Pool for node-postgres
but I don't understand what would be the use of Pool connection, since Nodejs is single-threaded and there won't be an attempt to use a single connection at the same time even if there are concurrent requests occurring.
Also by using a single connection, I can benefit from the prepared statements much more efficiently. I can prepare them at the initialization phase of my app, and then execute it whenever a request arrives.
Yes since Postgresql is still multithreaded.
When making a database request your process spends 0% CPU time executing code. Yes, you've read that right, zero.
The computer does not execute code in order to wait. Instead it sets up interrupt handlers and tells the hardware (ethernet card or wifi module) to send it an interrupt when there is data. Regardless of the number of requests you make to your database you still only have ONE ethernet card in your PC (well, some servers can have multiple and have increased bandwidth by trunking but I think you can see that the number of PCI cards you have does not have any relationship with the number of threads you are running - rather it is more related with the amount of $money you are willing to spend). Your hardware still basically sends all the requests out one bit at a time.
A traditional multi-threaded server therefore spends exactly the same amount of CPU time as node.js waiting for responses from the database: zero. Which means node.js improves efficiency by not needing to malloc a lot of RAM for each thread since node only has one thread.
Even when you are running your database on the same computer as your node process, communication with the database is not overly parallel. And the TCP/IP stack itself sort of serializes the communication. And while it does not go through the networking hardware the OS still schedules the responses using OS level events (instead of hardware interrupts).
So yes, it makes sense for your node.js process to make multiple parallel connections to the database even when node is singlethreaded - it is to allow the database to process requests in multiple database threads. You are making use of your database's multithreading instead of forcing your database to also use only one thread to process node's single connection.

How to find optimal size of connection pool for single mongo nodejs driver

I am using official mongo nodejs driver with default settings, but was digging deeper into options today and apparently there is an option of maxPoolSize that is set to 100 by default.
My understanding of this is that single nodejs process can establish up to 100 connections, thus allowing mongo to handle 100 reads/writes simultaneously in paralel?
If so, it seems that setting this number higher could only benefit the performance, but I am not sure hence decided to ask here.
Assuming default setup with no indexes, is there a way to determine (based on cpu's and memory of the db) what the optimal connection number for pool should be?
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).
Good question =)
it seems that setting this number higher could only benefit the performance
It does indeed. I mean it seems, and it would be the case for an abstract nodejs process in a vacuum with unlimited resources. Connections are not free, so there are things to consider:
limited connection quota on the server. Atlas in particular, but even self-hosted cluster has only 65k sockets. Remember the driver keeps them open to reuse, and the default timeout per cursor is 30 minutes of inactivity.
single thread clientside. BSON serialisation blocks event loop and is quite expensive, e.g. see the flamechart in this answer https://stackoverflow.com/a/72264469/1110423 . Blocking the loop, you increase time cursors from the previous point remain open, and in worst case get performance degradation.
limited RAM. Each connection require ~1 MB serverside.
Assuming default setup with no indexes
You have at least _id, and you should have more if we are talking about performance
is there a way to determine what the optimal connection number for pool should be?
I'd love to know that too. There are too many factors to consider, not only CPA/RAM, but also data shape, query patterns, etc. This is what dbops are for. Mongo cluster requires some attention, monitoring and adjustments for optimal operations. In many cases it's more cost efficient to scale up the cluster than optimise the app.
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).
This is quite wild assumption. The process cannot scale horisontally. It's on the OS level. Once you have a process descriptor, it's locked to it till the death. You can use a node cluster to utilise all CPU cores, can even have multiple servers running the same nodejs and balance the load, but none of them will share connections from the pool. The pool is local to nodejs process.

What is the meaning of I/O intensive in Node.js

I was learning Node.js and also found out that Node.js is best to be used with I/O intensive tasks which confused me a bit. So, after some research I found this statement: "An application that reads and/or writes a large amount of data". So, does it mean that Node.js is best to be used with data, that is, read big data, take necessary data from that and send back to client?
A nodejs application can be architectured just fine to include non-I/O things and is not just suited for big data applications (in fact big data has nothing to do with it at all).
A default, simple implementation of Node.js performs best when your application is not CPU intensive and instead spends most of its time doing I/O (input/output) tasks such as reading/writing to a database, read/writing from files, reading/sending network data and so on. It's not about big data, it's about what does the server spend most of its time doing.
Surprisingly enough (to some) since a web server's primary job is responding to http requests which are usually requests for data, most web servers spend most of their time fetching things, reading and writing things and sending things which are all I/O tasks. In the node.js design, all these I/O tasks happen asynchronously in a non-blocking fashion and they use events to signal when those operations complete. This is where the phrase "event-driven design" comes from when describing node.js. It so happens that this makes node.js very efficient at handling things that involve primarily I/O. This is what a simple implementation of node.js does best. And, it generally does it better than a purely threaded server design that devotes an OS thread to every currently in-flight I/O operation (the original design for many server frameworks).
If you do have CPU intensive things (major calculations, image processing, heavy crypto operations, etc...) and you do them very often or they take very long, then you will be best served if you put those tasks in a Worker Thread or in another process and communicate back and forth between the main process in node.js and this worker to get that CPU-intensive work done. It used to be that node.js didn't have Worker Threads which made this task a little more complicated where you often had to use one or more additional processes (either via clustering or additional dedicated processes) in order to handle this CPU-intensive work, but now you can use Worker Threads which can be a bit more convenient.
For example, I have a server task that requires a very heavy amount of crypto (performing a billion crypto operations). If I put that in the main node.js thread, that essentially blocks the event loop so my server can't process other requests while that heavy duty crypto operation is running which would ruin the responsiveness of my server.
But, I was able to move the crypto work to a worker thread (actually to several worker threads) and then can crunch away on the crypto while my main thread stays nice and lively to handle other, unrelated incoming requests in a timely fashion.
First of all, Big Data has nothing to do with Node.js.
I/O intensive means that the given task often waits for I/O. The best examples for these are file operations, networking.
If the processor has to regularly wait for data to arrive, the task is said to be I/O intensive.
Node.js's asynchronous nature however makes it really good at I/O intensive tasks, as it can keep doing other work while it waits for the data to arrive asynchronously.
For example, if you have 10 clients connected to the server and one of the clients requests for a data or task that is heavy to process, my server should not get stuck or wait until this task is finished as it will cause greater response time to other 9 clients or bad user experience. Rather, server should allow the other 9 clients to request data or task from the server, and when the respective tasks get finished, response should be sent back to clients.
PS: You can study about Event loop in Node.js
What Node.js is great at is serving as the middle layer between clients and data sources, i.e. the inputs and outputs.
The reason Node.js is great at this is in the non-blocking event-driven approach it takes.
For example, when you make a request to a Node.js app that asks for some data from a database, Node.js will request that data and immediately return to other requests without being blocked by the database request.
Once the database sends the data back, Node.js triggers the callback (or resolves the promise) with that data and continues onwards.
There's no race condition between these input and output events because their synchronization is done in a single threaded mechanism called the Event Loop. Only one event gets processed at a time.
We can think of the Event Loop as a single-seat rollercoaster ride in an amusement park that has many lines of people waiting to go on the ride, one by one. When you get to go depends on when you got in a line, how important you are or if a friend saved you a spot but nevertheless only one person at a time will be able to partake.
This non-blocking event-driven approach allows Node.js to very efficiently react to input and output events and process many read/write operations because it's not really doing much processing, the CPU work is quite low. It's just serving as the middle layer between you and the data.
On the other hand, if these events lead to some intense CPU operations, Node.js used to perform quite poorly because the Event Loop can process only one event at a time.
To use the rollercoaster analogy from above, a CPU-intensive task would be as if one person is taking a really long ride while all others have to wait for them to be done.
Newer versions of Node.js did get some tools to allow it do to more than 1 thing at time (parallelism) by using workers. The trick here is that every pool of workers has its own Event Loop which allows applications to move the intense work into a different thread and run it in parallel with the rest of the application. Do note that this will only actually help if you run on a machine with more than 1 core. If your machine has 1 core, no matter what tool you use, you're gonna have a bad time because nothing can actually be done in parallel on a single core machine.
In case of Intensive I/O tasks Majority of the time is spent waiting for network, filesystem and perhaps database I/O to complete. Increasing hard disk speed or network connection improves the overall performance.
In its most basic form Node.js is best suited for this type of computing. All I/O in Node.js is non-blocking and it allows other requests to be served while waiting for a particular read or write to complete.

How exactly does NodeJS handle high concurrent requests?

I was trying to understand how nodejs can achieve higher concurrency compared to thread-based approaches such as Servlet servers.
I already know that in nodejs "everything runs in parallel except your code", and also there is a backend thread pool in libuv to handle File IO or database calls which are usually the bottlenecks.
So here is my question: if nodejs uses thread pool to handle database calls, how it can service higher concurrent request than Servlet servers such as Tomcat given that Tomcat can also use NIO backed by epoll/kqueue to achieve high concurrency ?
For example, if there's a 100k concurrent request coming in and each requires database operations, if these 100k request are to be serviced concurrently, with nodejs we still end up creating 100k threads which might cause memory exhaustion as Tomcat does. Yes, the 100k threads is just an imagination because (I know) that nodejs has a fixed thread pool and different operations are queued in the event loop, but with Tomcat it handles things in the same way--we also can configure the thread pool size in Tomcat and it also queues request.
Or, am I wrong to say that "nodejs uses backend thread pool in libuv to handle File IO or database calls"? Does nodejs use epoll/kqueue to handle database io without a separate thread?
I was reading this similar question but still didn't get the answer.
if nodejs uses thread pool to handle database calls
That's a wrong assumption. nodejs will typically use networking to talk to a local database running in a different process or on a different host. Networking in node.js does not use threads of any kind - it uses event driven I/O. What the database does for threads is up to the database and independent of node.js since it would be the same no matter which server environment you were using.
node.js does use a thread pool for local disk access, but high scale applications are usually using a database for the crux of their disk access which run in a separate process and have their own I/O optimizations to handle lots of requests. How a given database does it is up to that implementation, but it will not be using a nodejs thread per request.
I was trying to understand how nodejs can achieve higher concurrency compared to thread-based approaches such as Servlet servers.
The general concept is that a properly written server app in node.js uses async I/O for all I/O (except perhaps startup code that only runs during server startup). This means that it can have a lot of requests in-flight at the same time with only a single Javascript thread while most of them are waiting on some type of I/O. If you're going to have a lot of requests in-flight at the same time, it can be a lot more efficient for the system to do it the node.js way of a single thread where all the requests are cooperatively switched vs. using OS threads where every thread has OS overhead associated with it and every pre-emptive thread switch has OS and CPU overhead associated with it.
In node-js, there is no pre-emptive switching between the active requests. Only one runs at a time and it runs until it either finishes or hits an asychronous operation and has nothing else to do until that async I/O operation completes. At that point, the JS engine goes back to the event queue and picks out an event (probably for one of the other requests). This type of cooperate switching can be significantly faster and more efficient than OS-level threads. There is sometimes a programming cost in that a node.js developer has to code with async I/O in order to take advantage of this which has a learning curve in order to get proficient at writing good, clean code with proper error handling and has a learning curve for debugging it too.
For example, if there's a 100k concurrent request coming in and each requires database operations, if these 100k request are to be serviced concurrently, with nodejs we still end up creating 100k threads which might cause memory exhaustion as Tomcat does.
No, you will not be creating 100k threads. A node.js database interface layer that interfaces between node.js and the actual database code in another process or on another host may be written entirely in node.js (using TCP networking to talk to the database) and introduce no new threads at all or it may have some native code and use a small number of threads for its own native code operations, but it will likely be a small number of threads and nothing even close to one per request.
Or, am I wrong to say that "nodejs uses backend thread pool in libuv to handle File IO or database calls"? Does nodejs use epoll/kqueue to handle database io without a separate thread?
For file I/O, yes it uses a thread pool in libuv. For database calls, no - While the details depend entirely upon the database implementation, usually there is not a thread per database call. The database is typically in another process and the nodejs interface library for the DB either directly uses nodejs TCP to talk to the database (which uses no threads) or it has its own native code add-on that talks to the database which probably uses a small number of threads for its work, but typically not a thread per request.

Controlling the flow of requests without dropping them - NodeJS

I have a simple nodejs webserver running, it:
Accepts requests
Spawns separate thread to perform background processing
Background thread returns results
App responds to client
Using Apache benchmark "ab -r -n 100 -c 10", performing 100 requests with 10 at a time.
Average response time of 5.6 seconds.
My logic for using nodejs is that is typically quite resource efficient, especially when the bulk of the work is being done by another process. Seems like the most lightweight webserver option for this scenario.
The Problem
With 10 concurrent requests my CPU was maxed out, which is no surprise since there is CPU intensive work going on the background.
Scaling horizontally is an easy thing to, although I want to make the most out of each server for obvious reasons.
So how with nodejs, either raw or some framework, how can one keep that under control as to not go overkill on the CPU.
Potential Approach?
Could accepting the request storing it in a db or some persistent storage and having a separate process that uses an async library to process x at a time?
In your potential approach, you're basically describing a queue. You can store incoming messages (jobs) there and have each process get one job at the time, only getting the next one when processing the previous job has finished. You could spawn a number of processes working in parallel, like an amount equal to the number of cores in your system. Spawning more won't help performance, because multiple processes sharing a core will just run slower. Keeping one core free might be preferred to keep the system responsive for administrative tasks.
Many different queues exist. A node-based one using redis for persistence that seems to be well supported is Kue (I have no personal experience using it). I found a tutorial for building an implementation with Kue here. Depending on the software your environment is running in though, another choice might make more sense.
Good luck and have fun!

Resources