How to process huge array of objects in nodejs - node.js

I want to process array of length around 100 000, without putting too much load on CPU. I researched about streams and stumbled upon highlandjs, but i am unable to make it work.
I also tried using promises and processing in chunks but still it is putting very much load on CPU, program can be slow if needed but should not put load on CPU

With node.js which runs your Javascript as single threaded, if you want your server to be maximally responsive to incoming requests, then you need to remove any CPU intensive code from the main http server process. This means doing CPU intensive work in some other process.
There are a bunch of different approaches to doing this:
Use the child_process module to launch another nodejs app that is purposeful built for doing your CPU intensive work.
Cluster your app so that you have N different processes that can do both CPU intensive work and handle requests.
Create a work queue and a number of worker processes that will handle the CPU intensive work.
Use the newer Worker Threads to move CPU intensive work to separate node.js threads (requires node v12+ for stable, non-experimental version of threads).
If you don't do this CPU intensive work very often, then #1 is probably simplest.
If you need scale for other reasons (like handling lots of incoming requests) and you don't do the CPU intensive stuff very often #2.
If you do the CPU intensive stuff pretty regularly and you want incoming request processing to always have the highest priority and you're willing to allow the CPU intensive stuff to take longer, then #3 (work queue) or #4 (threads) is probably the best and you can tune the number of workers to optimize your result.

Related

What is the meaning of I/O intensive in Node.js

I was learning Node.js and also found out that Node.js is best to be used with I/O intensive tasks which confused me a bit. So, after some research I found this statement: "An application that reads and/or writes a large amount of data". So, does it mean that Node.js is best to be used with data, that is, read big data, take necessary data from that and send back to client?
A nodejs application can be architectured just fine to include non-I/O things and is not just suited for big data applications (in fact big data has nothing to do with it at all).
A default, simple implementation of Node.js performs best when your application is not CPU intensive and instead spends most of its time doing I/O (input/output) tasks such as reading/writing to a database, read/writing from files, reading/sending network data and so on. It's not about big data, it's about what does the server spend most of its time doing.
Surprisingly enough (to some) since a web server's primary job is responding to http requests which are usually requests for data, most web servers spend most of their time fetching things, reading and writing things and sending things which are all I/O tasks. In the node.js design, all these I/O tasks happen asynchronously in a non-blocking fashion and they use events to signal when those operations complete. This is where the phrase "event-driven design" comes from when describing node.js. It so happens that this makes node.js very efficient at handling things that involve primarily I/O. This is what a simple implementation of node.js does best. And, it generally does it better than a purely threaded server design that devotes an OS thread to every currently in-flight I/O operation (the original design for many server frameworks).
If you do have CPU intensive things (major calculations, image processing, heavy crypto operations, etc...) and you do them very often or they take very long, then you will be best served if you put those tasks in a Worker Thread or in another process and communicate back and forth between the main process in node.js and this worker to get that CPU-intensive work done. It used to be that node.js didn't have Worker Threads which made this task a little more complicated where you often had to use one or more additional processes (either via clustering or additional dedicated processes) in order to handle this CPU-intensive work, but now you can use Worker Threads which can be a bit more convenient.
For example, I have a server task that requires a very heavy amount of crypto (performing a billion crypto operations). If I put that in the main node.js thread, that essentially blocks the event loop so my server can't process other requests while that heavy duty crypto operation is running which would ruin the responsiveness of my server.
But, I was able to move the crypto work to a worker thread (actually to several worker threads) and then can crunch away on the crypto while my main thread stays nice and lively to handle other, unrelated incoming requests in a timely fashion.
First of all, Big Data has nothing to do with Node.js.
I/O intensive means that the given task often waits for I/O. The best examples for these are file operations, networking.
If the processor has to regularly wait for data to arrive, the task is said to be I/O intensive.
Node.js's asynchronous nature however makes it really good at I/O intensive tasks, as it can keep doing other work while it waits for the data to arrive asynchronously.
For example, if you have 10 clients connected to the server and one of the clients requests for a data or task that is heavy to process, my server should not get stuck or wait until this task is finished as it will cause greater response time to other 9 clients or bad user experience. Rather, server should allow the other 9 clients to request data or task from the server, and when the respective tasks get finished, response should be sent back to clients.
PS: You can study about Event loop in Node.js
What Node.js is great at is serving as the middle layer between clients and data sources, i.e. the inputs and outputs.
The reason Node.js is great at this is in the non-blocking event-driven approach it takes.
For example, when you make a request to a Node.js app that asks for some data from a database, Node.js will request that data and immediately return to other requests without being blocked by the database request.
Once the database sends the data back, Node.js triggers the callback (or resolves the promise) with that data and continues onwards.
There's no race condition between these input and output events because their synchronization is done in a single threaded mechanism called the Event Loop. Only one event gets processed at a time.
We can think of the Event Loop as a single-seat rollercoaster ride in an amusement park that has many lines of people waiting to go on the ride, one by one. When you get to go depends on when you got in a line, how important you are or if a friend saved you a spot but nevertheless only one person at a time will be able to partake.
This non-blocking event-driven approach allows Node.js to very efficiently react to input and output events and process many read/write operations because it's not really doing much processing, the CPU work is quite low. It's just serving as the middle layer between you and the data.
On the other hand, if these events lead to some intense CPU operations, Node.js used to perform quite poorly because the Event Loop can process only one event at a time.
To use the rollercoaster analogy from above, a CPU-intensive task would be as if one person is taking a really long ride while all others have to wait for them to be done.
Newer versions of Node.js did get some tools to allow it do to more than 1 thing at time (parallelism) by using workers. The trick here is that every pool of workers has its own Event Loop which allows applications to move the intense work into a different thread and run it in parallel with the rest of the application. Do note that this will only actually help if you run on a machine with more than 1 core. If your machine has 1 core, no matter what tool you use, you're gonna have a bad time because nothing can actually be done in parallel on a single core machine.
In case of Intensive I/O tasks Majority of the time is spent waiting for network, filesystem and perhaps database I/O to complete. Increasing hard disk speed or network connection improves the overall performance.
In its most basic form Node.js is best suited for this type of computing. All I/O in Node.js is non-blocking and it allows other requests to be served while waiting for a particular read or write to complete.

Does NodeJS require a multi cores VPS

I want to develop a website with Nuxt.js or Next.js in 1 core CPU 2.4Ghz, 1GB RAM.
Can my website run fast as a start?
How many requests per seconds will be available maybe?
Whether a Node application benefits from multiple cores is application dependent.
Generally, if the child process or cluster modules are not involved,
then there is no need to have multiple cores on your system because Node.js will only use one core as the request handler always runs on the same event loop, which runs on a single thread.
How to achieve process concurrency and high throughput:
Because JavaScript execution in Node.js is single-threaded, so a good rule of thumb for keeping your Node server speedy: is to avoid blocking the event loop. You can read about this in the official documentation in my reference below.
Simple Illustration:
Consider a case where each request to a web server takes 50ms to complete and 45ms of that 50ms is database I/O that can be done asynchronously.
Choosing non-blocking asynchronous operations frees up that 45ms per request to handle other requests.
This is a significant difference in your application capacity and processing speed just by choosing to use non-blocking methods instead of blocking methods.
Reference:
https://nodejs.org/en/docs/guides/dont-block-the-event-loop/
https://nodejs.org/en/docs/guides/blocking-vs-non-blocking/
I hope this helps.

Node.js thread pool and core usage

I've read tons of articles and stackoverflow questions, and I saw a lot of information about thread pool, but no one talks about physical CPU core usage. I believe this question is not duplicated.
Given that I have a quad-core computer and libuv thread pool size of 4, will Node.js utilize all those 4 cores when processing lots of i/o requests(maybe more than thousands)?
I'm also curious that which i/o request uses thread pool. No one gives clear and full list of request. I know that Node.js event loop is single threaded but uses a thread pool to handle i/o such as accessing disk and db.
I'm also curious that which i/o request uses thread pool.
Disk I/O uses the thread pool.
Network I/O is async from the beginning and does not use threads.
With disk I/O, the individual disk I/O calls still present to Javascript as non-blocking and asynchronous even though they use threads in their native code implementation. When you exceed more disk I/O calls in process than the size of the thread pool, the disk I/O calls are queued and when one of the threads frees up, the next disk I/O call in the queue will run using that now available thread. Since the Javascript for the disk I/O is all non-blocking and assumes a completion callback will get called sometime in the future, the queuing of requests when the thread pool is all busy just means it will take longer to get to the later I/O requests, but otherwise the Javascript programming interface is not affected.
Given that I have a quad-core computer and libuv thread pool size of 4, will Node.js utilize all those 4 cores when processing lots of i/o requests(maybe more than thousands)?
This is not up to node.js and is hard to answer in the absolute for that reason. The first referenced article below says that on Linux, the I/O thread pool will use multiple cores and offers a small demo app that shows that.
This is up to the specific OS implementation and the thread scheduler that it uses. node.js just happily creates the threads and uses them and the OS then decides how to make use of the CPU given what it is being asked to do overall on the system. Since threads in the same process often have to communicate with one another in some way, using a separate CPU for different threads in the same process is a lot more complicated.
There are a couple node.js design patterns that are guaranteed to take advantage of multiple cores (in any modern OS)
Cluster your app and create as many clusters as you have processor cores. This also has the advantage that each cluster has its own I/O thread pool that can work independently and each can execute it's own Javascript independently. With only one node.js process and multiple cores, you never get more than one thread of Javascript execution (this is where node.js is referred to as single threaded - even though it does use threads in its library implementations). But, with clustering, you get independent Javascript execution for each clustered server process.
For individual tasks that might be CPU-intensive (for example, image processing), you can create a work queue and a pool of child worker processes that you hand work off to. This has some benefits in common with clustering, but it is more special purpose where you know exactly where the CPU bottleneck is and you want to attack it specifically.
Other related answers/articles:
how libuv threads in nodejs utilize multi core cpu
Node.js on multi-core machines
Taking Advantage of Multi-Processor Environments in node.js
When is the thread pool used?

How many Clusters of Node.js app could be initialised per core?

I had googled about how many clusters could be initialized per core, and all articles I read sad that one node per core is good enough. Therefore, I haven't found a nice explanation about that.
I have an app that uses a lot of i/o and only does few computations, like push/remove from a queue then schedule a task to the database. For this app, I had initialized 5 clusters per core, and it had increased the number of requests I could receive. however the load Avg in the server is very high, about 30 sometimes.
The points are, what are the side effects using this approach (more than one node per core)?
A single core can only run one process at a time.
Most processes that run on your computer are mostly idle (they sit around and wait for something to happen), that's how you can have dozens, or hundreds, of processes running on your computer without much problems.
However, active processes, that do a lot of stuff (like your processes do; and in this case, a lot of I/O also counts), keep a CPU core pretty busy. So it's useless to start more than one of such processes per core, because they will all be contending for a time slice on that core.
That's also why you get a high load average, which is an indication of how many processes are either using the CPU, or are waiting to use the CPU.

Node.js asynchronous call handling and multi-core scaling

It is known that node.js internally handles asynchronous calls and the programmer never needs to care about what is going on in the backstage. As far as I know, even if everyone says that node.js is only single thread, internally v8/libuv libraries are spawning threads to handle the execution of the async fragments of the program.
My question is if those threads are spawned, are they scaling the multicore architectures? I mean If I have a cpu with 4 cores and my main node thread is running on one of those CPU's, will those internally spawned threads scale to the other three CPU's and not remain on the same CPU. Theoretically they should scale but since everyone says node.js out-of-box is not using multiple cores, I thought this is worth asking.
Node.js deals with one-thread-per-process. To make it scale out to multiple cores, you need to run multiple Node.js servers, one per core and split request traffic between them.

Resources