Thread in an event-driven vs non-event driven web server - node.js

The following two diagrams are my understanding on how threads work in a event-driven web server (like Node.js + JavaScript) compared to a non-event driven web server (like IIS + C#)
From the diagram is easy to tell that on a traditional web server the number of threads used to perform 3 long running operations is larger than on a event-driven web server (3 vs 1.)
I think I got the "traditional web server" counts correct (3) but I wonder about the event-driven one (1). Here are my questions:
Is it correct to assume that only one thread was used in the event-driven scenario? That can't be correct, something must have been created to handle the I/O tasks. Right?
How did the evented server handled the I/O? Let's say that the I/O was to read from a database. I suspect that the web server had to create a thread to hand off the job of connecting to the database? Right?
If the event-driven web server indeed created threads to handle the I/O where is the gain?
A possible explanation for my confusion could be that on both scenarios, traditional and event-driven, three separate threads were indeed created to handle the I/O (not shown in the pictures) but the difference is really on the number of threads on the web server per-se, not on the I/O threads. Is that accurate?

Node may use threads for IO. The JS code runs in a single thread, but all the IO requests are running in parallel threads. If you want some JS code to run in parallel threads, use thread-a-gogo or some other packages out there which mitigate that behaviour.
Same as 1., threads are created by Node for IO operations.
You don't have to handle threading, unless you want to. Easier to develop. At least that's my point of view.
A node application can be coded to run like another web server. Typically, JS code runs in a single thread, but there are ways to make it behave differently.
Personally, I recommend threads-a-gogo (the package name isn't that revealing, but it is easy to use) if you want to experiment with threads. It's faster.
Node also supports multiple processes, you may run a completely separate process if you also want to try that out.

The best way to picture NodeJS is like a furious squirrel (i.e. your thread) running in a wheel with an infinite number of pigeons (your I/O) available to pass messages around.
I/O in node is "free". Your squirrel works to set up the connection and send the pigeon off, then can go on to do other things while the pigeon retrieves the data, only dealing with the data when the pigeon returns.
If you write bad code, you can end up having the squirrel waiting for each pigeon.
So always write non-blocking i/o code.
If you can encourage your Pigeons to promise to come back ;)
Promises and generators are probably the best approach you can take to this.
HOWEVER you can always use Node cluster to establish a master squirrel that will procreate child squirrels based on the number of CPUs the master squirrel can find to dole out the work.
Hope this helps and note the complete lack of a car analogy.

Related

What is the meaning of I/O intensive in Node.js

I was learning Node.js and also found out that Node.js is best to be used with I/O intensive tasks which confused me a bit. So, after some research I found this statement: "An application that reads and/or writes a large amount of data". So, does it mean that Node.js is best to be used with data, that is, read big data, take necessary data from that and send back to client?
A nodejs application can be architectured just fine to include non-I/O things and is not just suited for big data applications (in fact big data has nothing to do with it at all).
A default, simple implementation of Node.js performs best when your application is not CPU intensive and instead spends most of its time doing I/O (input/output) tasks such as reading/writing to a database, read/writing from files, reading/sending network data and so on. It's not about big data, it's about what does the server spend most of its time doing.
Surprisingly enough (to some) since a web server's primary job is responding to http requests which are usually requests for data, most web servers spend most of their time fetching things, reading and writing things and sending things which are all I/O tasks. In the node.js design, all these I/O tasks happen asynchronously in a non-blocking fashion and they use events to signal when those operations complete. This is where the phrase "event-driven design" comes from when describing node.js. It so happens that this makes node.js very efficient at handling things that involve primarily I/O. This is what a simple implementation of node.js does best. And, it generally does it better than a purely threaded server design that devotes an OS thread to every currently in-flight I/O operation (the original design for many server frameworks).
If you do have CPU intensive things (major calculations, image processing, heavy crypto operations, etc...) and you do them very often or they take very long, then you will be best served if you put those tasks in a Worker Thread or in another process and communicate back and forth between the main process in node.js and this worker to get that CPU-intensive work done. It used to be that node.js didn't have Worker Threads which made this task a little more complicated where you often had to use one or more additional processes (either via clustering or additional dedicated processes) in order to handle this CPU-intensive work, but now you can use Worker Threads which can be a bit more convenient.
For example, I have a server task that requires a very heavy amount of crypto (performing a billion crypto operations). If I put that in the main node.js thread, that essentially blocks the event loop so my server can't process other requests while that heavy duty crypto operation is running which would ruin the responsiveness of my server.
But, I was able to move the crypto work to a worker thread (actually to several worker threads) and then can crunch away on the crypto while my main thread stays nice and lively to handle other, unrelated incoming requests in a timely fashion.
First of all, Big Data has nothing to do with Node.js.
I/O intensive means that the given task often waits for I/O. The best examples for these are file operations, networking.
If the processor has to regularly wait for data to arrive, the task is said to be I/O intensive.
Node.js's asynchronous nature however makes it really good at I/O intensive tasks, as it can keep doing other work while it waits for the data to arrive asynchronously.
For example, if you have 10 clients connected to the server and one of the clients requests for a data or task that is heavy to process, my server should not get stuck or wait until this task is finished as it will cause greater response time to other 9 clients or bad user experience. Rather, server should allow the other 9 clients to request data or task from the server, and when the respective tasks get finished, response should be sent back to clients.
PS: You can study about Event loop in Node.js
What Node.js is great at is serving as the middle layer between clients and data sources, i.e. the inputs and outputs.
The reason Node.js is great at this is in the non-blocking event-driven approach it takes.
For example, when you make a request to a Node.js app that asks for some data from a database, Node.js will request that data and immediately return to other requests without being blocked by the database request.
Once the database sends the data back, Node.js triggers the callback (or resolves the promise) with that data and continues onwards.
There's no race condition between these input and output events because their synchronization is done in a single threaded mechanism called the Event Loop. Only one event gets processed at a time.
We can think of the Event Loop as a single-seat rollercoaster ride in an amusement park that has many lines of people waiting to go on the ride, one by one. When you get to go depends on when you got in a line, how important you are or if a friend saved you a spot but nevertheless only one person at a time will be able to partake.
This non-blocking event-driven approach allows Node.js to very efficiently react to input and output events and process many read/write operations because it's not really doing much processing, the CPU work is quite low. It's just serving as the middle layer between you and the data.
On the other hand, if these events lead to some intense CPU operations, Node.js used to perform quite poorly because the Event Loop can process only one event at a time.
To use the rollercoaster analogy from above, a CPU-intensive task would be as if one person is taking a really long ride while all others have to wait for them to be done.
Newer versions of Node.js did get some tools to allow it do to more than 1 thing at time (parallelism) by using workers. The trick here is that every pool of workers has its own Event Loop which allows applications to move the intense work into a different thread and run it in parallel with the rest of the application. Do note that this will only actually help if you run on a machine with more than 1 core. If your machine has 1 core, no matter what tool you use, you're gonna have a bad time because nothing can actually be done in parallel on a single core machine.
In case of Intensive I/O tasks Majority of the time is spent waiting for network, filesystem and perhaps database I/O to complete. Increasing hard disk speed or network connection improves the overall performance.
In its most basic form Node.js is best suited for this type of computing. All I/O in Node.js is non-blocking and it allows other requests to be served while waiting for a particular read or write to complete.

What exactly are the implications of the fact that nodejs is single threaded?

The NodeJS website says the following. Emphasis is mine.
Node.js is a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.
Even though I love NodeJS I dont see why it is better for scalable applications compared to the existing technologies such as Python, Java or even PHP.
As I understand the JavaScript run-time always runs as a single thread in the CPU. The IO however probably uses underlying kernel methods which might rely on the thread pools provided by the kernel.
So the real questions that need to be answered are:
Because all JS code will run in a single thread NodeJS is unsuitable for applications where there is less IO and lots of computation ?
If I am writing a web application using nodejs and there are 100 open connections each performing a pure computation requiring 100ms, at least one of them will take 10s to finish ?
If your machine has 10 cores but if you are running just one nodeJS instance your other 9 CPUs are sitting ducks ?
I would appreciate if you also post how other technologies perform viz a viz NodeJS in these cases.
I haven't done a ton of node, but I have some opinions on this. Please correct if I am mistaken, SO.
Because all JS code will run in a single thread NodeJS is unsuitable for applications where there is less IO and lots of computation ?
Yeah. Single threaded means if you are crunching lots of data hard in your JS code, you are blocking everything else. And that sucks. But this isn't typical for most web applications.
If I am writing a web application using nodejs and there are 100 open connections each performing a pure computation requiring 100ms, at least one of them will take 10s to finish?
Yep. 10 seconds of CPU time.
If your machine has 10 cores but if you are running just one nodeJS instance your other 9 CPUs are sitting ducks?
That I'm not sure about. The V8 engine might have some optimizations in it that take advantage of multiple cores, transparent to the programmer. But I doubt it.
The thing is, most of the time a web application isn't calculating. If your app is engineered well, a single request can be responded to very quickly. And if you have to fetch things to do that (db, files, remote services) you shouldn't have to wait for that fetch to return before processing the next request.
So you may have many requests in various stages at the same time in various stages of completion, due to when I/O callbacks happen. Even though only one request is running JS code at a time, that code should do what it needs to do very quickly, exit the run loop, and await the next event callback.
If your JS can't run quickly, then this model does pose a problem. As you note, things will get hung as the CPU churns. So don't build a node web application that does lots of intense calculation on the fly.
However, you can refactor things to be asynchronous instead. Maybe you have a standalone node script that can do the calculation for you, with a callback when it's done. Your web application can then boot up that script as a child process, tell it do stuff, and provide a callback to run when it's done. You now have sort of faked threads, in a round about way.
In virtually all web application technologies, you do not want to be doing complex and intense calculation on the fly. Even with proper threading, it's a losing battle. Instead you have to strategize. Do the calculations in the background, or at regular intervals on a cron job, outside of the main web application process itself.
The things you point out are flaws in theory, but in practice it really only becomes an issue if you aren't doing it right.
Node.js is single threaded. This means anything that would block the main thread needs to be done outside the main thread.
In practice this just means using callbacks for heavy computations the same way you use callbacks for I/O.
For instace here's the API for node bcrypt
var bcrypt = require('bcrypt');
bcrypt.genSalt(10, function(err, salt) {
bcrypt.hash("B4c0/\/", salt, function(err, hash) {
// Store hash in your password DB.
});
});
Which Mozilla Persona uses in production. See their code here.

If nodejs is multithreaded why should i use cluster module to utilize multicore cpu?

if nodejs is multithreaded see
this article and
threads are managed by OS which can do it in the same core or in another core in multicore cpu see this question then nodejs will automatically utilize multicore cpu ,
so why should i use cluster.fork to make different process of node to utilize multicore as shown in this example at node docs
i know that multiprocess have the advantage that when one process fall there still another process to respond to requests unlike in threads , i need to know if multicore can be utilized by just spawning process for each core or it's an OS task that i can't control
It depends.
Work that happens asynchronously and by Node itself, such as IO operations, is multithreaded. Your JavaScript application runs in a single thread.
In my opinion, the only time you need to fire off multiple processes, is if the vast majority of your work is done in straight JavaScript. Node was designed behind the fact that this is rarely the case, and is built for applications that primarily block on disk and network.
So, if you have a typical Node application where your JavaScript isn't the bulk of the work, then firing off multiple processes will not help you utilize multiple CPUs/cores.
However, if you have a special application where you do lots of work in your main loop, then multiple processes may be for you.
The easiest way to know is to monitor CPU utilization while your application runs. You will have to decide on a per-application basis what is best.
Node is not multi-threaded from the point of developer's view. Threads are used in a very different way than they are used by for example Apache's worker mpm.
I believe this answer will clear things up.

"Everything runs in parallel except your code".. wait what?

I am trying to learn Node.js and some of points that I understand:
Node.js does'nt create a seperate process for each request, instead it is just one process which processes all requests.
It is asynchronous which means you can attach a callback to a long-lasting process and continue your rest of the work without waiting for it to finish.
What I really don't understand is author's point in Understanding node.js - "Everything runs in parallel except your code". I have understood the analogy and the code that explains it but still I don't get it what is the distinction between "Everything" and "code". I have more often heard this about node.js.
Also, people pat node.js for its efficiency since memory overhead for one concurrent connection may be as low as 8KB but what about CPU load. Does node.js make it way less as compared to PHP+Apache?
Node.js uses a single thread any time it is running the JavaScript in your application. Tasks that are asynchronous (network, filesystem, etc.) are all handled on separate threads automatically for you. This means that you get much of the usefulness of a multithreaded application without having to worry about all of the trouble that comes with locking resources and what not.
Node is not a tool for every job. It is ideal for applications that are IO bound. For example, if your application required a ton of work to process templates and what not, Node probably isn't for you. If instead you're just shuffling data around, Node can be very effective.
The reason Node is often quoted as being faster than servers like Apache is that it doesn't create a thread and all of the resources with it to handling requests. In Apache, most of the time, that thread handling requests is waiting on network or filesystem data. While it does this, it is wasting resources. With Node, only one thread processes those requests (in your application). Again, this is great for some things, but if you have a lot of processing to do, Node would not be effective as it can really only handle a single request at a time in these situations.
This video does a pretty good job of explaining: http://www.youtube.com/watch?v=F6k8lTrAE2g&feature=youtube_gdata
Everything runs in parallel except your code.
It means if you do
while(true){}
anywhere in your code the entire node application will stop. While the code you write executes, nothing else does. Requests will not be handled, responses won't be returned, nothing. You have to be extremely careful to not hog the cpu in node.
but what about CPU load?
That completely depends on the nature of your application and the load. If your app is busy, it'll use more cpu.
Imagine a busy intersection with a traffic cop in the middle. When the cop is doing his job properly, hundreds of cars can pass through the intersection in a very fast and efficient way.
If the cop starts receiving and answering SMS messages on his cell while doing traffic, then things might go out of hand really quickly.
The traffic cop is your node.js app, and the time he spends doing SMS is what the author refers to as "your code".
In other words: node.js performance will shine the more you use it as a traffic cop. The more you start using it to do things other than pulling and pushing data (i.e.: sorting a list of numbers, rendering an html template, etc.), the more your capacity to accept and process new connections quickly will suffer.
"Everything" refers to everything else besides your code. For example, the stuff that handles HTTP. Another way to say the same thing is "your code doesn't wait for node.js to do stuff, like send data over TCP, because that's done asynchronously."
To answer your second question, I don't know which has less CPU load, I'm guessing they're similar. Node.js' touted advantage is the CPU is better utilized due to the aforementioned asynchronicity.

What is different about the way NodeJS handles requests as opposed to a setup like Rails / Passenger?

My understanding is that Node is an 'Event' driven as opposed to sequentially driven server application. This I've come to understand means, that for event driven software, the user is in command, he can create an event at any time and the server is in a state such that it can respond, whereas with sequential software (like a DOS prompt), the application tells the user when its 'ok' to response, and may at any given time be not available (due to some other process).
Further, my understanding is that applications like Node and EventMachine use a reactor of sorts.. they wait for an 'event' to occur, and using a callback they delegate the task to some other worker. Ok.. so then, what about Rails & Passenger?
Rails might use a server like NGINX with Passenger to spawn new processes when 'events' are received by the system. Is this not conceptually the same idea? If it is, is it just the processing overhead that is really separating the two where Passenger would need to potentially spawn a new rails instance while, node is already waiting to handle the request?
Node.js is event driven non blocking programming language. The key is the non blocking part. Node doesn't spawn for other processes. It runs in one thread (this is for starters... you can actually spawn it now through some modules - i think - but that's another talk)
Anyway this is different from other typical programming languages where you receive a request and the thread is locked until it has an answer. If you assign it to another thread, that thread is still locked...
In node you never lock. You receive request and the thread continues to receive requests. When a request is processed, the callback is called.
Hope I made myself understand and I used the right terms ;)
Anyway, if you want this video is nicee: http://www.youtube.com/watch?v=jo_B4LTHi3I
The non-blocking/evented I/O as jribeiro described is one part of the answer. Ruby applications tend to be written using blocking I/O, and using processes and threads for concurrency.
However, non-blocking and evented I/O are not inherent to Node. You can achieve the same thing in Ruby by using EventMachine, and an in-process evented server like Thin. If you program directly against EventMachine and Thin, then it is architecturally almost the same as Node. That being said, the Ruby ecosystem does not have as many event-friendly libraries and documentation as Node does, so it takes a bit of skill to achieve the same thing in Ruby.
Conversely, the way Phusion Passenger manages processes - i.e. by spawning multiple processes and load balancing requests between them, and supervising processes - is not unique to Ruby. In fact, Phusion Passenger introduced Node.js support recently. The Node.js support was open sourced today.

Resources