Node Background Threads - When Do These Get Created? - multithreading

I've been doing a fair amount of work with Node lately, trying to build a system which has certain characteristics, one of which is non-blocking / parallelism - a Node strong suit, as I understand it.
What I don't fully understand is when a separate thread is spun off to handle some processing. I'm pretty sue this happens on a function call/call back, but certainly not all of them.
In my specific case, it's an Express based app. At app start-up it does several things including instantiating a RabbitMQ based "bus", an object with a method which will write to the bus (objA) and object which will subscribe to the bus and process messages coming across it (objB).
objA will write to the bus inside an express callback
app.put((req,res) => {
objA.methodWhichWritesToBus();
});
I believe at this point, that objA.methodWhichWritesToBus is executed in a background/worker thread - whatever you call it, not on the main event loop.
Is that the only point at which this sort of thing happens? methodWhichWritesToBus is IO instensive (it calls an elastic search service on another box and brings back 10's to 100's of thousands of records) with lots of chained promises etc., but none of that gets split off, does it?
How about the fact that the obj on which the method is called is instantiated outside the Express callback - does that affect the parallel-ism?
Finally, are the ways to effect/force a method etc to "run in the background"?
I've been noodling this, testing it, for awhile now but all on one machine so it's difficult to tell what's going on.
Who can clarify this for me?

Pre-answer: this is a topic best learned by going and reading, doing coding exercises to solidify your understanding, and working with the technology in a significant way. You're not going to "get it" based on a Q&A format. That said...
What I don't fully understand is when a separate thread is spun off to handle some processing.
Never, sort of. "Processing" as in the computation that happens in your javascript program, happens in the main event loop thread. End of story. However, waiting on I/O to come back from the OS is not considered "processing" so there are various queues managed by node and the OS to track pending I/O requests and invoke callbacks when data is ready. There are a handful of threads node uses internally to manage this stuff with the OS, but from your program's perspective, those threads are irrelevant. Your program can ask node to do some IO, then your program keeps running in parallel, and when the I/O is done, node will eventually invoke the callback in the main event loop and you can process the results.
I believe at this point, that objA.methodWhichWritesToBus is executed in a background/worker thread - whatever you call it, not on the main event loop.
You call it "asynchronously" and it happens whenever you do IO, including filesystem calls, networking, or child processes. Which is to say, quite a lot.
How about the fact that the obj on which the method is called is instantiated outside the Express callback - does that affect the parallel-ism?
Nope.
Finally, are the ways to effect/force a method etc to "run in the background"?
Generally I/O is done asynchronously by default, so no you don't normally need to force anything to run in the background. It's baked into the node design by way of the node core APIs themselves. However, there are ways to delay synchronous processing to a future event loop using setImmediate, setTimeout, or process.nextTick. I explain these in some detail in my blog post setTimeout and friends.
More precisely, all networking is asynchronous. End of story. Specifically, the APIs in node core that are available are all asynchronous, and there's simply no synchronous API available in node. For filesystem IO and child processes, there are both synchronous and asynchronous APIs, but the synchronous APIs must only be used under special limited circumstances, and if you don't know confidently that it's OK in this specific case to make a synchronous IO API call, you should use the asynchronous API so you don't break the lynchpin that makes node perform as it does.

Related

Can the same line of Node.js code run at the same time?

I'm pretty new to node and am trying to setup an express server. The server will handle various requests and if any of them fail call a common failure function that will send e-mail. If I was doing this in something like Java I'd likely use something like a synchronized block and a boolean to allow the first entrance into the code to send the mail.
Is there anything like a synchronized block in Node? I believe node is single threaded and has a few helper threads to handle asyncronous/callback code. Is it at all possible that the same line of code could run at exactly the same time in Node?
Thanks!
Can the same line of Node.js code run at the same time? Is it at all possible that the same line of code could run at exactly the same time in Node?
No, it is not. Your Javascript in node.js is entirely single threaded. An event is pulled from the event queue. That calls a callback associated with that event. That callback runs until it returns. No other events can be processed until that first one returns. When it returns, the interpreter pulls the next event from the event queue and then calls the callback associated with it.
This does not mean that there are not concurrency issues in node.js. There can be, but it is caused not by code running at the same physical time and creating conflicting access to shared variables (like can happen in threaded languages like Java). Concurrency issues can be caused by the asynchronous nature of I/O in node.js. In the asynchronous case, you call an asynchronous function, pass it a callback (or expect a promise in return). Your code then continues on and returns to the interpreter. Some time later an event will occur inside of node.js native code that will add something to the event queue. When the interpreter is free from running other Javascript, it will process that event and then call your callback which will cause more of your code to run.
While all this is "in process", other events are free to run and other parts of your Javascript can run. So, the exposure to concurrency issues comes, not from simultaneous running of two pieces of code, but from one piece of your code running while another piece of your code is waiting for a callback to occur. Both pieces of code are "in process". They are not "running" at the same time, but both operations are waiting from something else to occur in order to complete. If these two operations access variables in ways that can conflict with each other, then you can have a concurrency issue in Javascript. Because none of this is pre-emptive (like threads in Java), it's all very predictable and is much easier to design for.
Is there anything like a synchronized block in Node?
No, there is not. It is simply not needed in your Javascript code. If you wanted to protect something from some other asynchronous operation modifying it while your own asynchronous operation was waiting to complete, you can use simple flags or variables in your code. Because there's no preemption, a simple flag will work just fine.
I believe node is single threaded and has a few helper threads to handle asyncronous/callback code.
Node.js runs your Javascript as single threaded. Internally in its own native code, it does use threads in order to do its work. For example, the asynchronous file system access code internal to node.js uses threads for disk I/O. But these threads are only internal and the result of these threads is not to call Javascript directly, but to insert events in the event queue and all your Javascript is serialized through the event queue. Pull event from event queue, run callback associated with the event. Wait for that callback to return. Pull next event from the event queue, repeat...
The server will handle various requests and if any of them fail call a common failure function that will send e-mail. If I was doing this in something like Java I'd likely use something like a synchronized block and a boolean to allow the first entrance into the code to send the mail.
We'd really have to see what your code looks like to understand what exact problem you're trying to solve. I'd guess that you can just use a simple boolean (regular variable) in node.js, but we'd have to see your code to really understand what you're doing.

Channels in Go, and emitters in node.js?

Does Go have an equivalent of node.js' "emitter"?
I'm teaching myself Go by porting over a node.js library I wrote. In the node version, the library emits an event once something happens (e.g. it listens on UDP port 1234 and when "ABC" is received, "abcreceived" is emitted so the calling code can respond as necessary (e.g. sending back "DEF")
I've seen channels in Go (and am currently reading up on them), but as I'm still new to this language, I don't know if (or how, for that matter) that can be used to communicate with whatever code is using my library.
I've also seen https://github.com/chuckpreslar/emission, but am not sure if this is acceptable, or if there's a better ("Best practice") way of doing things.
Go and Node.js are very different. Node.js supports concurrency only via callbacks. There might be various ways of dressing them up, but they're fundamentally callbacks.
In Node.js, there is no parallelism; Node.js has a single-threaded runtime. When Node.js async is used to achieve what is called 'parallel' execution, it isn't parallel in the sense used in Go, but concurrent.
Concurrency is not parallelism in the Go world.
Go has explicit concurrency based on Communicating Sequential Processes (CSP), a mathematical basis conceived by Tony Hoare at Oxford. The runtime interleaves cooperating processes called goroutines by time-slicing them onto the available CPU cores. Within each goroutine, the code is single threaded, so is easy to write. In the simple case, no data is shared between goroutines; instead messages pass between them along channels. In this way, there is no need for callbacks.
When goroutines get blocked waiting for I/O, that's OK because they don't use any CPU time until they're unblocked. Their memory footprint is slight and you can have very large numbers of them. So callbacks are not needed for I/O operations either.
Because the execution models of Go and Node.js are about as different as they could be, attempting to port code from one to the other is very likely to lead to very clumsy solutions. It's better to start from the original requirements and implement from scratch.
It would be possible to distort the Go concurrency model using function arguments to behave like callbacks. This would be a bad idea because it would not be idiomatic and would lose the benefits that CSP gives.
So by reading others' Go code and some links in the comments to my question, I think channels are the way to go.
In my library code (semi pseudo-code):
// Make a new channel called "Events"
var Events = make(chan
func doSomething() {
// ...
Events <-"abcreceived" // Add "abcreceived" to the Events channel
}
And in the code that will use my library:
evt := <-mylib.Events
switch evt {
case "abcreceived":
sendBackDEF()
break
// ...
}
I still prefer node.js' EventEmitter (because you can transfer data back easily) but for simple things, this should suffice.

Understanding the Event-Loop in node.js

I've been reading a lot about the Event Loop, and I understand the abstraction provided whereby I can make an I/O request (let's use fs.readFile(foo.txt)) and just pass in a callback that will be executed once a particular event indicates completion of the file reading is fired. However, what I do not understand is where the function that is doing the work of actually reading the file is being executed. Javascript is single-threaded, but there are two things happening at once: the execution of my node.js file and of some program/function actually reading data from the hard drive. Where does this second function take place in relation to node?
The Node event loop is truly single threaded. When we start up a program with Node, a single instance of the event loop is created and placed into one thread.
However for some standard library function calls, the node C++ side and libuv decide to do expensive calculations outside of the event loop entirely. So they will not block the main loop or event loop. Instead they make use of something called a thread pool that thread pool is a series of (by default) four threads that can be used for running computationally intensive tasks. There are ONLY FOUR things that use this thread pool - DNS lookup, fs, crypto and zlib. Everything else execute in the main thread.
"Of course, on the backend, there are threads and processes for DB access and process execution. However, these are not explicitly exposed to your code, so you can’t worry about them other than by knowing that I/O interactions e.g. with the database, or with other processes will be asynchronous from the perspective of each request since the results from those threads are returned via the event loop to your code. Compared to the Apache model, there are a lot less threads and thread overhead, since threads aren’t needed for each connection; just when you absolutely positively must have something else running in parallel and even then the management is handled by Node.js." via http://blog.mixu.net/2011/02/01/understanding-the-node-js-event-loop/
Its like using, setTimeout(function(){/*file reading code here*/},1000);. JavaScript can run multiple things side by side like, having three setInterval(function(){/*code to execute*/},1000);. So in a way, JavaScript is multi-threading. And for actually reading from/or writing to the hard drive, in NodeJS, if you use:
var child=require("child_process");
function put_text(file,text){
child.exec("echo "+text+">"+file);
}
function get_text(file){
//JQuery code for getting file contents here (i think)
return JQueryResults;
}
These can also be used for reading and writing to/from the hard drive using NodeJS.

Does Go have callback concept?

I found many talks saying that Node.js is bad because of callback hell and Go is good because of its synchronous model.
What I feel is Go can also do callback as same as Node.js but in a synchronous way. As we can pass anonymous function and do closure things
So, why are they comparing Go and Node.js in callback perspective as if Go cannot become callback hell.
Or I misunderstand the meaning of callback and anonymous function in Go?
A lot of things take time, e.g. waiting on a network socket, a file system read, a system call, etc. Therefore, a lot of languages, or more precisely their standard library, include asynchronous version of their functions (often in addition to the synchronous version), so that your program is able to do something else in the mean-time.
In node.js things are even more extreme. They use a single-threaded event loop and therefore need to ensure that your program never blocks. They have a very well written standard library that is built around the concept of being asynchronous and they use callbacks in order to notify you when something is ready. The code basically looks like this:
doSomething1(arg1, arg2, function() {
doSomething2(arg1, arg2, function() {
doSomething3(function() {
// done
});
});
});
somethingElse();
doSomething1 might take a long time to execute (because it needs to read from the network for example), but your program can still execute somethingElse in the mean time. After doSomething1 has been executed, you want to call doSomething2 and doSomething3.
Go on the other hand is based around the concept of goroutines and channels (google for "Communicating Sequential Processes", if you want to learn more about the abstract concept). Goroutines are very cheap (you can have several thousands of them running at the same time) and therefore you can use them everywhere. The same code might look like this in Go:
go func() {
doSomething1(arg1, arg2)
doSomething2(arg1, arg2)
doSomething3()
// done
}()
somethingElse()
Whereas node.js focus on providing only asynchronous APIs, Go usually encourages you to write only synchronous APIs (without callbacks or channels). The call to doSomething1 will block the current goroutine and doSomething2 will only be executed after doSomething1 has finished. But that's not a problem in Go, since there are usually other goroutines available that can be scheduled to run on the system thread. In this case, somethingElse is part of another goroutine and can be executed in the meantime, just like in the node.js example.
I personally prefer the Go code, since it's easier to read and reason about. Another advantage of Go is that it also works well with computation heavy tasks. If you start a heavy computation in node.js that doesn't need to wait for network of filesystem calls, this computation basically blocks your event loop. Go's scheduler on the other hand will do its best to dispatch the goroutines on a few number of system threads and the OS might run those threads in parallel if your CPU supports it.
What I feel is Golang can also do callback as same as Node.js but in a synchronous way. As we can pass anonymous function and do closure things
So, why are they comparing Golang and Node.js in callback perspective as if Golang cannot become callback hell.
Yes, of course it is possible to mess things up in Go as well. The reason why you don't see as much callbacks as in node.js is that Go has channels for communication, which allow for a way of structuring your code without using callbacks.
So, since there are channels, callbacks are not used as often therefore it is unlikely to stumble over callback infested code. Of course this doesn't mean that you cannot write scary code with channels as well...

How the single threaded non blocking IO model works in Node.js

I'm not a Node programmer, but I'm interested in how the single-threaded non-blocking IO model works.
After I read the article understanding-the-node-js-event-loop, I'm really confused about it.
It gave an example for the model:
c.query(
'SELECT SLEEP(20);',
function (err, results, fields) {
if (err) {
throw err;
}
res.writeHead(200, {'Content-Type': 'text/html'});
res.end('<html><head><title>Hello</title></head><body><h1>Return from async DB query</h1></body></html>');
c.end();
}
);
Que: When there are two requests A(comes first) and B since there is only a single thread, the server-side program will handle the request A firstly: doing SQL querying is asleep statement standing for I/O wait. And The program is stuck at the I/O waiting, and cannot execute the code which renders the web page behind. Will the program switch to request B during the waiting? In my opinion, because of the single thread model, there is no way to switch one request from another. But the title of the example code says that everything runs in parallel except your code.
(P.S I'm not sure if I misunderstand the code or not since I have
never used Node.)How Node switch A to B during the waiting? And can
you explain the single-threaded non-blocking IO model of Node in a
simple way? I would appreciate it if you could help me. :)
Node.js is built upon libuv, a cross-platform library that abstracts apis/syscalls for asynchronous (non-blocking) input/output provided by the supported OSes (Unix, OS X and Windows at least).
Asynchronous IO
In this programming model open/read/write operation on devices and resources (sockets, filesystem, etc.) managed by the file-system don't block the calling thread (as in the typical synchronous c-like model) and just mark the process (in kernel/OS level data structure) to be notified when new data or events are available. In case of a web-server-like app, the process is then responsible to figure out which request/context the notified event belongs to and proceed processing the request from there. Note that this will necessarily mean you'll be on a different stack frame from the one that originated the request to the OS as the latter had to yield to a process' dispatcher in order for a single threaded process to handle new events.
The problem with the model I described is that it's not familiar and hard to reason about for the programmer as it's non-sequential in nature. "You need to make request in function A and handle the result in a different function where your locals from A are usually not available."
Node's model (Continuation Passing Style and Event Loop)
Node tackles the problem leveraging javascript's language features to make this model a little more synchronous-looking by inducing the programmer to employ a certain programming style. Every function that requests IO has a signature like function (... parameters ..., callback) and needs to be given a callback that will be invoked when the requested operation is completed (keep in mind that most of the time is spent waiting for the OS to signal the completion - time that can be spent doing other work). Javascript's support for closures allows you to use variables you've defined in the outer (calling) function inside the body of the callback - this allows to keep state between different functions that will be invoked by the node runtime independently. See also Continuation Passing Style.
Moreover, after invoking a function spawning an IO operation the calling function will usually return control to node's event loop. This loop will invoke the next callback or function that was scheduled for execution (most likely because the corresponding event was notified by the OS) - this allows the concurrent processing of multiple requests.
You can think of node's event loop as somewhat similar to the kernel's dispatcher: the kernel would schedule for execution a blocked thread once its pending IO is completed while node will schedule a callback when the corresponding event has occured.
Highly concurrent, no parallelism
As a final remark, the phrase "everything runs in parallel except your code" does a decent job of capturing the point that node allows your code to handle requests from hundreds of thousands open socket with a single thread concurrently by multiplexing and sequencing all your js logic in a single stream of execution (even though saying "everything runs in parallel" is probably not correct here - see Concurrency vs Parallelism - What is the difference?). This works pretty well for webapp servers as most of the time is actually spent on waiting for network or disk (database / sockets) and the logic is not really CPU intensive - that is to say: this works well for IO-bound workloads.
Well, to give some perspective, let me compare node.js with apache.
Apache is a multi-threaded HTTP server, for each and every request that the server receives, it creates a separate thread which handles that request.
Node.js on the other hand is event driven, handling all requests asynchronously from single thread.
When A and B are received on apache, two threads are created which handle requests. Each handling the query separately, each waiting for the query results before serving the page. The page is only served until the query is finished. The query fetch is blocking because the server cannot execute the rest of thread until it receives the result.
In node, c.query is handled asynchronously, which means while c.query fetches the results for A, it jumps to handle c.query for B, and when the results arrive for A arrive it sends back the results to callback which sends the response. Node.js knows to execute callback when fetch finishes.
In my opinion, because it's a single thread model, there is no way to
switch from one request to another.
Actually the node server does exactly that for you all the time. To make switches, (the asynchronous behavior) most functions that you would use will have callbacks.
Edit
The SQL query is taken from mysql library. It implements callback style as well as event emitter to queue SQL requests. It does not execute them asynchronously, that is done by the internal libuv threads that provide the abstraction of non-blocking I/O. The following steps happen for making a query :
Open a connection to db, connection itself can be made asynchronously.
Once db is connected, query is passed on to the server. Queries can be queued.
The main event loop gets notified of the completion with callback or event.
Main loop executes your callback/eventhandler.
The incoming requests to http server are handled in the similar fashion. The internal thread architecture is something like this:
The C++ threads are the libuv ones which do the asynchronous I/O (disk or network). The main event loop continues to execute after the dispatching the request to thread pool. It can accept more requests as it does not wait or sleep. SQL queries/HTTP requests/file system reads all happen this way.
Node.js uses libuv behind the scenes. libuv has a thread pool (of size 4 by default). Therefore Node.js does use threads to achieve concurrency.
However, your code runs on a single thread (i.e., all of the callbacks of Node.js functions will be called on the same thread, the so called loop-thread or event-loop). When people say "Node.js runs on a single thread" they are really saying "the callbacks of Node.js run on a single thread".
Node.js is based on the event loop programming model. The event loop runs in single thread and repeatedly waits for events and then runs any event handlers subscribed to those events. Events can be for example
timer wait is complete
next chunk of data is ready to be written to this file
theres a fresh new HTTP request coming our way
All of this runs in single thread and no JavaScript code is ever executed in parallel. As long as these event handlers are small and wait for yet more events themselves everything works out nicely. This allows multiple request to be handled concurrently by a single Node.js process.
(There's a little bit magic under the hood as where the events originate. Some of it involve low level worker threads running in parallel.)
In this SQL case, there's a lot of things (events) happening between making the database query and getting its results in the callback. During that time the event loop keeps pumping life into the application and advancing other requests one tiny event at a time. Therefore multiple requests are being served concurrently.
According to: "Event loop from 10,000ft - core concept behind Node.js".
The function c.query() has two argument
c.query("Fetch Data", "Post-Processing of Data")
The operation "Fetch Data" in this case is a DB-Query, now this may be handled by Node.js by spawning off a worker thread and giving it this task of performing the DB-Query. (Remember Node.js can create thread internally). This enables the function to return instantaneously without any delay
The second argument "Post-Processing of Data" is a callback function, the node framework registers this callback and is called by the event loop.
Thus the statement c.query (paramenter1, parameter2) will return instantaneously, enabling node to cater for another request.
P.S: I have just started to understand node, actually I wanted to write this as comment to #Philip but since didn't have enough reputation points so wrote it as an answer.
if you read a bit further - "Of course, on the backend, there are threads and processes for DB access and process execution. However, these are not explicitly exposed to your code, so you can’t worry about them other than by knowing that I/O interactions e.g. with the database, or with other processes will be asynchronous from the perspective of each request since the results from those threads are returned via the event loop to your code."
about - "everything runs in parallel except your code" - your code is executed synchronously, whenever you invoke an asynchronous operation such as waiting for IO, the event loop handles everything and invokes the callback. it just not something you have to think about.
in your example: there are two requests A (comes first) and B. you execute request A, your code continue to run synchronously and execute request B. the event loop handles request A, when it finishes it invokes the callback of request A with the result, same goes to request B.
Okay, most things should be clear so far... the tricky part is the SQL: if it is not in reality running in another thread or process in it’s entirety, the SQL-execution has to be broken down into individual steps (by an SQL processor made for asynchronous execution!), where the non-blocking ones are executed, and the blocking ones (e.g. the sleep) actually can be transferred to the kernel (as an alarm interrupt/event) and put on the event list for the main loop.
That means, e.g. the interpretation of the SQL, etc. is done immediately, but during the wait (stored as an event to come in the future by the kernel in some kqueue, epoll, ... structure; together with the other IO operations) the main loop can do other things and eventually check if something happened of those IOs and waits.
So, to rephrase it again: the program is never (allowed to get) stuck, sleeping calls are never executed. Their duty is done by the kernel (write something, wait for something to come over the network, waiting for time to elapse) or another thread or process. – The Node process checks if at least one of those duties is finished by the kernel in the only blocking call to the OS once in each event-loop-cycle. That point is reached, when everything non-blocking is done.
Clear? :-)
I don’t know Node. But where does the c.query come from?
The event loop is what allows Node.js to perform non-blocking I/O operations — despite the fact that JavaScript is single-threaded — by offloading operations to the system kernel whenever possible. Think of event loop as the manager.
New requests are sent into a queue and watched by the synchronous event demultiplexer. As you see each operations handler is also registered.
Then those requests are sent to the thread pool (Worker Pool) synchronously to be executed. JavaScript cannot perform asynchronous I/O operations. In browser environment, browser handles the async operations. In node environment, async operations are handled by the libuv by using C++. Thread's pool default size is 4, but it can be changed at startup time by setting the UV_THREADPOOL_SIZE environment variable to any value (maximum is 128). thread pool size 4 means 4 requests can get executed at a time, if event demultiplexer has 5 requsts, 4 would be passed to thread pool and 5th would be waiting. Once each request gets executed, result is returned to the `event demultiplexer.
When a set of I/O operations completes, the Event Demultiplexer pushes a set of corresponding events into the Event Queue.
handler is the callback. Now event loop keeps an eye on the event queue, if there is something ready, it is pushed to stack to execute the callback. Remember eventually callbacks get executed on stack. Note that some callbacks has priorities on other, the event loop does pick the callbacks based on their priorities.
For those who seek short answer and don't want to go to the deepest levels of Node.js internals.
Node.js is not single threaded, it runs on 5 threads by default.
Yes, the only single thread is for actual JavaScript processing, but it always switches from function to function.
It sends SQL query to a database and lets it wait in other thread, while single threaded Node.js continues to compute some other code ready to be computed.
If you wish more explanations, there are good articles about Event Loop, Worker Pool and the whole libuv documentation.

Resources