Well, I've been looking for this topic but don't find it, so I'll ask you:
I need to write and read on an array on the server with frequency, so I decided to not using database, but I don't know what the best practice? It will be a lot of data, could it be a javascript array? Is it possible to read and write in a non-blocking method but avoid concurrency problems?
It is a MMORPG, a multiplayer game online, the data is all players online. The process will write on it almost every step a player does and will read it after that. I was thinking about a child process, or something to do the process more quickly and non-blocking, but I even know what child process is HAHA!
Thank you
Since Node.js is single threaded, anytime your code is doing something it is technically blocking the process from doing anything else. Once it hits a point where it is waiting for a callback, Node will start processing other requests until your callback comes back. How much data is 'a lot'? What do you need to do with the data?
If you're not doing much processing on the data, and there is a lot of it, a database solution wouldn't be a bad idea. Node drivers for databases (MongoDB, Redis, etc) are async and non-blocking so Node does a great job of interleaving the calls resulting in the ability to handle lots and lots of calls concurrently. Using storage like this (instead of just in memory) also means you could use Node cluster to spin up multiple node processes to use more than one core on your machine (as well as using multiple machines) to respond to requests.
If you're not doing much processing on the data, and the data set is pretty small, and you don't care about sharing the data among Node processes, then sure, just keep it in memory in whatever data structure you want. Arrays, dictionaries, or something like an LRU cache.
If you are doing lots of processing on the data, and there is a lot of it, then you'll need to do a bit more work since this isn't Nodes greatest strength (processing blocks the one and only thread which means it can't handle additional requests). I would suggest something like a PubSub model with a non-blocking queue with worker processes handling the processing.
Related
I'm now a couple of weeks into my node deep dive. I've learned a lot from Anthony's excellent course on udemy and I'm currently going through a book " nodejs the right way". I've also gone through quite a few articles that brought up some very good points about real world scenarios with node and coupling other technologies out there.
HOWEVER, it seems to be accepted as law, that you don't perform computationally heavy tasks with Node as its a single thread architecture. I get the idea of the event loop and asynch callbacks etc. In fact nodes strength stems from tons of concurrent IO connections if I understand correctly. No matter where I'm reading though, the source warns against hanging up that thread executing a task. I can't seem to find any rule of thumb of things to avoid using nodes process for. I've seen a solution saying that node should pass computationally heavy tasks to a message service like RabbitMQ which a dedicated app server can churn through(any suggestions on what pairs well with node for this task? I read something about an N-tier architecture). The reason I'm so confused is because I see node being used for reading and writing files to highlight the usage of streams but in my mind fetching/reading/writing files is an expensive task (I feel mistaken).
Tl;Dr What kind of tasks should node pass off to a work horse server ? What material can I read that explains the paradigm in detail?
Edit: it seems like my lack of understanding stemmed from not knowing what would even halt a thread in the first place outside of an obviously synchronous IO request . So if I understand correctly reading and writing data is IO where mutating said data or doing mathematical computations is computationally expensive (at varying levels depending on the task of course ) . Thanks for all the answers!
If you're using node.js as a server, then running a long running synchronous computational task ties up the one thread and during that computation, your server is non-responsive to other requests. That's generally a bad situation for servers.
So, the general design principles for node.js server design is this:
Use only asynchronous I/O functions. For example, use fs.readFile(), not fs.readyFileSync().
If you have a computationally intense operation, then move it to a child process. If you do a lot of these, then have several child processes that can process these long running operations. This keeps your main thread free so it can be responsive to I/O requests.
If you want to increase the overall scalability and responsiveness of your server, you can implement clustering with a server process per CPU. This isn't really a substitute for #2 above, but can also improve scalability and responsiveness.
The reason I'm so confused is because I see node being used for
reading and writing files to highlight the usage of streams but in my
mind fetching/reading/writing files is an expensive task (I feel
mistaken).
If you use the asynchronous versions of the I/O functions, then read/writing from the disk does not block the main JS thread as they present an asynchronous interface and the main thread can do other things while the system is fetching data from the disk.
What kind of tasks should node pass off to a work horse server ?
It depends a bit on the server load that you are trying to support, what you're asking it to do and your tolerance for responsiveness delays. The higher the load you're aiming for, then the more you need to get any computationally intensive task off the main JS thread and into some other process. At a medium number of long running transactions and a modest server load, you may just be able to use clustering to reach your scalability and responsiveness goal, but at some threshold of either length of the transaction or the load you're trying to support, you have to get the computationally intensive stuff out of the main JS thread.
HOWEVER, it seems to be accepted as law, that you don't perform computationally heavy tasks with Node as its a single thread architecture.
I would reword this:
don't perform computationally heavy tasks unless you need to with Node
Sometimes, you need to crunch through a bunch of data. There are times when it's faster or better to do that in-process than it is to pass it around.
A practical example:
I have a Node.js server that reads in raw log data from a bunch of servers. No standard logging utilities could be used as I have some custom processing being done, as well as custom authentication schemes for getting the log data. The whole thing is HTTP requests, and then parsing and re-writing the data.
As you can imagine, this uses a ton of CPU. Here's the thing though... is that CPU wasted? Am I doing anything in JS that I could do faster had I written it in another language? Often times the CPU is busy for a real reason, and the benefit of switching to something more native might be marginal. And then, you have to factor in the overhead of switching.
Remember that with Node.js, you can compile native extensions, so it's possible to have the best of both worlds in a well established framework.
For me, the human trade-offs came in. I'm a far more efficient Node.js developer than anything that runs natively. Even if my Node.js app were prove to be 5x slower than something native (which I'd imagine would be on the extreme), I could just buy 5 more servers to run, at much less cost than it would take for me to develop and maintain the native solution.
Use what you need. If you need to burn a lot of CPU in Node.js, just make sure you're doing it as efficiently as you can be. If you find that you could optimize something with native code, consider making an extension and besure to measure the performance differences afterwards. If you feel the desire to throw out the whole stack... reconsider your approach, as there might be something you're not considering.
Reading and writing files are I/O operations, so they are NOT CPU intensive. You can do a fair amount of concurrent I/O with Node without tying up any single request (in a Node HTTP server for example).
But people use Node in general for CPU-intensive tasks all the time and its fine. You just have to realize that if it uses all of the CPU for any significant amount of time then you will block all other requests to that server, which generally won't be acceptable if you need the server to stay available. But there are lots of times when your Node process is not trying to be a server firing back responses to many requests, such as when you have a Node program that just processes data and isn't a server at all.
Also, using another process is not the only way to do background tasks in Node. There is also webworker-threads which allows you to use threads if that is more convenient (you do have to copy the data in and out).
I would stop reading and do some targeted experiments so you can see what they are talking about. Try to create and test three different programs: 1) HTTP server handles lots of requests, always returns immediately with file contents 2) HTTP server handles lots of requests, but a certain request causes a large math computation that takes many seconds to return (which will block all the other requests -- big problem) 3) a Node program that is not an HTTP server, which does that large math computation and spits out the result in the terminal (which even though it takes awhile to work, is not handling other requests since its not a server, so its fine for it to block).
I know this question has been discussed in the past in much details (How is Node.js inherently faster when it still relies on Threads internally?) but I still fail to properly understand node.js event loop model and being a single threaded model how it handles concurrent requests.
Uptil now my understanding is : We receive an IO request --> a thread is spawned internally by node.js and IO request is handed to it --> since this is an IO request so CPU hands it to DMA controller and frees this thread --> this thread again goes into the thread pool to serve a different request --> DMA is still doing the IO, once DMA get all the data a sort of event is fired --> this event is captured by the node.js system and it puts the supplied callback function on the event loop --> whenever event loop get the opportunity it executed the callback on the data fetched by the IO -- > thanks to closures, callback function executes on the data fetched by the callback only
So this process goes on repeatedly. Please someone elucidate on my understand and provide some information
There is only one thread (the main thread) for dealing with network I/O (file I/O is a slightly different story because not all platforms provide usable asynchronous, non-blocking file I/O APIs, so the synchronous file I/O APIs are used on those platforms in a threadpool).
So when network requests come in, they're all handled by the main thread which uses (indirectly via libuv) epoll/kqueue/IOCP/etc. for detecting (in a non-blocking way) when data is available (or when there is an incoming TCP connection for example). If there is data available, it calls out appropriately to javascript as needed, passing the socket data. If there is no data on the socket (and there's nothing else for the event loop to do, e.g. firing timers), then execution proceeds to the next iteration of the event loop where the process starts all over again.
As far as associating socket data with socket javascript objects goes, it's the combination of C++ wrapper objects (e.g. tcp_wrap, udp_wrap, etc.) and javascript objects that makes sure the data gets to the appropriate place.
Here's a slightly older diagram that explains what happens in a single cycle of node's event loop. Some of it may have changed slightly since node v0.9, but it gets you the general idea:
node.js has a single threaded model which eliminates the need for locks and semaphores (used in the traditional multithreaded model). Locks and semaphores can add some costs in terms of performance and, more importantly, can provide a lot of rope to hang yourself with (in other words, many pitfalls). IO operations happen in parallel and because work between IOs is typically very small, this single threaded model usually works quite nicely.
(side note: if you have an app that does a lot of work between IO operations, i.e. CPU intense apps, that is a case where node doesn't not scale well)
I like to think of the argument for why node's model scales well is the same as why people think NoSQL scales better than SQL databases. Obviously Java (multi-threaded) and SQL scale; big companies like Facebook and Twitter have proven that. However, like in SQL, there are a lot of things you could do incorrectly to slow down your performance. Node.js doesn't eliminate all potential problems, it just does a good job of restricting many of the common causes.
I'm just starting out with Node.js and I understand that most operations must work with callbacks to be non-blocking. My question pertains to the methods that Underscore.js exposes. For example
_.shuffle([1, 2, 3, 4, 5, 6]);
Wouldn't this be considered synchronous code given that no callback is provided? Consider a large list to shuffle.
Trying to come to grips in terms of what libraries I can use with node without impacting the fundamentals of using node.
Thanks!
Node is single threaded so any work that needs to get done will eventually be done by that thread. The async nature of Node means that it always tries to keep itself busy with work instead of waiting around for data to be returned (things like database calls, networks calls, disk access, etc). When you read things about making sure code is asynchronous, these are the types of operations that people are talking about.
Shuffling a bunch of numbers is a bunch of work that has to be done by the single Node thread, making this type of call async wouldn't do anything. So yes, that call is synchronous and will block the thread but there really isn't an alternative (without spawning worker threads or additional node processes). This is one of the reasons that Node really isn't the best option if you have a lot of heavy computations to do since it will block the single thread. Node is best at doing lots and lots of short duration tasks quickly.
Note that shuffling a million numbers will probably still be faster than a single database call, so this particular operation wouldn't impact overall performance that much. If you need to shuffle 100 million numbers, Node probably isn't the right platform.
Yes, it's synchronous, but that's not a problem in this example (or really any of Underscore's methods).
The reason many node APIs are asynchronous is because they perform potentially long operations. In order to do so, the work is offloaded to native OS asynchronous facilities (sockets) or done on a separate thread. Only when the work is complete is the data marshaled back into JS land and a callback invoked.
In this case, you're dealing strictly with JavaScript-managed memory. Only one thread has access to JS memory; you can't share memory between threads. This means you must do your work (shuffling the array) synchronously.
Unless you're dealing with a truly large array, this won't be a problem.
The "never make synchronous calls in node" rule really applies to I/O and computationally expensive operations. That's why all the network, filesystem, crypto, and zlib APIs are asynchronous. You'll notice that other APIs like the URL/path parsing modules are synchronous calls.
I have read about node.js and other servers such as Apache, where the threading is different. I simply do not understand what the threading means.
If I have a webpage that runs SQL to hit a database, say three different databases in the one server side page, what does that mean for threading in node.js? Apache? What does "thread" mean here?
Or as an article I saw, "start a new thread to handle each request."
What does it mean to say Apache spawns a thread per request, but node.js does not?
EDIT: I am hoping for an example that I can grasp. I'm used to having a server side page that hits a database(s). Several connections inside that file.
A thread is a context of program execution. Programs that are single-threaded can only do one thing at once, where multi-threaded programs can do many things at once.
Think of it like a kitchen at a restaurant. A single chef can really only do one task at a time, be that chopping onions or putting something in an oven. If an order comes in that requires lots of work from the chef (such as making salads vs. putting stuff in the oven and waiting) some meals may get delayed because that chef is busy. On the other hand, if that chef just has to bake a bunch of stuff, there isn't much work for him to do and he can make other meals while waiting for the food in the oven to be done.
With multiple chefs, many of these tasks can be done simultaneously. Many meals can be prepared simultaneously.
Apache's threading model is like hiring a fixed number of chefs (regardless of how many customers your restauarant has that night) and each chef can only work on one meal at a time. That means that if a meal order comes in, a dedicated chef is assigned to that meal. There will be times when that chef is busy chopping up ingredients and mixing cake batter, but there will also be times when he's just standing around waiting for the potatoes to boil. At any given time, you could have most of your chefs sitting idle, waiting on potatoes to boil and cake to bake and no more orders will be worked on, since each chef is dedicated to one order at a time.
To make matters worse, your kitchen is only as big as you can afford to make it. Each chef takes up space and resources, and you may have a situation where a bunch of chefs standing around holding the only spoons available are preventing other chefs from getting their food made.
Nginx is another web server (often used as a proxy) that you didn't ask about, but I'm including it to explain another threading model. It also hires a fixed number of chefs, but it hires fewer of them. Each chef can work on multiple meals at a time. So, if they're waiting on potatoes to boil while an order comes in for a chopped salad, they can go work on that salad instead of standing around idle. You can have a smaller kitchen (relative to the size of restaurant/number of customers) and get the same number of meals out, or more. It's a tight crew that is effective at not wasting time and resources.
Node.js is a bit different. It is single-threaded from a JavaScript perspective, but other tasks like disk and network IO are handled on separate threads automatically. It's like having a kitchen with only one chef, but that makes sense in some cases. If your kitchen has a lot of busy work for that chef, perhaps it makes sense to hire more chefs to do work. (To do this in Node.js, you can only spawn more processes, which is effectively like building a bunch of small kitchens right next to each other. You can have one guy standing out front coordinating the orders for all those kitchens.) However, if you're just a bakery (mainly just IO, with little busy-work for the chef), maybe you only need one chef.
To sum all this up, different threading models are used to divide work and process it effectively. Which threading model makes sense depends on your needs, and the other characteristics of the server you are choosing.
Node.js is single threaded in that it can only do one thing at once. You can run multiple instances of the node process on pretty much all cloud service providers, though. The apache process can multi-task on threads.
If the node process hangs for some reason, nothing else can happen. That's why its important to write node in an asyncronous way so that if a database query hangs, node can still take requests.
Without getting too technical, a thread can be thought of as a lane in the highway of the program. Its a specific channel of execution. In the lifetime of a request, a lot of things have to happen. All of those things are in one box.
Node doesn't have threads! You can think of it like a one lane road. But the way node is deployed you get many instances of that one lane road. They don't share anything though. If you a value gets added to an array in one, its not in the other. Anything that needs to be shared has to be shared in a cache or database layer.
What people confuse between is Threads, Process & Async, Non-blocking I/O.
Threads are child level 'runnable' to a process. All the execution environment is set up for a thread. Right from the Stack to Addressable memory locations it's allocated to a thread. If a child-level thread has to communicate back to the the main process thread, it has to use safe-messaging,notification models. There are multiple ways to do this, based on the language.
Node.js is Single Threaded and obviously single Process based. It's not meant for high CPU intensive blocking calls. But if you still want to use, You could consider Node clustering. So instead of creating threads, it creates multiple "process" that works like a thread.
Async - All the code that carries a callback functions are not actually Async.
Okay in other words, Literally, they are Asynchrounous as they don't block the call.
But in Node.js context, When someone says, Node is Async, it's completely linked to the OS interfacing. The capability of Node depends on the Non-blocking I/O capabilities of the underlying OS. So whatever objects the OS supports Non-blocking I/O for example, Sockets, Files, Pipes, Node utilizes them to maximum.
And btw, when you talk about Apache, you should ideally be comparing Nginx. Not Node.js.
Node.js is not meant to serve as a Web Server. It's a basically a Process that puts effective use of Async I/O.
How do you make your application multithreaded ?
Do you use asynch functions ?
or do you spawn a new thread ?
I think that asynch functions are already spawning a thread so if your job is doing just some file reading, being lazy and just spawning your job on a thread would just "waste" ressources...
So is there some kind of design when using thread or asynch functions ?
If you are talking about .Net, then don't forget the ThreadPool. The thread pool is also what asynch functions often use. Spawning to much threads can actually hurt your performance. A thread pool is designed to spawn just enough threads to do the work the fastest. So do use a thread pool instead of spwaning your own threads, unless the thread pool doesn't meet your needs.
PS: And keep an eye out on the Parallel Extensions from Microsoft
Spawning threads is only going to waste resources if you start spawning tons of them, one or two extra threads isn't going to effect the platforms proformance, infact System currently has over 70 threads for me, and msn is using 32 (I really have no idea how a messenger can use that many threads, exspecialy when its minimised and not really doing anything...)
Useualy a good time to spawn a thread is when something will take a long time, but you need to keep doing something else.
eg say a calculation will take 30 seconds. The best thing to do is spawn a new thread for the calculation, so that you can continue to update the screen, and handle any user input because users will hate it if your app freezes untill its finished doing the calculation.
On the other hand, creating threads to do something that can be done almost instantly is nearly pointless, since the overhead of creating (or even just passing work to an existing thread using a thread pool) will be higher than just doing the job in the first place.
Sometimes you can break your app into a couple of seprate parts which run in their own threads. For example in games the updates/physics etc may be one thread, while grahpics are another, sound/music is a third, and networking is another. The problem here is you really have to think about how these parts will interact or else you may have worse proformance, bugs that happen seemingly "randomly", or it may even deadlock.
I'll second Fire Lancer's answer - creating your own threads is an excellent way to process big tasks or to handle a task that would otherwise be "blocking" to the rest of synchronous app, but you have to have a clear understanding of the problem that you must solve and develope in a way that clearly defines the task of a thread, and limits the scope of what it does.
For an example I recently worked on - a Java console app runs periodically to capture data by essentially screen-scraping urls, parsing the document with DOM, extracting data and storing it in a database.
As a single threaded application, it, as you would expect, took an age, averaging around 1 url a second for a 50kb page. Not too bad, but when you scale out to needing to processes thousands of urls in a batch, it's no good.
Profiling the app showed that most of the time the active thread was idle - it was waiting for I/O operations - opening of a socket to the remote URL, opening a connection to the database etc. It's this sort of situation that can easily be improved with multithreading. Rewriting to be multi-threaded and with just 5 threads instead of one, even on a single core cpu, gave an increase in throughput of over 20 times.
In this example, each "worker" thread was explicitly limited to what it did - open the remote a remote url, parse the data, store it in the db. All the "high level" processing - generating the list of urls to parse, working out which next, handling errors, all remained with the control of the main thread.
The use of threads makes you think more about the way your application needs threading and can in the long run make it easier to improve / control your performance.
Async methods are faster to use but they are a bit magic - a lot of things happen to make them possible - so it's probable that at some point you will need something that they can't give you. Then you can try and roll some custom threading code.
It all depends on your needs.
The answer is "it depends".
It depends on what you're trying to achieve. I'm going to assume that you're aiming for more performance.
The simplest solution is to find another way to improve your performance. Run a profiler. Look for hot spots. Reduce unnecessary IO.
The next solution is to break your program into multiple processes, each of which can run in their own address space. This is easiest because there is no chance of the individual processes messing each other up.
The next solution is to use threads. At this point you're opening a major can of worms, so start small, and only multi-thread the critical path of the code.
The next solution is to use asynch IO. Generally only recommended for people writing some of very heavily loaded server, and even then I would rather re-use one of the existing frameworks that abstract away the details e.g. the C++ framework ICE, or an EJB server under java.
Note that each of these solutions has multiple sub-solutions - there are different breeds of threads and different kinds of asynch IO, each with slightly different performance characteristics, but again, it's generally best to let the framework handle it for you.