Why this question?
I’m learning things about js performances and web rendering. This post was very useful.
If you follow some links you will land here and read this:
User code in Node.js runs in a single thread, so for compute operations (as opposed to I/O), you can execute them concurrently, but not in parallel.
So I’ve read things about concurrency and parallelism in nodeJS. What I’ve learned is that nodeJS is:
Parallel for I/O bound tasks since it’s handled by libuv
Concurrent for CPU bound tasks
This explains why renderToString is a slow operation since it is CPU bound. But it seems that there is a way to enable parallelism of CPU bound tasks in nodejs: clustering.
The question
This is why I’m here. Do you know why renderToString isn’t clustered (don’t know if this is valid English)?
Maybe it’s too complicated?
Maybe it just can’t be done in parallel?
Maybe it doesn't improve performances for some reason?
I would like to understand why. Because after those readings, I tend to think that nodeJS is very performant when it comes to dealing with I/O, but it seems to also be performant for CPU bound tasks since you can create clusters. Nevertheless it doesn’t seems to be trivial and it’s a choice to consider in some specific cases.
So this leads us to one bonus question: what are the limitations/drawbacks of nodejs’s clusters ? (Excepting the fact that it seems to be complicated to setup and to maintain on large projects?)
It would not make sense to place the abstraction at this level.
It is not hard to run a renderToString() as a cluster. For instance, you can easily use the worker-farm library.
The problem is this becomes hard to use in a beneficial way, because the "store" of data built for each incoming request must be in scope for the entire component tree that renderToString() works on.
Perhaps though with the experimental worker threads node.js library, we might get some multi-threaded renderToString. But, the work on SSR (Server Side Rendering) of React is not nearly as active as the client.
Maybe with the work to allow React to suspend a tree its rendering, and start it up, we'll eventually have a thread that can continue rendering while primary thread acts on incoming request/action.
Related
I am learning Erlang and trying to understand how its sockets work as it is meant to be one of the strongest parts of the language and OTP.
I have experience with NodeJS, and wonder, how the applications made with NodeJS and Erlang differ in regards on how multiple sockets connections are managed.
As I understand, while JavaScript is single-threaded, V8 manages all the multiple simultaneous connections for it, though Erlang can manage multiple connections itself.
So, I wonder, if Erlang has excellent support for managing multiple connections at a time, how is it different from other technologies for a programmer? I mean, when I write an app for NodeJS, it can have as many connections open and well-managed as if I wrote code in Erlang, isn't it?
Please share your thoughts, links to some articles about the specialties of Erlang in this context are welcome too.
I am by no means an expert in Erlang, but I think I know Erlang and NodeJs on the same level.
The things you say, are all correct. Bot can handle multiple connections very efficiently, well well-managed you say.
But the thing is, the problems are not only handling multiple concurrent connections. The problems Erlang tries to solve very good, are fail safety, and distribution. I don't think NodeJs will be as good at it, as it is now.
Don't take it wrong, I'm not saying no one can code a distributed app in NodeJs, but considering the tools Erlang gives you, it maybe is a better choice.
For fail safety, as an example, Erlang let's you link your processes, so when one fails, other also fails or gets notified. That is not very practical by itself, but when you look at it alongside supervisors and shared-nothing processes, it is a great tool.
For distribution, Erlang let's you link nodes together. Linked nodes can talk together as if they were on the same machine, and they can spawn processes on other side too. Consider this, with the ability to start a failed app from a failed node on another node that is healthy. Gives you a great uptime.
And not to mention that these tools have years of experience behind them.
Just try to solve these issues on another ecosystem. I say ecosystem, because Erlang as a language is not complete, but the tools and frameworks (mostly OTP) have to be considered too. Then you can also say that Erlang really shines in this areas.
But Erlang also is not very good when it comes to linear processing, number crunching, image/sound processing, etc. That would be better implemented in another system.
I think, in this areas, the big difference between NodeJs and Erlang is their runtime model. NodeJs has one process, one thread that is working async on io-related tasks. Of course, you can run multiple processes, but that is the basic thing. On the other hand, Erlang has a VM called BEAM. Erlang uses special processes inside this VM, very light processes. BEAM schedules them itself, because they are not OS processes. This gives BEAM the advantage to have hundreds of thousands processes at the same time, each doing a task, be it io or anything else.
You see the difference now, I think. Erlang is more battle-tested, more better when fail safety or distribution is a must. NodeJs maybe better when you need faster development, and deployment.
We have an app that uses server-side rendering for SEO purposes using EJS templating.
I am well-versed with Node.js and know that it's probably possible to tap into the Node.js threadpool for asynchronous I/O for whatever purpose you want, whether it's a good idea or a bad idea. Currently I am wondering if it is possible to run ejs.render() or res.render() with a thread in the threadpool instead of the main thread in Node.js?
We are doing a lot of heavy computational lifting in the render functions and we definitely want that off the main thread, otherwise we will be paying $$$ for more servers.
Is it just the rendering that is concerning you? There are other template engines which should produce better results; being that template rendering should be an idempotent operation, you could additionally distribute across a cluster.
V8 will compile your code to assembly and, if your not hitting any deoptimizations or getting stalled by the garbage collector, I believe you should be in the neighborhood of your network I/O limits. I would definitely recommend you try other template engines, adding a caching HTTP reverse proxy at the front and running some benchmarks first.
EJS is known to be synchronous, and that's not going to change, so basically it's an inefficient rendering engine for Node.js since it blocks the JS thread whenever it renders a view, which degrades your overall throughput, especially if your rendering is CPU heavy.
You should definitely think about some other options. E.g. https://github.com/ericf/express-handlebars
If you really have CPU-heavy computation in your webserver, then Node.js is definitely not the right tool for the job anyway. There are much better servers to handle multi-threading and parallel processing. You could just setup Node to be a controller and forward your CPU-heavy requests to a backend service/server that can do the heavy-lifting.
It would be helpful to see what kind of computation you are doing during render to provide a better answer.
Tapping into the thread-pool (which is handled by libuv) would probably be a bad idea, but it is possible of course.. you just need some C++ skills and the uv_queue_work() method of the libuv library to schedule stuff on a worker thread.
I have experimented with building a scripting engine that is run in a forked process (Read on node's child process module here). I find that to be an attractive proposition for implementing rendering engines. Yes there are issues of passing parameters (post/get query strings, session status, etc) but they are easy to deal with, especially if you use the fork option (as opposed to exec or spawn). There are standard messaging methods to communicate between the child and parent.
The only overhead is the generation of the additional instance of node (the rendering engine itself). If you are doing extensive computation in the scripting engine then this constant, the one-time per rendering request overhead of forking a new process will be minor compared to the time taken to render.
If EJS rendering blocks the main node thread, then that alone is sufficient reason NOT to use it if you are doing any significant computation during rendering.
If I were writing an application using the MEAN stack, and the database is optimized sufficiently enough to almost never be the bottleneck, can Node.js itself become a bottleneck due to the site traffic and/or number of concurrent users? This is purely from the perspective of Node.js being an asynchronous, single threaded event loop. One of the first tenets of Node.js development is to avoid writing code that performs CPU intensive tasks.
Say, if I had to post-process the data returned from MongoDB and that was even moderately CPU intensive, it sounds like that should be handled by a service layer sitting in between Node.js and MongoDB, that is not pounding the same CPU dedicated to Node.js. Techniques such as process.nextTick() are harder to comprehend and more importantly to realize when to use them.
Forgive me for this borderline rant, but I really do want to have a better idea of Node.js' strengths and weaknesses.
As I've done some more research into web server software, I've begun to question if Apache's thread/process based method is the way to go vs. the the asynchronous request handling provided by servers like Nginx a Lighttpd, which tend to scale better with heavier loads.
I understand there are many other differences between these latter two and Apache. My question is under what circumstances would I pick a thread/process based method over the asynchronous handling.
Are there any features/technologies that I can't use with an asynchronous method (or would function poorly/not as well)?
What situations would cause the performance of an asynchronous method to perform worse than a thread/process based approach? Are these common or rare cases, and how big is the difference?
Are there any other factors I should take into consideration when comparing the two? Keep in mind I'm focusing mainly on the thread/process based method vs. asynchronous, not any particular server software which happens to utilize one of these methods. These concerns might be difficulty of managing/debugging, security issues, etc.
This is old, but worth answering. Let's first start by saying how each model works.
In threaded, you have a request come in to a handler, the handler spawns a new OS thread to handle that request, and all work for that request happens in that thread until a response is sent and the thread is ended. This model supports as many concurrent requests as threads that your server can spawn (but threads can be somewhat heavyweight).
When doing async a request comes in to a handler but instead of creating a thread to deal with it, it adds the connection to what's known as an event loop. The event loop listens for data/state changes on the connection and fires callbacks each time "something" happens. Once the connection is added to the event loop, the handler immediately listens for new connections to add. This allows you to have many (sometimes 100K) concurrent connections at the same time.
Are there any features/technologies that I can't use with an asynchronous method (or would function poorly/not as well)?
Yes, when you're doing number crunching. The architecture of an async (or "evented") system is such that it is great at passing data around but not processing data. It can handle thousands of concurrent operations, but because it only runs on one OS thread, the callbacks it fires need to do as little as possible to get the most throughput. This is because if one of your callbacks does some number crunching that takes 5 seconds, your entire server is frozen for 5 seconds until that operation completes. The idea is to get data, send it to where it's going (database, API, etc) and send a response all with minimal processing.
Async is good for network I/O: passing data between multiple sources/destinations (and also user interfaces, but that's beyond this post).
What situations would cause the performance of an asynchronous method to perform worse than a thread/process based approach? Are these common or rare cases, and how big is the difference?
See above, but any time you're doing more CPU work than network I/O, you should switch to a threaded model. However, there are architecture workarounds...for instance, you could have an async app, and anytime it needs to do real work, it sends a job to a worker queue. However, if every request requires CPU processing then that architecture is overkill and you might as well just use a threaded server.
Are there any other factors I should take into consideration when comparing the two? Keep in mind I'm focusing mainly on the thread/process based method vs. asynchronous, not any particular server software which happens to utilize one of these methods. These concerns might be difficulty of managing/debugging, security issues, etc.
Programming in async is generally more complicated than threaded. That said, if you're not doing the programming yourself (ie you're choosing between nginx and apache) then I usually recommend you go async (nginx) because you'll generally be able to squeeze more juice out of your server that way. I'm always in favor of using as much async in the stack as possible.
That said, if you're programming an app and trying to decide whether to use a threaded or async model, you will have to take developer time into account. Unless you're using a language that has green threads over an event loop (like scheme), expect to tear your hair out quite a bit over rogue exceptions crashing your entire app and in general wrapping your head around CPS/using callbacks for everything. Futures/promises are your friend, but are only a bandaid to make async nicer.
TL;DR
Async, when used in a server, can squeeze (a lot) more concurrent operations than threading if you're doing network IO and nothing else.
If you're doing any kind of number crunching, either use a threaded app server or use an async app with a background queuing system.
Async is a lot harder to program in unless your language supports "fake" threading over it (ie green threads). Once you get past the initial hump you're fine, generally. If you don't have green threads, use promises.
If you have the choice between threaded and async as a component in your stack (apache vs nginx), and they provide the exact same features, slightly favor async. Don't just pick it because you think it will make everything 20x faster though.
Processes have several advantages compared to threads and async models related to security and reliability. Most websites don't need these particular advantages, but sometimes they're indispensable.
Security: you can run your worker processes in a sandbox, as a low privileged user, and handle only one request per worker process. This mitigates against some kinds of security vulnerabilities: even if an attacker takes over your entire worker process, as long as you sandboxed it tightly based on request metadata (i.e. it doesn't have write access to all your data), then it can't harm system stability or affect the responses made to requests.
Security #2: sometimes you need to sandbox untrusted code, or to enforce segregation between different code or different requests, and the only way to do this is with a separate one-shot process. (Think running user-provided code.)
Reliability: memory leaks and memory corruption are much less severe if you teardown and replace worker processes regularly (or for each request).
It's easy to enforce hard limits on CPU time, disk and network quota, etc. spent on handling a user request in a separate process. Even if the request-handling code goes into an infinite loop, the master process (or the OS) can enforce a timeout.
We are planning to start a fairly complex web-portal which is expected to attract good local traffic and I've been told by my boss to consider/analyse node.js for the serve side.
I think scalability and multi-core support can be handled with an Nginx or Cherokee in front.
1) Is this node.js ready for some serious/big business?
2) Does this 'event/asynchronous' paradigm on server side has the potential to support the heavy traffic and data operation ? considering the fact that 'everything' is being processed in a single thread and all the live connections would be lost if it got crashed (though its easy to restart).
3) What are the advantages of event based programming compared to thread based style ? or vice-versa.
(I know of higher cost associated with thread switching but hardware can be squeezed with event model.)
Following are interesting but contradicting (to some extent) papers:-
1) http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html
2) http://pdos.csail.mit.edu/~rtm/papers/dabek:event.pdf
Node.js is developing extremely rapidly, and most of its functionality is sturdy and ready for business. However, there are a lot of places where its lacking, like database drivers, jquery and DOM, multiple http headers, etc. There are plenty of modules coming up tackling every aspect, but for a production environment you'll have to be careful to pick ones that are stable.
Its actually much MUCH more efficient using a single thread than a thousand (or even fifty) from an operating system perspective, and benchmarks I've read (sorry, don't have them on hand -- will try to find them and link them later) show that it's able to support heavy traffic -- not sure about file-system access though.
Event based programming is:
Cleaner-looking code than threaded code (in JavaScript, that is)
The JavaScript engine is extremely efficient with processing events and handling callbacks, and its easily one of the languages seeing the most runtime optimization right now.
Harder to fit when you are thinking in terms of control flow. With events, you can never be sure of the flow. However, you can also come to think of it as more dynamic programming. You can treat each event being fired as independent.
It forces you to be more security-conscious when programming, for the above reason. In that sense, its better than linear systems, where sometimes you take sanitized input for granted.
As for the two papers, both are relatively old. The first benchmarks against this, which as you can see, has a more recent note about these studies:
http://www.eecs.harvard.edu/~mdw/proj/seda/
It also cites the second paper you linked about what they have done, but refuses to comment on its relevance to the comparison between event-based systems and thread-based ones :)
Try yourself to discover the truth
See What is Node.js? where we cover exactly that:
Node in production is definitely possible, but far from the "turn-key" deployment seemingly promised by the docs. With Node v0.6.x, "cluster" has been integrated into the platform, providing one of the essential building blocks, but my "production.js" script is still ~150 lines of logic to handle stuff like creating the log directory, recycling dead workers, etc. For a "serious" production service, you also need to be prepared to throttle incoming connections and do all the stuff that Apache does for PHP. To be fair, Rails has this exact problem. It is solved via two complementary mechanisms: 1) Putting Rails/Node behind a dedicated webserver (written in C and tested to hell and back) like Nginx (or Apache / Lighttd). The webserver can efficiently serve static content, access logging, rewrite URLs, terminate SSL, enforce access rules, and manage multiple sub-services. For requests that hit the actual node service, the webserver proxies the request through. 2) Using a framework like "Unicorn" that will manage the worker processes, recycle them periodically, etc. I've yet to find a Node serving framework that seems fully baked; it may exist, but I haven't found it yet and still use ~150 lines in my hand-rolled "production.js".