Node.js event vs thread programming on server side - node.js

We are planning to start a fairly complex web-portal which is expected to attract good local traffic and I've been told by my boss to consider/analyse node.js for the serve side.
I think scalability and multi-core support can be handled with an Nginx or Cherokee in front.
1) Is this node.js ready for some serious/big business?
2) Does this 'event/asynchronous' paradigm on server side has the potential to support the heavy traffic and data operation ? considering the fact that 'everything' is being processed in a single thread and all the live connections would be lost if it got crashed (though its easy to restart).
3) What are the advantages of event based programming compared to thread based style ? or vice-versa.
(I know of higher cost associated with thread switching but hardware can be squeezed with event model.)
Following are interesting but contradicting (to some extent) papers:-
1) http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html
2) http://pdos.csail.mit.edu/~rtm/papers/dabek:event.pdf

Node.js is developing extremely rapidly, and most of its functionality is sturdy and ready for business. However, there are a lot of places where its lacking, like database drivers, jquery and DOM, multiple http headers, etc. There are plenty of modules coming up tackling every aspect, but for a production environment you'll have to be careful to pick ones that are stable.
Its actually much MUCH more efficient using a single thread than a thousand (or even fifty) from an operating system perspective, and benchmarks I've read (sorry, don't have them on hand -- will try to find them and link them later) show that it's able to support heavy traffic -- not sure about file-system access though.
Event based programming is:
Cleaner-looking code than threaded code (in JavaScript, that is)
The JavaScript engine is extremely efficient with processing events and handling callbacks, and its easily one of the languages seeing the most runtime optimization right now.
Harder to fit when you are thinking in terms of control flow. With events, you can never be sure of the flow. However, you can also come to think of it as more dynamic programming. You can treat each event being fired as independent.
It forces you to be more security-conscious when programming, for the above reason. In that sense, its better than linear systems, where sometimes you take sanitized input for granted.
As for the two papers, both are relatively old. The first benchmarks against this, which as you can see, has a more recent note about these studies:
http://www.eecs.harvard.edu/~mdw/proj/seda/
It also cites the second paper you linked about what they have done, but refuses to comment on its relevance to the comparison between event-based systems and thread-based ones :)

Try yourself to discover the truth

See What is Node.js? where we cover exactly that:
Node in production is definitely possible, but far from the "turn-key" deployment seemingly promised by the docs. With Node v0.6.x, "cluster" has been integrated into the platform, providing one of the essential building blocks, but my "production.js" script is still ~150 lines of logic to handle stuff like creating the log directory, recycling dead workers, etc. For a "serious" production service, you also need to be prepared to throttle incoming connections and do all the stuff that Apache does for PHP. To be fair, Rails has this exact problem. It is solved via two complementary mechanisms: 1) Putting Rails/Node behind a dedicated webserver (written in C and tested to hell and back) like Nginx (or Apache / Lighttd). The webserver can efficiently serve static content, access logging, rewrite URLs, terminate SSL, enforce access rules, and manage multiple sub-services. For requests that hit the actual node service, the webserver proxies the request through. 2) Using a framework like "Unicorn" that will manage the worker processes, recycle them periodically, etc. I've yet to find a Node serving framework that seems fully baked; it may exist, but I haven't found it yet and still use ~150 lines in my hand-rolled "production.js".

Related

Sockets server made with Erlang vs others

I am learning Erlang and trying to understand how its sockets work as it is meant to be one of the strongest parts of the language and OTP.
I have experience with NodeJS, and wonder, how the applications made with NodeJS and Erlang differ in regards on how multiple sockets connections are managed.
As I understand, while JavaScript is single-threaded, V8 manages all the multiple simultaneous connections for it, though Erlang can manage multiple connections itself.
So, I wonder, if Erlang has excellent support for managing multiple connections at a time, how is it different from other technologies for a programmer? I mean, when I write an app for NodeJS, it can have as many connections open and well-managed as if I wrote code in Erlang, isn't it?
Please share your thoughts, links to some articles about the specialties of Erlang in this context are welcome too.
I am by no means an expert in Erlang, but I think I know Erlang and NodeJs on the same level.
The things you say, are all correct. Bot can handle multiple connections very efficiently, well well-managed you say.
But the thing is, the problems are not only handling multiple concurrent connections. The problems Erlang tries to solve very good, are fail safety, and distribution. I don't think NodeJs will be as good at it, as it is now.
Don't take it wrong, I'm not saying no one can code a distributed app in NodeJs, but considering the tools Erlang gives you, it maybe is a better choice.
For fail safety, as an example, Erlang let's you link your processes, so when one fails, other also fails or gets notified. That is not very practical by itself, but when you look at it alongside supervisors and shared-nothing processes, it is a great tool.
For distribution, Erlang let's you link nodes together. Linked nodes can talk together as if they were on the same machine, and they can spawn processes on other side too. Consider this, with the ability to start a failed app from a failed node on another node that is healthy. Gives you a great uptime.
And not to mention that these tools have years of experience behind them.
Just try to solve these issues on another ecosystem. I say ecosystem, because Erlang as a language is not complete, but the tools and frameworks (mostly OTP) have to be considered too. Then you can also say that Erlang really shines in this areas.
But Erlang also is not very good when it comes to linear processing, number crunching, image/sound processing, etc. That would be better implemented in another system.
I think, in this areas, the big difference between NodeJs and Erlang is their runtime model. NodeJs has one process, one thread that is working async on io-related tasks. Of course, you can run multiple processes, but that is the basic thing. On the other hand, Erlang has a VM called BEAM. Erlang uses special processes inside this VM, very light processes. BEAM schedules them itself, because they are not OS processes. This gives BEAM the advantage to have hundreds of thousands processes at the same time, each doing a task, be it io or anything else.
You see the difference now, I think. Erlang is more battle-tested, more better when fail safety or distribution is a must. NodeJs maybe better when you need faster development, and deployment.

hesitation between two technologies for a little program

I want to make a program (more precisely, a service) that periodically scans directories to find some video files (.avi, .mkv, etc) and automatically download some associated files (mostly subtitles) from one or several websites.
This program could run on linux or windows as well.
On one hand, I know well Qt from a long time and I know all its benefits, but on the other hand, I'm attracted by node.js and it extreme flexibility and liveliness.
I need to offer some interactivity with the end user of my program (for instance, chose the scans directories, etc).
What would be the best choice in your opinion in 2013?
I advise against Node.js for "small tools and programs". Especially for iterative tasks.
The long story
The reason is quite simply the way Node.js works. Its asynchronous model makes simple tasks unnecessarily convoluted. Additionally, because many callbacks are called from the Node.js event loop, you can't just use try/catch structures so every tiny error will crash your whole Application.
Of course there are ways to catch those errors or work with them, but the docs advise you against all of them and advise you to restart the application gracefully in any case to prevent memory leaks. This means you have to implement yet another piece of code.
The only real solution in Node.js would be writing your Application as a Cluster, which is a great concept but of course would require you to use some kind of IPC to get your data back to a process that can handle it.
Also, since you wrote about "periodically scan"ning a directory, I want to point out that you should...
Use file system watchers for services
Almost every language kit has those now and I strongly suggest using those and only use a fallback full-scan.
In Qt there is a system-independent class QFileSystemWatcher that provides a handy callback whenever specified files are changed
In Java there is the java.nio.file.FileSystem.getWatchService()
Node.js has the fs.watch function, if you really want to go for it

Pros & Cons of Running More Than One Node App Instance For A CodeBase

We can run more than one node app for a code base, all we need to start them on a diff port every time, but i am not sure if doing so is good or not.
I can see the following pros & cons of this approach
Pros:
multiple domains like sub1.domain.com, sub2.domain.com and so on, sharing same code base.
updates code at single place.
Any other pros you like to mention?
Cons:
May be it can cause some dead lock on reading some files or some other multi process issue.
Any other cons you like to mention?
Is it a good move to share code base?
Please share your experience.
Thank You
You are essentially spawning several instances of you application which is not a bad or a good thing in itself, it has to do with what you application does. If the application does not access any ressources which will be shared with instances of itself, it is not a problem and you can spawn as many instances as you like, for what ever purpose you see fit.
BUT if your application uses any shared ressources such as a database or flat files, you need to take race conditions and dead locks into account. This is very well handled on ACID compliant databases, on document oriented databases this is not as mature and requires you do have a good grasp on the techniques and languages used.
If there is no obvious reason to run multiple instances of your application, do not do it.
Once you start going down the route of multiple instances, you have to design around bottlenecks, network traffic, backups and a lot of other things that give people headaches, do not do it just because you can.

What are common development issues, pitfalls and suggestions?

I've been developing in Node.js for only 2 weeks and started re-creating a web site previously written in PHP. So far so good, and looks like I can do same thing in Node (with Express) that was done in PHP in same or less time.
I have ran into things you just have to get used to such as using modules, modules not sharing common environment, and getting into a habit of using callbacks for file system and database operations etc.
But is there anything that developer might discover a lot later that is pretty important to development in node? Issues that everyone else developing in Node has but they don't surface until later? Pitfalls? Anything that pros know and noobs don't?
I would appreciate any suggestions and advice.
Here are the things you might not realize until later:
Node will pause execution to run the garbage collector eventually/periodically. Your server will pause for a hiccup when this happens. For most people, this issue is not a significant problem, but it can be a barrier for building near-time systems. See Does Node.js scalability suffer because of garbage collection when under high load?
Node is single process and thus by default will only use 1 CPU. There is built-in clustering support to run multiple processes (typically 1 per CPU), and for the most part the Node community believes this to be a solid approach. You might be surprised by this reality, though.
Stack traces are often lost due to the event queue, so your logging and debugging methodology needs to change significantly
Here are some minor stumbling blocks you may run into for a while (I still bump up against these)
Remembering to do callback(null, value) on a successful callback. Passing null as a first parameter is weird and thus I forget to do it. Instead I accidentally do callback(value), which is interpreted as an error by the caller until I debug into it for a while and slap my forehead.
forgetting to use return when you invoke the callback in a guard clause and don't want a function to continue to execute past that point. Sometimes this results in the callback getting invoked twice, which causes all manner of misbehavior.
Here are some NICE things you might not realize initially
It is much easier in node.js, using one of the awesome flow control libraries, to do complex operations like loading 3 network resources in parallel, then making 2 DB calls in serial, then writing to 2 log files in parallel, then sending an HTTP response. This stuff is trivial and beautiful in node and damn near impossible in many synchronous environments.
ALL of node's modules are new and modern, and for the most part, you can find a beautifully-designed module with a great API to do what you need. Python has great libraries by now, too, but compare Node's cheerio or jsdom module to python's BeautifulSoup and see what I mean. Compare python's requests module to node's superagent.
There's a community benefit that comes from working with a modern platform where people are focused on modern web development. The contrast between the node community and the PHP community cannot be overstated.

Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis?

I'm writing a piece to a project that's responsible for processing tasks outside of the main application facing data server, which is written in javascript using Node.js. It needs to handle tasks which are scheduled in the future and potentially handle tasks that are "right now". The "right now" just means the next time a worker becomes available it will operate on that task, so that bit might not matter. The workers are going to all talk to external resources, an example job would be to send an email. We are a small shop and we don't have a ton of resources so one thing I don't want to do is start mixing languages at this point in the process, and I already see that Node can do this for us pretty easily, so that's what we're going to go with unless I see a compelling reason not to before I start coding, which is soon.
All that said, I can't tell if there is a compelling reason to use an AMQP based server, like OpenAMQ or RabbitMQ over something like Kue or Beanstalkd with a node client. So, here we go:
Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis with Kue? If yes, which AMPQ based server would fit best with the architecture that I laid out? If no, which nosql solution (beanstalkd, redis/Kue) would be easiest to set up and fastest to deploy?
FWIW, I'm not accepting my answer yet, I'm going to explain what I've decided and why. If I don't get any answers that appear to be better than what I've decided, I'll accept my own later.
I decided on Kue. It supports multiple workers running asynchronously, and with cluster it can take advantage of multicore systems. It is easily extended to provide security. It's backed with Redis, which is used all over for this exact thing, so I know I'm not backing my job process server with unproven software (that's not to say that any of the others are unproven.)
The most compelling reasons that I picked Kue is that it provides a JSON api so that the client applications (The first client is going to be a web based application, but we're planning on making smartphone apps also) can add jobs easily without going through the main application facing node instance, so I can be totally out of the way of the rest of my team as I write this. I don't need a route, I don't need anything, and it's all provided for me so I don't need to write anything to support this. This has another advantage, with an extention to provide l/p security only authorized clients can add jobs, so I don;t have to expose my redis server to client applications directly. It also has a built in web console and the API allows the client to pull back lists of jobs associated with a given user very easily, so we can show the user all of their scheduled tasks in a nifty calendar view with 0 effort on my part.
The other compelling reason is the lack of steep learning curve associated with getting redis and Kue going for me. I've set up redis before, and Kue is simple and effective.
Yes, I'm a lazy developer, but I'm the good kind of lazy developer.
UPDATE:
I have it working and doing jobs, the throughput is amazing. I split out the task marshaling logic into it's own node instance, basically all I have to do is deploy my repo to a new machine and run node task-server.js to scale out my workers. I may need to add in some more job searching calls to Kue, because of how I implimented a few things, but that will be easy.

Resources