hesitation between two technologies for a little program - node.js

I want to make a program (more precisely, a service) that periodically scans directories to find some video files (.avi, .mkv, etc) and automatically download some associated files (mostly subtitles) from one or several websites.
This program could run on linux or windows as well.
On one hand, I know well Qt from a long time and I know all its benefits, but on the other hand, I'm attracted by node.js and it extreme flexibility and liveliness.
I need to offer some interactivity with the end user of my program (for instance, chose the scans directories, etc).
What would be the best choice in your opinion in 2013?

I advise against Node.js for "small tools and programs". Especially for iterative tasks.
The long story
The reason is quite simply the way Node.js works. Its asynchronous model makes simple tasks unnecessarily convoluted. Additionally, because many callbacks are called from the Node.js event loop, you can't just use try/catch structures so every tiny error will crash your whole Application.
Of course there are ways to catch those errors or work with them, but the docs advise you against all of them and advise you to restart the application gracefully in any case to prevent memory leaks. This means you have to implement yet another piece of code.
The only real solution in Node.js would be writing your Application as a Cluster, which is a great concept but of course would require you to use some kind of IPC to get your data back to a process that can handle it.
Also, since you wrote about "periodically scan"ning a directory, I want to point out that you should...
Use file system watchers for services
Almost every language kit has those now and I strongly suggest using those and only use a fallback full-scan.
In Qt there is a system-independent class QFileSystemWatcher that provides a handy callback whenever specified files are changed
In Java there is the java.nio.file.FileSystem.getWatchService()
Node.js has the fs.watch function, if you really want to go for it

Related

Sockets server made with Erlang vs others

I am learning Erlang and trying to understand how its sockets work as it is meant to be one of the strongest parts of the language and OTP.
I have experience with NodeJS, and wonder, how the applications made with NodeJS and Erlang differ in regards on how multiple sockets connections are managed.
As I understand, while JavaScript is single-threaded, V8 manages all the multiple simultaneous connections for it, though Erlang can manage multiple connections itself.
So, I wonder, if Erlang has excellent support for managing multiple connections at a time, how is it different from other technologies for a programmer? I mean, when I write an app for NodeJS, it can have as many connections open and well-managed as if I wrote code in Erlang, isn't it?
Please share your thoughts, links to some articles about the specialties of Erlang in this context are welcome too.
I am by no means an expert in Erlang, but I think I know Erlang and NodeJs on the same level.
The things you say, are all correct. Bot can handle multiple connections very efficiently, well well-managed you say.
But the thing is, the problems are not only handling multiple concurrent connections. The problems Erlang tries to solve very good, are fail safety, and distribution. I don't think NodeJs will be as good at it, as it is now.
Don't take it wrong, I'm not saying no one can code a distributed app in NodeJs, but considering the tools Erlang gives you, it maybe is a better choice.
For fail safety, as an example, Erlang let's you link your processes, so when one fails, other also fails or gets notified. That is not very practical by itself, but when you look at it alongside supervisors and shared-nothing processes, it is a great tool.
For distribution, Erlang let's you link nodes together. Linked nodes can talk together as if they were on the same machine, and they can spawn processes on other side too. Consider this, with the ability to start a failed app from a failed node on another node that is healthy. Gives you a great uptime.
And not to mention that these tools have years of experience behind them.
Just try to solve these issues on another ecosystem. I say ecosystem, because Erlang as a language is not complete, but the tools and frameworks (mostly OTP) have to be considered too. Then you can also say that Erlang really shines in this areas.
But Erlang also is not very good when it comes to linear processing, number crunching, image/sound processing, etc. That would be better implemented in another system.
I think, in this areas, the big difference between NodeJs and Erlang is their runtime model. NodeJs has one process, one thread that is working async on io-related tasks. Of course, you can run multiple processes, but that is the basic thing. On the other hand, Erlang has a VM called BEAM. Erlang uses special processes inside this VM, very light processes. BEAM schedules them itself, because they are not OS processes. This gives BEAM the advantage to have hundreds of thousands processes at the same time, each doing a task, be it io or anything else.
You see the difference now, I think. Erlang is more battle-tested, more better when fail safety or distribution is a must. NodeJs maybe better when you need faster development, and deployment.

When do I need to use worker processes in Heroku

I have a Node.js app with a small set of users that is currently architected with a single web process. I'm thinking about adding an after save trigger that will get called when a record is added to one of my tables. When that after save trigger is executed, I want to perform a large number of IO operations to external APIs. The number of IO operations depends on the number of elements in an array column on the record. Thus, I could be performing a large number of asynchronous operations after each record is saved in this particular table.
I thought about moving this work to a background job as suggested in Worker Dynos, Background Jobs and Queueing. The article gives as a rule of thumb that tasks that take longer than 500 ms be moved to background job. However, after working through the example using RabbitMQ (Asynchronous Web-Worker Model Using RabbitMQ in Node), I'm not convinced that it's worth the time to set everything up.
So, my questions are:
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
Is there a way to do this that is easier than implementing a message queue?
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
this is more a question of preference, than anything.
in general i say no - it's not ok... but that's based on experience in building rabbitmq services that run in heroku workers, and not seeing this as a difficult thing to do.
with a little practice, you may find that this is the simpler solution, as I have (it allows simpler code, and more robust code, as it splits the web away from the background processor - allowing each to run without knowing about each other directly)
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
are you missing something? not really
as long as you write your current in-the-web-process code in a well structured and modular fashion, moving it to a background process is not usually a big deal
most of the panic that people get from having to move code into the background, comes from having their code tightly coupled to the HTTP request / response process (i know from personal experience how painful it can be)
Is there a way to do this that is easier than implementing a message queue?
there are many options for distributed computing and background processing. i personally like RabbitMQ and the messaging patterns that it uses.
i would suggest giving it a try and seeing if it's something that can work well for you.
other options include redis with pub/sub libraries on top of it, using direct HTTP API calls to another web server, or just using a timer in your background process to check database tables on a given frequency and having the code run based on the data it finds.
p.s. you may find my RabbitMQ For Developers course of interest, if you are wanting to dig deeper into RMQ w/ node: http://rabbitmq4devs.com

What are common development issues, pitfalls and suggestions?

I've been developing in Node.js for only 2 weeks and started re-creating a web site previously written in PHP. So far so good, and looks like I can do same thing in Node (with Express) that was done in PHP in same or less time.
I have ran into things you just have to get used to such as using modules, modules not sharing common environment, and getting into a habit of using callbacks for file system and database operations etc.
But is there anything that developer might discover a lot later that is pretty important to development in node? Issues that everyone else developing in Node has but they don't surface until later? Pitfalls? Anything that pros know and noobs don't?
I would appreciate any suggestions and advice.
Here are the things you might not realize until later:
Node will pause execution to run the garbage collector eventually/periodically. Your server will pause for a hiccup when this happens. For most people, this issue is not a significant problem, but it can be a barrier for building near-time systems. See Does Node.js scalability suffer because of garbage collection when under high load?
Node is single process and thus by default will only use 1 CPU. There is built-in clustering support to run multiple processes (typically 1 per CPU), and for the most part the Node community believes this to be a solid approach. You might be surprised by this reality, though.
Stack traces are often lost due to the event queue, so your logging and debugging methodology needs to change significantly
Here are some minor stumbling blocks you may run into for a while (I still bump up against these)
Remembering to do callback(null, value) on a successful callback. Passing null as a first parameter is weird and thus I forget to do it. Instead I accidentally do callback(value), which is interpreted as an error by the caller until I debug into it for a while and slap my forehead.
forgetting to use return when you invoke the callback in a guard clause and don't want a function to continue to execute past that point. Sometimes this results in the callback getting invoked twice, which causes all manner of misbehavior.
Here are some NICE things you might not realize initially
It is much easier in node.js, using one of the awesome flow control libraries, to do complex operations like loading 3 network resources in parallel, then making 2 DB calls in serial, then writing to 2 log files in parallel, then sending an HTTP response. This stuff is trivial and beautiful in node and damn near impossible in many synchronous environments.
ALL of node's modules are new and modern, and for the most part, you can find a beautifully-designed module with a great API to do what you need. Python has great libraries by now, too, but compare Node's cheerio or jsdom module to python's BeautifulSoup and see what I mean. Compare python's requests module to node's superagent.
There's a community benefit that comes from working with a modern platform where people are focused on modern web development. The contrast between the node community and the PHP community cannot be overstated.

nodejs job server (multiple purpose)

I'm fairly new and just getting to know node.js (background as PHP developer). I've seen some nodeJs examples and the video on nodejs website.
Currently I'm running a video site and in the background a lot of tasks have to be executed. Currently this is done by cronjobs that call php scripts. The downsite of this approach is when an other process is started when the previous is still working you get a high load on the servers etc.
The jobs that needs to be done on the server are the following:
Scrape feeds from websites and insert them in mysql database
Fetch data from websites (scraping) (upon request)
Generate data for reporting. These are mostly mysql queries that need to be executed.
Tasks that need to be done in the future
Log video views (when a user visits a video page) (this will also be logged to mysql)
Log visitors in general
Show ads based on searched video
I want to be able to call an url so that a job can be queued and also be able to schedule jobs by time or they can run constantly.
I don't know if node.js is the path to follow that's why I'm asking it here. What are the advantages of doing this in node? The downsites?
What are the pro's here with node.js?
Thanks for the response!
While traditionally used for web/network tasks (web servers, IRC chat servers, etc.), Node.js shines when you give it any kind of IO bound (as opposed to CPU bound) task, since it uses completely asynchronous IO (that is, all IO happens outside of the main event loop). For example, Node can easily hold open many sockets, waiting for data on each, or stream data to and from files very efficiently.
It really sounds like you're just looking for a job queue; a popular one is Resque, and though it's written for Ruby, there are versions for PHP, Node.js, and more. There are also job queues built specifically for PHP; if you want to stick to PHP, a Google search for "PHP job queue" make take you far.
Now, one advantage to using Node.js is, again, its ability to handle a lot of IO very easily. Of course I'm just guessing, but based on your requirements, it could be a good tool for the job:
Scrape data/feeds from websites - mostly waiting on network IO
Insert data into MySQL - mostly waiting on network IO
Reporting - again, Node is good at MySQL queries, but probably not so good at analyzing data
Call a URL to schedule a job - Node's built-in HTTP handling and excellent web libraries make this a cinch
So it's entirely possible you may want to experiment with Node for these tasks. If you do, take a look at Resque for Node or another job system like Kue. It's also not very hard to build your own, if you don't need something complicated--Redis is a good tool for this.
There are a few reasons you might not want to use Node. If you're not familiar with JavaScript and evented and continuation-passing style programming, Node.js may have a bit of a learning curve, as you have to force yourself to stop thinking synchronously. Furthermore, if you do have a lot of heavy non-IO tasks in your program, such as analyzing data, Node will not excel as those calculations will block the main event loop and keep Node from handling callbacks, etc. for your asynchronous IO. Finally, if you have a lot of logic already in PHP or another language, it may be easier and/or quicker to find a solution in your language of choice.
I second the above answers. You don't necessarily need a full-service job queue, however: you can use flow-control modules like async to run tasks in parallel or series, as fast as they'll go or with controlled concurrency.
Node.js has many powerful scraping/parsing tools. This post mentions a few; I just heard about Trumpet recently; there are probably dozens of options. Node.js has a Stream module in core and Request makes HTTP interactions extremely easy.
For timed tasks, the simplest approach is a basic setTimeout/setInterval. Or you could write the scraper as a script that's called on cron. Or have it triggered on some event using the EventEmitter module in core. etc...
Uncontrolled amount of node js parallel jobs may lay down your server. You will need to control processes or in better way put them in queue for each task
For this needs and if you know php I suggest to use gearman and add jobs by needs or by small php scripts

Node.js event vs thread programming on server side

We are planning to start a fairly complex web-portal which is expected to attract good local traffic and I've been told by my boss to consider/analyse node.js for the serve side.
I think scalability and multi-core support can be handled with an Nginx or Cherokee in front.
1) Is this node.js ready for some serious/big business?
2) Does this 'event/asynchronous' paradigm on server side has the potential to support the heavy traffic and data operation ? considering the fact that 'everything' is being processed in a single thread and all the live connections would be lost if it got crashed (though its easy to restart).
3) What are the advantages of event based programming compared to thread based style ? or vice-versa.
(I know of higher cost associated with thread switching but hardware can be squeezed with event model.)
Following are interesting but contradicting (to some extent) papers:-
1) http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html
2) http://pdos.csail.mit.edu/~rtm/papers/dabek:event.pdf
Node.js is developing extremely rapidly, and most of its functionality is sturdy and ready for business. However, there are a lot of places where its lacking, like database drivers, jquery and DOM, multiple http headers, etc. There are plenty of modules coming up tackling every aspect, but for a production environment you'll have to be careful to pick ones that are stable.
Its actually much MUCH more efficient using a single thread than a thousand (or even fifty) from an operating system perspective, and benchmarks I've read (sorry, don't have them on hand -- will try to find them and link them later) show that it's able to support heavy traffic -- not sure about file-system access though.
Event based programming is:
Cleaner-looking code than threaded code (in JavaScript, that is)
The JavaScript engine is extremely efficient with processing events and handling callbacks, and its easily one of the languages seeing the most runtime optimization right now.
Harder to fit when you are thinking in terms of control flow. With events, you can never be sure of the flow. However, you can also come to think of it as more dynamic programming. You can treat each event being fired as independent.
It forces you to be more security-conscious when programming, for the above reason. In that sense, its better than linear systems, where sometimes you take sanitized input for granted.
As for the two papers, both are relatively old. The first benchmarks against this, which as you can see, has a more recent note about these studies:
http://www.eecs.harvard.edu/~mdw/proj/seda/
It also cites the second paper you linked about what they have done, but refuses to comment on its relevance to the comparison between event-based systems and thread-based ones :)
Try yourself to discover the truth
See What is Node.js? where we cover exactly that:
Node in production is definitely possible, but far from the "turn-key" deployment seemingly promised by the docs. With Node v0.6.x, "cluster" has been integrated into the platform, providing one of the essential building blocks, but my "production.js" script is still ~150 lines of logic to handle stuff like creating the log directory, recycling dead workers, etc. For a "serious" production service, you also need to be prepared to throttle incoming connections and do all the stuff that Apache does for PHP. To be fair, Rails has this exact problem. It is solved via two complementary mechanisms: 1) Putting Rails/Node behind a dedicated webserver (written in C and tested to hell and back) like Nginx (or Apache / Lighttd). The webserver can efficiently serve static content, access logging, rewrite URLs, terminate SSL, enforce access rules, and manage multiple sub-services. For requests that hit the actual node service, the webserver proxies the request through. 2) Using a framework like "Unicorn" that will manage the worker processes, recycle them periodically, etc. I've yet to find a Node serving framework that seems fully baked; it may exist, but I haven't found it yet and still use ~150 lines in my hand-rolled "production.js".

Resources