How to run long running synchronous operation in nodejs - node.js

I am writing payroll management web application in nodejs for my organisation. In many cases application shall involve cpu intensive mathematical calculation for calculating the figures and that too with many users trying to do this simulatenously.
If i plainly write the logic (setting aside the fact that i already did my best from algorithm and data structure point of view to contain the complexity) it will run synchronously blocking the event loop and make request, response slow.
How to resolve this scenario? What are the possible options to do this asynchronously? I also want to mention that this calculation stuff can be let to run in the background and later i can choose to tell user via notification about the status. I have searched for the solution all over this places and i found some solutions but only in theory & i haven't tested them all by implementing. Mentioning below:
Clustering the node server
Use worker threads
Use an alternate server and do some load balancing.
Use a message queue and couple it with worker thread to do backgound tasks.
Can someone suggest me some tried and battle tested advice on this scenario? and also some tutorial links associated with that.

You might wanna try web workers,easy to use and documented.
https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers

Related

How to distribute NodeJS requests to several servers and merge the results

I have a simple NodeJS web app that calls several apis asynchronously and merges the results to return one big result. Now let's say that I want to optimize this. How do I do this?
I am new to NoeJS and also the concept of scaling systems. I have been reading about load balancing, distributed systems, etc... I think this is the right way to go, but honestly I don't know.
I was thinking of doing something like this -
Set up a system that has several servers, and each has an instance of a NodeJS webapp that makes an api call given a path, and returns the result.
Have a master server that grabs the result from each of these servers, and merge the result and return it to the client.
Is this right way to go? What technologies do I use? Thank you for your help.
I am guessing you are trying to setup web-crawling or api-crawling, to grab data from 3rd party end point. If that is true, you would have a list of users / IDs or something like that that you pass to the web service you call and grab the data.
First of making a large number of requests very fast and in a stable way is tricky and depends on several factors to be stable and robust.
Is the 3rd party API rate limited.
Network connection on the client machine making the requests.
Error handling for both API and client errors like connection reset etc.
Sheer volume of data you are fetching back, like if you are trying to crawl data on millions of users from 3rd party API as fast as possible.
Your instinct is correct that you would have to scale this over several servers or at-least several parallel node processes on machine with lot of resources, however start small, test, and then scale would be my recommendation. Here are a few steps.
Use a good robust node http client like axios
If you are dealing with huge number of items (username, ids. emails etc) you will need stable way of iterating over them. Put them in a database like PostgreSQL or MySQL.
From here on figure out what's the fastest rate at which your API supports calling. And write stable function to iterate over your 'input' and call the API.
Then you have a couple of options. If data you are collecting is separate for each request you make. You can save it back in the database for each input. If you literally want to merge the data from multiple API calls, you can use a key-value storage like Redis. You can give an ID to each call and create a combination key for input+request_id format, then when all requests are done, you can merge them.
When you a small scale model in place you can now add a good job manager like Kue or Bull to the mix, and split the set of inputs in database from point (2) over several jobs that can be run in parallel.
Once you have a stable job-manager for that can repeat this node process for a range of inputs , now you are at a point where you can scale.
Deploy this same code on multiple servers that all talk to same Database and Redis. Install the Node process to run using a process manager like PM2.
Finally the way setup works is, each copy of same node program fetches a different set of inputs (usernames/IDs etc) form the source database, and writes the results back to the database or Redis depending on how you want to handle the output.
Optional post processing on redis to fetch the key value pairs and merge the responses grouped by input.
Some important things you have to be hyper aware of when coding this issues are:
Memory Management: Use design patterns/code/libraries that saves you most memory. Load absolutely minimum of what you need to in memory. Eg: iterating on an array of 1 millions usernames in memory is more expensive than keeping them in database and paging over them.
Error Handing: There will be lots of them. API errors, unforeseen exceptions, memory leaks, network drops etc. Having robust error handling and recovery mechanism will save the day.
Logging: Good quality logging will be critical to keep a check on how different parts of system are doing. Look at winston.
Throttling API calls: Remember making 10,000 API calls at the same minute will likely crash your machine or even most APIs.At the very least go very slow due to memory overloads. However adding a slight delay (like 10 milliseconds) between every 10 parallel calls will be HUGE boost in speed and make the calls much more stable. This strategy is called throttling or rate-limiting the API calls. Finding a sweet spot that works for your problem is important. Yes going slow can actually make you reach goal faster!
Your question was quite broad without specific code question, this is a general strategy and hopefully will give you a good starting point and links to reference materials so you can start building your solution.

When do I need to use worker processes in Heroku

I have a Node.js app with a small set of users that is currently architected with a single web process. I'm thinking about adding an after save trigger that will get called when a record is added to one of my tables. When that after save trigger is executed, I want to perform a large number of IO operations to external APIs. The number of IO operations depends on the number of elements in an array column on the record. Thus, I could be performing a large number of asynchronous operations after each record is saved in this particular table.
I thought about moving this work to a background job as suggested in Worker Dynos, Background Jobs and Queueing. The article gives as a rule of thumb that tasks that take longer than 500 ms be moved to background job. However, after working through the example using RabbitMQ (Asynchronous Web-Worker Model Using RabbitMQ in Node), I'm not convinced that it's worth the time to set everything up.
So, my questions are:
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
Is there a way to do this that is easier than implementing a message queue?
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
this is more a question of preference, than anything.
in general i say no - it's not ok... but that's based on experience in building rabbitmq services that run in heroku workers, and not seeing this as a difficult thing to do.
with a little practice, you may find that this is the simpler solution, as I have (it allows simpler code, and more robust code, as it splits the web away from the background processor - allowing each to run without knowing about each other directly)
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
are you missing something? not really
as long as you write your current in-the-web-process code in a well structured and modular fashion, moving it to a background process is not usually a big deal
most of the panic that people get from having to move code into the background, comes from having their code tightly coupled to the HTTP request / response process (i know from personal experience how painful it can be)
Is there a way to do this that is easier than implementing a message queue?
there are many options for distributed computing and background processing. i personally like RabbitMQ and the messaging patterns that it uses.
i would suggest giving it a try and seeing if it's something that can work well for you.
other options include redis with pub/sub libraries on top of it, using direct HTTP API calls to another web server, or just using a timer in your background process to check database tables on a given frequency and having the code run based on the data it finds.
p.s. you may find my RabbitMQ For Developers course of interest, if you are wanting to dig deeper into RMQ w/ node: http://rabbitmq4devs.com

nodejs job server (multiple purpose)

I'm fairly new and just getting to know node.js (background as PHP developer). I've seen some nodeJs examples and the video on nodejs website.
Currently I'm running a video site and in the background a lot of tasks have to be executed. Currently this is done by cronjobs that call php scripts. The downsite of this approach is when an other process is started when the previous is still working you get a high load on the servers etc.
The jobs that needs to be done on the server are the following:
Scrape feeds from websites and insert them in mysql database
Fetch data from websites (scraping) (upon request)
Generate data for reporting. These are mostly mysql queries that need to be executed.
Tasks that need to be done in the future
Log video views (when a user visits a video page) (this will also be logged to mysql)
Log visitors in general
Show ads based on searched video
I want to be able to call an url so that a job can be queued and also be able to schedule jobs by time or they can run constantly.
I don't know if node.js is the path to follow that's why I'm asking it here. What are the advantages of doing this in node? The downsites?
What are the pro's here with node.js?
Thanks for the response!
While traditionally used for web/network tasks (web servers, IRC chat servers, etc.), Node.js shines when you give it any kind of IO bound (as opposed to CPU bound) task, since it uses completely asynchronous IO (that is, all IO happens outside of the main event loop). For example, Node can easily hold open many sockets, waiting for data on each, or stream data to and from files very efficiently.
It really sounds like you're just looking for a job queue; a popular one is Resque, and though it's written for Ruby, there are versions for PHP, Node.js, and more. There are also job queues built specifically for PHP; if you want to stick to PHP, a Google search for "PHP job queue" make take you far.
Now, one advantage to using Node.js is, again, its ability to handle a lot of IO very easily. Of course I'm just guessing, but based on your requirements, it could be a good tool for the job:
Scrape data/feeds from websites - mostly waiting on network IO
Insert data into MySQL - mostly waiting on network IO
Reporting - again, Node is good at MySQL queries, but probably not so good at analyzing data
Call a URL to schedule a job - Node's built-in HTTP handling and excellent web libraries make this a cinch
So it's entirely possible you may want to experiment with Node for these tasks. If you do, take a look at Resque for Node or another job system like Kue. It's also not very hard to build your own, if you don't need something complicated--Redis is a good tool for this.
There are a few reasons you might not want to use Node. If you're not familiar with JavaScript and evented and continuation-passing style programming, Node.js may have a bit of a learning curve, as you have to force yourself to stop thinking synchronously. Furthermore, if you do have a lot of heavy non-IO tasks in your program, such as analyzing data, Node will not excel as those calculations will block the main event loop and keep Node from handling callbacks, etc. for your asynchronous IO. Finally, if you have a lot of logic already in PHP or another language, it may be easier and/or quicker to find a solution in your language of choice.
I second the above answers. You don't necessarily need a full-service job queue, however: you can use flow-control modules like async to run tasks in parallel or series, as fast as they'll go or with controlled concurrency.
Node.js has many powerful scraping/parsing tools. This post mentions a few; I just heard about Trumpet recently; there are probably dozens of options. Node.js has a Stream module in core and Request makes HTTP interactions extremely easy.
For timed tasks, the simplest approach is a basic setTimeout/setInterval. Or you could write the scraper as a script that's called on cron. Or have it triggered on some event using the EventEmitter module in core. etc...
Uncontrolled amount of node js parallel jobs may lay down your server. You will need to control processes or in better way put them in queue for each task
For this needs and if you know php I suggest to use gearman and add jobs by needs or by small php scripts

Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis?

I'm writing a piece to a project that's responsible for processing tasks outside of the main application facing data server, which is written in javascript using Node.js. It needs to handle tasks which are scheduled in the future and potentially handle tasks that are "right now". The "right now" just means the next time a worker becomes available it will operate on that task, so that bit might not matter. The workers are going to all talk to external resources, an example job would be to send an email. We are a small shop and we don't have a ton of resources so one thing I don't want to do is start mixing languages at this point in the process, and I already see that Node can do this for us pretty easily, so that's what we're going to go with unless I see a compelling reason not to before I start coding, which is soon.
All that said, I can't tell if there is a compelling reason to use an AMQP based server, like OpenAMQ or RabbitMQ over something like Kue or Beanstalkd with a node client. So, here we go:
Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis with Kue? If yes, which AMPQ based server would fit best with the architecture that I laid out? If no, which nosql solution (beanstalkd, redis/Kue) would be easiest to set up and fastest to deploy?
FWIW, I'm not accepting my answer yet, I'm going to explain what I've decided and why. If I don't get any answers that appear to be better than what I've decided, I'll accept my own later.
I decided on Kue. It supports multiple workers running asynchronously, and with cluster it can take advantage of multicore systems. It is easily extended to provide security. It's backed with Redis, which is used all over for this exact thing, so I know I'm not backing my job process server with unproven software (that's not to say that any of the others are unproven.)
The most compelling reasons that I picked Kue is that it provides a JSON api so that the client applications (The first client is going to be a web based application, but we're planning on making smartphone apps also) can add jobs easily without going through the main application facing node instance, so I can be totally out of the way of the rest of my team as I write this. I don't need a route, I don't need anything, and it's all provided for me so I don't need to write anything to support this. This has another advantage, with an extention to provide l/p security only authorized clients can add jobs, so I don;t have to expose my redis server to client applications directly. It also has a built in web console and the API allows the client to pull back lists of jobs associated with a given user very easily, so we can show the user all of their scheduled tasks in a nifty calendar view with 0 effort on my part.
The other compelling reason is the lack of steep learning curve associated with getting redis and Kue going for me. I've set up redis before, and Kue is simple and effective.
Yes, I'm a lazy developer, but I'm the good kind of lazy developer.
UPDATE:
I have it working and doing jobs, the throughput is amazing. I split out the task marshaling logic into it's own node instance, basically all I have to do is deploy my repo to a new machine and run node task-server.js to scale out my workers. I may need to add in some more job searching calls to Kue, because of how I implimented a few things, but that will be easy.

communication between two processes running node.js

I'm writing a server, and decided to split up the work between different processes running node.js, because I heard node.js was single threaded and figured this would parallize better. The application is going to be a game. I have one process serving html pages, and then other processes dealing with the communication between clients playing the game. The clients will be placed into "rooms" and then use sockets to talk to each other relayed through the server. The problem I have is that the html server needs to be aware of how full the different rooms are to place people correctly. The socket servers need to update this information so that an accurate representation of the various rooms is maintained. So, as far as I see it, the html server and the room servers need to share some objects in memory. I am planning to run it on one (multicore) machine. Does anyone know of an easy way to do this? Any help would be greatly appreciated
Node currently doesn't support shared memory directly, and that's a reflection of JavaScript's complete lack of semantics or support for threading/shared memory handling.
With node 0.7, only recently usable even experimentally, the ability to run multiple event loops and JS contexts in a single process has become a reality (utilizing V8's concept of isolates and large changes to libuv to allow multiple event loops per process). In this case it's possible, but still not directly supported or easy, to have some kind of shared memory. In order to do that you'd need to use a Buffer or ArrayBuffer (both which represent a chunk of memory outside of JavaScript's heap but accessible from it in a limited manner) and then some way to share a pointer to the underlying V8 representation of the foreign object. I know it can be done from a minimal native node module but I'm not sure if it's possible from JS alone yet.
Regardless, the scenario you described is best fulfilled by simply using child_process.fork and sending the (seemingly minimal) amount of data through the communication channel provided (uses serialization).
http://nodejs.org/docs/latest/api/child_processes.html
Edit: it'd be possible from JS alone assuming you used node-ffi to bridge the gap.
You may want to try using a database like Redis for this. You can have a process subscribed to a channel listening new connections and publishing from the web server every time you need.
You can also have multiple processes waiting for users and use a list and BRPOP to subscribe to wait for players.
Sounds like you want to not do that.
Serving and message-passing are both IO-bound, which Node is very good at doing with a single thread. If you need long-running calculations about those messages, those might be good for doing separately, but even so, you might be surprised at how well you do with a single thread.
If not, look into Workers.
zeromq is also becomming quite popular as a process comm method. Might be worth a look. http://www.zeromq.org/ and https://github.com/JustinTulloss/zeromq.node

Resources