When do I need to use worker processes in Heroku - node.js

I have a Node.js app with a small set of users that is currently architected with a single web process. I'm thinking about adding an after save trigger that will get called when a record is added to one of my tables. When that after save trigger is executed, I want to perform a large number of IO operations to external APIs. The number of IO operations depends on the number of elements in an array column on the record. Thus, I could be performing a large number of asynchronous operations after each record is saved in this particular table.
I thought about moving this work to a background job as suggested in Worker Dynos, Background Jobs and Queueing. The article gives as a rule of thumb that tasks that take longer than 500 ms be moved to background job. However, after working through the example using RabbitMQ (Asynchronous Web-Worker Model Using RabbitMQ in Node), I'm not convinced that it's worth the time to set everything up.
So, my questions are:
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
Is there a way to do this that is easier than implementing a message queue?

For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
this is more a question of preference, than anything.
in general i say no - it's not ok... but that's based on experience in building rabbitmq services that run in heroku workers, and not seeing this as a difficult thing to do.
with a little practice, you may find that this is the simpler solution, as I have (it allows simpler code, and more robust code, as it splits the web away from the background processor - allowing each to run without knowing about each other directly)
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
are you missing something? not really
as long as you write your current in-the-web-process code in a well structured and modular fashion, moving it to a background process is not usually a big deal
most of the panic that people get from having to move code into the background, comes from having their code tightly coupled to the HTTP request / response process (i know from personal experience how painful it can be)
Is there a way to do this that is easier than implementing a message queue?
there are many options for distributed computing and background processing. i personally like RabbitMQ and the messaging patterns that it uses.
i would suggest giving it a try and seeing if it's something that can work well for you.
other options include redis with pub/sub libraries on top of it, using direct HTTP API calls to another web server, or just using a timer in your background process to check database tables on a given frequency and having the code run based on the data it finds.
p.s. you may find my RabbitMQ For Developers course of interest, if you are wanting to dig deeper into RMQ w/ node: http://rabbitmq4devs.com

Related

How to run long running synchronous operation in nodejs

I am writing payroll management web application in nodejs for my organisation. In many cases application shall involve cpu intensive mathematical calculation for calculating the figures and that too with many users trying to do this simulatenously.
If i plainly write the logic (setting aside the fact that i already did my best from algorithm and data structure point of view to contain the complexity) it will run synchronously blocking the event loop and make request, response slow.
How to resolve this scenario? What are the possible options to do this asynchronously? I also want to mention that this calculation stuff can be let to run in the background and later i can choose to tell user via notification about the status. I have searched for the solution all over this places and i found some solutions but only in theory & i haven't tested them all by implementing. Mentioning below:
Clustering the node server
Use worker threads
Use an alternate server and do some load balancing.
Use a message queue and couple it with worker thread to do backgound tasks.
Can someone suggest me some tried and battle tested advice on this scenario? and also some tutorial links associated with that.
You might wanna try web workers,easy to use and documented.
https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers

Options for getting a CPU intensive job off my web server?

I have been working on a Web App for visualizing live data. It is crucial that this data is kept up to date on the client side without such updates being invoked directly by the client (e.g. no button presses or refreshing the page). Currently, on page load, I grab the current data set from a database (DynamoDB) via Ajax, and subsequent updates are pushed to any listening clients every 5 minutes via a Websockets connection (using Socket.io).
I have overlooked the computational load of this update job. It has to mine some data, process it, update the database, and send the update out to all clients. As a result, the web server is left unresponsive for about 30 seconds with each update. Furthermore, my current architecture limits me from putting my server behind a load balancer, which is something I anticipate coming up in the future. For both these reasons, I really need to get this update job off my web server.
I am relatively inexperienced in web development, and I don't feel I am knowledgeable enough about these technologies to know the drawbacks of the solutions I have come up with. Currently, I am considering:
Break the update off into a separate process so it does not block the Node event loop. This would solve my issue in the short term, but if I ever want to load balance my application, I can't have the update running on multiple machines.
Drop Websockets entirely and just have the client query the database every 5 minutes, while a separate process (or separate server if I want load balancing) keeps the database up to date without interacting directly with the client. Will this kind of access pattern put too much load on my db?
Have a separate server run the update, and send the result via Websockets (or maybe some other protocol) to my load balanced application servers, which then push that update to all listening clients as usual. Is this even possible?
Perhaps there are other solutions. It seems like this would be a relatively common problem, so I was hoping I could find some guidance here. What are the potential issues with the solutions I have proposed, and are there other possible solutions that my suit my use case better?
It sounds like you want one process sitting somewhere which crunches the data and publishes it to a stream. Clients can then subscribe to the stream as and when they like. Redis handles streams nicely, you could process your data and push it into a redis stream. You could then create a small node service which subscribes to the redis stream and pushes the formatted data out over a websocket or via polling.
In this scenario you can then scale up either the publishing process (the one crunching the numbers) if your data load goes up, or scale up your subscribed process (which serves the data over a websocket to browsers) if you get an influx of clients watching the data.
You can also easily distribute the hosting of these services across other machines, and even write them in different languages if you decide the number crunching needs something like threading.
You're then left with the issue of clients (web browsers) consuming this data with a load balance in-between. This can be a hard problem if you use websockets and is bundled with pros and cons. But importantly you'll have separated your data crunching from your result publishing and that'll isolate out your issue to only the load balancing.
I have done pretty much the same to check ressources on some of our servers.
I have a C# service getting the information on each server that we manage, sending them to a queue (Amq).
From there, I have a stomp client fetching data from amq and emiting them to a websocket.
My main micro service is fetching the data to save them into a db.
My visualisation webapp is connected to the same ws and is fetching the data as they are sent to display them.
The Amq step isn't mandatory at all, it's just something I had to work with (historical).
I don't know what type of data your are working with, so I don't know if my solution can apply to you.
Don't hesitate if I'm not clear or you have any question.
This is a big question and I'm not going to try and give you a definitive answer.
For option 2
It really depends on how expensive your queries are. You can make DynamoDB fast if you pay for enough throughput. That said, on the face it, re-loading your whole dataset, when that sounds like its probably large, probably isn't good engineering.
For option 3
This option seems best to me if its achievable, although admittedly its hard to say with such a complex system - obviously you can't share your whole project.
Given your are already using AWS you might want to look into AWS Lambda. If you can move the update process into a stand alone job, you can host it on lambda and move the load off the web server. Lambda is essentially infinitely scalable and you only pay for the compute you use.
This really depends on you being able to split the update task off into a separate service. Its likely you would need a fair bit of refactoring to isolate it as a service. If you can break little bits off at a time, and make the move gradually, even better.
If you consider trying this, and you've not used Lambda before, I would definitely start small with some hello world examples. Then try a very simple service in your application, and build up to taking on the update service.
You might also consider looking in AWS Simple Message Queue Service to handle the comms between clients and server.
Database tuning
If a lot of your update time is spent waiting for database actions to complete, rather than server processing, you can consider tuning that side of things up. Things to consider are:
Buying more throughput
Using batch operations (as these move load to DynamoDB from your server)
Tuning keys, indexes and database access

How would you do heavy processing in Meteor?

I have a meteor app that is currently pulling data from twitter and is subsequently doing some manipulation and then inserting the documents into a collection. Let's say I run this process forever but don't want to block the event loop, is there any solution for this?
Note: I know node.js is single-threaded, and meteor doesn't support packages such as cluster because it requires sticky sessions. The only solution I can think of is adding a server dedicated to processing the data coming in from twitter and forwarding the requests to that server but then I have no longer have a case to use Meteor or node.
Help would be appreciated.
The truth here is that while javascript/node/meteor might be capable to do processing in, you yourself really don't want to do that. Let me give some observations and a personal example:
Your app is all about the latency. If one of your requests takes long to complete because it is stuck in a tight loop it affects every other client connected to your server at that moment. Everybody's latency will increase if this happens. (This is the case for making sure you have no tight loops in your code)
Javascript (the language) has very unsophisticated support for numeric values. (You basically get a double). Things like float, long, int, byte are all meant to allow you to do tight loops as fast as possible. If you can represent a value in a primitive type most closely matched to it you will get a lot of improvement. (This is the case for extracting your data processing to a language suited for data processing)
I was prototyping an app that had to do some aggregations over data. I fired it in meteor using a setInterval callback and it took about 2 seconds to complete each time. On my own development machine I didn't really notice it (because meteor apps hide latency issues very effectively). As soon as I deployed it and started looking at the logs I realized that not a single user had latency on any request below 4 seconds. This is horrible client experience.
I extracted the number crunching to a small clojure app. All integration happens via records inserted and read from the mongo db and the clojure code has some timed events firing every couple of seconds doing exactly the same calcs as was previously done in meteor.
In clojure those calcs now take less than 100ms in total (compared to 2-4 seconds in meteor).
To come back to your question: It doesn't sound like your application has a user interface? If it does, you would do well to keep that in meteor because it's excellent for web UI's. But it's not the right technology for headless apps, which it sounds to me like you have.
You can use this.unblock() within the beginning of your method that does the heavy processing. Meteor will than start another fiber, go on with processing your method, fire the callback when it is done. More info here: http://docs.meteor.com/#method_unblock

nodejs job server (multiple purpose)

I'm fairly new and just getting to know node.js (background as PHP developer). I've seen some nodeJs examples and the video on nodejs website.
Currently I'm running a video site and in the background a lot of tasks have to be executed. Currently this is done by cronjobs that call php scripts. The downsite of this approach is when an other process is started when the previous is still working you get a high load on the servers etc.
The jobs that needs to be done on the server are the following:
Scrape feeds from websites and insert them in mysql database
Fetch data from websites (scraping) (upon request)
Generate data for reporting. These are mostly mysql queries that need to be executed.
Tasks that need to be done in the future
Log video views (when a user visits a video page) (this will also be logged to mysql)
Log visitors in general
Show ads based on searched video
I want to be able to call an url so that a job can be queued and also be able to schedule jobs by time or they can run constantly.
I don't know if node.js is the path to follow that's why I'm asking it here. What are the advantages of doing this in node? The downsites?
What are the pro's here with node.js?
Thanks for the response!
While traditionally used for web/network tasks (web servers, IRC chat servers, etc.), Node.js shines when you give it any kind of IO bound (as opposed to CPU bound) task, since it uses completely asynchronous IO (that is, all IO happens outside of the main event loop). For example, Node can easily hold open many sockets, waiting for data on each, or stream data to and from files very efficiently.
It really sounds like you're just looking for a job queue; a popular one is Resque, and though it's written for Ruby, there are versions for PHP, Node.js, and more. There are also job queues built specifically for PHP; if you want to stick to PHP, a Google search for "PHP job queue" make take you far.
Now, one advantage to using Node.js is, again, its ability to handle a lot of IO very easily. Of course I'm just guessing, but based on your requirements, it could be a good tool for the job:
Scrape data/feeds from websites - mostly waiting on network IO
Insert data into MySQL - mostly waiting on network IO
Reporting - again, Node is good at MySQL queries, but probably not so good at analyzing data
Call a URL to schedule a job - Node's built-in HTTP handling and excellent web libraries make this a cinch
So it's entirely possible you may want to experiment with Node for these tasks. If you do, take a look at Resque for Node or another job system like Kue. It's also not very hard to build your own, if you don't need something complicated--Redis is a good tool for this.
There are a few reasons you might not want to use Node. If you're not familiar with JavaScript and evented and continuation-passing style programming, Node.js may have a bit of a learning curve, as you have to force yourself to stop thinking synchronously. Furthermore, if you do have a lot of heavy non-IO tasks in your program, such as analyzing data, Node will not excel as those calculations will block the main event loop and keep Node from handling callbacks, etc. for your asynchronous IO. Finally, if you have a lot of logic already in PHP or another language, it may be easier and/or quicker to find a solution in your language of choice.
I second the above answers. You don't necessarily need a full-service job queue, however: you can use flow-control modules like async to run tasks in parallel or series, as fast as they'll go or with controlled concurrency.
Node.js has many powerful scraping/parsing tools. This post mentions a few; I just heard about Trumpet recently; there are probably dozens of options. Node.js has a Stream module in core and Request makes HTTP interactions extremely easy.
For timed tasks, the simplest approach is a basic setTimeout/setInterval. Or you could write the scraper as a script that's called on cron. Or have it triggered on some event using the EventEmitter module in core. etc...
Uncontrolled amount of node js parallel jobs may lay down your server. You will need to control processes or in better way put them in queue for each task
For this needs and if you know php I suggest to use gearman and add jobs by needs or by small php scripts

Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis?

I'm writing a piece to a project that's responsible for processing tasks outside of the main application facing data server, which is written in javascript using Node.js. It needs to handle tasks which are scheduled in the future and potentially handle tasks that are "right now". The "right now" just means the next time a worker becomes available it will operate on that task, so that bit might not matter. The workers are going to all talk to external resources, an example job would be to send an email. We are a small shop and we don't have a ton of resources so one thing I don't want to do is start mixing languages at this point in the process, and I already see that Node can do this for us pretty easily, so that's what we're going to go with unless I see a compelling reason not to before I start coding, which is soon.
All that said, I can't tell if there is a compelling reason to use an AMQP based server, like OpenAMQ or RabbitMQ over something like Kue or Beanstalkd with a node client. So, here we go:
Is there a compelling reason to use an AMQP based server over something like beanstalkd or redis with Kue? If yes, which AMPQ based server would fit best with the architecture that I laid out? If no, which nosql solution (beanstalkd, redis/Kue) would be easiest to set up and fastest to deploy?
FWIW, I'm not accepting my answer yet, I'm going to explain what I've decided and why. If I don't get any answers that appear to be better than what I've decided, I'll accept my own later.
I decided on Kue. It supports multiple workers running asynchronously, and with cluster it can take advantage of multicore systems. It is easily extended to provide security. It's backed with Redis, which is used all over for this exact thing, so I know I'm not backing my job process server with unproven software (that's not to say that any of the others are unproven.)
The most compelling reasons that I picked Kue is that it provides a JSON api so that the client applications (The first client is going to be a web based application, but we're planning on making smartphone apps also) can add jobs easily without going through the main application facing node instance, so I can be totally out of the way of the rest of my team as I write this. I don't need a route, I don't need anything, and it's all provided for me so I don't need to write anything to support this. This has another advantage, with an extention to provide l/p security only authorized clients can add jobs, so I don;t have to expose my redis server to client applications directly. It also has a built in web console and the API allows the client to pull back lists of jobs associated with a given user very easily, so we can show the user all of their scheduled tasks in a nifty calendar view with 0 effort on my part.
The other compelling reason is the lack of steep learning curve associated with getting redis and Kue going for me. I've set up redis before, and Kue is simple and effective.
Yes, I'm a lazy developer, but I'm the good kind of lazy developer.
UPDATE:
I have it working and doing jobs, the throughput is amazing. I split out the task marshaling logic into it's own node instance, basically all I have to do is deploy my repo to a new machine and run node task-server.js to scale out my workers. I may need to add in some more job searching calls to Kue, because of how I implimented a few things, but that will be easy.

Resources