I'm writing a simple image upload site as a learning project.
It's written in nodejs, with mongodb and deployed onto Heroku cedar.
I'd like to implement a node script that runs say, once an hour, and applies the reddit algorithm to images and stores the score against each image in mongodb.
How can I achieve this bearing in mind I am on heroku and have file system limitations? - Given the cedar architecture, it would be best to hand off to a separate worker, but if there's a faster/simpler/easier approach I'd be happy to hear it. The heroku dev center article on workers/background jobs unfortunately doesn't list any tutorials yet for such a system.
My previous experience of background processing on heroku was with rails - so scheduled tasks add-on, + delayed_job and it's very straight forward.
An extremely simple approach might utilize setInterval or node-cron. You might also want to spawn or fork a child process for this periodic processing.
Related
I'm building my app backend as a node/express API to be deployed on Heroku.
I am new to implementing "cron jobs", and I found the npm library named node-cron which seems very straightforward.
Is it as simple as just setting up the cron job in my app runtime code? I know it will not run when the heroku free dyno goes into "sleep mode" (based on other StackOverflow answers), but I plan to use paid dynos in production so that's not an issue.
My main concern is when I "scale up" on heroku and run multiple dynos, will this cause weird interactions? Will every "instance" of my app on a separate dyno try to run its own crons independantly, causing duplication of work?
I know heroku provides a free "scheduler" addon for this to spin up dynos to do this, but if the above won't happen, the scheduler heroku addon seems to be unneeded overhead in my case.
Notes: my cron will be very simple, just doing some database clean up on old records, I didn't want to do it in the database layer to keep things simpler as it seems not super easy to schedule jobs in postgres.
Any insights will be much appreciated. Thank you.
As you mentioned above all things are correct as per my past experience with the same situation.
npm package node-cron only works fine if you have only one dyno
otherwise it will trigger based on a number of dynos.
If you want to execute the cron perfectly without taking any risk (doesn't matter how much dyno's you have) I suggest you to use heroku add-on.
My node application currently has two main modules:
a scraper module
an express server
The former is very server intensive task which indefinately runs in a loop. It scrapes information from over more than 100 urls, crunches the data and puts it into a mongodb database (using mongoose). This process runs over and over and over. :P
The latter part, my express server, responds to http/socket get requests and returns the crunched data which was written to the db by the scraper to the requesting client.
I'd like to optimize the performance of my server so that the express requests and responds get prioritized over the server intensive task(s). A client should be able to get the requested data asap, without having the scraper eat up all of my server resources.
I though about putting the server intensive task or the express server into its own thread, but then I stumbled upon cluster, and child processes; and now I'm totally confused which approach would be the right one for my situation.
One of the benefits I'm having is that there is a clear seperation between the writing part of my application and the reading part. The scraper writes stuff to the db, express reads from the db (no post/put/delete/...) calls are exposed. So, I -guess- I won't run into threading problems with different threads trying to write to the same db.
Any good suggestions? Thanks in advance!
Resources like cpu and memory required by processes are managed by the operative system. You should not waste your time writing that logic within your source code.
I think you should look at the problem from outside your source code files. Once they ran they are processes. Processes are managed, as I said, by the OS.
Firstly I would split that on two separate commands.
One being the scraper module (eg npm run scraper, that runs something like node scraper.js).
The other one being your express server (eg npm start, that runs something like node server.js).
This approach will let you configure that within your OS or your cluster.
A rapid approach for that will be to use docker.
With two docker containers running your projects with cpu usage limitations. This is fairly easy to do and does not require for you to lift a new server... and at the same time it provides the
isolation level you need to scale it to many servers in the future.
Steps to do this:
Learn a little about docker and docker compose and install them in your server
Build a docker image for your application (you can upload it to a free private image that docker hub gives you for free)
Build a docker compose for your two services using that image, with the cpu configuration you need (you can set both cpu and memory limits easily)
An alternative to that would be running the two commands (npm run scraper and npm start) with some tool like cpulimit, nice/niceness and ionice, or something else like namespaces and cgroups manually (but docker does that for you).
PD: Also, I would recommend to rethink your backend process. Maybe it's better to run it every 12 hours or something like that, instead of all the time, and you may run it from within cron instead of a loop.
I would be very gratefull if someone could answer my question. I'm new to nodejs. Doing app in meteor. Everything is fine, mongo etc. But when I end huge crud...I will need to parse some xml apis, scrape some websites. All as backend tasks done via cron etc. My question is...I don't see any examples of such backends in meteor. I see using npm libs. Is this the only path to follow? Also meteor writes mongo's ids as strings. While php writes as objectid. If I would use npm will it write as objectid? Will it do harm? Overall question is...to do parsing backend in meteor the npm are the good path?
I quote #Dan Dascalescu in his excellent answer to another question:
There are several packages to run background tasks in Meteor. From the simplest to the most involved:
super basic cron packages: cron,
easycron
percolatestudio:synced-cron
cron jobs distributed across multiple app servers
queue-async - minimalistic async (see below), written by D3 author Mike Bostock
peerlibrary:async - wrapper for the popular async package for Node.js and the
browser. Offers over 20 functions
(map, reduce, every, filter etc.) and supports powerful control flow
(serial, parallel, waterfall etc.); see also this
example.
artwells:queue - priorities, scheduling, logging, re-queuing. Queue backed by MongoDB.
vsivsi:jobCollection
schedule persistent jobs to be run anywhere (servers, clients). I used this to power the RSS feed aggregation at a financial news
aggregator startup (StockBase.com).
differential:workers
Spawn headless worker meteor processes to work on async jobs
Packages I would recommend caution with:
PowerQueue - queue async tasks, throttle resource usage, retry failed. Supports sub
queues. No
scheduling.
No tests, but nifty demo. Not
suitable for running for a long
while
due to using recursive
calls.
Kue - the priority job queue for Node.js backed by redis.
Not updated for Meteor 0.9+.
Our node app gets quite big and one job takes quite some time to execute. We run this job with a cronjob, but by calling the URL. Now Heroku has problems with this, because the job takes more than 30 seconds to finish. So we receive a time-out and after that it tries to execute it immediately again, and again, till our Memory quota is about 300% and the app crashes.
Now I want to fix this. Locally we don't have any problems running this script at all. It takes about a minute (for now, but in the future if we have more users it may take more time) to finish and memory stays stable.
Now running this script on the background should fix the problem according https://devcenter.heroku.com/articles/request-timeout#debugging-request-timeouts
Overe here https://devcenter.heroku.com/articles/asynchronous-web-worker-model-using-rabbitmq-in-node#getting-started I read about JackRabbit. But it seems like it's used for systems like RabbitMQ https://github.com/hunterloftis/jackrabbit
So my question: anyone who has experience with background tasks in node? Can and should I use JackRabbit for my background tasks, or are there better solutions? My background task just contains a very complex ExpressJS task, which takes some time to execute so....
I'm the Node.js platform owner at Heroku (and I actually wrote the web worker article you referenced).
Your use case sounds like it may fit the scheduler very well:
https://devcenter.heroku.com/articles/scheduler
It's a great replacement for cron-type jobs.
I'm planning to host an express app on Heroku (I'm already experimenting with a single dyno).
I want to use node-cron for maintenance tasks (doing some MongoDB updates). The question is, what's the simplest way to make sure the maintenance only runs once? All dynos would try to run the maintenance at the same time.
My current approach uses MongoDB's atomic upserts as some sort of semaphore (every dyno tries to set the flag for the current maintenance). But that's kinda ugly.
I'd like to not have a separate worker instance since it's really just a simple task that needs to be run once a day.
I think that's what the Heroku Scheduler is good for (https://devcenter.heroku.com/articles/scheduler). If the execution of your update isn't taking too long it's the way to go. You write a JS script (e.g. schedule.js) that knows how to update your MongoDB, put it in your root and schedule it using the Heroku Scheduler (it comes with a trivial frontend) to invoke it (node scheduler.js) at the time desired.