Cron job on NodeJS server runs multiple times simultaneously due to load balancers

Cron job on NodeJS server runs multiple times simultaneously due to load balancers - node.js

I have cron job services on my nodeJS server (part of a React app) that I deploy using Convox to AWS, which has 4 load balancer servers. This means my cron job runs 4 times simultaneously on each server, when I only want it to run once. How can I stop this from happening and have my cron jobs run only once? As far as I know, there is no reliable way to lock my cron to a specific instance, since instances are volatile and may be deleted/recreated as needed.
The cron job services conduct tasks such as querying and updating our database, sending out emails and texts to users, and conducting external API calls. The services are run using the cron npm package, upon the server starting (after server.listen).

Can you expose these tasks via url? That way you can have an external cron service that requests each job via url against the ELB.
See https://cron-job.org/en/
Another advantage of this approach is you get error reports if a url does not return a 200 status. This could simplify error tracking across all jobs.
Also this provides better redudency and load balancing, as opposed to having a single instance where you run all jobs.

I had the same issue. Se my solution here. Two emails was sent because of two instances on AWS. I lock each sending by unique random number.
My example based on MongoDB.
https://forums.meteor.com/t/help-email-sends-emails-twice/50624

Related

How to synchronize background tasks across multiple FastAPI processes?

I have a FastAPI application which sends emails when the users register on the website.
The app is implemented in such a way that there is a cron task (scheduled every minute) which checks the database and if a flag is set, it will try to send an email.
Deployment: Two instances of the FastAPI application are running connected to a locally hosted MYSQL database
Problem: Since, there are two instances of the Application running, each one will trigger a cron job every minute and this results in sending an email twice.
How to synchronise between multiple processes? Please help me with this issue. Thanks

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.

I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.

I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

No Mongo Query gets result when cron is running in background

I have been NodeJS as server side and MongoDB as our database. It really works great together.
Now I have added node-schedule library into our system , to call a function like a cron-job.
The process takes around hours to complete.
My issue is whenever cron is running , all users to my site gets No response fro server i.e database gets locked.
Stuck on the issue from a week , needs good solution to run cron , without affecting users using the site.

Typically you will want to write a worker and run the worker in a different entry point that is not part of your server. There are multiple ways you could achieve this.
1) Write a worker on another server to interact with your database
2) Write a service worker on another server that interacts with your api
3) Use the same server but setup a cronjob to execute the file that does the work at a specified time.
But you should not do this from the same entry point that your server is running on. You need a different execution file.
There is one thing you can do to run this where it will not bog down your server and that would be for your trigger for node-schedule to run a child process. https://nodejs.org/api/child_process.html

How to fail over node.js timer on amazon load balancer?

I have setup 2 instance under aws load balancer. I have deployed node.js web services + mongodb in both instance. load balancer works fine with web services.
But, Problem is I have one timer service (node.js service only). the behavior of this timer is updating my mongodb based on some calculation.
My problem is, I must need to run this timer service (timer.js) at only one aws instance (out of 2) at same time. and expected that if one aws instance goes down then timer service at other instance will come up.
i know elb not providing this kind of facility.Can any one please help me to make it done ?
Condition : At a time only one timer service must be run with amazon load balancer.
Thanks.

You would have to implement this yourself using a locking algorithm using a shared data store that supports atomic operations
Alternatively, consider starting a "timer" server in an Auto Scale Group of Min:1, Max: 1 so Amazon keeps it running. This instance can be a t2.micro which is very cheap. It can either run the job itself, or just make an http request to your load balancer to run the job at the desired internal. If you so that, only one of your servers will run each job

Wouldn't it make more sense to handle this like any other "service" that needs to keep running?
upstart service
running node.js server using upstart causes 'terminated with status 127' on 'ubuntu 10.04'
This guy had a bad path in his file but his upstart script looks okay
monit
Node.js (sudo) and monit

How to run cronjobs on local files on cloudControl PaaS?

On cloudControl, I can either run a local task via a worker or I can run a cronjob.
What if I want to perform a local task on a regular basis (I don't want to call a publicly accessible website).
I see possible solutions:
According to the documentation,
"cronjobs on cloudControl are periodical calls to a URL you specify."
So calling the file locally is not possible(?). So I'd have to create a page I can call via URL. And I have to perform checks, if the client is on localhost (=the server) -- I would like to avoid this way.
I make the worker sleep() for the desired amount of time and then make it re-run.
// do some arbitrary action
Foo::doSomeAction();
// e.g. sleep 1 day
sleep(86400);
// restart worker
exit(2);
Which one is recommended?
(Or: Can I simply call a local file via cron?)

The first option is not possible, because the url request is made from a seperate webservice.
You could either use HTTP authentication in the cron task, but the worker solution is also completely valid.
Just keep in mind that the worker can get migrated to a different server (in case of software updates or hardware failure), so do SomeAction() may get executed more often than once per day from time to time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string