App Engine Flexible cron is terminated after 120 seconds - node.js

My App Engine Flexible cron sometimes takes more than 120 seconds. So, whenever it exceeds 120 seconds, app engine throws 502 error. It doesn't terminate my nodejs task, it only terminates the http request started by App Engine Cron job.
There is one value 240 seconds, I didn't understand where its coming from. I guess this is a retry request. It would be helpful if anyone can highlight this as well.
As per App Engine documentation, a cron can run for an hour. Is this true for http requests started by cron job as well?
To be clear, I want to run my cron for more than 120 seconds and http request to be active for 1 hour.

Even though you have switched to Kubernetes Engine, I would like to take the chance and clarify the purpose of cron jobs here.
As it is stated in the official documentation, cron jobs are used to perform jobs at regular time intervals. They involve invoking a URL through an HTTP request and run for up to 60 minutes while respecting the request's own limitations.
Some good uses for cron jobs: sending report emails on a daily basis, update cached data at regular time intervals or update summary information every hour. When a task involves obtaining external information, especially when there is a large number of operations involved that may exceed the time an HTTP connection remains open, or when there are different types of data that are coming from the external application, I would not consider it a good use of cron jobs.
If you are using Kubernetes now, and consider it to be more useful for the tasks you need to perform, go ahead and continue with it.

Related

Node/Bull/Throng background jobs. Super slow, and many processes are being used?

I've changed a long running process on an Node/Express route to use Bull/Redis.
I've pretty much copied this tutorial on Heroku docs.
The gist of that tutorial is: The Express route schedules the Job, immediately returns 200 to the client, and browser long polls for the job status (a ping route on Express). When the client gets a completed status, it displays it in the UI. The Worker is a separate file and is run with an addtional yarn run worker.js.
Notice the end where it recommends using Throng for clustering your workers.
I'm using this Bull Dashboard to monitor jobs/queues. This dashboard shows the available Workers, and their status (Idle when not running/ Not in Idle when running).
I've got the MVP working, but the operation is super slow. The average time to complete is 1 minute 30 second. Whereas before (before adding Bull) it was seconds to complete.
Another strange it seems to take at least 30 seconds for a Workers status to change from Idle to not Idle. Seems that a lot of the latency is waiting for the worker.
Being that the new operation is a separate file (worker.js) and throng is enabling clustering, I was expecting this operation to be super fast, but it is just the opposite.
Does anyone have an experience with this? Or pointers to help figure out what is causing this to be so slow?

Cron job on NodeJS server runs multiple times simultaneously due to load balancers

I have cron job services on my nodeJS server (part of a React app) that I deploy using Convox to AWS, which has 4 load balancer servers. This means my cron job runs 4 times simultaneously on each server, when I only want it to run once. How can I stop this from happening and have my cron jobs run only once? As far as I know, there is no reliable way to lock my cron to a specific instance, since instances are volatile and may be deleted/recreated as needed.
The cron job services conduct tasks such as querying and updating our database, sending out emails and texts to users, and conducting external API calls. The services are run using the cron npm package, upon the server starting (after server.listen).
Can you expose these tasks via url? That way you can have an external cron service that requests each job via url against the ELB.
See https://cron-job.org/en/
Another advantage of this approach is you get error reports if a url does not return a 200 status. This could simplify error tracking across all jobs.
Also this provides better redudency and load balancing, as opposed to having a single instance where you run all jobs.
I had the same issue. Se my solution here. Two emails was sent because of two instances on AWS. I lock each sending by unique random number.
My example based on MongoDB.
https://forums.meteor.com/t/help-email-sends-emails-twice/50624

run tasks on google app engine that runs my app server

I am using nodejs on google app engine with an end point for a cron job. When the rest end point is called I want to proceed with my cron job after returning the response back to the caller.The cron task will continue for about an hour. Will GAE terminate the task if it runs for an hour or more ? I suppose GAE should not kill my nodejs server process because that way my application will stop. I want to know if there is any possibility for the task to end prematurely due to some restrictions on GAE.
It depends on which type of scaling you have selected: https://cloud.google.com/appengine/docs/standard/java/an-overview-of-app-engine
Requests on Basic & Manual Scaling can run indefinitely, Automatic scaling has a 60 second deadline for http requests & 10 minutes for task queue requests. If you're not sure which type of scaling you have you probably have Automatic.
You could setup a micro-service with Basic scaling specifically for tasks like this; so that your primary service can stay on Automatic scaling.
You could also split up your cron task into several tasks, and then daisy-chain them using push queues (i.e. you cron task launches, does some work, and then launches task2 and dies. task2 launches, does some work, launches task3 and dies. etc)

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

How to handle cron execution time and user wait

I have scheduled a cron job that is executed every minute.
This cron job generates a pdf file using a distant web service. This operation alone takes a few seconds (something like 3 seconds), that means the cron job will be able to generate 20 pdf files per minute approximately.
If the visitor requests 60 documents, that means it will take 3 minutes for the server to generate all the pdf files.
Executing parallel cron jobs to do this task is not possible as all the files request must be handled individually for database relationships and integrity reasons. Basically, each file can only be handle one by one.
Therefore, is there any logic I could apply in order to :
execute multiple occurrences of the same cron job to speed up the process and decrease the user waiting time
and make the file creation process handled by one cron job only so that a specific creation process is not handled by another cron job doing the same task.
Thank you

Resources