I have a cronjob which processes lakhs of data with API calls and runs very frequently. Now the issue is sometimes the API calls are timed out and the cronjob gets stuck in running state. When next trigger time comes a new cronjob is started on a different thread. This results in duplicate data and multiple instances of same job running on multiple thread. How do i stop it.?
I have a timer trigger Function App ("version": "2.0") in azure which runs every 5 min. Cron Expression Used- 0 */5 * * * *
Its working as expected but sometimes it suddenly stops running. If I disable the function app and re-enable it, its starts working again.
If you see the screenshot below, It stopped working from 2021-04-14 16:54:59.998 to 2021-04-14 20:55:12.139
Any Help will be appreciated.
There could be different reasons for this issue and I will suggest you to review the below document to troubleshoot the issue and see if you are able to find the root cause.
Timer triggered function app uses TimerTriggerAttribute. This attribute consists of the Singleton Lock feature which ensures that only a single instance of the function is running at any given time. If any process runs longer than the scheduled timer, the new incoming process waits for the older process to finish and then uses the same instance. If you are using the same storage account across different timer trigger functions then this could be one of the reasons as mentioned here.
The other reason could be a restart and I will suggest you to check the Web App Restart detection section.
https://github.com/Azure/azure-functions-host/wiki/Investigating-and-reporting-issues-with-timer-triggered-functions-not-firing
https://github.com/Azure/azure-webjobs-sdk-extensions/wiki/TimerTrigger#troubleshooting
I am using PM2 in cluster mode and have 2 instances of my node.js application running. I have some long executing cron jobs (about 30 seconds) that I am trying to run. I am placing an if statement before the execution of the cron jobs to ensure that they only run on the first process via
if (process.env.NODE_APP_INSTANCE === 0) {
myCronFunction()
}
The goal was that since there are two processes, and PM2 should be load balancing them, if the cron job executes on process one, then process two would still be available to respond to the request. I am not sure what's going on, if PM2 is not successfully load balancing them, or what. But when my cron job executes on instance one, instance two is still not responding to requests until after the job on instance one finishes executing.
I'm not sure why that is. It is my understanding that they are supposed to be completely independent of one another.
Anyone have any ideas?
My App Engine Flexible cron sometimes takes more than 120 seconds. So, whenever it exceeds 120 seconds, app engine throws 502 error. It doesn't terminate my nodejs task, it only terminates the http request started by App Engine Cron job.
There is one value 240 seconds, I didn't understand where its coming from. I guess this is a retry request. It would be helpful if anyone can highlight this as well.
As per App Engine documentation, a cron can run for an hour. Is this true for http requests started by cron job as well?
To be clear, I want to run my cron for more than 120 seconds and http request to be active for 1 hour.
Even though you have switched to Kubernetes Engine, I would like to take the chance and clarify the purpose of cron jobs here.
As it is stated in the official documentation, cron jobs are used to perform jobs at regular time intervals. They involve invoking a URL through an HTTP request and run for up to 60 minutes while respecting the request's own limitations.
Some good uses for cron jobs: sending report emails on a daily basis, update cached data at regular time intervals or update summary information every hour. When a task involves obtaining external information, especially when there is a large number of operations involved that may exceed the time an HTTP connection remains open, or when there are different types of data that are coming from the external application, I would not consider it a good use of cron jobs.
If you are using Kubernetes now, and consider it to be more useful for the tasks you need to perform, go ahead and continue with it.
I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.