How to handle cron execution time and user wait - linux

I have scheduled a cron job that is executed every minute.
This cron job generates a pdf file using a distant web service. This operation alone takes a few seconds (something like 3 seconds), that means the cron job will be able to generate 20 pdf files per minute approximately.
If the visitor requests 60 documents, that means it will take 3 minutes for the server to generate all the pdf files.
Executing parallel cron jobs to do this task is not possible as all the files request must be handled individually for database relationships and integrity reasons. Basically, each file can only be handle one by one.
Therefore, is there any logic I could apply in order to :
execute multiple occurrences of the same cron job to speed up the process and decrease the user waiting time
and make the file creation process handled by one cron job only so that a specific creation process is not handled by another cron job doing the same task.
Thank you

Related

How to make job wait for cluster to become available

I have a workflow in Databricks called "score-customer", which I can run with a parameter called "--start_date". I want to make a job for each date this month, so I manually create 30 runs - passing a different date parameter for each run. However, after 5 concurrent runs, the rest of the runs fail with:
Unexpected failure while waiting for the cluster (1128-195616-z656sbvv) to be ready.
I want my jobs to wait for the cluster to become available instead of failing, how is this achieved?

Should create a single azure web job (NODEJS)?

I need to create a webjob that runs 2 processes (maybe more).
All the time.
Process 1 (Continuous)
Get messages from the queue
for each message connect to db and update a value.
repeat 1
Process 2 (schedule - every day early in the morning)
Go to db and move records a tmp table
Send each record vía HTTP
if a record cant sent, retry for all day.
if all records were sent, run again tomorrow
According to the 2 processes (should be more), Can I create one single web job for all processes ? or should I create a single job for each process?
I was thinking about this implementation, but I don't know how accurate it is.
crojobs: 1
Type: Continuous
while(true){
process1();
process2();
}
async function process1() {
// do staff
}
async function process2() {
// do staff
// node-cron lib schedule: (every early morning day)
}
According to the 2 processes (should be more), Can I create one single web job for all processes ? or should I create a single job for each process?
In short, you need to create a single job for each process.
When you run webjob with continuous or schedule, the webjob type is work for all the webjob in it. So you can not create one single webjob which both have continuous and schedule process. For more details, you could refer to this article.

Nested Cron Job in AWS Lambda Function

Requirement: To send reminder to n users at their appropriate time. E.g user 1 at 9:10AM, user 2 at 10:50PM, user 3 at 4:20 AM and so on.
Solution in Nodejs
I have a Nodejs Cron job which runs at every 55 min (i.e. 9:55, 10:55, 11:55). At first it deletes all the child cron job and then fetch data from database and check for reminder settings for users. Based on reminder settings in database, it creates child cron jobs for all users to send the reminders.
Solution in AWS Lambda
I created lambda function and schedule it for 55 min. Inside lambda, I am doing the same thing as it was done in nodejs but since lambda's execution is finished, the child cron job are not getting executed.
I thought about step functions but not sure as how to achieve this since it is dynamic. Also someone suggested to trigger SNS but this will also not work in my scenario.
Someone please help me in achieving this with AWS Lambda.
Why not have 1 cron job that runs every minute that sends all reminders that need to be sent based on the database information? I don't really see why you need nested cron jobs?
In any case, you could also use DynamoDB's time to live attribute and a stream that triggers a Lambda function. Create a record to send a reminder at X every Y, with X being the expiration time. The Lambda function triggers, and when done you create a new DDB record with as expiration time X+Y. You might not even need a cron job Lambda in this case.
I suppose you could use aws-sdk to create cloudwatch rules dynamically in your nodejs cron job.
for better contrast, create 2 separate functions.
Main Cron Job (delete child cron job in cloud-watch, retrieve data from database, create cloudwatch rules that invoke the child cron job at specific time)
Child Cron Job (only send reminders)
To read more: Nodejs Create CloudWatch

App Engine Flexible cron is terminated after 120 seconds

My App Engine Flexible cron sometimes takes more than 120 seconds. So, whenever it exceeds 120 seconds, app engine throws 502 error. It doesn't terminate my nodejs task, it only terminates the http request started by App Engine Cron job.
There is one value 240 seconds, I didn't understand where its coming from. I guess this is a retry request. It would be helpful if anyone can highlight this as well.
As per App Engine documentation, a cron can run for an hour. Is this true for http requests started by cron job as well?
To be clear, I want to run my cron for more than 120 seconds and http request to be active for 1 hour.
Even though you have switched to Kubernetes Engine, I would like to take the chance and clarify the purpose of cron jobs here.
As it is stated in the official documentation, cron jobs are used to perform jobs at regular time intervals. They involve invoking a URL through an HTTP request and run for up to 60 minutes while respecting the request's own limitations.
Some good uses for cron jobs: sending report emails on a daily basis, update cached data at regular time intervals or update summary information every hour. When a task involves obtaining external information, especially when there is a large number of operations involved that may exceed the time an HTTP connection remains open, or when there are different types of data that are coming from the external application, I would not consider it a good use of cron jobs.
If you are using Kubernetes now, and consider it to be more useful for the tasks you need to perform, go ahead and continue with it.

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

Resources