Make quartz only have one thread in clustered environment

Make quartz only have one thread in clustered environment - cron

I have a webapp with 2 instances and some cronjobs running from time to time. When only one scheduler Instance is running, the cronjobs are triggered after eachother with the next waiting for the first one to be finished and that's exactly how I want it to function, because almost every cronjob is accessing the same data.
However if I start the scheduler in the second instance one the second cronjobs starts before the first has finished.
I found a few "solutions" and implemented tehm but they did not work.
my quartz.properties
# Use the MongoDB store
org.quartz.jobStore.class=com.novemberain.quartz.mongodb.MongoDBJobStore
# MongoDB URI (optional if 'org.quartz.jobStore.addresses' is set)
org.quartz.jobStore.mongoUri=
# comma separated list of mongodb hosts/replica set seeds (optional if 'org.quartz.jobStore.mongoUri' is set)
org.quartz.jobStore.addresses=mongodb://localhost:27017
# database name
org.quartz.jobStore.dbName=mytDb
# Will be used to create collections like mycol_jobs, mycol_triggers, mycol_calendars, mycol_locks
org.quartz.jobStore.collectionPrefix=quartz
# thread count setting is ignored by the MongoDB store but Quartz requries it
org.quartz.threadPool.threadCount=1
#set to true to enable the cluster funktions
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
org.quartz.scheduler.instanceId=AUTO
As you see I have a thread cont of 1 and what I think happens is that there are 2 Threads when there are two instances running.
So how can I configure quartz that there are two Scheduler instances running but only one is doing the work and the second to be used if one instance crashes.
I also added #DisallowConcurrentExecution on my Job classes but this seems not to work either.

Related

Is there a way to set concurrency in a linux EC2 instance?

I currently have a script inside a linux ec2 instance that processes some documents. This script gets called from AWS Lambda using (SSM) send_command. It works fine when it processes one or two documents but when it gets past that, I get empty responses. Im assuming the system bottlenecks as there is essentially no limit to the amount of calls that I can send to the instance. So is there a way to set the concurrency level on the instance to only process say 2 commands at a time?
I know I can set the concurrency level on the lambdas, but the execution time is usually less than 200ms. Meanwhile the processing time in the instance is about 5 to 15 seconds.
Ultimately, I can have the lambdas wait for the job to be completed but it would be expensive as I need to process thousands of documents.
Thank you!

Airflow - Locking between tasks so that only one parallel task runs at a time?

I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.

You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.

Properties don't take effect until second execution

I'm running scripts that require a different thread for each user account I pull from a database. So the script starts by running a JDBC processor to get all the accounts and store them (using the "Variable Names" field) in "accounts". Then I run a BeanShell PreProcessor to convert the variable "accounts_#" to a property:
props.put("p_accounts_#",vars.get("accounts_#"));
Then, I have a thread group start. Under "Number of Threads (users)", I have
${__P(p_accounts_#)}
The FIRST time I run this script (after launching jMeter), I only get a SINGLE thread. Every subsequent time I run it, it runs for all accounts.
It seems like for some reason, the property is not being saved until the end of the first execution. This is a very big problem as when jMeter is launched without the UI, it only does a single thread every time.
Am I setting the property incorrectly? I also tried it with a Beanshell Assertion with the same result.
Just as a test, I created a new test with the bare minimum I needed to reproduce this. Here's the script (images): http://imgur.com/a/WB5J2
It's a Beanshell PreProcessor with "props.put("accounts","12");"
Then a Thread group using "${__P(accounts)}" as the Number of Threads
Then inside that thread group is a Debug Sampler outputting the JMeter properties.
At the end is a View Results Tree.
When I run it the first time, there's only one output: "Thread 1 Running".
When I run it again, there's 12 outputs, "Tread 1 Running", "Thread 2 running", etc.
I can see that for both Debug Samplers (for the first run and second run), the "Accounts" property is set to 12. But the Thread Group needed to execute TWICE before it would work.
Any ideas?

This can be solved by adding another ThreadGroup called a 'setUp ThreadGroup' to contain the setup portion. If you put all of your staging steps into this type of threadgroup, it will run prior to any other threadgroups. You can then have your preprocessor, or move the logic to a beanshell sampler if you'd like, and set the property from there.

CRON + Nodejs + multiple cores => behaviour?

I'm building in a CRON like module into my service (using node-schedule) that will get required into each instance of my multi-core setup and I'm wondering since they are all running their own threads and they are all scheduled to run at the same time, will they get called for every single thread or just once because they're all loading the same module.
If they do get called multiple times, then what is the best way to make sure the desired actions only get called once?

if you are using pm2 with cluster mode, then can use
process.env.NODE_APP_INSTANCE to detect which instance is running. You can use the following code so your cron jobs will be called only once.
// run cron jobs only for first instance
if(process.env.NODE_APP_INSTANCE === '0'){
// cron jobs
}

node-schedule runs inside a given node process and it schedules things that that particular node process asked it to schedule.
If you are running multiple node processes and each is using node-schedule, then all the node-schedule instances within those separate node processes are independent (no cooperation or coordination between them). If each node process asks it's own node-schedule instance to run a particular task at 3pm on the first wednesday of the month, then all the node processes will start running that task at that time.
If you only want the action carried out once, then you have to coordinate among your node-instances so that the action is only scheduled in one node process, not in all of them or only schedule these types of operations in one of your node instances, not all of them.

The best way to handle this in a generic way is to have a shared database that you write a "lock" entry to. As in, let's say all tasks wrote a DB entry such as {instanceId: "a", taskId: "myTask", timestamp: "2021-12-22:10:35"}.
All tasks would submit the same thing except with their own instanceId. You then have an unique index on 'timestamp' so that only 1 gets accepted.
Then they all do a query and see if their node was the one that was accepted to do the cron.
You could do the same thing but also add a "random" field that generates a random number and the task with the lowest number wins.

How to define frequency of a job in application by users?

I have an application that has to launch jobs repeatingly. But (yes, that would have been to easy without a but...) I would like users to define their backup frequency in application.
In worst case, they would have to choose between :
weekly,
daily,
every 12 hours,
every 6 hours,
hourly
In best case, they should be able to use crontab expressions (see documentation for example)
How to do this? Do I launch a job every minutes that check for last execution time, frequency and then launches another job if needed? Do I create a sort of queue that will be executed by a masterjob?
Any clues, ideas, opinions, best pratices, experiences are welcome!
EDIT : Solved this problem using Akka scheduler. Ok, this is a technical solution not a design answer but still everything works great.
Each user defined repetition is an actor that send messages every period to a new actor to execute the actual job.

There may be two ways to do this depending on your requirements/architecture:
If you can only use Play:
The user creates the job and the frequency it will run (crontab, whatever).
On saving the job, you calculate the first time it will have to be run. You then add an entry to a table JOBS with the execution time, job id, and any other information required. This is required as Play is stateless and information must be stored in the DB for later retrieval.
You have a job that queries the table for entries whose execution date is less than now. Retrieves the first, runs it, removes it from the table and adds a new entry for next execution. You should keep some execution counter so if a task fails (which means the entry is not removed from DB) it won't block execution of the other tasks by the job trying again and again.
The frequency of this job is set to run every second. That way while there is information in the table, you should execute the request around as often as they are required. As Play won't spawn a new job while the current one is working if you have enough tasks this one job will serve all. If not, it will be killed at some point and restored when required.
Of course, the crons of the users will not be too precise, as you have to account for you own cron delays plus execution delays on all the tasks in queue, which will be run sequentially. Not the best approach, unless you somehow disallow crons which run every second or more often than every minute (to be safe). Doing a check on execution time of the crons to kill them if they are over a certain amount of time would be a good idea.
If you can use more than Play:
The better alternative I believe is to use Quartz (see this) to create a future execution when the user creates the job, and reproram it once the execution is over.

There was a discussion on google-groups about it. As far as I remember you must define a job which start every 6 hours and check which backups must be done. So you must remember when the last backup job was finished and make the control yourself. I'm unsure if Quartz can handle such a requirement.
I looked in the source-code (always a good source ;-)) and found a method every, where I think this should be do what you want. How ever I'm unsure if this is a clever design, because if you have 1000 user you will have then 1000 Jobs. I'm unsure if Play was build to handle such a large number of jobs.
[Update] For cron-expressions you should have a look into JobPlugin.scheduleForCRON()

There are several ways to solve this.
If you don't have a really huge load of jobs, I'd just persist them to a table using the required flexibility. Then check all of them every hour (or the lowest interval you support) and run those eligible. Simple.
Or, if you prefer to use cron syntax anyway, just write (export) jobs to a user crontab using a wrapper which calls back to your running app, or starts the job in a standalone process if that's possible.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string