How to set agenda job concurrency properly - node.js

Here's an example job:
const Agenda = require('agenda')
const agenda = new Agenda({db: {address: process.env.MONGO_URL}})
agenda.define('example-job', (job) => {
console.log('took a job -', job.attrs._id)
})
So now, let's say I queue up up 11 agenda jobs like this:
const times = require('lodash/times')
times(11, () => agenda.now('example-job'))
Now if I look in the DB I can see that there are 11 jobs queued and ready to go (like I would expect).
So now I start one worker process:
agenda.on('ready', () => {
require('./jobs/example_job')
agenda.start()
})
When that process starts I see 5 jobs get pulled off the queue, this makes sense because the defaultConcurrency is 5 https://github.com/agenda/agenda#defaultconcurrencynumber
So far so good, but if I start another worker process (same as the one above) I would expect 5 more jobs to be pulled off the queue so there would be a total of 10 running (5 per process), and one left on the queue.
However, when the second worker starts, it doesn't pull down any more jobs, it just idles.
I would expect that defaultConcurrency is the number of jobs that can run at any given moment per process, but it looks like it is a setting that applies to the number of jobs at any moment in aggregate, across all agenda processes.
What am I missing here or what is the correct way to specify how many jobs can run per process, without putting a limit on the number of jobs that can be run across all the processes.

The problem is that the defaultLockLimit needs to be set.
By default, lock limit is 0, which means no limit, which means one worker will lock up all the available jobs, allowing no other workers to claim them.
By setting defaultLockLimit to the same value as defaultConcurrency this ensures that a worker will only lock the jobs that it is actively processing.
See: https://github.com/agenda/agenda/issues/412#issuecomment-374430070

Related

How to limit/set parallelism of QueueTriggers that get executed with Azure WebJobs

I have 5 QueueTrigger jobs within a single Function.cs file. 3 jobs must execute sequentially (synchronously) and 2 can process up to 16 items at a time.
From what I can decode from the documentation the AddAzureStorage queue configuration method only supports setting this parallelism for all the jobs:
.AddAzureStorage(queueConfig =>
{
queueConfig.BatchSize = 1;
});
The above now sets that all jobs can process only one item at a time. If I set it to 16, then all jobs will run in parallel which is not what I want either.
Is there a way to set the BatchSize per QueueTrigger webjob or will I have to set it to 16 and use locks on those I don't want to run in parallel to achieve the desired behaviour?

creating recurring events in nodejs to update or insert MySQL table

I have a MySQL table tasks. In tasks, we can create a normal task or a recurring task that will automatically create a new task in the MySQL tasks table and send an email notification to the user that a task has been created. After a lot of research, I found out that you can do it in four methods
MySQL events
Kue, bull, agenda(node.js scheduling libraries)
Using a cron job to monitor every day for tasks
the recurring tasks would be repeated over weekly, daily, monthly, and yearly.
We must put an option to remove the recurring event at any time. What would be a nice and clean solution?
As you've identified there are a number of ways of going about this, here's how I would do it but I'm making a number of assumptions such as how many tasks you're likely to have and how flexible the system is going forward.
If you're unlikely to change the task time options (daily, weekly, monthly, yearly). Each task would have the following fields last_run_date and next_run_date. Every time a task is run I would update these fields and create an entry in a log table such as task_run_log which will also store the date/time the task was run at.
I would then have a cron job which fires a HTTP message to a nodejs service. This web service would look through the table of tasks, find which ones need to be executed for that day and would dispatch a message for each task into some sort of a queue (AWS SQS, GCP Pub/Sub, Apache Kafka, etc). Each message in the queue would represent a single task that needs to be carried out, workers can subscribe to this queue and process the task themselves. Once a worker has processed a job it would then make the log entry and update the last_run_date and next_run_date fields. If a task fails it'll add it into move that message into an error queue and will log a failed task in the task log.
This system would be robust as any failed jobs would exist as failed jobs in your database and would appear in an error queue (which you can either drain to remove the failed jobs, or you can replay them into the normal queue when the worker is fixed). It would also scale to many tasks that have to happen each day as you can scale up your workers. You also won't be flooding cron, your cron job will just send a single HTTP request each day to your HTTP service which kicks off the processing.
You can also setup alerts based on whether the cron job runs or not to make sure the process gets kicked off properly.
I had to do something very similar, you can use the npm module node-schedule
Node scheduler has many features. You can first create your rule setup, which determines when it runs and then schedules the job, which is where determine what the job performs and activates it, I have an example below from my code which sets a job to run at midnight every day.
var rule = new schedule.RecurrenceRule();
rule.dayOfWeek = [0, new schedule.Range(1, 6)];
var j = schedule.scheduleJob(rule, function(){
sqlUpdate(server);
});
This may not exactly fit all of your requirements alone but there are other features and setups you can do.
For example you can cancel any job with the cancel function
j.cancel()
You can also set start times and end times like so as shown in the npm page
let startTime = new Date(Date.now() + 5000);
let endTime = new Date(startTime.getTime() + 5000);
var j = schedule.scheduleJob({ start: startTime, end: endTime, rule: '*/1 * * * * *' }, function(){
console.log('Time for tea!');
});
There are also other options for scheduling the date and time as this also follows the cron format. Meaning you can set dynamic times
var j = schedule.scheduleJob('42 * * * *', function(){
console.log();
});
As such this would allow node.js to handle everything you need. You would likely need to set up a system to keep track of the scheduled jobs (var j) But it would allow you to cancel it and schedule it to your desire.
It additionally can allow you to reschedule, retrieve the next scheduled event and you can have multiple date formats.
If you need to persist the jobs after the process is turned of and on or reset you will need to save the details of the job, a MySQL database would make sense here, and upon startup, the code could make a quick pull and restart all of the created tasks based on the data from the database. And when you cancel a job you just delete it from the database. It should be noted the process needs to be on for this to work, a job will not run if the process is turned off

Agenda Job: now() make the job running multiple times

I am scheduling an Agenda Job as below:
await agenda.now("xyz");
But the above command makes my job running almost every 1 minute. But when I change it to
await agenda.every('5 minutes', "xyz");
The above works as expected i.e. it runs the job every 5 minutes.
But I don't want a recurring job. Rather run it once.
The issue was with the concurrency of the job definition. It was set to 10 because of which several instances of the same job were running in parallel.
Changing the concurrency to 1 solved the issue.

Execute (or queue) set of tasks simultaneously

I have the situation like:
between 5 and 20 test environments separated to groups by 5 VMs (1 set = 5 VMs usually)
hundreds of test cases which should be executed simultaneously on 1 VM set.
celery with 5 workers (each worker for 1 VM item from VM's set: alpha, beta, charlie, delta, echo)
Test sets can run in different order and use diff. amount of time to execute.
Each worker should execute only one test case without overlapping or concurrency.
Each worker run tasks only from its own queue/consumer.
In previous version I had a solution with multiprocessing and it works fine. But with Celery I can't add all 100 tests cases for all 5VMs from one set, it only starts adding tasks for VM alpha and wait until they all finished to start tasks for next VM beta and so on.
Now when I've tried to use multiprocessing to create separate threads for each worker I got: AssertionError: daemonic processes are not allowed to have children
Problem is - how to add 100 tests for 5 workers at the same time?
So each worker (from alpha, beta, ...) will run its own set of 100 test cases simultaneously.
This problem can be solved using task keys based on each consumer, like:
app.control.add_consumer(
queue='alpha',
exchange = 'local',
exchange_type = 'direct',
routing_key = 'alpha.*',
destination = ['worker_for_alpha#HOSTNAME'])
So now you can send any task to this consumer for separate worker using key and queue name:
#app.task(queue='alpha', routing_key = 'alpha.task_for_something')
def any_task(arg_1, arg_2):
do something with arg_1 and arg_2
Now you can scale it to any amount of workers or consumers for single worker. Just make a collection and iter them one by one for multiple workers\consumers.
Another issue can be solved with --concurrency option of each worker.
You can set concurrency to 5 to have 5 simultaneously threads on one worker. Or break the task flow on separate threads for each worker with unique key and consumer(queue).

How do you implement periodically executing job?

I am recently implementing a system that automatically replies to tweets that contain arbitrary hashtags. This system consists of a process that periodically crawls Twitter and a process that periodically replies to these tweets. Following the tradition of my company, these periodical jobs are implemented with working tables on RDMS that have a status column which would have values like "waiting", "processing" or "succeed". To ensure redundancy, I make multiple same processes running by leveraging low level locks.
My question is, I'm implementing periodically jobs with working tables in RDMS, how these jobs are implemented in generally.
There's a node package cron which allows you to execute code at some specified interval, just like crontab. Here's a link to the package: https://www.npmjs.org/package/cron
For example:
var cronJob = require("cron").CronJob;
// Run this cron job every Sunday (0) at 7:00:00 AM
new cronJob("00 00 7 * * 0", function() {
// insert code to run here...
}, null, true);
You might be able to use that module to run some job periodically, which crawls twitter or replies to tweets.

Resources