Workflow System with Azure Table Storage - azure

I have a system where we need to run a simple workflow.
Example:
On Jan 1st 08:15 trigger task A for object Z
When triggered then run some code (implementation details not important)
Schedule task B for object Z to run at Jan 3rd 10:25 (and so on)
The workflow itself is simple, but I need to run 500.000+ instances and that's the tricky part.
I know Windows Workflow Foundation and for that very same reason I have chosen not to use that.
My initial design would be to use Azure Table Storage and I would really appreciate some feedback on the design.
The system will consist of two tables
Table "Jobs"
PartitionKey: ObjectId
Rowkey: ProcessOn (UTC Ticks in reverse so that newest are on top)
Attributes: State (Pending, Processed, Error, Skipped), etc...
Table "Timetable"
PartitionKey: YYYYMMDD
Rowkey: YYYYMMDDHHMM_<GUID>
Attributes: Job_PartitionKey, Job_RowKey
The idea is that the runs table will have the complete history of jobs per object and the Timetable will have a list of all jobs to run in the future.
Some assumptions:
A job will never span more than one Object
There will only ever be one pending job per Object
The "job" is very lightweight e.g. posting a message to a queue
The system must be able to perform these tasks:
Execute pending jobs
Query for all records in "Timetable" with a "partition <= Today" and "RowKey <= today"
For each record (in parallel)
Lookup job in Jobs table via PartitionKey and RowKey
If "not exists" or State != Pending then skip
Execute "logic". If fails => log and maybe do some retry logic
Submit "Next run date in Timetable"
Submit "Update State = Processed" and "New Job Record (next run)" as a single transaction
When all are finished => Delete all processed Timetable records
Concern: Only two of the three records modifications are in a transaction. Could this be overcome in any way?
Stop workflow
Stop/pause workflow for Object Z
Query top 1 jobs in Jobs table by PartitionKey
If any AND State == Pending then update to "Cancelled"
(No need to bother cleaning Timetable it will clean itself up "when time comes")
Start workflow
Create Pending record in Jobs table
Create record in Timetable
In terms of "executing the thing" I would
be using a Azure Function or Scheduler-thing to execute the pending jobs every 5 minutes or so.
Any comments or suggestions would be highly appreciated.
Thanks!

How about using Service Bus instead? The BrokeredMessage class has a property called ScheduledEnqueueTimeUtc. You can just schedule when you want your jobs to run via the ScheduledEnqueueTimeUtc property, and then fuggedabouddit. You can then have a triggered webjob that monitors the Service Bus messaging queue, and will be triggered very near when the job message is enqueued. I'm a big fan of relying on existing services to minimize the coding needed.

Related

Azure Data Factory - Tumbling Window Trigger - Limit hours it is running

With an Azure Data Factory "Tumbling Window" trigger, is it possible to limit the hours of each day that it triggers during (adding a window you might say)?
For example I have a Tumbling Window trigger that runs a pipeline every 15 minutes. This is currently running 24/7 but I'd like it to only run during business hours (0700-1900) to reduce costs.
Edit:
I played around with this, and found another option which isn't ideal from a monitoring perspective, but it appears to work:
Create a new pipeline with a single "If Condition" step with a dynamic Expression like this:
#and(greater(int(formatDateTime(utcnow(),'HH')),6),less(int(formatDateTime(utcnow(),'HH')),20))
In the true case activity, add an Execute Pipeline step executing your original pipeline (with "Wait on completion" ticked)
In the false case activity, add a wait step which sleeps for X minutes
The longer you sleep for, the longer you can possibly encroach on your window, so adjust that to match.
I need to give it a couple of days before I check the billing on the portal to see if it has reduced costs. At the moment I'm assuming a job which just sleeps for 15 minutes won't incur the costs that one running and processing data would.
there is no easy way but you can create two deployment pipelines for the same job in Azure devops and as soon as your winodw 0700 to 1900 expires you replace that job with a dummy job using azure dev ops pipeline.

Nodejs agenda job scheduler - How to group multiple jobs that will run in the same database scan into 1 batch job?

Problem
I have purchase service that users can use to buy/rent digital assets like game, media, movies... When purchase event happened, I create a job and schedule it to run at calculated expired date to remove key for such asset.
Everything works. But it would be better if I can group those jobs that will run in the same agenda db scan into 1 batch job to remove multiple keys.
This will reduce significant amount of db read/write/delete operations in both keys and agenda collection, it also increases the free memory at most of the time as instead of storing 100+ jobs to run in a scan, it stores just 1 job to remove 100+ keys.
Research
The closest feature I found in Agenda repo is unique(). Which allows user to find and modify the existing job that matches the fields defined in unique(). If it can concat new jobs to the existing job, that will solve my case.
Implementation
Before diving in and modify the package I want to check if there are already people solved the problem I have and have some thoughts to share.
Another solution without touching the package is to create an in-memory dictionary to accumulate jobs for a specific db scan with this strategy:
dict = {}
//if key expires in 1597202228 then put to dict slot:
dict = {
1597300000: [jobA]
}
//another key expires in 1597202238 then put to the same slot:
dict = {
1597300000: [jobA,jobB]
}
//the latch condition to put job batch into agenda:
if dict_size == dict_allocated_memory then put the whole dict into db.
if a batch_size = batch_limit then put the batch into db and remove the batch in dict.
if the batch is going to expire in the next db scan then put the batch (it may be empty, has a few jobs...) into db and remove the batch in dict.

Running a repetitive task in Node.js for each row in a postgres table on a different interval for each row

What would be a good approach to running a repetitive task for each row in a large postgres db table on a different per row interval in Node.js.
To give you some more context, here's a quick description of the application:
It's a chat based customer support app.
It consists of teams, which can be either a client team or a support team. Teams have users, which can be either client users or support users.
Client users send messages to a support team and wait for one of that team's users to answer their question.
When there's an unanswered client message waiting for a response, every agent for the receiving support team will receive a notification every n seconds (n being set on a per-team basis by the team admin).
So this task needs to infinitely loop through the rows in the teams table and send notifications if:
The team has messages waiting to be answered.
N seconds have passed since the last notification was sent (N being the number of seconds set by the team admin).
There might be a better approach to this condition altogether.
So my questions are:
What is an efficient way to infinitely loop through a postgres table with no upper limit on the number rows?
Should I load 1 row at a time? Several at a time?
What would be a good way to do this in Node?
I'm using Knex. Does Knex provide a mechanism for lazy loading a table and iterating through the rows?
A) Running a repetitive task via node can be done via a the js built-in function 'setInterval'.
// run the intervalFnc() every 5 seconds
const timerId = setTimeout(intervalFnc, 5000);
function intervalFnc() { console.log("Hello"); }
// to quit running it:
clearTimeout(timerId);
Then your interval function can do the actual work. An alternative would be to use cron (linux), or some OS process scheduler to trigger the function. I would use this method if you want to do it every minute, and a cron job if you want to do it every hour (in between these times becomes more debatable).
B) An efficient way...
B-1) Retrieving a block of records from a DB will be more efficient than one at a time. Knex has .offset and .limit clauses to choose a group of records to retrieve. A sample from the knex doc:
knex.select('*').from('users').limit(10).offset(30)
B-2) Database indexed access is important for performance if your tables are very large. I would recommend including an status flag field in your table to note which records are 'in-process', and also include a "next-review-timestamp" field with both fields being both indexed. Retrieve the records that have status_flag='in-process' AND next_review_timestamp <= now(). Sample:
knex('users').where('status_flag', 'in-process').whereRaw('next_review_timestamp <= now()')
Hope this helps!

creating recurring events in nodejs to update or insert MySQL table

I have a MySQL table tasks. In tasks, we can create a normal task or a recurring task that will automatically create a new task in the MySQL tasks table and send an email notification to the user that a task has been created. After a lot of research, I found out that you can do it in four methods
MySQL events
Kue, bull, agenda(node.js scheduling libraries)
Using a cron job to monitor every day for tasks
the recurring tasks would be repeated over weekly, daily, monthly, and yearly.
We must put an option to remove the recurring event at any time. What would be a nice and clean solution?
As you've identified there are a number of ways of going about this, here's how I would do it but I'm making a number of assumptions such as how many tasks you're likely to have and how flexible the system is going forward.
If you're unlikely to change the task time options (daily, weekly, monthly, yearly). Each task would have the following fields last_run_date and next_run_date. Every time a task is run I would update these fields and create an entry in a log table such as task_run_log which will also store the date/time the task was run at.
I would then have a cron job which fires a HTTP message to a nodejs service. This web service would look through the table of tasks, find which ones need to be executed for that day and would dispatch a message for each task into some sort of a queue (AWS SQS, GCP Pub/Sub, Apache Kafka, etc). Each message in the queue would represent a single task that needs to be carried out, workers can subscribe to this queue and process the task themselves. Once a worker has processed a job it would then make the log entry and update the last_run_date and next_run_date fields. If a task fails it'll add it into move that message into an error queue and will log a failed task in the task log.
This system would be robust as any failed jobs would exist as failed jobs in your database and would appear in an error queue (which you can either drain to remove the failed jobs, or you can replay them into the normal queue when the worker is fixed). It would also scale to many tasks that have to happen each day as you can scale up your workers. You also won't be flooding cron, your cron job will just send a single HTTP request each day to your HTTP service which kicks off the processing.
You can also setup alerts based on whether the cron job runs or not to make sure the process gets kicked off properly.
I had to do something very similar, you can use the npm module node-schedule
Node scheduler has many features. You can first create your rule setup, which determines when it runs and then schedules the job, which is where determine what the job performs and activates it, I have an example below from my code which sets a job to run at midnight every day.
var rule = new schedule.RecurrenceRule();
rule.dayOfWeek = [0, new schedule.Range(1, 6)];
var j = schedule.scheduleJob(rule, function(){
sqlUpdate(server);
});
This may not exactly fit all of your requirements alone but there are other features and setups you can do.
For example you can cancel any job with the cancel function
j.cancel()
You can also set start times and end times like so as shown in the npm page
let startTime = new Date(Date.now() + 5000);
let endTime = new Date(startTime.getTime() + 5000);
var j = schedule.scheduleJob({ start: startTime, end: endTime, rule: '*/1 * * * * *' }, function(){
console.log('Time for tea!');
});
There are also other options for scheduling the date and time as this also follows the cron format. Meaning you can set dynamic times
var j = schedule.scheduleJob('42 * * * *', function(){
console.log();
});
As such this would allow node.js to handle everything you need. You would likely need to set up a system to keep track of the scheduled jobs (var j) But it would allow you to cancel it and schedule it to your desire.
It additionally can allow you to reschedule, retrieve the next scheduled event and you can have multiple date formats.
If you need to persist the jobs after the process is turned of and on or reset you will need to save the details of the job, a MySQL database would make sense here, and upon startup, the code could make a quick pull and restart all of the created tasks based on the data from the database. And when you cancel a job you just delete it from the database. It should be noted the process needs to be on for this to work, a job will not run if the process is turned off

How to process scheduled, recurring jobs with Kue?

In my webapp, users can create recurring invoices that need to be generated and sent out at certain dates each month. For example, an invoice might need to be sent out on the 5th of every month.
I am using Kue to process all my background jobs so I want to do it in this case as well.
My current solution is to use setInterval() to create a processRecurringInvoices job every hour. This job will then find all recurring invoices from database and create a separate generateInvoice job for each recurring invoice.
The generateInvoice job will then actually generate the invoice, and if needed, will also in turn create a sendInvoiceToEmail job that will email the invoice.
At the moment this solution looks good to me, because it has a nice separation of concerns, however, I have the following questions:
I am not sure if I should wait for all the 'child' jobs to complete before I call done() on the main processRecurringInvoices job?
Where should I handle errors? Should i pass them back to the processRecurringInvoices job or should I handle them separately for each job?
How can I make sure that if processing takes extra long time (more than an hour), and either processRecurringInvoices or any of the child jobs are still runnning, the processRecurringInvoices job is not created again? Kind of like a unique job, or mutual exclusion?
Instead of "processRecurringInvoices" it might be easier to think of it as a job that initiates other, separate invoice-processing jobs. Thinking of it this way, once the invoice processing jobs have been enqueued, you can safely call done() on the job that kicks them all off.
Thinking of the problem in the way described in question 1, errors should be handled within each of the individual invoice processing jobs. If an error occurred finding potential invoice jobs, then that would probably be handled in the processRecurringInvoices jobs.
you can use kue.Job.rangeByType() to search for currently active jobs. If a job is active, you can skip kicking it off again.

Resources