Run a Cron Job every 30mins after onCreate Firestore event

Run a Cron Job every 30mins after onCreate Firestore event - node.js

I want to have a cron job/scheduler that will run every 30 minutes after an onCreate event occurs in Firestore. The cron job should trigger a cloud function that picks the documents created in the last 30 minutes-validates them against a json schema-and saves them in another collection.How do I achieve this,programmatically writing such a scheduler?
What would also be fail-safe mechanism and some sort of queuing/tracking the documents created before the cron job runs to push them to another collection.

Building a queue with Firestore is simple and fits perfectly for your use-case. The idea is to write tasks to a queue collection with a due date that will then be processed when being due.
Here's an example.
Whenever your initial onCreate event for your collection occurs, write a document with the following data to a tasks collection:
duedate: new Date() + 30 minutes
type: 'yourjob'
status: 'scheduled'
data: '...' // <-- put whatever data here you need to know when processing the task
Have a worker pick up available work regularly - e.g. every minute depending on your needs
// Define what happens on what task type
const workers: Workers = {
yourjob: (data) => db.collection('xyz').add({ foo: data }),
}
// The following needs to be scheduled
export const checkQueue = functions.https.onRequest(async (req, res) => {
// Consistent timestamp
const now = admin.firestore.Timestamp.now();
// Check which tasks are due
const query = db.collection('tasks').where('duedate', '<=', new Date()).where('status', '==', 'scheduled');
const tasks = await query.get();
// Process tasks and mark it in queue as done
tasks.forEach(snapshot => {
const { type, data } = snapshot.data();
console.info('Executing job for task ' + JSON.stringify(type) + ' with data ' + JSON.stringify(data));
const job = workers[type](data)
// Update task doc with status or error
.then(() => snapshot.ref.update({ status: 'complete' }))
.catch((err) => {
console.error('Error when executing worker', err);
return snapshot.ref.update({ status: 'error' });
});
jobs.push(job);
});
return Promise.all(jobs).then(() => {
res.send('ok');
return true;
}).catch((onError) => {
console.error('Error', onError);
});
});
You have different options to trigger the checking of the queue if there is a task that is due:
Using a http callable function as in the example above. This requires you to perform a http call to this function regularly so it executes and checks if there is a task to be done. Depending on your needs you could do it from an own server or use a service like cron-job.org to perform the calls. Note that the HTTP callable function will be available publicly and potentially, others could also call it. However, if you make your check code idempotent, it shouldn't be an issue.
Use the Firebase "internal" cron option that uses Cloud Scheduler internally. Using that you can directly trigger the queue checking:
export scheduledFunctionCrontab =
functions.pubsub.schedule('* * * * *').onRun((context) => {
console.log('This will be run every minute!');
// Include code from checkQueue here from above
});
Using such a queue also makes your system more robust - if something goes wrong in between, you will not loose tasks that would somehow only exist in memory but as long as they are not marked as processed, a fixed worker will pick them up and reprocess them. This of course depends on your implementation.

You can trigger a cloud function on the Firestore Create event which will schedule the Cloud Task after 30 minutes. This will have queuing and retrying mechanism.

An easy way is that you could add a created field with a timestamp, and then have a scheduled function run at a predefined period (say, once a minute) and execute certain code for all records where created >= NOW - 31 mins AND created <= NOW - 30 mins (pseudocode). If your time precision requirements are not extremely high, that should work for most cases.
If this doesn't suit your needs, you can add a Cloud Task (Google Cloud product). The details are specified in this good article.

Related

Prevent the same job to run twice node-cron

I'm working with node and typescript, using node-cron 2.0 to schedule a background operation to run every hour.
cron.schedule("0 0 * * * *", () => {
purgeResponsesSurveys();
});
I'm concerned about what happens if the method doesn't finish within 1 hour since I don't want two instances of my method to run at the same time.
What are the best practices to prevent the scheduler from invoking the function purgeResponsesSurveys if it is already running from the previous hourly invocation?

You can use a Semaphore to prevent parallel calls.
You will need to know when purgeResponsesSurveys is done. So if it's asynchronous you will need to return Promise or receive a callback that will be called when purgeResponsesSurveys is done.
I used semaphore npm package.
Here is a small example/simulation.
const semaphore = require('semaphore');
const sem = semaphore(1);
simulateCron(function() {
console.log('cron was triggered')
// wrap task with mutex
sem.take(function() {
longTask(function(){
sem.leave();
})
})
})
function longTask(cb) {
console.log("Start longTask")
setTimeout(function(){
cb()
console.log("Done longTask")
}, 3000)
}
function simulateCron(cb) {
setInterval(cb, 500)
}
// output
cron was triggered
Start longTask
cron was triggered
cron was triggered
cron was triggered
cron was triggered
cron was triggered
Done longTask
Start longTask
cron was triggered
...

bull queue block job while got job on different function

I need an approach to block worker to process a job while I called getJob on different function. I've looked around but couldn't find a solution for that.
I have following setup.
In nodeJS with express, I have worker node.
Job created with delayed state.
Job is being accessed in different function
async function jobReader(id) {
const job = await queue.getJob(id);
/* do some stuff */
await job.remove();
}
Worker node that independently processes jobs. Job will be only processed if the delayed time is finishes.
queue.process(async (job) => {
/* do some stuff */
})
queue.getJob(id) doesn't block the worker to process the job. So there's race on worker processing the job and jobReader processing the job. I am writing some result to DB according to job status. So the race condition is not acceptable.
Apparently, getJob is not blocking the worker to process the job. Is there any way to lock or block to worker work on the job, if the job is read by some other function with getJob function.
Any help or documentation will be appreciated.
Thanks

I guess you should change your architecture a little. Worker Node does exactly what it is intended for, it takes jobs and runs them. So instead of blocking the queue in some way, you should only add the job to the queue when the user approved/canceled/failed it (or did not sent a response after 120 seconds).
If I understood you right, this should give you an idea how to have control over jobs between different requests:
// this is YOUR queueu object. I don't now implentation but think
// of it like this..
const queue = new Queue()
// a variable holding the pending jobs which are not timeouted
// or explicitly approved/canceled/failed by user
const waitingJobs = {
}
// This could be your location where the user calls the api for creating a job.
app.post('/job', (req, res) => {
// create the job as the user requested it
const job = createJob(req)
// Add a timeout for 120 seconds into your waitingJobs array.
// So if the user does not respond after that time, the job will
// be added to queue! .
const timeout = setTimeout(() => {
queue.add(job)
// remove the reference after adding, garbage collection..
waitingJobs[job.id] = null
// job is added to queue automatically after 120 seconds
}, 120 * 1000)
// store the timeout in the job object!
job.timeout = timeout
// store the waiting job!
waitingJobs[job.id] = job
// respond to user, send back id so client can do another
// request if wanted.
req.status(200).json({ message: 'Job created!', id: job.id })
})
app.post('/job/:id', (req, res) => {
const id = req.params.id
if (!id) {
req.status(400).json('bad job id provided')
return
}
// get the queued job:
const job = waitingJobs[id]
if (!job) {
req.status(400).json('Job nod found OR job already processed. Job id: ' + id)
return
}
// now the user responded to a specific job, clean the
// timeout first, so it won't be added to queue!
if (job.timeout) {
clearTimeout(job.timeout)
}
// Now the job won't be processed somewhere else!
// you can do whatever you want...
// example:
// get the action
const action = req.query.action
if(!action) {
res.status(400).json('Bad action provided: ' + action)
return
}
if(action === 'APPROVE') {
// job approved! , add it to queue so worker node
// can process it..
queue.add(job)
}
if(action === 'CANCEL') {
// do something else...
}
/// etc..
// ofc clear the job reference after you did something..
waitingJobs[job.id] = null
// since everything worked, inform user the job will now be processed!
res.status(200).json('Job ' + job.id + 'Will now be processed')
})

Do not process next job until previous job is completed (BullJS/Redis)?

Basically, each of the clients ---that have a clientId associated with them--- can push messages and it is important that a second message from the same client isn't processed until the first one is finished processing (Even though the client can send multiple messages in a row, and they are ordered, and multiple clients sending messages should ideally not interfere with each other). And, importantly, a job shouldn't be processed twice.
I thought that using Redis I might be able to fix this issue, I started with some quick prototyping using the bull library, but I am clearly not doing it well, I was hoping someone would know how to proceed.
This is what I tried so far:
Create jobs and add them to the same queue name for one process, using the clientId as the job name.
Consume jobs while waiting large random amounts of random time on 2 separate process.
I tried adding the default locking provided by the library that I am using (bull) but it locks on the jobId, which is unique for each job, not on the clientId .
What I would want to happen:
One of the consumers can't take the job from the same clientId until the previous one is finished processing it.
They should be able to, however, get items from different clientIds in parallel without problem (asynchronously). (I haven't gotten this far, I am right now simply dealing with only one clientId)
What I get:
Both consumers consume as many items as they can from the queue without waiting for the previous item for the clientId to be completed.
Is Redis even the right tool for this job?
Example code
// ./setup.ts
import Queue from 'bull';
import * as uuid from 'uuid';
// Check that when a message is taken from a place, no other message is taken
// TO do that test, have two processes that process messages and one that sets messages, and make the job take a long time
// queue for each room https://stackoverflow.com/questions/54178462/how-does-redis-pubsub-subscribe-mechanism-works/54243792#54243792
// https://groups.google.com/forum/#!topic/redis-db/R09u__3Jzfk
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
export async function sleep(ms: number): Promise<void> {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
export interface JobData {
id: string;
v: number;
}
export const queue = new Queue<JobData>('messages', 'redis://127.0.0.1:6379');
queue.on('error', (err) => {
console.error('Uncaught error on queue.', err);
process.exit(1);
});
export function clientId(): string {
return uuid.v4();
}
export function randomWait(minms: number, maxms: number): Promise<void> {
const ms = Math.random() * (maxms - minms) + minms;
return sleep(ms);
}
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
// eslint-disable-next-line #typescript-eslint/ban-ts-comment
//#ts-ignore
queue.LOCK_RENEW_TIME = 5 * 60 * 1000;
// ./create.ts
import { queue, randomWait } from './setup';
const MIN_WAIT = 300;
const MAX_WAIT = 1500;
async function createJobs(n = 10): Promise<void> {
await randomWait(MIN_WAIT, MAX_WAIT);
// always same Id
const clientId = Math.random() > 1 ? 'zero' : 'one';
for (let index = 0; index < n; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
const job = { id: clientId, v: index };
await queue.add(clientId, job).catch(console.error);
console.log('Added job', job);
}
}
export async function create(nIds = 10, nItems = 10): Promise<void> {
const jobs = [];
await randomWait(MIN_WAIT, MAX_WAIT);
for (let index = 0; index < nIds; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
jobs.push(createJobs(nItems));
await randomWait(MIN_WAIT, MAX_WAIT);
}
await randomWait(MIN_WAIT, MAX_WAIT);
await Promise.all(jobs)
process.exit();
}
(function mainCreate(): void {
create().catch((err) => {
console.error(err);
process.exit(1);
});
})();
// ./consume.ts
import { queue, randomWait, clientId } from './setup';
function startProcessor(minWait = 5000, maxWait = 10000): void {
queue
.process('*', 100, async (job) => {
console.log('LOCKING: ', job.lockKey());
await job.takeLock();
const name = job.name;
const processingId = clientId().split('-', 1)[0];
try {
console.log('START: ', processingId, '\tjobName:', name);
await randomWait(minWait, maxWait);
const data = job.data;
console.log('PROCESSING: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('PROCESSED: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('FINISHED: ', processingId, '\tjobName:', name, '\tdata:', data);
} catch (err) {
console.error(err);
} finally {
await job.releaseLock();
}
})
.catch(console.error); // Catches initialization
}
startProcessor();
This is run using 3 different processes, which you might call like this (Although I use different tabs for a clearer view of what is happening)
npx ts-node consume.ts &
npx ts-node consume.ts &
npx ts-node create.ts &

I'm not familir with node.js. But for Redis, I would try this,
Let's say you have client_1, client_2, they are all publisher of events.
You have three machines, consumer_1,consumer_2, consumer_3.
Establish a list of tasks in redis, eg, JOB_LIST.
Clients put(LPUSH) jobs into this JOB_LIST, in a specific form, like "CLIENT_1:[jobcontent]", "CLIENT_2:[jobcontent]"
Each consumer takes out jobs blockingly (RPOP command of Redis) and process them.
For example, consumer_1 takes out a job, content is CLIENT_1:[jobcontent]. It parses the content and recognize it's from CLIENT_1. Then it wants to check if some other consumer is processing CLIENT_1 already, if not, it will lock the key to indicate that it's processing CLIENT_1.
It goes on to set a key of "CLIENT_1_PROCESSING" , with content as "consumer_1", using the Redis SETNX command (set if the key not exists), with an appropriate timeout. For example, the task norally takes one minute to finish, you set a timeout of the key of five minutes, just in case consumer_1 crashes and holds on the lock indefinitely.
If the SETNX returns 0, it means it fails to acquire the lock of CLIENT_1 (someone is already processing a job of client_1). Then it returns the job (a value of "CLIENT_1:[jobcontent]")to the left side of JOB_LIST, by using Redis LPUSH command.Then it might wait a bit (sleep a few seconds), and RPOP another task from the right side of the LIST. If this time SETNX returns 1, consumer_1 acquires the lock. It goes on to process job, after it finishes, it deletes the key of "CLIENT_1_PROCESSING", releasing the lock. Then it goes on to RPOP another job, and so on.
Some things to consider:
The JOB_LIST is not fair,eg, earlier jobs might be processed later
The locking part is a bit rudimentary, but will suffice.
----------update--------------
I've figured another way to keep tasks in order.
For each client(producer), build a list. Like "client_1_list", push jobs into the left side of the list.
Save all the client names in a list "client_names_list", with values "client_1", "client_2", etc.
For each consumer(processor), iterate the "client_names_list", for example, consumer_1 get a "client_1", check if the key of client_1 is locked(some one is processing a task of client_1 already), if not, right pop a value(job) from client_1_list and lock client_1. If client_1 is locked, (probably sleep one second) and iterate to the next client, "client_2", for example, and check the keys and so on.
This way, each client(task producer)'s task is processed by their order of entering.

EDIT: I found the problem regarding BullJS is starting jobs in parallel on one processor: We are using named jobs and where defining many named process functions on one queue/processor. The default concurrency factor for a queue/processor is 1. So the queue should not process any jobs in parallel.
The problem with our mentioned setup is if you define many (named) process-handlers on one queue the concurrency is added up with each process-handler function: So if you define three named process-handlers you get a concurrency factor of 3 for given queue for all the defined named jobs.
So just define one named job per queue for queues where parallel processing should not happen and all jobs should run sequentially one after the other.
That could be important e.g. when pushing a high number of jobs onto the queue and the processing involves API calls that would give errors if handled in parallel.
The following text is my first approach of answering the op's question and describes just a workaround to the problem. So better just go with my edit :) and configure your queues the right way.
I found an easy solution to operators question.
In fact BullJS is processing many jobs in parallel on one worker instance:
Let's say you have one worker instance up and running and push 10 jobs onto the queue than possibly that worker starts all processes in parallel.
My research on BullJS-queues gave that this is not intended behavior: One worker (also called processor by BullJS) should only start a new job from the queue when its in idle state so not processing a former job.
Nevertheless BullJS keeps starting jobs in parallel on one worker.
In our implementation that lead to big problems during API calls that most likely are caused by t00 many API calls at a time. Tests gave that when only starting one worker the API calls finished just fine and gave status 200.
So how to just process one job after the other once the previous is finished if BullJS does not do that for us (just what the op asked)?
We first experimented with delays and other BullJS options but thats kind of workaround and not the exact solution to the problem we are looking for. At least we did not get it working to stop BullJS from processing more than one job at a time.
So we did it ourself and started one job after the other.
The solution was rather simple for our use case after looking into BullJS API reference (BullJS API Ref).
We just used a for-loop to start the jobs one after another. The trick was to use BullJS's
job.finished
method to get a Promise.resolve once the job is finished. By using await inside the for-loop the next job gets just started immediately after the job.finished Promise is awaited (resolved). Thats the nice thing with for-loops: Await works in it!
Here a small code example on how to achieve the intended behavior:
for (let i = 0; i < theValues.length; i++) {
jobCounter++
const job = await this.processingQueue.add(
'update-values',
{
value: theValues[i],
},
{
// delay: i * 90000,
// lifo: true,
}
)
this.jobs[job.id] = {
jobType: 'socket',
jobSocketId: BackgroundJobTasks.UPDATE_VALUES,
data: {
value: theValues[i],
},
jobCount: theValues.length,
jobNumber: jobCounter,
cumulatedJobId
}
await job.finished()
.then((val) => {
console.log('job finished:: ', val)
})
}
The important part is really
await job.finished()
inside the for loop. leasingValues.length jobs get started all just one after the other as intended.
That way horizontally scaling jobs across more than one worker is not possible anymore. Nevertheless this workaround is okay for us at the moment.
I will get in contact with optimalbits - the maker of BullJS to clear things out.

Agenda jobs, is it possible to schedule a job to be triggered after the previous one finished?

I'm using agenda jobs in nodejs. I was wondering if it's possible to define a job with a dependency.
In my case I have 3 jobs, each one of the jobs execute a part of a whole logic, which is quite big, that's why they are separated.
In the Job 1, I collect info from the database, transform them and perform a few tasks.
In the Job 2, I get all the data that had been previously processed by the Job 1 and perform new transformations..
In the Job 3, I get again all the data that had been processed in the Job 2 and send a few reports.
I have those tasks schedule to be executed every 5 minutes. But the thing is that the 3 of them are schedule at the same time because, they are schedule dynamically.
job1.schedule("now");
job1.repeatEvery('5 minutes');
job2.schedule("now");
job2.repeatEvery('5 minutes');
job3.schedule("now");
job3.repeatEvery('5 minutes');
As it's configured currently, to process one instance that needs to go through the 3 jobs, worst case scenario the user will need to wait 15min, which is not ideal.
I was wondering if there is an option to define that the tasks should be executed as soon as the previous task is finished. I'm aware that a possible workaround will be to schedule the jobs with a few minutes of difference, but given that depending on the number of instances the job can require more or less time this doesn't work for me.

The way I am doing a pretty similar thing is, I trigger the next job when the main function of the preceding job is done, this way I can feed data from one job to another and run them in sequence.
So I trigger job3 in the definition of job2 and trigger job2 in the definition of job1:
// DEFINE JOBS
agenda.define('job1', (job) => {
return myMainJob1Function(job.attrs.data)
.then((data) => {
for (const element of data) {
agenda.now('job2', { // runs many instances of 'job2' job with distinct data
// use output data from the "parent" job
arg1: element.arg1,
arg2: element.arg2,
arg3: element.arg3,
arg4: job.attrs.data.someArg, // this will propagate an argument we set for the "parent" job
});
}
});
});
agenda.define('job2', (job) => {
return myMainJob2Function(job.attrs.data) // data supplied from 'job1'
.then((data) => {
for (const element of data) {
agenda.now('job3', { // runs many instances of 'job3' job with distinct data
// use output data from the "parent" job
arg1: element.arg1,
arg2: element.arg2,
arg3: element.arg3,
arg4: job.attrs.data.someArg, // this will propagate an argument we set for the "parent" job
});
}
});
});
agenda.define('job3', (job) => {
return myMainJob3Function(job.attrs.data) // data supplied from 'job2'
.then((data) => {
// save data to database or do something else
});
});
// TRIGGER 'job1', WHICH TRIGGERS CHILD JOBS
agenda.now('job1', { someArg: 5 }); // we're providing some initial argument, that gets passed onto child jobs
const schedule = '0 0/5 0 ? * * *' // cron expression for every 5 minutes
agenda.every(schedule, 'job1', { someArg: 5 });
One other thing that comes to mind is using triggers for complete or success of job1 but when you run many of the same jobs with different input data you can't (I haven't found a way) listen to those events with regard to the specific job instance that you want.
That being said you can do:
agenda.on('success:job1', (job) => {
agenda.now('job2', { someArg: job.attrs.data.someArg });
});

Google App Engine Node Task Queue Retries before Finished

I am running a slow operation via a cloud task queue to delete objects from Google Cloud Storage. I have noticed that the task queue retries the task after two minutes have passed, even though the running task is not yet finished nor errored.
What is the best strategy to trigger valid retries, but not retry while the task is still running?
Here's my task creator:
router.get('/start-delete-old', async (req, res) => {
const task = {
appEngineHttpRequest: {
httpMethod: 'POST',
relativeUri: `/videos/delete-old`,
},
};
const request = {
parent: taskClient.parent,
task: task,
};
const [response] = await taskClient.queue.createTask(request);
res.send(response);
});
Here's my task handler:
router.post('/delete-old', async (req, res) => {
let cameras = await knex('cameras')
let date = moment().subtract(365, 'days').format('YYYY-MM-DD');
for (let i = 0; i < cameras.length; i++) {
let camera = cameras[i]
let prefix = `${camera.id}/${date}/`
try {
await bucket.deleteFiles({ prefix: prefix, force: true })
await knex.raw(`delete from videos where camera_id = ${camera.id} and cast(start_time as date) = '${date}'`)
}
catch (e) {
console.log('error deleting ' + e)
}
}
res.send({});
});

As per the documentation, a timeout in a task can vary depending on the environment you are using:
Standard environment
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
Flexible environment
For worker services running in the flex environment: all types have a 60 minute timeout.
So, if your handler misses the deadline, the queue assumes the task failed and retries it.
Also, the Task Queue is expecting to receive an status code between 200 and 299, if not, it will assume that the running task will fail. Quoting the documentation:
Upon successful completion of processing, the handler must send an HTTP status code between 200 and 299 back to the queue. Any other value indicates the task has failed and the queue retries the task.
I believe that both the bucket delete files and the knex raw query, are taking a lot of time to be processed, and this is causing the handler to return an status other than 200-299.
One good way to troubleshoot is using Stackdriver logs, you will be able to gather more information about the ongoing processes and see if any of them is returning any error.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string