I need an approach to block worker to process a job while I called getJob on different function. I've looked around but couldn't find a solution for that.
I have following setup.
In nodeJS with express, I have worker node.
Job created with delayed state.
Job is being accessed in different function
async function jobReader(id) {
const job = await queue.getJob(id);
/* do some stuff */
await job.remove();
}
Worker node that independently processes jobs. Job will be only processed if the delayed time is finishes.
queue.process(async (job) => {
/* do some stuff */
})
queue.getJob(id) doesn't block the worker to process the job. So there's race on worker processing the job and jobReader processing the job. I am writing some result to DB according to job status. So the race condition is not acceptable.
Apparently, getJob is not blocking the worker to process the job. Is there any way to lock or block to worker work on the job, if the job is read by some other function with getJob function.
Any help or documentation will be appreciated.
Thanks
I guess you should change your architecture a little. Worker Node does exactly what it is intended for, it takes jobs and runs them. So instead of blocking the queue in some way, you should only add the job to the queue when the user approved/canceled/failed it (or did not sent a response after 120 seconds).
If I understood you right, this should give you an idea how to have control over jobs between different requests:
// this is YOUR queueu object. I don't now implentation but think
// of it like this..
const queue = new Queue()
// a variable holding the pending jobs which are not timeouted
// or explicitly approved/canceled/failed by user
const waitingJobs = {
}
// This could be your location where the user calls the api for creating a job.
app.post('/job', (req, res) => {
// create the job as the user requested it
const job = createJob(req)
// Add a timeout for 120 seconds into your waitingJobs array.
// So if the user does not respond after that time, the job will
// be added to queue! .
const timeout = setTimeout(() => {
queue.add(job)
// remove the reference after adding, garbage collection..
waitingJobs[job.id] = null
// job is added to queue automatically after 120 seconds
}, 120 * 1000)
// store the timeout in the job object!
job.timeout = timeout
// store the waiting job!
waitingJobs[job.id] = job
// respond to user, send back id so client can do another
// request if wanted.
req.status(200).json({ message: 'Job created!', id: job.id })
})
app.post('/job/:id', (req, res) => {
const id = req.params.id
if (!id) {
req.status(400).json('bad job id provided')
return
}
// get the queued job:
const job = waitingJobs[id]
if (!job) {
req.status(400).json('Job nod found OR job already processed. Job id: ' + id)
return
}
// now the user responded to a specific job, clean the
// timeout first, so it won't be added to queue!
if (job.timeout) {
clearTimeout(job.timeout)
}
// Now the job won't be processed somewhere else!
// you can do whatever you want...
// example:
// get the action
const action = req.query.action
if(!action) {
res.status(400).json('Bad action provided: ' + action)
return
}
if(action === 'APPROVE') {
// job approved! , add it to queue so worker node
// can process it..
queue.add(job)
}
if(action === 'CANCEL') {
// do something else...
}
/// etc..
// ofc clear the job reference after you did something..
waitingJobs[job.id] = null
// since everything worked, inform user the job will now be processed!
res.status(200).json('Job ' + job.id + 'Will now be processed')
})
I want to have a cron job/scheduler that will run every 30 minutes after an onCreate event occurs in Firestore. The cron job should trigger a cloud function that picks the documents created in the last 30 minutes-validates them against a json schema-and saves them in another collection.How do I achieve this,programmatically writing such a scheduler?
What would also be fail-safe mechanism and some sort of queuing/tracking the documents created before the cron job runs to push them to another collection.
Building a queue with Firestore is simple and fits perfectly for your use-case. The idea is to write tasks to a queue collection with a due date that will then be processed when being due.
Here's an example.
Whenever your initial onCreate event for your collection occurs, write a document with the following data to a tasks collection:
duedate: new Date() + 30 minutes
type: 'yourjob'
status: 'scheduled'
data: '...' // <-- put whatever data here you need to know when processing the task
Have a worker pick up available work regularly - e.g. every minute depending on your needs
// Define what happens on what task type
const workers: Workers = {
yourjob: (data) => db.collection('xyz').add({ foo: data }),
}
// The following needs to be scheduled
export const checkQueue = functions.https.onRequest(async (req, res) => {
// Consistent timestamp
const now = admin.firestore.Timestamp.now();
// Check which tasks are due
const query = db.collection('tasks').where('duedate', '<=', new Date()).where('status', '==', 'scheduled');
const tasks = await query.get();
// Process tasks and mark it in queue as done
tasks.forEach(snapshot => {
const { type, data } = snapshot.data();
console.info('Executing job for task ' + JSON.stringify(type) + ' with data ' + JSON.stringify(data));
const job = workers[type](data)
// Update task doc with status or error
.then(() => snapshot.ref.update({ status: 'complete' }))
.catch((err) => {
console.error('Error when executing worker', err);
return snapshot.ref.update({ status: 'error' });
});
jobs.push(job);
});
return Promise.all(jobs).then(() => {
res.send('ok');
return true;
}).catch((onError) => {
console.error('Error', onError);
});
});
You have different options to trigger the checking of the queue if there is a task that is due:
Using a http callable function as in the example above. This requires you to perform a http call to this function regularly so it executes and checks if there is a task to be done. Depending on your needs you could do it from an own server or use a service like cron-job.org to perform the calls. Note that the HTTP callable function will be available publicly and potentially, others could also call it. However, if you make your check code idempotent, it shouldn't be an issue.
Use the Firebase "internal" cron option that uses Cloud Scheduler internally. Using that you can directly trigger the queue checking:
export scheduledFunctionCrontab =
functions.pubsub.schedule('* * * * *').onRun((context) => {
console.log('This will be run every minute!');
// Include code from checkQueue here from above
});
Using such a queue also makes your system more robust - if something goes wrong in between, you will not loose tasks that would somehow only exist in memory but as long as they are not marked as processed, a fixed worker will pick them up and reprocess them. This of course depends on your implementation.
You can trigger a cloud function on the Firestore Create event which will schedule the Cloud Task after 30 minutes. This will have queuing and retrying mechanism.
An easy way is that you could add a created field with a timestamp, and then have a scheduled function run at a predefined period (say, once a minute) and execute certain code for all records where created >= NOW - 31 mins AND created <= NOW - 30 mins (pseudocode). If your time precision requirements are not extremely high, that should work for most cases.
If this doesn't suit your needs, you can add a Cloud Task (Google Cloud product). The details are specified in this good article.
Basically, each of the clients ---that have a clientId associated with them--- can push messages and it is important that a second message from the same client isn't processed until the first one is finished processing (Even though the client can send multiple messages in a row, and they are ordered, and multiple clients sending messages should ideally not interfere with each other). And, importantly, a job shouldn't be processed twice.
I thought that using Redis I might be able to fix this issue, I started with some quick prototyping using the bull library, but I am clearly not doing it well, I was hoping someone would know how to proceed.
This is what I tried so far:
Create jobs and add them to the same queue name for one process, using the clientId as the job name.
Consume jobs while waiting large random amounts of random time on 2 separate process.
I tried adding the default locking provided by the library that I am using (bull) but it locks on the jobId, which is unique for each job, not on the clientId .
What I would want to happen:
One of the consumers can't take the job from the same clientId until the previous one is finished processing it.
They should be able to, however, get items from different clientIds in parallel without problem (asynchronously). (I haven't gotten this far, I am right now simply dealing with only one clientId)
What I get:
Both consumers consume as many items as they can from the queue without waiting for the previous item for the clientId to be completed.
Is Redis even the right tool for this job?
Example code
// ./setup.ts
import Queue from 'bull';
import * as uuid from 'uuid';
// Check that when a message is taken from a place, no other message is taken
// TO do that test, have two processes that process messages and one that sets messages, and make the job take a long time
// queue for each room https://stackoverflow.com/questions/54178462/how-does-redis-pubsub-subscribe-mechanism-works/54243792#54243792
// https://groups.google.com/forum/#!topic/redis-db/R09u__3Jzfk
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
export async function sleep(ms: number): Promise<void> {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
export interface JobData {
id: string;
v: number;
}
export const queue = new Queue<JobData>('messages', 'redis://127.0.0.1:6379');
queue.on('error', (err) => {
console.error('Uncaught error on queue.', err);
process.exit(1);
});
export function clientId(): string {
return uuid.v4();
}
export function randomWait(minms: number, maxms: number): Promise<void> {
const ms = Math.random() * (maxms - minms) + minms;
return sleep(ms);
}
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
// eslint-disable-next-line #typescript-eslint/ban-ts-comment
//#ts-ignore
queue.LOCK_RENEW_TIME = 5 * 60 * 1000;
// ./create.ts
import { queue, randomWait } from './setup';
const MIN_WAIT = 300;
const MAX_WAIT = 1500;
async function createJobs(n = 10): Promise<void> {
await randomWait(MIN_WAIT, MAX_WAIT);
// always same Id
const clientId = Math.random() > 1 ? 'zero' : 'one';
for (let index = 0; index < n; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
const job = { id: clientId, v: index };
await queue.add(clientId, job).catch(console.error);
console.log('Added job', job);
}
}
export async function create(nIds = 10, nItems = 10): Promise<void> {
const jobs = [];
await randomWait(MIN_WAIT, MAX_WAIT);
for (let index = 0; index < nIds; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
jobs.push(createJobs(nItems));
await randomWait(MIN_WAIT, MAX_WAIT);
}
await randomWait(MIN_WAIT, MAX_WAIT);
await Promise.all(jobs)
process.exit();
}
(function mainCreate(): void {
create().catch((err) => {
console.error(err);
process.exit(1);
});
})();
// ./consume.ts
import { queue, randomWait, clientId } from './setup';
function startProcessor(minWait = 5000, maxWait = 10000): void {
queue
.process('*', 100, async (job) => {
console.log('LOCKING: ', job.lockKey());
await job.takeLock();
const name = job.name;
const processingId = clientId().split('-', 1)[0];
try {
console.log('START: ', processingId, '\tjobName:', name);
await randomWait(minWait, maxWait);
const data = job.data;
console.log('PROCESSING: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('PROCESSED: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('FINISHED: ', processingId, '\tjobName:', name, '\tdata:', data);
} catch (err) {
console.error(err);
} finally {
await job.releaseLock();
}
})
.catch(console.error); // Catches initialization
}
startProcessor();
This is run using 3 different processes, which you might call like this (Although I use different tabs for a clearer view of what is happening)
npx ts-node consume.ts &
npx ts-node consume.ts &
npx ts-node create.ts &
I'm not familir with node.js. But for Redis, I would try this,
Let's say you have client_1, client_2, they are all publisher of events.
You have three machines, consumer_1,consumer_2, consumer_3.
Establish a list of tasks in redis, eg, JOB_LIST.
Clients put(LPUSH) jobs into this JOB_LIST, in a specific form, like "CLIENT_1:[jobcontent]", "CLIENT_2:[jobcontent]"
Each consumer takes out jobs blockingly (RPOP command of Redis) and process them.
For example, consumer_1 takes out a job, content is CLIENT_1:[jobcontent]. It parses the content and recognize it's from CLIENT_1. Then it wants to check if some other consumer is processing CLIENT_1 already, if not, it will lock the key to indicate that it's processing CLIENT_1.
It goes on to set a key of "CLIENT_1_PROCESSING" , with content as "consumer_1", using the Redis SETNX command (set if the key not exists), with an appropriate timeout. For example, the task norally takes one minute to finish, you set a timeout of the key of five minutes, just in case consumer_1 crashes and holds on the lock indefinitely.
If the SETNX returns 0, it means it fails to acquire the lock of CLIENT_1 (someone is already processing a job of client_1). Then it returns the job (a value of "CLIENT_1:[jobcontent]")to the left side of JOB_LIST, by using Redis LPUSH command.Then it might wait a bit (sleep a few seconds), and RPOP another task from the right side of the LIST. If this time SETNX returns 1, consumer_1 acquires the lock. It goes on to process job, after it finishes, it deletes the key of "CLIENT_1_PROCESSING", releasing the lock. Then it goes on to RPOP another job, and so on.
Some things to consider:
The JOB_LIST is not fair,eg, earlier jobs might be processed later
The locking part is a bit rudimentary, but will suffice.
----------update--------------
I've figured another way to keep tasks in order.
For each client(producer), build a list. Like "client_1_list", push jobs into the left side of the list.
Save all the client names in a list "client_names_list", with values "client_1", "client_2", etc.
For each consumer(processor), iterate the "client_names_list", for example, consumer_1 get a "client_1", check if the key of client_1 is locked(some one is processing a task of client_1 already), if not, right pop a value(job) from client_1_list and lock client_1. If client_1 is locked, (probably sleep one second) and iterate to the next client, "client_2", for example, and check the keys and so on.
This way, each client(task producer)'s task is processed by their order of entering.
EDIT: I found the problem regarding BullJS is starting jobs in parallel on one processor: We are using named jobs and where defining many named process functions on one queue/processor. The default concurrency factor for a queue/processor is 1. So the queue should not process any jobs in parallel.
The problem with our mentioned setup is if you define many (named) process-handlers on one queue the concurrency is added up with each process-handler function: So if you define three named process-handlers you get a concurrency factor of 3 for given queue for all the defined named jobs.
So just define one named job per queue for queues where parallel processing should not happen and all jobs should run sequentially one after the other.
That could be important e.g. when pushing a high number of jobs onto the queue and the processing involves API calls that would give errors if handled in parallel.
The following text is my first approach of answering the op's question and describes just a workaround to the problem. So better just go with my edit :) and configure your queues the right way.
I found an easy solution to operators question.
In fact BullJS is processing many jobs in parallel on one worker instance:
Let's say you have one worker instance up and running and push 10 jobs onto the queue than possibly that worker starts all processes in parallel.
My research on BullJS-queues gave that this is not intended behavior: One worker (also called processor by BullJS) should only start a new job from the queue when its in idle state so not processing a former job.
Nevertheless BullJS keeps starting jobs in parallel on one worker.
In our implementation that lead to big problems during API calls that most likely are caused by t00 many API calls at a time. Tests gave that when only starting one worker the API calls finished just fine and gave status 200.
So how to just process one job after the other once the previous is finished if BullJS does not do that for us (just what the op asked)?
We first experimented with delays and other BullJS options but thats kind of workaround and not the exact solution to the problem we are looking for. At least we did not get it working to stop BullJS from processing more than one job at a time.
So we did it ourself and started one job after the other.
The solution was rather simple for our use case after looking into BullJS API reference (BullJS API Ref).
We just used a for-loop to start the jobs one after another. The trick was to use BullJS's
job.finished
method to get a Promise.resolve once the job is finished. By using await inside the for-loop the next job gets just started immediately after the job.finished Promise is awaited (resolved). Thats the nice thing with for-loops: Await works in it!
Here a small code example on how to achieve the intended behavior:
for (let i = 0; i < theValues.length; i++) {
jobCounter++
const job = await this.processingQueue.add(
'update-values',
{
value: theValues[i],
},
{
// delay: i * 90000,
// lifo: true,
}
)
this.jobs[job.id] = {
jobType: 'socket',
jobSocketId: BackgroundJobTasks.UPDATE_VALUES,
data: {
value: theValues[i],
},
jobCount: theValues.length,
jobNumber: jobCounter,
cumulatedJobId
}
await job.finished()
.then((val) => {
console.log('job finished:: ', val)
})
}
The important part is really
await job.finished()
inside the for loop. leasingValues.length jobs get started all just one after the other as intended.
That way horizontally scaling jobs across more than one worker is not possible anymore. Nevertheless this workaround is okay for us at the moment.
I will get in contact with optimalbits - the maker of BullJS to clear things out.
If I publish 50K messages using Promise.all like below:
const pubsub = new PubSub({ projectId: PUBSUB_PROJECT_ID });
const topic = pubsub.topic(topicName, {
batching: {
maxMessages: 1000,
maxMilliseconds: 100,
},
});
const n = 50 * 1000;
const dataBufs: Buffer[] = [];
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
dataBufs.push(dataBuffer);
}
const tasks = dataBufs.map((d, idx) =>
topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
})
);
// publish messages concurrencly
await Promise.all(tasks);
// send response to front-end
res.json(data);
I will hit this issue: pubsub-emulator throw error and publisher throw "Retry total timeout exceeded before any response was received" when publish 50k messages
If I use for loop and async/await. The issue is gone.
const n = 50 * 1000;
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
const messageId = await topic.publish(dataBuffer)
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${i}`)
}
// some logic ...
// send response to front-end
res.json(data);
But it will block the execution of subsequent logic because of async/await until all messages have been published. It takes a long time to post 50k messages.
Any suggestions about how to publish a huge amount of messages(about 50k) without blocking the execution of subsequent logic? Do I need to use child_process or some queue like bull to publish the huge amount of messages in the background without blocking request/response workflow of the API? This means I need to respond to the front-end as soon as possible, the 50k messages should be the background tasks.
It seems there is a memory queue inside #google/pubsub library. I am not sure if I should use another queue like bull again.
The time it will take to publish large amounts of data depends on a lot of factors:
Message size. The larger the messages, the longer it takes to send them.
Network capacity (both of the connection between wherever the publisher is running and Google Cloud and, if relevant, of the virtual machine itself). This puts an upper bound on the amount of data that can be transmitted. It is not atypical to see smaller virtual machines with limits in the 40MB/s range. Note that if you are testing via Wifi, the limits could be even lower than this.
Number of threads and number of CPU cores. When having to run a lot of asynchronous callbacks, the ability to schedule them to run can be limited by the parallel capacity of the machine or runtime environment.
Typically, it is not good to try to send 50,000 publishes simultaneously from one instance of a publisher. It is likely that the above factors will cause the client to get overloaded and result in deadline exceeded errors. The best way to prevent this is to limit the number of messages that can be outstanding for publish at one time. Some of the libraries like Java support this natively. The Node.js library does not yet support this feature, but likely will in the future.
In the meantime, you'd want to keep a counter of the number of messages outstanding and limit it to whatever the client seems to be able to handle. Start with 1000 and work up or down from there based on the results. A semaphore would be a pretty standard way to achieve this behavior. In your case the code would look something like this:
var sem = require('semaphore')(1000);
var publishes = []
const tasks = dataBufs.map((d, idx) =>
sem.take(function() => {
publishes.push(topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
sem.leave();
}));
})
);
// Await the start of publishing all messages
await Promise.all(tasks);
// Await the actual publishes
await Promise.all(publishes);
I'm using Nodejs cluster module to have multiple workers running.
I created a basic Architecture where there will be a single MASTER process which is basically an express server handling multiple requests and the main task of MASTER would be writing incoming data from requests into a REDIS instance. Other workers(numOfCPUs - 1) will be non-master i.e. they won't be handling any request as they are just the consumers. I have two features namely ABC and DEF. I distributed the non-master workers evenly across features via assigning them type.
For eg: on a 8-core machine:
1 will be MASTER instance handling request via express server
Remaining (8 - 1 = 7) will be distributed evenly. 4 to feature:ABD and 3 to fetaure:DEF.
non-master workers are basically consumers i.e. they read from REDIS in which only MASTER worker can write data.
Here's the code for the same:
if (cluster.isMaster) {
// Fork workers.
for (let i = 0; i < numCPUs - 1; i++) {
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
}
cluster.on('exit', function(worker) {
console.log(`Worker ${worker.process.pid}::type(${worker.type}) died`);
ClusteringUtil.removeWorkerFromList(worker.type);
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
});
// Start consuming on server-start
ABCConsumer.start();
DEFConsumer.start();
console.log(`Master running with process-id: ${process.pid}`);
} else {
console.log('CLUSTER type', cluster.worker.process.env.type, 'running on', process.pid);
if (
cluster.worker.process.env &&
cluster.worker.process.env.type &&
cluster.worker.process.env.type === ServerTypeEnum.EXPRESS
) {
// worker for handling requests
app.use(express.json());
...
}
{
Everything works fine except consumers reading from REDIS.
Since there are multiple consumers of a particular feature, each one reads the same message and start processing individually, which is what I don't want. If there are 4 consumers, 1 is marked as busy and can not consumer until free, 3 are available. Once the message for that particular feature is written in REDIS by MASTER, the problem is all 3 available consumers of that feature start consuming. This means that the for a single message, the job is done based on number of available consumers.
const stringifedData = JSON.stringify(req.body);
const key = uuidv1();
const asyncHsetRes = await asyncHset(type, key, stringifedData);
if (asyncHsetRes) {
await asyncRpush(FeatureKeyEnum.REDIS.ABC_MESSAGE_QUEUE, key);
res.send({ status: 'success', message: 'Added to processing queue' });
} else {
res.send({ error: 'failure', message: 'Something went wrong in adding to queue' });
}
Consumer simply accepts messages and stop when it is busy
module.exports.startHeartbeat = startHeartbeat = async function(config = {}) {
if (!config || !config.type || !config.listKey) {
return;
}
heartbeatIntervalObj[config.type] = setInterval(async () => {
await asyncLindex(config.listKey, -1).then(async res => {
if (res) {
await getFreeWorkerAndDoJob(res, config);
stopHeartbeat(config);
}
});
}, HEARTBEAT_INTERVAL);
};
Ideally, a message should be read by only one consumer of that particular feature. After consuming, it is marked as busy so it won't consume further until free(I have handled this). Next message could only be processed by only one consumer out of other available consumers.
Please help me in tacking this problem. Again, I want one message to be read by only one free consumer and rest free consumers should wait for new message.
Thanks
I'm not sure I fully get your Redis consumers architecture, but I feel like it contradicts with the use case of Redis itself. What you're trying to achieve is essentially a queue based messaging with an ability to commit a message once its done.
Redis has its own pub/sub feature, but it is built on fire and forget principle. It doesn't distinguish between consumers - it just sends the data to all of them, assuming that its their logic to handle the incoming data.
I recommend to you use Queue Servers like RabbitMQ. You can achieve your goal with some features that AMQP 0-9-1 supports: message acknowledgment, consumer's prefetch count and so on. You can set up your cluster with very agile configs like ok, I want to have X consumers, and each can handle 1 unique (!) message at a time and they will receive new ones only after they let the server (rabbitmq) know that they successfully finished message processing. This is highly configurable and robust.
However, if you want to go serverless with some fully managed service so that you don't provision like virtual machines or anything else to run a message queue server of your choice, you can use AWS SQS. It has pretty much similar API and features list.
Hope it helps!