Bull.js is duplicating jobs on Redis - node.js

A Node.js app running, which only processes queues from Redis using Bull.js. The job performs a task and then it adds another entry in Redis with same data. Like this:
job({dataId}) {
// Task
queue.add({dataId}, {delay});
}
There is only one job running per dataId and it's been made like this because I need the delay of the queue to be random every time it's processed.
The problem:
Given time with the app running, the jobs will be duplicated as there will be more than one job with the same dataId being processed by Bull.js. I only got realized that when I used the Task Force to see the status.
I can check if there are any jobs running with the dataId before adding it into the queue but that definitely would not be optimal, so I'd like to know what's causing it so I can prevent it from happening.

Related

how to run multiple instances without duplicate job in nodejs

I have a problem when I scale the project (nestjs) to multiple instance. In my project, I have a crawler service that run each 10 minutes. When 2 instance running, crawler will run on both instances, so data will duplicate. Does anyone know how to handle it?
Looks like it can be processed using a queue, but I don't have a solution yet
Jobs aren't the right construct in this case.
Instead, use a job Queue: https://docs.nestjs.com/techniques/queues
You won't even need to set up a separate worker server to handle the jobs. Instead, add Redis (or similar) to your setup, configure a queue to use it, then set up 1) a producer module to add jobs to the queue whenever they need to be run, 2) a consumer module which will pull jobs off the queue and process them. Add logic into the producer module to ensure that duplicate jobs aren't being created, if that logic is running on both your machines.
Conversely it may be easier just to separate job production/processing into a separate server.

How to perform long event processing in Node JS with a message queue?

I am building an email processing pipeline in Node JS with Google Pub/Sub as a message queue. The message queue has a limitation where it needs an acknowledgment for a sent message within 10 minutes. However, the jobs it's sending to the Node JS server might take an hour to complete. So the same job might run multiple times till one of them finishes. I'm worried that this will block the Node JS event loop and slow down the server too.
Find an architecture diagram attached. My questions are:
Should I be using a message queue to start this long-running job given that the message queue expects a response in 10 mins or is there some other architecture I should consider?
If multiple such jobs start, should I be worried about the Node JS event loop being blocked. Each job is basically iterating through a MongoDB cursor creating hundreds of thousands of emails.
Well, it sounds like you either should not be using that queue (with the timeout you can't change) or you should break up your jobs into something that easily finishes long before the timeouts. It sounds like a case of you just need to match the tool with the requirements of the job. If that queue doesn't match your requirements, you probably need a different mechanism. I don't fully understand what you need from Google's pub/sub, but creating a queue of your own or finding a generic queue on NPM is generally fairly easy if you just want to serialize access to a bunch of jobs.
I rather doubt you have nodejs event loop blockage issues as long as all your I/O is using asynchronous methods. Nothing you're doing sounds CPU-heavy and that's what blocks the event loop (long running CPU-heavy operations). Your whole project is probably limited by both MongoDB and whatever you're using to send the emails so you should probably make sure you're not overwhelming either one of those to the point where they become sluggish and lose throughput.
To answer the original question:
Should I be using a message queue to start this long-running job given that the message queue expects a response in 10 mins or is there
some other architecture I should consider?
Yes, a message queue works well for dealing with these kinds of events. The important thing is to make sure the final action is idempotent, so that even if you process duplicate events by accident, the final result is applied once. This guide from Google Cloud is a helpful resource on making your subscriber idempotent.
To get around the 10 min limit of Pub/Sub, I ended up creating an in-memory table that tracked active jobs. If a job was actively being processed and Pub/Sub sent the message again, it would do nothing. If the server restarts and loses the job, the in-memory table also disappears, so the job can be processed once again if it was incomplete.
If multiple such jobs start, should I be worried about the Node JS event loop being blocked. Each job is basically iterating through a
MongoDB cursor creating hundreds of thousands of emails.
I have ignored this for now as per the comment left by jfriend00. You can also rate-limit the number of jobs being processed.

Should create a single azure web job (NODEJS)?

I need to create a webjob that runs 2 processes (maybe more).
All the time.
Process 1 (Continuous)
Get messages from the queue
for each message connect to db and update a value.
repeat 1
Process 2 (schedule - every day early in the morning)
Go to db and move records a tmp table
Send each record vía HTTP
if a record cant sent, retry for all day.
if all records were sent, run again tomorrow
According to the 2 processes (should be more), Can I create one single web job for all processes ? or should I create a single job for each process?
I was thinking about this implementation, but I don't know how accurate it is.
crojobs: 1
Type: Continuous
while(true){
process1();
process2();
}
async function process1() {
// do staff
}
async function process2() {
// do staff
// node-cron lib schedule: (every early morning day)
}
According to the 2 processes (should be more), Can I create one single web job for all processes ? or should I create a single job for each process?
In short, you need to create a single job for each process.
When you run webjob with continuous or schedule, the webjob type is work for all the webjob in it. So you can not create one single webjob which both have continuous and schedule process. For more details, you could refer to this article.

How to release a batch of celery backend resources without stopping/pausing producer script?

I was going through the celery documentations and I ran across this
Warning Backends use resources to store and transmit results. To
ensure that resources are released, you must eventually call get() or
forget() on EVERY AsyncResult instance returned after calling a task.
My celery application is indeed using the backend to store task results in the public.celery_taskmeta table in Postgres and I am sure that this warning is relevant to me. I currently have a producer that queues up a bunch of tasks for my workers every X minutes and moves on and performs other stuff. The producer is a long-running script that will eventually queue up a bunch of new tasks in RabbitMQ. The workers will usually take 5-20 minutes to finish executing a task because it pulls data from Kafka, Postgres/MySQL, processes those data and inserts them into Redshift. So for example, this is what my producer is doing
import celery_workers
for task in task_list: # task_list can hold up to 100s of tasks
async_result = celery_workers.delay(task)
# move on and do other stuff
Now my question is: how do I go back and release the backend resources by calling async_result.get() (as stated in the celery docs warning) without having my producer pause/wait for the workers to finish?
Using Python 3.6 and celery 4.3.0

SQS: Know remaining jobs

I'm creating an app that uses a JobQueue using Amazon SQS.
Every time a user logs in, I create a bunch of jobs for that specific user, and I want him to wait until all his jobs have been processed before taking the user to a specific screen.
My problem is that I don't know how to query the queue to see if there are still pending jobs for a specific user, or how is the correct way to implement such solution.
Everything regarding the queue (Job creation and processing is working as expected). But I am missing that final step.
Just for the record:
In my previous implementation I was using Redis + Kue and I had created a key with the user Id and the job count, every time a job was added that job count was incremented, and every time a job finished or failed I decremented that count. But now I want to move away from Redi + Kue and I am not sure how to implement this step.
Amazon SQS is not the ideal tool for the scenario you describe. A queueing system is normally used in a "Send and Forget" situation, where the sending system doesn't remain interested in later processing.
You could investigate Amazon Simple Workflow (SWF), which allows work to be monitored as it goes through several processes. Your existing code could mostly be re-used, just with the SWF framework added. Or even power it from Lambda, since you are already using node.js.

Resources