Currently I am using queue to specify the worker for the celery task as follow:
celery.signature("cropper.run", args=[str("/ngpointdata/NG94_118.ngt"), int(1), str("NGdata_V4_ML_Test"), True], priority="3", queue="crop").delay()
But due to the needs of the pipeline I am working on, I have multiple workers with the same queue name, so I wanted to know if it is possible to send the task to a specific worker that has the same queue name as others but different node name?
Related
I have a problem when I scale the project (nestjs) to multiple instance. In my project, I have a crawler service that run each 10 minutes. When 2 instance running, crawler will run on both instances, so data will duplicate. Does anyone know how to handle it?
Looks like it can be processed using a queue, but I don't have a solution yet
Jobs aren't the right construct in this case.
Instead, use a job Queue: https://docs.nestjs.com/techniques/queues
You won't even need to set up a separate worker server to handle the jobs. Instead, add Redis (or similar) to your setup, configure a queue to use it, then set up 1) a producer module to add jobs to the queue whenever they need to be run, 2) a consumer module which will pull jobs off the queue and process them. Add logic into the producer module to ensure that duplicate jobs aren't being created, if that logic is running on both your machines.
Conversely it may be easier just to separate job production/processing into a separate server.
Am working on a nodejs project where i need to implement task queueing. I have picked the bullMq with redis packages for this. Following the documentation here
import { Queue, Worker } from 'bullmq'
// Create a new connection in every instance
const myQueue = new Queue('myqueue', { connection: {
host: "myredis.taskforce.run",
port: 32856
}});
const myWorker = new Worker('myworker', async (job)=>{}, { connection: {
host: "myredis.taskforce.run",
port: 32856
}});
After digging deeper in the documentation, i ended up asking some questions:
Do i need one worker and queue instance for the whole app? (I think
this depends on the kind of tasks and operations you need) I need a task queue that
process payments. Another task queue to work on marketing emails. I cant figure out how
this would work if we had only one instance of worker and queue. It would but it requires setting up identifiers for every kind of operation and acting on each accordingly.
If i were to have many queues and worker instances,how would a worker know from which queue it should listen for tasks. From the code sample above, the worker seems to have be named myworker and the queue is called myqueue. How are these two connected? How does the worker know it should listen to jobs from that specific queue without colliding with other queues and workers?
Am quite new in tasks and queues, any help will be appreciated.
How does a worker know which queue to get tasks from?
The first argument to the Worker is supposed to be the name of the queue that you want it to pull messages from. The code you show is not doing that properly. But, the doc here explains that.
Do i need one worker and queue instance for the whole app? (I think this depends on the kind of tasks and operations you need) I need a task queue that process payments. Another task queue to work on marketing emails. I cant figure out how this would work if we had only one instance of worker and queue. It would but it requires setting up identifiers for every kind of operation and acting on each accordingly.
This really depends upon your design. You could have one queue that holds multiple types of things and one worker that processes whatever it finds in the queue.
Or, if want jobs to be processed concurrently, you can create more than one worker and those additional workers can even be in different processes.
If i were to have many queues and worker instances,how would a worker know from which queue it should listen for tasks. From the code sample above, the worker seems to have be named myworker and the queue is called myqueue. How are these two connected? How does the worker know it should listen to jobs from that specific queue without colliding with other queues and workers?
As explained above, the first argument to the Worker is supposed to be the name of the queue that you want it to pull messages from.
My situation requires that the next message will only start processing after the previous one finishes processing. (The message processing function is an async function).
RabbitMQ fits my needs with Single Active Consumer functionality and by setting prefetch to 1. About Single Active Consumer
A queue is declared and some consumers register to it at roughly the same time.
The very first registered consumer become the single active consumer: messages are dispatched to it and the other consumers are ignored.
The single active consumer is cancelled for some reason or simply dies. One of the registered consumer becomes the new single active consumer and messages are now dispatched to it. In other terms, the queue fails over automatically to another consumer.
, but it cannot be installed via npm.
Are there any alternative queue systems that does this and can be installed via npm?
As my reply to #paulsm4 The reason I don't consider RabbitMQ at this moment is because I plan to deploy my app to Heroku and would like to keep it as simple as possible without the need to use a third party RabbitMQ addon. However, I will use Redis anyways so any library that depends on Redis is fine.
I was going through the celery documentations and I ran across this
Warning Backends use resources to store and transmit results. To
ensure that resources are released, you must eventually call get() or
forget() on EVERY AsyncResult instance returned after calling a task.
My celery application is indeed using the backend to store task results in the public.celery_taskmeta table in Postgres and I am sure that this warning is relevant to me. I currently have a producer that queues up a bunch of tasks for my workers every X minutes and moves on and performs other stuff. The producer is a long-running script that will eventually queue up a bunch of new tasks in RabbitMQ. The workers will usually take 5-20 minutes to finish executing a task because it pulls data from Kafka, Postgres/MySQL, processes those data and inserts them into Redshift. So for example, this is what my producer is doing
import celery_workers
for task in task_list: # task_list can hold up to 100s of tasks
async_result = celery_workers.delay(task)
# move on and do other stuff
Now my question is: how do I go back and release the backend resources by calling async_result.get() (as stated in the celery docs warning) without having my producer pause/wait for the workers to finish?
Using Python 3.6 and celery 4.3.0
Every 10 minutes several worker roles in Azure is set to process a set of jobs(100+). Some jobs are independent, but others are not. For (simple) example, a job A must be processed, send and acknowledged by a receiver before a job B can be sent.
Independent jobs can be put on queues to distribute to worker roles. I wonder if queues could work for dependent jobs in order to make a consistent solution.
Edit: I have used a too simplistic example. Jobs A and B both consist of several related messages. These messages will be distributed to n worker roles and will be sent separately, so Job A is finished when n worker roles get acks and then the messages (distributed to and processed by m worker roles) of job B can be sent.
I think in this case the only option would be to let a single worker role process both job A and B, otherwise a complex inter worker role synchronization mechanism is needed.
I think you can use queues to facilitate this. One possible solution would be to have the worker writing another message in same or other queue once Job A is finished. So worker will pick up the message for Job A, processes the job, writes another message that Job A is done and delete original message. Now another thread will pick up the message and start working on Job B. If the message is posted in the same queue, then the message needs to convey that it is part of a multi-job chain and what steps have been completed. If the message is posted in another queue (e.g. specific queue for Job B) then your code would know that this message is for Job B and should process it accordingly.