how to run multiple instances without duplicate job in nodejs - node.js

I have a problem when I scale the project (nestjs) to multiple instance. In my project, I have a crawler service that run each 10 minutes. When 2 instance running, crawler will run on both instances, so data will duplicate. Does anyone know how to handle it?
Looks like it can be processed using a queue, but I don't have a solution yet

Jobs aren't the right construct in this case.
Instead, use a job Queue: https://docs.nestjs.com/techniques/queues
You won't even need to set up a separate worker server to handle the jobs. Instead, add Redis (or similar) to your setup, configure a queue to use it, then set up 1) a producer module to add jobs to the queue whenever they need to be run, 2) a consumer module which will pull jobs off the queue and process them. Add logic into the producer module to ensure that duplicate jobs aren't being created, if that logic is running on both your machines.
Conversely it may be easier just to separate job production/processing into a separate server.

Related

Difference between Process managers (pm2), Job Queues (bull) and Message Brokers (RabbitMQ)

I'd like to receive clarification about differences between process managers (pm2), job queue managers (bull, agenda), and message brokers (RabbitMQ) and their use cases.
Before we start:
I am not asking about, what's better, but I am asking, in which cases some of these modules can be used appropriately and in which I should not use them at all
My own thoughts and sub-questions, that I'd want to be answered are marked with bold after each use-case. I understand that the whole question is complex, so I have nothing to add. Feel free to write it down. You'll receive my reaction and upvote in case if your answer is useful.
As I understand it:
the message broker is responsible only for delivering messages between various microservices (which could be written in different frameworks and languages), but it's doesn't have any impact on processes/tasks.
process and queue managers are responsible for processes themselves, and if the job has logic like: process.on("message" => ...) process is capable to react for any input commands. (But if they are also capable to communicate with each other via node.js worker_threads we don't need RabbitMQ at all.)
For example, imagine my use-cases are the following:
Queue 10 workers, with each 2 in parallel.
I have one worker.js file with a long and heavy resourceful task. And I want to run 10 instances of it, but in parallel of 2 times at once.
=1=>
=2=>
=3=>
=4=>
=..n1=>
=..n2=>
In that case, as I understand it, pm2 won't help me, but job queue managers like bull will. Am I right here?
Selectively send a command to workers/jobs
But let's imagine, for now, I want to run all the worker.js tasks in parallel. -Great, now they are running but waiting for my input command. Doesn't matter how exactly, via http/ws/cli/etc.
And what if I want to select only processes with certain PIDs (or select them by my own certain criteria) and send commands only for them, except other worker.js instances.
Like this:
User:
run all threads PID1, PID3 print "hello"
Threads:
==1=wait_for_input======="hello"=======>
==2=wait_for_input=====================>
==3=wait_for_input======="hello"=======>
Does in that case I need any message broker? Or such job managers like bull or agenda and their UI (like bull-board) will allow me to selectively send commands only for necessary jobs in the queue?"
In that case, our worker.js instances will be capable to react to incoming "messages" via process.on trigger.
Role of process managers (like pm2 and nodemon)
And if queue managers are capable to run tasks separate / in-parallel, why do we need such things as pm2 at all? Are process managers replaceable by job managers? Or we need such thing as pm2, only to start bull which will maintain queries?

How can I prevent similar queues from running at the same time?

We currently process a set of tasks using Queue workers in Laravel. When I am using multiple threads of php artisan queue:work jobs end up running together (async). We are using Beanstalkd as the queue driver.
The issue is that in the queue work we are polling an API that only allows one concurrent session for a particular agent_id. That is, only one API call with the same agent_id can run at a time.
We thought of spinning up multiple php artisan queue:work threads with a filter on the queue_name matching the agent_id but we have over 500 agents therefore we would need 500 threads so this is not ideal.
Is there anyway to implement a lock style feature for each agent_id so that if a job is already running for a particular agent_id it will send it back to the queue? Or are there any features of beanstalkd that would allow for this?
The other option could also be to gracefully handle the rejection from the API when the user is already logged in (and send the job back to the queue). But this could get messy and could clutter the logs.
You could either run only a single worker that is capable of running the fetch-from-API job, or use some sort of external marshalling/lock service.
The options for that, may be either an internal rate limiting system, or some kind of common atomically locking system. A memcached or redis server where a worker tries to set a lock-key, and only the agent that successfully sets it, gets to work on the task. An advantage of that may be that as soon as the API request has been completed, you can remove the lock, and then while the worker processes the results, a different worker can make a new request.

Node/Express: running specific CPU-instensive tasks in the background

I have a site that makes the standard data-bound calls, but then also have a few CPU-intensive tasks which are ran a few times per day, mainly by the admin.
These tasks include grabbing data from the db, running a few time-consuming different algorithms, then reuploading the data. What would be the best method for making these calls and having them run without blocking the event loop?
I definitely want to keep the calculations on the server so web workers wouldn't work here. Would a child process be enough here? Or should I have a separate thread running in the background handling all /api/admin calls?
The basic answer to this scenario in Node.js land is to use the core cluster module - https://nodejs.org/docs/latest/api/cluster.html
It is an acceptable API to :
easily launch worker node.js instances on the same machine (each instance will have its own event loop)
keep a live communication channel for short messages between instances
this way, any work done in the child instance will not block your master event loop.

gearman and retrying workers with unreliable external dependencies

I'm using gearman to queue a variety of different jobs, some which can always be serviced immediately, and some which can "fail", because they require an unreliable external service. (For example, sending email might require an SMTP server that's frequently unavailable.)
If an external service goes down, I'd like to keep all jobs which require that service on the queue, and retry one job occasionally (every few minutes, say) until the service becomes available again. (Perhaps optionally sending email if the service has not been available for hours.)
However I'd like jobs that don't require a failed service to be passed on to workers as soon as possible. How can this be achieved? (I'm happy to put some of the logic in the workers if necessary, although it seems to be a bit "late" to throttle on the worker side.)
Gearman should already be handle this. As long as you have some workers which specialise in handling jobs with unreliable dependancies and don't handle other jobs, along with some workers that either do all jobs, or just jobs without unreliable dependencies.
All you would need to do it add some code the unreliable dependancy workers so that they only accept jobs once that have checked that the dependent service is running, if the service is down then just have them wait a bit and retest the service (and continue ad infinitum), once the service is up then have them join the gearmand server, do job, return work, retest service, etc etc.
While the dependent service is down, the workers that don't handle jobs that need the service will keep on trundling through the job queue for the other jobs. Gearmand won't block an entire job queue (or worker) on one job type if there are workers available to handle other job types.
The key is to be sensible about how you define your job types and workers.
EDIT--
Ah-ha, I knew my thinking was a little out, (I wrote my gearman system about a year ago and haven't really touched it since). My solution to this type of issue was to have all the workers that normally handle dependent-job unregister their dependent job handling capability with the gearmand server once a failure was detected with the dependent service. (and any workers that are currently trying to complete that job should return a failure.) Once the service is backup - get those same workers to reregister their ability to handle that job. Do note this does require another channel of communications for the workers to be notified of the status of the dependent services.
Hope this helps

How to run a sub-task only on one Worker Role instance

I have two instances of a worker role.
I want to run a sub-task (on a Thread Pool thread) only on one of the Worker Role instances.
My initial idea was to do something like this:
ThreadPool.QueueUserWorkItem((o) =>
{
if (RoleEnvironment.CurrentRoleInstance.Id == RoleEnvironment.Roles[RoleEnvironment.CurrentRoleInstance.Role.Name].Instances.First().Id)
{
emailWorker.Start();
}
});
However, the above code relies on Role.Instances collection always returning the instances in the same order. Is this the case? or can the items be returned in any order?
Is there another approved way of running a task on one role instance only?
Joe, the solution you are looking for typically rely on:
either acquiring on lease (similar to a lock, but with an expiration) on a specific blob using the Blob Storage as a synchronization point between your role instances.
or queuing / dequeuing a message from the Queue Storage, which is usually the suggested pattern to delay long running operations such as sending an email.
Either ways, you need to go through the Azure Storage to make it work. I suggest to have a look at Lokad.Cloud, as we have designed this open-source framework precisely to handle this sort of situations.
If they need to be doing different things, then it sounds to me like you don't have 2 instances of a single worker role. In reality you have 2 different worker roles.
Especially when looking at the scalability of your application, processes need to be able to run on more than one instance. What happens when that task that you only want to run on one role gets large enough that it needs to scale to 2 or more role instances?
One of the benefits of developing for Azure is that you get scalability automatically if you design your app properly. If makes you work extra to get something that's not scalable, which is what you're trying to do.
What triggers this task to be started? If you use a message on Queue Storage (as suggested by Joannes) then only one worker role will pick up the message and process it and it doesn't matter which instance of your worker role does that.
So for now if you've got one worker role that's doing the sub task and another worker role that's doing everything else, just add 2 worker roles to your Azure solution. However, even if you do that, the worker role that processes the sub task should be written in such a way that if you ever scale it to run more than a single instance that it will run properly. In that case, you might as well stick with the single worker role and code for processing messages off the Queue to start your sub task.

Resources