Job processing with a traditional worker or using a Node server? - node.js

I want to be able to process background jobs that will have multiple tasks associated with it. Tasks may consist of launching API requests (blocking operations) and manipulating and persisting responses. Some of these tasks could also have subtasks that must be executed asynchronously.
For languages such as Ruby, I might use a worker to execute the jobs. As I understand, every time a new job comes gets to the queue, a new thread will execute it. As I mentioned before, sometimes a task could contain a series of subtasks to be executed asynchronously, so as I see it, I have two options:
Add the substep execution to the worker queue (But a job could have easily lots of subtasks that will fill the queue fast and will block new jobs from been processed).
What if I use event-driven Node server to handle a job execution? I would not need to add subtasks to a queue as a single node server could be able to handle one's job entire execution asynchronously. Is there something wrong with doing this?
This is the first time I encounter this kind of problem and I want to know which approach is better suited to solve my issue.

Related

Solution for user-specific background job queues

I have been researching how to efficiently solve the following use case and I am struggling to find the best solution.
Basically I have a Node.js REST API which handles requests for users from a mobile application. We want some requests to launch background tasks outside of the req/res flow because they are CPU intensive or might just take a while to execute. We are trying to implement or use any existing frameworks which are able to handle different job queues in the following way (or at least compatible with the use case):
Every user has their own set job queues (there are different kind of jobs).
The jobs within one specific queue have to be executed sequentially and only one job at a time but everything else can be executed in parallel (it would be preferable if there are no queues hogging the workers or whatever is actually consuming the tasks so all queues get more or less the same priority).
Some queues might fill up with hundreds of tasks at a given time but most likely they will be empty a lot of the time.
Queues need to be persistent.
We currently have a solution with RabbitMQ with one queue for every kind of task which all the users share. The users dump tasks into the same queues which results in them filling up with tasks from a specific user for a long time and having the rest of users wait for those tasks to be done before their own start to be consumed. We have looked into priority queues but we don't think that's the way to go for our own use case.
The first somewhat logical solution we thought of is to create temporary queues whenever a user needs to run background jobs and have them be deleted when empty. Nevertheless we are not sure if having that many queues is scalable and we are also struggling with dynamically creating RabbitMQ queues, exchanges, etc (we have even read somewhere that it might be an anti-pattern?).
We have been doing some more research and maybe the way to go would be with other stuff such as Kafka or Redis based stuff like BullMQ or similar.
What would you recommend?
If you're on AWS, have you considered SQS? There is no limit on number of standard queues created, and in flight messages can reach up to 120k. This would seem to satisfy your requirements above.
While the mentioned SQS solution did prove to be very scalable our amount of polling we would need to do or use of SNS did not make the solution optimal. On the other hand implementing a self made solution via database polling was too much for our use case and we did not have the time or computational resources to consider a new database in our stack.
Luckily, we ended up finding that the Pro version of BullMQ does have a "Group" functionality which performs a round robin strategy for different tasks within a single queue. This ended up adjusting perfectly to our use case and is what we ended up using.

Single job store, single started scheduler and multiple read-only worker processes

I ran into this FAQ indicating that sharing a persistent job store among two or more processes will lead to incorrect scheduler behavior:
How do I share a single job store among one or more worker processes?
My question is: If there's only a single worker scheduler that has been started via .start(), and another scheduler process is initialized on the same persistent sqlite jobstore only to print the trigger of a certain job_id (won't invoke a .start()), could that lead to cases of incorrect scheduler behavior?
Using apscheduler 3.6.3
Yes. First of all, the scheduler has to be started for it to return you the list of permanently stored jobs. Another potential issue is that the current APScheduler version deletes any jobs on retrieval for which it cannot find the corresponding task function. This behavior was initially added to clear out obsolete jobs, but was in retrospect ill conceived and will be removed in v4.0.
On the upside, it is possible to start the scheduler in paused mode so it won't try to run any jobs but will still give you the list of jobs, so long as all the target functions are importable.

Handle long-running processes in NodeJS?

I've seen some older posts touching on this topic but I wanted to know what the current, modern approach is.
The use case is: (1) assume you want to do a long running task on a video file, say 60 seconds long, say jspm install that can take up to 60 seconds. (2) you can NOT subdivide the task.
Other requirements include:
need to know when a task finishes
nice to be able to stop a running task
stability: if one task dies, it doesn't bring down the server
needs to be able to handle 100s of simultaneous requests
I've seen these solutions mentioned:
nodejs child process
webworkers
fibers - not used for CPU-bound tasks
generators - not used for CPU-bound tasks
https://adambom.github.io/parallel.js/
https://github.com/xk/node-threads-a-gogo
any others?
Which is the modern, standard-based approach? Also, if nodejs isn't suited for this type of task, then that's also a valid answer.
The short answer is: Depends
If you mean a nodejs server, then the answer is no for this use case. Nodejs's single-thread event can't handle CPU-bound tasks, so it makes sense to outsource the work to another process or thread. However, for this use case where the CPU-bound task runs for a long time, it makes sense to find some way of queueing tasks... i.e., it makes sense to use a worker queue.
However, for this particular use case of running JS code (jspm API), it makes sense to use a worker queue that uses nodejs. Hence, the solution is: (1) use a nodejs server that does nothing but queue tasks in the worker queue. (2) use a nodejs worker queue (like kue) to do the actual work. Use cluster to spread the work across different CPUs. The result is a simple, single server that can handle hundreds of requests (w/o choking). (Well, almost, see the note below...)
Note:
the above solution uses processes. I did not investigate thread solutions because it seems that these have fallen out of favor for node.
the worker queue + cluster give you the equivalent of a thread pool.
yea, in the worst case, the 100th parallel request will take 25 minutes to complete on a 4-core machine. The solution is to spin up another worker queue server (if I'm not mistaken, with a db-backed worker queue like kue this is trivial---just make each point server point to the same db).
You're mentioning a CPU-bound task, and a long-running one, that's definitely not a node.js thing. You also mention hundreds of simultaneous tasks.
You might take a look at something like Gearman job server for things like that - it's a dedicated solution.
Alternatively, you can still have Node.js manage the requests, just not do the actual job execution.
If it's relatively acceptable to have lower then optimal performance, and you want to keep your code in JavaScript, you can still do it, but you should have some sort of job queue - something like Redis or RabbitMQ comes to mind.
I think job queue will be a must-have requirement for long-running, hundreds/sec tasks, regardless of your runtime. Except if you can spawn this job on other servers/services/machines - then you don't care, your Node.js API is just a front and management layer for the job cluster, then Node.js is perfectly ok for the job, and you need to focus on that job cluster, and you could then make a better question.
Now, node.js can still be useful for you here, it can help manage and hold those hundreds of tasks, depending where they come from (ie. you might only allow requests to go through to your job server for certain users, or limit the "pause" functionality to others etc.
Easily perform Concurrent Execution to LongRunning Processes using Simple ConcurrentQueue. Feel free to improve and share feedback.
πŸ‘¨πŸ»β€πŸ’» Create your own Custom ConcurrentExecutor and set your concurrency limit.
πŸ”₯ Boom you got all your long-running processes run in concurrent mode.
For Understanding you can have a look:
Concurrent Process Executor Queue

using .NET 4 Tasks instead of Thread.QueueUserWorkItem

I've been reading bunch of articles regarding new TPL in .NET 4. Most of them recommend using Tasks as a replacement for Thread.QueueUserWorkItem. But from what I understand, tasks are not threads. So what happens in the following scenario where I want to use Producer/Consumer queue using new BlockingCollection class in .NET 4:
Queue is initialized with a parameter (say 100) to indicate number of worker tasks. Task.Factory.StartNew() is called to create a bunch of tasks.
Then new work item is added to the queue, the consumer takes this task and executes it.
Now based on the above, there is seems to be a limit of how many tasks you can execute at the same time, while using Thread.QueueUserWorkItem, CLR will use thread pool with default pool size.
Basically what I'm trying to do is figure out is using Tasks with BlockingCollection is appropriate in a scenario where I want to create a Windows service that polls a database for jobs that are ready to be run. If job is ready to be executed, the timer in Windows service (my only producer) will add a new work item to the queue where the work will then be picked up and executed by a worker task.
Does it make sense to use Producer/Consumer queue in this case? And what about number of worker tasks?
I am not sure about whether using the Producer/Consumer queue is the best pattern to use but with respect to the threads issue.
As I believe it. The .NET4 Tasks still run as thread however you do not have to worry about the scheduling of these threads as the .NET4 provides a nice interface to it.
The main advantages of using tasks are:
That you can queue as many of these up as you want with out having the overhead of 1M of memory for each queued workitem that you pass to Thread.QueueUserWorkItem.
It will also manages which threads and processors your tasks will run on to improve data flow and caching.
You can build in a hierarchy of dependancies for your tasks.
It will automatically use as many of the cores avaliable on your machine as possible.

Eclipse RCP: Only one Job runs at a time?

The Jobs API in Eclipse RCP apparently works much differently than I expected. I thought that creating and scheduling multiple Jobs would actually cause multiple worker threads to be created, executing the Jobs in parallel unless there was an ISchedulingRule conflict.
I went back and read the documentation more closely, and also discovered this comment in the JobManager class:
/**
* Returns a running or blocked job whose scheduling rule conflicts with the
* scheduling rule of the given waiting job. Returns null if there are no
* conflicting jobs. A job can only run if there are no running jobs and no blocked
* jobs whose scheduling rule conflicts with its rule.
*/
Now it looks to me like the Job manager will only ever attempt to use one background worker thread. Am I completely wrong about this? If I'm right,
what is the point of scheduling rules and locks? If there is only one worker thread, Jobs can never preemt each other. Wouldn't these only ever be used in case a Job's sleep() method is called (e.g. sleeping while holding a Lock)?
does any part of the platform allow two Jobs to actually run concurrently, on multiple worker threads, thus making the above features useful somehow?
What am I missing here?
Take a look at the run method in the documentation, specifically this part:
Jobs can optionally finish their
execution asynchronously (in another
thread) by returning a result status
of ASYNC_FINISH. Jobs that finish
asynchronously must specify the
execution thread by calling setThread,
and must indicate when they are
finished by calling the method done.
ASYNC_FINISH there looks interresting.
AFAIK creating and scheduling multiple Jobs DO actually cause multiple worker threads to be created and to be executed in parallel.
However if you specify an optional scheduling rule to your job (using the setRule() method) and if that rule conflicts with another job's scheduling rule then those two jobs can't run simultaneously.
This Eclipse Corner article provides good description as well as few code samples for Eclipse Job API.
The IJobManager API is only needed for advanced job manipulation, e.g. when you need to use locks, synchronize between several jobs, terminate jobs, etc.
Note: Eclipse 4.5M4 will include now (Q4 2014) a way to Support for Job Groups with throttling
See bug 432049:
Eclipse provides a simple Jobs API to perform different tasks in parallel and in asynchronous fashion. One limitation of the Eclipse Jobs is that there is no easy way to limit the number of worker threads being used to execute jobs.
This may lead to a thread pool explosion when many jobs are scheduled in quick succession. Due to that it’s easy to use Jobs to perform different unrelated tasks in parallel, but hard to implement thousands of Jobs co-operating to complete a single large task.
Eclipse currently supports the concept of Job Families, which provides one way of grouping with support for join, cancel, sleep, and wakeup operations on the whole family.
To address all these issue we would like to propose a simple way to group a set of Eclipse Jobs that are responsible for pieces of the same large task.
The API would support throttling, join, cancel, combined progress and error reporting for all of the jobs in the group and the job grouping functionality can be used to rewrite performance critical algorithms to use parallel execution of cooperating jobs.
You can see the implementation in this commit 26471fa

Resources