Anyone have any suggestions or resources that I can read up on how to go about scaling parallel tasks with fair processing and consume using rabbitMQ? If there are over 200 users with 2000+ tasks per user and want to make sure the tasks get processed and consumed equally while also having ability for users to get their tasks done in parallel. Each tasks take about 1s.
Below is how the app is currently handling that. But I feel like there is a much better way.
So quick overview of what the app is currently doing. User can create a lists of todo tasks. Each tasks takes about 1.5s max. We have parallel processing of these tasks in place using rabbitMQ. There are total of 6 workers for each server with prefetch value of 1. There are about 200 queues, call it task_queue_1, etc. So each task_queue is dedicated to some range of todo tasks.
For example task_queue_1 is dedicated to first 10 items from task, and so on. Each of the items, starting from task 1-10, get priority assigned from 10 to 1 to ensure first task gets process first, and so on.
The queues itself also have priority. Workers will pull from task_queue_1 first but if there are messages in task_queue_2 waiting for more than one minute the it will have bigger priority, and so on. This is handled through stack and ordering.
This has been doing fine but I realized it's not exactly scalable. If I have bunch of users submitting lists with over 2k tasks, some users will be forced to wait to get their tasks processed because workers are busy processing tasks from queue_2, queue_3, etc due to wait time longer than 1 minute.
Related
In my backend, I have about 5k users. Every day at 12pm UTC, I need to run a background node task, that carries out some real-time calculations on the user's data and sends it to the user by email/sms/notification, etc. This calculation is a bit intensive and takes a few seconds for each user.
Since 10k is a large number, I've created a schedule such that an API endpoint is called to run the task every 1 minute and processes 5 users on every call. The problem with this approach of processing 5 users per minute, is that at this speed it takes 16 hours to process all 5000 users. At the same time, the number of users is growing; soon enough, even 24 hours will not be enough.
What's the alternative? How can I process all these 5k users or even 10k users in a much shorter duration?
Without seeing relevant code or you describing in significant detail what exactly this background node task is doing, we can't advise specifically on what approach or approaches you need to take as the answer totally depends upon understanding that task in detail.
If the nodejs background task is CPU bound, then you will need to involve more CPUs in the processing either with child_processes or WorkerThreads.
If the nodejs background tasks is database-limited, then you will need to scale your database to handle more requests in less time or redesign how the relevant data is stored to be more efficient.
If nearly all the processing is asynchronous and not CPU bound and not database-limited, then you perhaps need to be processing N users in parallel.
As with any performance problem, it's not good to guess where the biggest bottleneck is. You need to instrument and measure! I would suggest that you start by instrumenting the processing of one user to find out exactly where all the time is going. Then, start with the longest pole in the tent and dive into exactly why it's taking that much time. See how much you can improve that one item. Move onto the next one. Once you've made the processing of a single user as fast as it can be, then work on ways to process N users at a time and make sure your database can scale with that. By dividing up the users across more processes and processing N at a time in each process, you can scale the calculation work as much as you want. You will have to make sure that your database can keep up with that additional load from all the separate processes.
In a project I'm currently working on, I issue Celery tasks every once in a while. These tasks are for specific clients, so there are tasks for e.g. clientA, for clientB and for clientC. There are some additional conditions:
Tasks for the same client may never be executed in parallel.
Tasks for the same client must be executed in sequence, i.e. message queue order.
The Celery cookbook (see also this article) shows a locking mechanism that ensures that a single task can only be executed one at a time. This mechanism could easily be adapted to ensure that a task for a single client could only be executed one at a time. This satisfies the first condition.
The second condition is harder to ensure. Since tasks are generated from different processes, I can't use task chaining. Perhaps I could modify the locking mechanism to retry tasks while they are waiting for the lock, but this still can not guarantee order (due to retry timeouts, but also due to a race condition in acquiring the lock.)
For now, i have limited my concurrency to 1 to ensure order, although some of the tasks are taking a long time and this scales quite badly.
a) I have a task which I want the server to do every X hours for every user (~5000 users). Is it better to:
1 - Create a worker thread for each user that does the task and sleep for X hours then start again, where each task is running in random time (so that most tasks are sleeping at every moment)
2 - Create one Thread that loops through the users and do the task for each user then start again (even if this takes more than X hours).
b) if plan 1 is used, do sleeping threads aftect the performance of the server?
c) If the answer is yes, do the sleeping thread have the same effect as the thread that is doing the task?
Note that this server is not only used for this task. It is used for all the communications with the ~5000 clients.
Sleeping threads generally do not affect CPU usage. They consume 1MB of stack memory each. This is not a big deal for dozens of threads. It is a big deal for 5000 threads.
Have one thread or timer dedicated to triggering the hourly work. Once per hour you can process the users. You can use parallelism if you want. Process the users using Parallel.ForEach or any other technique you like.
Whether you choose a thread or timer doesn't matter for CPU usage in any meaningful way. Do what fits your use app most.
There is not enough details about your issue for a complete answer. However, based on the information you provided, I would:
create a timer (threading.timer)
set an interval which will be the time interval between processing a "batch" of 5'000 users
Then say the method/task you want to perform is called UpdateUsers:
when timer "ticks", in the UpdateUsers method (callback):
1. stop timer
2. loop and perform task for each user 3. start timer
This way you ensure that the task is performed for each user and there is no overlapping if it takes more than X hours total. The updates will happen every Y time, where Y is the time interval you set for your timer. Also, this uses maximum one thread, depending on how your server/service is coded.
I've got a service that runs scans of various servers. The networks in question can be huge (hundreds of thousands of network nodes).
The current version of the software is using a queueing/threading architecture designed by us which works but isn't as efficient as it could be (not least of which because jobs can spawn children which isn't handled well)
V2 is coming up and I'm considering using the TPL. It seems like it should be ideally suited.
I've seen this question, the answer to which implies there's no limit to the tasks TPL can handle. In my simple tests (Spin up 100,000 tasks and give them to TPL), TPL barfed fairly early on with an Out-Of-Memory exception (fair enough - especially on my dev box).
The Scans take a variable length of time but 5 mins/task is a good average.
As you can imagine, scans for huge networks can take a considerable length of time, even on beefy servers.
I've already got a framework in place which allows the scan jobs (stored in a Db) to be split between multiple scan servers, but the question is how exactly I should pass work to the TPL on a specific server.
Can I monitor the size of TPL's queue and (say) top it up if it falls below a couple of hundred entries? Is there a downside to doing this?
I also need to handle the situation where a scan needs to be paused. This is seems easier to do by not giving the work to TPL than by cancelling/resetting tasks which may already be partially processed.
All of the initial tasks can be run in any order. Children must be run after the parent has started executing but since the parent spawns them, this shouldn't ever be a problem. Children can be run in any order. Because of this, I'm currently envisioning that child tasks be written back to the Db not spawned directly into TPL. This would allow other servers to "work steal" if required.
Has anyone had any experience with using the TPL in this way? Are there any considerations I need to be aware of?
TPL is about starting small units of work and running them in parallel. It is not about monitoring, pausing, or throttling this work.
You should see TPL as a low-level tool to start "work" and to synchronize threads.
Key point: TPL tasks != logical tasks. Logical tasks are in your case scan-tasks ("scan an ip-range from x to y"). Such a task should not correspond to a physical task "System.Threading.Task" because the two are different concepts.
You need to schedule, orchestrate, monitor and pause the logical tasks yourself because TPL does not understand them and cannot be made to.
Now the more practical concerns:
TPL can certainly start 100k tasks without OOM. The OOM happened because your tasks' code exhausted memory.
Scanning networks sounds like a great case for asynchronous code because while you are scanning you are likely to wait on results while having a great degree of parallelism. You probably don't want to have 500 threads in your process all waiting for a network packet to arrive. Asynchronous tasks fit well with the TPL because every task you run becomes purely CPU-bound and small. That is the sweet spot for TPL.
My question might sound a bit naive but I'm pretty new with multi-threaded programming.
I'm writing an application which processes incoming external data. For each data that arrives a new task is created in the following way:
System.Threading.Tasks.Task.Factory.StartNew(() => methodToActivate(data));
The items of data arrive very fast (each second, half second, etc...), so many tasks are created. Handling each task might take around a minute. When testing it I saw that the number of threads is increasing all the time. How can I limit the number of tasks created, so the number of actual working threads is stable and efficient. My computer is only dual core.
Thanks!
One of your issues is that the default scheduler sees tasks that last for a minute and makes the assumption that they are blocked on another tasks that have yet to be executed. To try and unblock things it schedules more pending tasks, hence the thread growth. There are a couple of things you can do here:
Make your tasks shorter (probably not an option).
Write a scheduler that deals with this scenario and doesn't add more threads.
Use SetMaxThreads to prevent
unbounded thread pool growth.
See the section on Thread Injection here:
http://msdn.microsoft.com/en-us/library/ff963549.aspx
You should look into using the producer/consumer pattern with a BlockingCollection<T> around a ConcurrentQueue<T> where you set the BoundedCapacity to something that makes sense given the characteristics of your workload. You can make your BoundedCapacity configurable and then tweak as you run through some profiling sessions to find the sweet spot.
While it's true that the TPL will take care of queueing up the tasks you create, creating too many tasks does not come without penalties. Also, what's the point in producing more work than you can consume? You want to produce enough work that the consumers will never be starved, but you don't want to get to far ahead of yourself because that's just wasting resources and potentially stealing those very same resources from your consumers.
You can create a custom TaskScheduler for the Task Parallel library and then schedule tasks on that by passing an instance of it to the TaskFactory constructor.
Here's one example of how to do that: Task Scheduler with a maximum degree of parallelism.