In my backend, I have about 5k users. Every day at 12pm UTC, I need to run a background node task, that carries out some real-time calculations on the user's data and sends it to the user by email/sms/notification, etc. This calculation is a bit intensive and takes a few seconds for each user.
Since 10k is a large number, I've created a schedule such that an API endpoint is called to run the task every 1 minute and processes 5 users on every call. The problem with this approach of processing 5 users per minute, is that at this speed it takes 16 hours to process all 5000 users. At the same time, the number of users is growing; soon enough, even 24 hours will not be enough.
What's the alternative? How can I process all these 5k users or even 10k users in a much shorter duration?
Without seeing relevant code or you describing in significant detail what exactly this background node task is doing, we can't advise specifically on what approach or approaches you need to take as the answer totally depends upon understanding that task in detail.
If the nodejs background task is CPU bound, then you will need to involve more CPUs in the processing either with child_processes or WorkerThreads.
If the nodejs background tasks is database-limited, then you will need to scale your database to handle more requests in less time or redesign how the relevant data is stored to be more efficient.
If nearly all the processing is asynchronous and not CPU bound and not database-limited, then you perhaps need to be processing N users in parallel.
As with any performance problem, it's not good to guess where the biggest bottleneck is. You need to instrument and measure! I would suggest that you start by instrumenting the processing of one user to find out exactly where all the time is going. Then, start with the longest pole in the tent and dive into exactly why it's taking that much time. See how much you can improve that one item. Move onto the next one. Once you've made the processing of a single user as fast as it can be, then work on ways to process N users at a time and make sure your database can scale with that. By dividing up the users across more processes and processing N at a time in each process, you can scale the calculation work as much as you want. You will have to make sure that your database can keep up with that additional load from all the separate processes.
Related
Let's say I have a GUI application that performs long computations for the user (for example, video processing). Should I increase the priority of my computation process/thread to make it run faster? What would be the harm in doing so?
Generally no. It's rare that someone would prefer a non-responsive system during a long calculation than a responsive system where the calculation takes a tiny bit longer.
There are also a variety of reasons that increased priority can result in the calculation actually taking longer. For example, increased pre-emption of other tasks can blow out the CPU caches, making those tasks take longer. This can slow down the task you care about for a variety of reasons, including increased inter-core contention.
You raise priority when you can't tolerate unpredictable or increased latency.
If you have a piece of processing that requires five minutes of processor time, it is going to take five minutes to process. On a non-busy system, where that is the only application running it is unlikely to make a lot of difference.
However, on a system that has other tasks, then you're going to steal processing time from those, and the user is likely to think that your system is making their computer hang.
If you have a long running task, you're actually better reducing the priority, so that the user can get on with doing other things. This is the principle behind multitasking...
Well, if you spend more time processing video they there may be less resources available for reading it from disk, for example.
The only real way is to test...
I've got a service that runs scans of various servers. The networks in question can be huge (hundreds of thousands of network nodes).
The current version of the software is using a queueing/threading architecture designed by us which works but isn't as efficient as it could be (not least of which because jobs can spawn children which isn't handled well)
V2 is coming up and I'm considering using the TPL. It seems like it should be ideally suited.
I've seen this question, the answer to which implies there's no limit to the tasks TPL can handle. In my simple tests (Spin up 100,000 tasks and give them to TPL), TPL barfed fairly early on with an Out-Of-Memory exception (fair enough - especially on my dev box).
The Scans take a variable length of time but 5 mins/task is a good average.
As you can imagine, scans for huge networks can take a considerable length of time, even on beefy servers.
I've already got a framework in place which allows the scan jobs (stored in a Db) to be split between multiple scan servers, but the question is how exactly I should pass work to the TPL on a specific server.
Can I monitor the size of TPL's queue and (say) top it up if it falls below a couple of hundred entries? Is there a downside to doing this?
I also need to handle the situation where a scan needs to be paused. This is seems easier to do by not giving the work to TPL than by cancelling/resetting tasks which may already be partially processed.
All of the initial tasks can be run in any order. Children must be run after the parent has started executing but since the parent spawns them, this shouldn't ever be a problem. Children can be run in any order. Because of this, I'm currently envisioning that child tasks be written back to the Db not spawned directly into TPL. This would allow other servers to "work steal" if required.
Has anyone had any experience with using the TPL in this way? Are there any considerations I need to be aware of?
TPL is about starting small units of work and running them in parallel. It is not about monitoring, pausing, or throttling this work.
You should see TPL as a low-level tool to start "work" and to synchronize threads.
Key point: TPL tasks != logical tasks. Logical tasks are in your case scan-tasks ("scan an ip-range from x to y"). Such a task should not correspond to a physical task "System.Threading.Task" because the two are different concepts.
You need to schedule, orchestrate, monitor and pause the logical tasks yourself because TPL does not understand them and cannot be made to.
Now the more practical concerns:
TPL can certainly start 100k tasks without OOM. The OOM happened because your tasks' code exhausted memory.
Scanning networks sounds like a great case for asynchronous code because while you are scanning you are likely to wait on results while having a great degree of parallelism. You probably don't want to have 500 threads in your process all waiting for a network packet to arrive. Asynchronous tasks fit well with the TPL because every task you run becomes purely CPU-bound and small. That is the sweet spot for TPL.
My question might sound a bit naive but I'm pretty new with multi-threaded programming.
I'm writing an application which processes incoming external data. For each data that arrives a new task is created in the following way:
System.Threading.Tasks.Task.Factory.StartNew(() => methodToActivate(data));
The items of data arrive very fast (each second, half second, etc...), so many tasks are created. Handling each task might take around a minute. When testing it I saw that the number of threads is increasing all the time. How can I limit the number of tasks created, so the number of actual working threads is stable and efficient. My computer is only dual core.
Thanks!
One of your issues is that the default scheduler sees tasks that last for a minute and makes the assumption that they are blocked on another tasks that have yet to be executed. To try and unblock things it schedules more pending tasks, hence the thread growth. There are a couple of things you can do here:
Make your tasks shorter (probably not an option).
Write a scheduler that deals with this scenario and doesn't add more threads.
Use SetMaxThreads to prevent
unbounded thread pool growth.
See the section on Thread Injection here:
http://msdn.microsoft.com/en-us/library/ff963549.aspx
You should look into using the producer/consumer pattern with a BlockingCollection<T> around a ConcurrentQueue<T> where you set the BoundedCapacity to something that makes sense given the characteristics of your workload. You can make your BoundedCapacity configurable and then tweak as you run through some profiling sessions to find the sweet spot.
While it's true that the TPL will take care of queueing up the tasks you create, creating too many tasks does not come without penalties. Also, what's the point in producing more work than you can consume? You want to produce enough work that the consumers will never be starved, but you don't want to get to far ahead of yourself because that's just wasting resources and potentially stealing those very same resources from your consumers.
You can create a custom TaskScheduler for the Task Parallel library and then schedule tasks on that by passing an instance of it to the TaskFactory constructor.
Here's one example of how to do that: Task Scheduler with a maximum degree of parallelism.
I am using a third party API which performs what I would assume are expensive operations in terms of time/resources used (image recognition, etc). What tell-tale signs are there that the code under test should be made to use threads to increase performance?
I have a profiler and will be profiling the code I write which will rely on this API.
Thanks
If you have two distinct sequences of events that don't depend on one-another, then consider it. If you have to write bunches of logic just to make sure that two operations aren't getting in each-others way, it pays off by making the two pieces of code clearer.
If on the other hand you find that, in attempting to make something multithreaded, you have to add gobs of code to communicate results between the threads, because one (or both) can't proceed without some information from the other, that's a good sign that you are trying to make threads where they don't make sense.
One case where it makes sense to go multi-threaded, even when you have to add communication to do it, is when you have one task that needs to stay available for input, and another to do heavy computing. One thread may poll for input from somewhere, blocking when none is available, so that when input is available it is responded to in a timely manner, and feed jobs to another 'worker' thread, so that processing continues at all times, not just when there's input.
One other thing to consider, is that even when a job is 'embarrassingly parallel' (i.e., requiring little or no communication between the parallelized parts), there are cases where multithreading may not be worthwhile. If your CPU can assign different threads to different cores, multithreading will give you a speed up, by allowing multiple cores to chew through the work simultaneously. But on a single core processor, or even a multi-core one with an unfortunate OS, having multiple threads will not speed things up, as the one core will still have to get through all the work.
Image processing is often cpu-bound. However, if your image-processing api already is designed to leverage multiple cpus, multi-threading probably won't help you. The strategy I usually consider for quickly determining if multi-threading will help is to write a simple program which does the relevant processing over and over again. Then, I will run it on a set of data, then run two instances of the process simultaneously,each on half of the data. There is no need to ensure the data is equalized for such a test; if one process runs out it will just run one instance for anything left. Timing is done via wall-clock time. I mean this literally; pick a large enough data set that it will take at least a full minute to run, but ideally 5 minutes or longer).
If running two copies at the same time improves throughput significantly, multi-threading is probably a good idea. Obviously this strategy is only practical in certain instances and in some cases multi-threading can involve leveraging shared output in ways this trick can't emulate. But, it's an absurdly easy test to run, and rarely requires much, if any, code to be written.
Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.
There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.
Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip
The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)
This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).
One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.