CPU usage vs Number of threads - multithreading

In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)

It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.

There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.

I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.

As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.

Related

Is there every a reason to use thread affinity when there are more threads being used than ones specified/reserved?

I am working with Rust but this question would also apply to many other situations.
Suppose you have M available vCPUs and N threads (including the main thread) to schedule, and that N > M. Each thread does approximately equal amounts of work.
Is there any good reason then to pin threads to specific cores? I've written a number of toy benchmarks and it seems like the answer is no, as I cannot make a program under these assumptions that performs better with thread affinity; in fact, it always does much worse.
if your application is working on a system with a lot of cores and heavily relies on the core cache, a context switch will be too expensive, so pinning tasks to cores reduces the context switches and improves throughput.
but in an "average pc" running plain RAM-bound tasks then your OS scheduler will be much better at load balancing the cores than you ever will.
pinning threads to cores is also useful if you care about latency instead of throughput, in a heavily loaded system if you have a time-critical task then you want it to have its own core which won't be contented by other tasks on the system, hence it makes sense to pin it to a certain core, an example will be an in-memory Database that needs to responds to request in under a millisecond latency.
so the answer is, it's only useful for certain apps.

Performance of multi-threading exceeding cores

If I have a process that starts X amount of threads, will there ever be a performance gain having X higher than the number of CPU cores (assuming all the threads are working synchronously without async calls to storage/network)?
E.G. If I have a two cores CPU, will I just slow down the application starting 3+ constantly working threads?
It really depends on what your code does. it is too broad.
Having more threads than cores might speed up the program for example if some of the threads sleep or try to block on a lock. in this case, the OS scheduler can wake different thread and that thread will work while the other thread is sleeping.
Having more threads than the number of cores may also decrease the program execution time because the OS scheduler has to do more work to switch between the threads execution and that scheduling might be a heavy operation.
As always, benchmarking your application with different amount of threads is the best way to achieve maximum performance. there are also algorithms (like Hill-Climbing) which may help the application fine tune the best number of threads on runtime.
It is possible that such a thing happens.
Both Intel and AMD currently implement forms of SMT in their CPUs. This means that, in general, one single thread of execution may not be able to exploit 100% of the computing resources.
This happens because modern CPUs execute instructions in multiple pipelined steps, so that the clock frequency can be increased (less stuff gets done in every cycle, so you can do more cycles). The downside of this approach is that, if you have two consecutive instructions A and B, with the latter depending on the result of the former, you may have to wait some clock cycles without doing anything, just waiting for instruction A to complete. So, they came up with SMT, which allows the CPU to interleave instructions from two different threads/processes on the same pipeline, in order to fill such gaps.
Note: it is not exactly like this, CPUs don't just wait. They try to guess the result of the first operation and execute the second assuming that result. If their guess is wrong, they cancel the pending instructions and start over. Also, they have some feedback circuits that allow tighter execution of interdependent instructions. And nowadays branch predictors are surprisingly good. Things get better for the pipeline if you can just fill gaps with instructions from some other process, rather than going with a guess, but this potentially halves the amount of cache each executing thread can use.
It makes sense to run more threads if your threads make read/write/send/recv syscalls or similar, or sleep on locks, etc.
If your threads are pure computation threads, adding more of them will slow down system because of context switches.
If you still need more threads by design, you might want to look into the cooperative multitasking. Both Windows and Linux have API for that and that will work faster than the context switches. In Windows it called fibers:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx
In Linux it is a set of functions make/get/swapcontext():
http://man7.org/linux/man-pages/man3/makecontext.3.html
This question: Optimal number of threads per core might help you.
In the thread I wrote an answer describing a scenario when having higher number of threads than the available number of cores boosts performance.

Why is optimal thread count of a program related to number of cores when there are thousands of background threads

I've been reading about multi-threaded programming and number of optimal threads. I understand that it is very subjective, varies case by case basis, and the real optimal can be found only through trial-and-error.
However, I've found so many posts saying that if the task is non-I/O-bound, then
Optimal: numberOf(threads) ~= numberOf(cores)
Please take a look at Optimal number of threads per core
Q) How can the above equation be valid if hundreds/thousands of background (OS/other stuff) threads are already fighting to get their turn?
Q) Doesn't having a bit more number of threads increase the probability of being allotted with a core?
The "optimal" only applies to threads that are executing full throttle. The 1000+ threads you can see in use in, say, the Windows Task Manager are threads that are not executing. They are waiting for a notification, blocking on a synchronization object's wait() call.
Which includes I/O but can also be a timer, a driver event, a process interop synch object, an UI thread waiting for a message, etcetera. The latter are much less visible since they are usually wrapped by a friendly api.
Writing a program that has as many threads as the machine has cores, all burning 100% core, is not actually that common. You'd have to solve the kind of problem that requires pure calculation. Real programs are typically bogged down by the need to read/write the data to perform an operation or are throttled by the rate at which data arrives.
Overscheduling the processor is not a good strategy if you have threads burning 100% core. They'll start to fight with each other, the context switching overhead causes less work to be done. It is fine when they block. Blocking automatically makes a core available to do something else.

Why would I have to use multiple threads for one processing task if i can turn up the priority of the program?

Earlier I asked about processing a datastream and someone suggested to put data in a queue and processing this data on a different thead. If this was to slow, I should use multiple threads.
However, i'm using a system that has one core.
So my question is: why not up the prio of my app, so it gets more CPU time from the OS?
I'm writing a server based app and it will be the only big thing running on there.
What would be the pro's and con's of putting the prio up?:)
If you have only one core, then the only way that multi-threading can help you is if chunks of that work depends on something other than CPU, so one thread can get some work done while another is waiting for data from a disk or network connection.
If your application has a GUI, then it can benefit from multi-threading in that while it would be no quicker to do the processing (slower in fact, though probably negligibly so if the task is very long), it can still react to user input in the meantime.
If you have two or more cores, then you can also gain in CPU-bound operations though doing so varies from trivial to impossible depending on just what that operation is. This is irrelevant to your case, but worth considering generally if code you write could later be run on a multi-core system.
Upping the priority is probably a bad idea though, especially if you have only one core (one advantage of multi-core systems is that people who up priorities can't do as much damage).
All threads have priorities which is a factor of both their process' priority and their priority within that process. A low-priority thread in a high priority process trumps a high-priority thread in a low-priority process.
The scheduler doles out CPU slices in a round-robin fashion to the highest priority threads that have work to do. If there are CPUs left over (which in your case means if there are zero threads at that priority that need to run), then it doles out slices to the next lowest priority, and so on.
Most of the time, most threads aren't doing much anyway, which can be seen from the fact that most of the time CPU usage on most systems is below the 100% mark (hyperthreading skews this, the internal scheduling within the cores means a hyperthreaded system can be fully saturated and seem to be only running at as little as 70%). Anyway, generally stuff gets done and a thread that suddenly has lots to do will do so at normal priority in pretty much the same time it would at a higher.
However, while the benefit to that busy thread of higher priority is generally little or nothing, the decrement is great. Since it's the only thread that gets any CPU time, all other threads are stuck. All other processes therefore hang for a while. Eventually the scheduler notices that they've all been waiting for around 3seconds, and fixes this by boosting them all to highest priority and giving them larger slices than normal. Now we have a burst of activity as threads that got no time are all suddenly highest-priority threads that all want CPU time. There's a spurt of every thread except the high-priority one running, and the system stops from keeling over, though there's likely still a lot of applications showing "Not Responding" in their title bars. It's far from ideal, but it is an effective way to deal with a thread of higher than usual priority grabbing the core for so long.
The threads gradually drop down in priority, and eventually we're back to the situation where the single higher priority thread is the only one that can work.
For extra fun, if our high priority thread in any way depended upon services provided by the lower priority threads, it would have ended up being stuck waiting on them. Hopefully in a way that made it block and stopped itself from doing any damage, but probably not.
In all, thread priorities are to be approached with great caution, and process priorities even more so. They're only really valid if they'll yield quickly and are either essential to the workings of other threads (e.g. some OS processes will be done at a higher priority, finaliser threads in .NET will be higher than the rest of the process, etc) or if sub-millisecond delays can mess things up (some intensive media work requires this).
If you have multiple cores/processors in your system, upping the priority of a single threaded program will not improve your performance by much, because the other cores would still be unused.
The only way to take advantage of multiple processing units is to write your program using multiple threads/processes.
Having said this, setting your multithreaded application to very high priority may lead to some performance improvement, but I really never saw it to be significant, at least in my own tests.
Edit: I see now that you are using only one core. Basically your program will be able to run more often on the CPU than the rest of the processes that are of lower priority. This may bring you a marginal improvement, but not a dramatic one. Since we cannot know what other applications are running at the same time on your system, the golden rule here is to try it yourself with various priority levels and see what happens. It's the only valid way to see if things will be faster or not.
It all depends on why the data processing is slow.
If the data processing is slow because it is a genuinely cpu intensive operation then splitting it out into multiple threads on a single core system is not going to get you any benefit. In this case increasing the task priority would provide some benefit, assuming that there is (user) cpu time being used by other processes.
However, if the data processing operation is slow because of some non-cpu restriction (eg. if it is I/O bound, or relying on another process), then:
Increasing the task priority is going to have negligible impact. Task priority won't affect I/O times and if there is a dependency on another process on the system you may actually harm performance.
Splitting the data processing out into multiple threads can allow the cpu intensive areas to continue processing while waiting for the non-cpu intensive (eg. I/O) areas to complete.
Increasing the priority of a single-threaded process just gives you more (or bigger) time slices on the one core the process is running on. The core can still only do one thing at a time.
If you spin off a thread to handle the data processing, it can run on a different processor core (assuming a multi-core system), and it and your main thread are actually executing at the same time. Much more efficient.
If you use only one thread your server app will only be able to service one request at a time, no matter what its priority. If you use multiple threads you could service many at the same time.

How many simultaneous threads in an application is a lot?

5, 100, 1000?
I guess, "it depends", but on what?
What is common in applications that run as server daemons / services?
What are hard limits?
Given that the machine can handle the overall workload, how do I determine at how many threads the overhead starts to have an impact on performance?
What are important differences between OS's?
What else should be considered?
I'm asking because I would like to employ threads in an application to organize subcomponents of my application that do not share data and are designed to do their work in parallel. As the application would also use thread pools for parallelizing some tasks, I was wondering at what point I should start to think about the number of threads that's going to run in total.
I know the n+1 rule as a guideline for determining the number of threads that simultaneously work on the same task to gain performance. However, I want to use threads like one might use processes in a larger scope, i. e. to organize independent tasks that should not interfere with each other.
In this related question, some people advise to minimise the number of threads because of the added complexity. To me it seems that threads can also help to keep things sorted more orderly and actually reduce interference. Isn't that correct?
I can't answer your question about "how much is many" but I agree that you should not use threads for every task possible.
The optimal amount of threads for performance of application is (n+1), where n is the amount of processors/cores your computer/claster has.
The more your actual thread amount differs from n+1, the less optimal it gets and gets your system resources wasted on thread calculations.
So usually you use 1 thread for the UI, 1 thread for some generic tasks, and (n+1) threads for some huge-calculation tasks.
Actually Ajmastrean is a little out of date. Quoting from his own link
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads. The number of threads in the
thread pool can be changed by using
the SetMaxThreads method.
But generally I think 25 is really where the law of diminishing returns (and programmers abilities to keep track of what is going on) starts coming into effect. Although Max is right, as long as all of the threads are performing non-blocking calculations n+1 is the optimal number, in the real world most of the threading tasks I perform tend to be done on stuff with some kind of IO.
Also depends on your architecture. E.g. in NVIDIA GPGPU lib CUDA you can put on an 8 thread multiprocessor 512 threads simoultanously. You may ask why assign each of the scalar processors 64 threads? The answer is easy: If the computation is not compute bound but memory IO bound, you can hide the mem latencies by executing other threads. Similar applies to normal CPUs. I can remember that a recommendation for the parallel option for make "-j" is to use approx 1.5 times the number of cores you got. Many of the compiling tasks are heavy IO burden and if a task has to wait for harddisk, mem ... whatever, CPU could work on a different thread.
Next you have to consider, how expensive a task/thread switch is. E.g. it is comes free, while CPU has to perform some work for a context switch. So in general you have to estimate if the penalty for two task switches is longer than the time the thread would block (which depends heavily on your applications).
Microsoft's ThreadPool class limits you to 25 threads per processor. The limit is based on context switching between threads and the memory consumed by each thread. So, that's a good guideline if you're on the Windows platform.

Resources