How many simultaneous threads in an application is a lot? - multithreading

5, 100, 1000?
I guess, "it depends", but on what?
What is common in applications that run as server daemons / services?
What are hard limits?
Given that the machine can handle the overall workload, how do I determine at how many threads the overhead starts to have an impact on performance?
What are important differences between OS's?
What else should be considered?
I'm asking because I would like to employ threads in an application to organize subcomponents of my application that do not share data and are designed to do their work in parallel. As the application would also use thread pools for parallelizing some tasks, I was wondering at what point I should start to think about the number of threads that's going to run in total.
I know the n+1 rule as a guideline for determining the number of threads that simultaneously work on the same task to gain performance. However, I want to use threads like one might use processes in a larger scope, i. e. to organize independent tasks that should not interfere with each other.
In this related question, some people advise to minimise the number of threads because of the added complexity. To me it seems that threads can also help to keep things sorted more orderly and actually reduce interference. Isn't that correct?

I can't answer your question about "how much is many" but I agree that you should not use threads for every task possible.
The optimal amount of threads for performance of application is (n+1), where n is the amount of processors/cores your computer/claster has.
The more your actual thread amount differs from n+1, the less optimal it gets and gets your system resources wasted on thread calculations.
So usually you use 1 thread for the UI, 1 thread for some generic tasks, and (n+1) threads for some huge-calculation tasks.

Actually Ajmastrean is a little out of date. Quoting from his own link
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads. The number of threads in the
thread pool can be changed by using
the SetMaxThreads method.
But generally I think 25 is really where the law of diminishing returns (and programmers abilities to keep track of what is going on) starts coming into effect. Although Max is right, as long as all of the threads are performing non-blocking calculations n+1 is the optimal number, in the real world most of the threading tasks I perform tend to be done on stuff with some kind of IO.

Also depends on your architecture. E.g. in NVIDIA GPGPU lib CUDA you can put on an 8 thread multiprocessor 512 threads simoultanously. You may ask why assign each of the scalar processors 64 threads? The answer is easy: If the computation is not compute bound but memory IO bound, you can hide the mem latencies by executing other threads. Similar applies to normal CPUs. I can remember that a recommendation for the parallel option for make "-j" is to use approx 1.5 times the number of cores you got. Many of the compiling tasks are heavy IO burden and if a task has to wait for harddisk, mem ... whatever, CPU could work on a different thread.
Next you have to consider, how expensive a task/thread switch is. E.g. it is comes free, while CPU has to perform some work for a context switch. So in general you have to estimate if the penalty for two task switches is longer than the time the thread would block (which depends heavily on your applications).

Microsoft's ThreadPool class limits you to 25 threads per processor. The limit is based on context switching between threads and the memory consumed by each thread. So, that's a good guideline if you're on the Windows platform.

Related

Is there every a reason to use thread affinity when there are more threads being used than ones specified/reserved?

I am working with Rust but this question would also apply to many other situations.
Suppose you have M available vCPUs and N threads (including the main thread) to schedule, and that N > M. Each thread does approximately equal amounts of work.
Is there any good reason then to pin threads to specific cores? I've written a number of toy benchmarks and it seems like the answer is no, as I cannot make a program under these assumptions that performs better with thread affinity; in fact, it always does much worse.
if your application is working on a system with a lot of cores and heavily relies on the core cache, a context switch will be too expensive, so pinning tasks to cores reduces the context switches and improves throughput.
but in an "average pc" running plain RAM-bound tasks then your OS scheduler will be much better at load balancing the cores than you ever will.
pinning threads to cores is also useful if you care about latency instead of throughput, in a heavily loaded system if you have a time-critical task then you want it to have its own core which won't be contented by other tasks on the system, hence it makes sense to pin it to a certain core, an example will be an in-memory Database that needs to responds to request in under a millisecond latency.
so the answer is, it's only useful for certain apps.

Performance of multi-threading exceeding cores

If I have a process that starts X amount of threads, will there ever be a performance gain having X higher than the number of CPU cores (assuming all the threads are working synchronously without async calls to storage/network)?
E.G. If I have a two cores CPU, will I just slow down the application starting 3+ constantly working threads?
It really depends on what your code does. it is too broad.
Having more threads than cores might speed up the program for example if some of the threads sleep or try to block on a lock. in this case, the OS scheduler can wake different thread and that thread will work while the other thread is sleeping.
Having more threads than the number of cores may also decrease the program execution time because the OS scheduler has to do more work to switch between the threads execution and that scheduling might be a heavy operation.
As always, benchmarking your application with different amount of threads is the best way to achieve maximum performance. there are also algorithms (like Hill-Climbing) which may help the application fine tune the best number of threads on runtime.
It is possible that such a thing happens.
Both Intel and AMD currently implement forms of SMT in their CPUs. This means that, in general, one single thread of execution may not be able to exploit 100% of the computing resources.
This happens because modern CPUs execute instructions in multiple pipelined steps, so that the clock frequency can be increased (less stuff gets done in every cycle, so you can do more cycles). The downside of this approach is that, if you have two consecutive instructions A and B, with the latter depending on the result of the former, you may have to wait some clock cycles without doing anything, just waiting for instruction A to complete. So, they came up with SMT, which allows the CPU to interleave instructions from two different threads/processes on the same pipeline, in order to fill such gaps.
Note: it is not exactly like this, CPUs don't just wait. They try to guess the result of the first operation and execute the second assuming that result. If their guess is wrong, they cancel the pending instructions and start over. Also, they have some feedback circuits that allow tighter execution of interdependent instructions. And nowadays branch predictors are surprisingly good. Things get better for the pipeline if you can just fill gaps with instructions from some other process, rather than going with a guess, but this potentially halves the amount of cache each executing thread can use.
It makes sense to run more threads if your threads make read/write/send/recv syscalls or similar, or sleep on locks, etc.
If your threads are pure computation threads, adding more of them will slow down system because of context switches.
If you still need more threads by design, you might want to look into the cooperative multitasking. Both Windows and Linux have API for that and that will work faster than the context switches. In Windows it called fibers:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx
In Linux it is a set of functions make/get/swapcontext():
http://man7.org/linux/man-pages/man3/makecontext.3.html
This question: Optimal number of threads per core might help you.
In the thread I wrote an answer describing a scenario when having higher number of threads than the available number of cores boosts performance.

Cost of a thread

I understand how to create a thread in my chosen language and I understand about mutexs, and the dangers of shared data e.t.c but I'm sure about how the O/S manages threads and the cost of each thread. I have a series of questions that all relate and the clearest way to show the limit of my understanding is probably via these questions.
What is the cost of spawning a thread? Is it worth even worrying about when designing software? One of the costs to creating a thread must be its own stack pointer and process counter, then space to copy all of the working registers to as it is moved on and off of a core by the scheduler, but what else?
Is the amount of stack available for one program split equally between threads of a process or on a first come first served?
Can I somehow check the hardware on start up (of the program) for number of cores. If I am running on a machine with N cores, should I keep the number of threads to N-1?
then space to copy all of the working registeres to as it is moved on
and off of a core by the scheduler, but what else?
One less evident cost is the strain imposed on the scheduler which may start to choke if it needs to juggle thousands of threads. The memory isn't really the issue. With the right tweaking you can get a "thread" to occupy very little memory, little more than its stack. This tweaking could be difficult (i.e. using clone(2) directly under linux etc) but it can be done.
Is the amount of stack available for one program split equally between
threads of a process or on a first come first served
Each thread gets its own stack, and typically you can control its size.
If I am running on a machine with N cores, should I keep the number of
threads to N-1
Checking the number of cores is easy, but environment-specific. However, limiting the number of threads to the number of cores only makes sense if your workload consists of CPU-intensive operations, with little I/O. If I/O is involved you may want to have many more threads than cores.
You should be as thoughtful as possible in everything you design and implement.
I know that a Java thread stack takes up about 1MB each time you create a thread. , so they add up.
Threads make sense for asynchronous tasks that allow long-running activities to happen without preventing all other users/processes from making progress.
Threads are managed by the operating system. There are lots of schemes, all under the control of the operating system (e.g. round robin, first come first served, etc.)
It makes perfect sense to me to assign one thread per core for some activities (e.g. computationally intensive calculations, graphics, math, etc.), but that need not be the deciding factor. One app I develop uses roughly 100 active threads in production; it's not a 100 core machine.
To add to the other excellent posts:
'What is the cost of spawning a thread? Is it worth even worrying about when designing software?'
It is if one of your design choices is doing such a thing often. A good way of avoiding this issue is to create threads once, at app startup, by using pools and/or app-lifetime threads dedicated to operations. Inter-thread signaling is much quicker than continual thread creation/termination/destruction and also much safer/easier.
The number of posts concerning problems with thread stopping, terminating, destroying, thread count runaway, OOM failure etc. is ledgendary. If you can avoid doing it at all, great.

CPU usage vs Number of threads

In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)
It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.
There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.
I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.
As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.

How to do the same calculations faster on 4-core CPU: 4 threads or 50 threads?

Lets assume we have fixed amount of calculation work, without blocking, sleeping, i/o-waiting. The work can be parallelized very well - it consists of 100M small and independent calculation tasks.
What is faster for 4-core CPU - to run 4 threads or... lets say 50? Why second variant should be slover and how much slover?
As i assume: when you run 4 heavy threads on 4-core CPU without another CPU-consuming processes/threads, scheduler is allowed to not move the threads between cores at all; it has no reason to do that in this situation. Core0 (main CPU) will be responsible for executing interruption handler for hardware timer 250 times per second (basic Linux configuration) and other hardware interruption handlers, but another cores may not feel any worries.
What is the cost of context switching? The time for store and restore CPU registers for different context? What about caches, pipelines and various code-prediction things inside CPU? Can we say that each time we switch context, we hurt caches, pipelines and some code-decoding facilities in CPU? So more threads executing on a single core, less work they can do together in comparison to their serial execution?
Question about caches and another hardware optimization in multithreading environment is the interesting question for me now.
As #Baile mentions in the comments, this is highly application, system, environment-specific.
And as such, I'm not going to take the hard-line approach of mentioning exactly 1 thread for each core. (or 2 threads/core in the case of Hyperthreading)
As an experienced shared-memory programmer, I have seen from my experience that the optimal # of threads (for a 4 core machine) can range anywhere from 1 to 64+.
Now I will enumerate the situations that can cause this range:
Optimal Threads < # of Cores
In certain tasks that are very fine-grained paralleled (such as small FFTs), the overhead of threading is the dominant performance factor. In some cases, it's it not helpful to parallelize at all. In some cases, you get speedup with 2 threads, but backwards scaling at 4 threads.
Another issue is resource contention. Even if you have a highly parallelizable task that can easily split across 4 cores/threads, you may be bottlenecked by memory bandwidth and cache effects. So often, you find that 2 threads will be just as fast as 4 threads. (as if often the case with very large FFTs)
Optimal Threads = # of Cores
This is the optimal case. No need to explain here - one thread per core. Most embarrassingly parallel applications that are not memory or I/O bound fit right here.
Optimal Threads > # of Cores
This is where it gets interesting... very interesting. Have you heard about load-imbalance? How about over-decomposition and work-stealing?
Many parallelizable applications are irregular - meaning that the tasks do not split into sub-tasks of equal size. So if you may end up splitting a large task into 4 unequal sizes, assign them to 4 threads and run them on 4 cores... the result? Poor parallel performance because 1 thread happened to get 10x more work than the other threads.
A common solution here is to over-decompose the task into many sub-tasks. You can either create threads for each one of them (so now you get threads >> cores). Or you can use some sort of task-scheduler with a fixed number of threads. Not all tasks are suited for both, so quite often, the approach of over-decomposing a task to 8 or 16 threads for a 4-core machine gives optimal results.
Although spawning more threads can lead to better load-balance, the overhead builds up. So there's typically an optimal point somewhere. I've seen as high as 64 threads on 4 cores. But as mentioned, it's highly application specific. And you need to experiment.
EDIT : Expanding answer to more directly answer the question...
What is the cost of context switching? The time for store and restore
CPU registers for different context?
This is very dependent on the environment - and is somewhat difficult to measure directly. Short answer: Very Expensive This might be a good read.
What about caches, pipelines and various code-prediction things inside
CPU? Can we say that each time we switch context, we hurt caches,
pipelines and some code-decoding facilities in CPU?
Short answer: Yes When you context switch out, you likely flush your pipeline and mess up all the predictors. Same with caches. The new thread is likely to replace the cache with new data.
There's a catch though. In some applications where the threads share the same data, it's possible that one thread could potentially "warm" the cache for another incoming thread or another thread on a different core sharing the same cache. (Although rare, I've seen this happen before on one of my NUMA machines - superlinear speedup: 17.6x across 16 cores!?!?!)
So more threads executing on a single core, less work they can do
together in comparison to their serial execution?
Depends, depends... Hyperthreading aside, there will definitely be overhead. But I've read a paper where someone used a second thread to prefetch for the main thread... Yes it's crazy...
Creating 50 threads will actually hurt performance, not improve it. It just doesn't make any sense.
Ideally you should make the 4 threads, not more, not less. There will be some overhead because of context switching, but that is unavoidable. The OS/services/other applications threads should too be executed. But nowadays with so powerful and lighting-fast CPUs this is of no concern since those OS threads will only take less that 2 % of the CPU's time. Almost all of them will be in blocked state while your program is running.
You might think that, since performance is of critical importance, you should code those small critical areas in low-level assembly language. Modern programming languages allow this.
But seriously... compilers and, in case of Java, the JVM, will optimize those portions so well that it just isn't worth it (unless you actually want to exercise something like this). Instead of your calculations finishing in 100 seconds, they'll finish in 97 or 98. The question you must ask yourself is: is it worth all those hours of coding and debugging ?
You asked about the time cost of context switching. These days, these are extremely low. Look at modern day dual-core CPUs that run Windows 7 for example. If you start an Apache web server on that machine and a MySQL database server, you will easily go over 800 threads. The machine just doesn't feel it. To see how low this cost is, read here: How to estimate the thread context switching overhead? . To spare you the searching/reading part: context switching can be done hundreds of thousands of times per second.
4 threads are faster if you can program your 40 tasks switching better than Operating System does.
If you can use 4 threads, use them. There's no way 50 will go faster than 4 on a 4-core machine. All you get is more overhead.
Of course, you're describing an ideal non-real-world situation, so whatever you are actually building, you'll need to measure in order to understand how the performance is affected.
There is Hyperthreading technology which can handle more that one thread per CPU, but it is hardly dependent on type of calculation you want to do. Consider using of GPU or very low assembly language to achieve maximum power.

Resources