Optimal number of threads when hyperthreading is off - multithreading

I have a question which I tried to find an answer for, but got more confused from all the information I found - unfortunately, couldn't gain a clear answer
So, let's say I have a coomputer with hyperthreading turned OFF.
What is the optimal number of threads I should use at a program I wrote?
I understand that if my program is NOT 100% CPU bound (deals with IO), so the optimal number of threads will be MORE than one thread per core - since I will have multi threads which are waiting, and having more (not too much due to context switching overhead) will be better for these kind of programs.
BUT, In case my program is 100% CPU bound - one thread per core is the optimal?
I'm confused since having more threads, meaning maybe getting a bigger slice time for each thread - which can improve the performance.
Thanks!

With purely CPU-bound loads without hyper-threading the answer always is 1 thread per core.
With HT turned on it can be less than one thread per HT-core because the threads fight over the same cache. But usually, even here one thread per HT core is best.
With IO workloads it's much more complicated but this does not apply here.
since having more threads, meaning maybe getting a bigger slice time for each thread
Not sure I follow the reasoning. The OS will hand out time slices to threads approximately in a round robin way. Time slices are 4-40ms and their size does not change depending on the count of threads.
Ideally, when the number of threads is exactly right, there are no context switches to speak of. The more threads you add the more context switches will there be.

Related

How to optimize number of threads per number of cores

I'm trying to get a better idea of how many threads should run on n number of cores. I know this a complicated question in which the answer depends on a number of factors, such as how much shared state there is, and how much sleeping and waiting on resources each thread does.
To simplify things, let's say we have 2 cores and just one process that can divide its work into threads with no shared state. Let's say each thread just performs computation after computation with no sleeping and no waiting on resources. Would the ideal number of threads in this case be 2?
Let's complicate things a bit and say that the threads have to do some sort of disk I/O. How does this change our answer? I would think that we could have more than 2 cores in this case.
Or let's say they don't do any sleeping or waiting on resources, but instead there's some memory that they both have access to that requires synchronization. How does this change our answer? I would think that in this case we may actually prefer 1 thread over 2, depending on how much synchronization is required.
This is a hard question to answer in the general matter. It really depends on more specifics in the case. What to remember is that it costs to do context-switches - if you're only doing computations it would be wasteful to have 2 threads running on one core (as you wouldn't really gain anything - only lose in the context switches). On the other hand if you are waiting for resources and at the same time can continue with other calculations, it is a good idea to have a thread wait for those resources to not make the entire execution lag behind.
When it comes to IO you don't think in terms of threads per core. You think in terms of threads per physical device. Each individual device has different optimal DOP's (magnetic disk = 1, SSD at least 4, network much higher).
For CPU bound work the optimal number is 1 (per core).
In mixed cases or cases where it is more complicated than that no general answer can be given. The system can behave in surprising ways (like collapsing under load!). The approach here is to test different DOPs and use the best. Generally, there will be exactly one optimum while both 1 and infinity perform much worse. So you only need to find the single maximum which is quite easy.

CPU usage vs Number of threads

In general what is the relation between CPU usage and number of threads in a program.
Assumptions:
Multi-core CPU
Threads do the exact same job (assume they fetch identical work items from a queue and process them)
It depends on the nature of the application.
An application that mostly do calculations - a ratio of 1 thread per
core is a reasonable decision, since you don't want to spawn too many threads due to overhead, and you want to take advantage of all your cores.
An application that mostly do IO operations (like http requests) can spawn much more threads then the #cores and still increase efficiency, since the bottleneck is the waiting time per IO request, and you want to gain as much information as possible in each time you need to wait.
That said, the CPU-usage you are going to get is still dependent on many factors (IO, synchronization, non parallel parts in your program).
If you are interested in the speed the application will take - always remember Amdahl's law, which gives you a strict bound on the time (speed-up) your application is going to take, even when having infinite number of working cores.
There is no such general relationship, except for the obvious ones:
an application can't use more CPU time (CPU seconds) than the number of available cores multiplied by the number of (wall clock) seconds that it runs, and
a single thread can't use more than one CPU second per second.
The actual amount of CPU that a multi-threaded application depends mostly on the nature of the application, and the way that you've implemented it:
If the computation performed by each thread does not generate contention with other threads for locks, memory access and so on, then you should be able to approach the theoretical limit of available CPU resources.
Contention is liable to reduce effective CPU usage, sometimes dramatically.
But there are no general formulae that will tell you how much speed-up you can get.
I think there is no relation or not easy one. It depends on the jobs the threads are doing. A program with one thread can consume 100% of CPU and a program with lots of threads can consume less.
If you are looking for an optimized relation between threads and job done, you must study your case, and possibly found an empiric solution.
As the other answers already state, "it depends". In an ideal world, for n cores, you would get a throughput of factor n, given that you do the same job in a separate thread on each core (which already contains a false assumption, since you need to somehow synchronize the threads when they read from the same queue).
Understanding the Disruptor, a Beginner's Guide to Hardcore Concurrency gives some nice examples what you need to consider when parallezing tasks, and also shows some cases where the attempt to parallelize leads to a longer execution time.

How to do the same calculations faster on 4-core CPU: 4 threads or 50 threads?

Lets assume we have fixed amount of calculation work, without blocking, sleeping, i/o-waiting. The work can be parallelized very well - it consists of 100M small and independent calculation tasks.
What is faster for 4-core CPU - to run 4 threads or... lets say 50? Why second variant should be slover and how much slover?
As i assume: when you run 4 heavy threads on 4-core CPU without another CPU-consuming processes/threads, scheduler is allowed to not move the threads between cores at all; it has no reason to do that in this situation. Core0 (main CPU) will be responsible for executing interruption handler for hardware timer 250 times per second (basic Linux configuration) and other hardware interruption handlers, but another cores may not feel any worries.
What is the cost of context switching? The time for store and restore CPU registers for different context? What about caches, pipelines and various code-prediction things inside CPU? Can we say that each time we switch context, we hurt caches, pipelines and some code-decoding facilities in CPU? So more threads executing on a single core, less work they can do together in comparison to their serial execution?
Question about caches and another hardware optimization in multithreading environment is the interesting question for me now.
As #Baile mentions in the comments, this is highly application, system, environment-specific.
And as such, I'm not going to take the hard-line approach of mentioning exactly 1 thread for each core. (or 2 threads/core in the case of Hyperthreading)
As an experienced shared-memory programmer, I have seen from my experience that the optimal # of threads (for a 4 core machine) can range anywhere from 1 to 64+.
Now I will enumerate the situations that can cause this range:
Optimal Threads < # of Cores
In certain tasks that are very fine-grained paralleled (such as small FFTs), the overhead of threading is the dominant performance factor. In some cases, it's it not helpful to parallelize at all. In some cases, you get speedup with 2 threads, but backwards scaling at 4 threads.
Another issue is resource contention. Even if you have a highly parallelizable task that can easily split across 4 cores/threads, you may be bottlenecked by memory bandwidth and cache effects. So often, you find that 2 threads will be just as fast as 4 threads. (as if often the case with very large FFTs)
Optimal Threads = # of Cores
This is the optimal case. No need to explain here - one thread per core. Most embarrassingly parallel applications that are not memory or I/O bound fit right here.
Optimal Threads > # of Cores
This is where it gets interesting... very interesting. Have you heard about load-imbalance? How about over-decomposition and work-stealing?
Many parallelizable applications are irregular - meaning that the tasks do not split into sub-tasks of equal size. So if you may end up splitting a large task into 4 unequal sizes, assign them to 4 threads and run them on 4 cores... the result? Poor parallel performance because 1 thread happened to get 10x more work than the other threads.
A common solution here is to over-decompose the task into many sub-tasks. You can either create threads for each one of them (so now you get threads >> cores). Or you can use some sort of task-scheduler with a fixed number of threads. Not all tasks are suited for both, so quite often, the approach of over-decomposing a task to 8 or 16 threads for a 4-core machine gives optimal results.
Although spawning more threads can lead to better load-balance, the overhead builds up. So there's typically an optimal point somewhere. I've seen as high as 64 threads on 4 cores. But as mentioned, it's highly application specific. And you need to experiment.
EDIT : Expanding answer to more directly answer the question...
What is the cost of context switching? The time for store and restore
CPU registers for different context?
This is very dependent on the environment - and is somewhat difficult to measure directly. Short answer: Very Expensive This might be a good read.
What about caches, pipelines and various code-prediction things inside
CPU? Can we say that each time we switch context, we hurt caches,
pipelines and some code-decoding facilities in CPU?
Short answer: Yes When you context switch out, you likely flush your pipeline and mess up all the predictors. Same with caches. The new thread is likely to replace the cache with new data.
There's a catch though. In some applications where the threads share the same data, it's possible that one thread could potentially "warm" the cache for another incoming thread or another thread on a different core sharing the same cache. (Although rare, I've seen this happen before on one of my NUMA machines - superlinear speedup: 17.6x across 16 cores!?!?!)
So more threads executing on a single core, less work they can do
together in comparison to their serial execution?
Depends, depends... Hyperthreading aside, there will definitely be overhead. But I've read a paper where someone used a second thread to prefetch for the main thread... Yes it's crazy...
Creating 50 threads will actually hurt performance, not improve it. It just doesn't make any sense.
Ideally you should make the 4 threads, not more, not less. There will be some overhead because of context switching, but that is unavoidable. The OS/services/other applications threads should too be executed. But nowadays with so powerful and lighting-fast CPUs this is of no concern since those OS threads will only take less that 2 % of the CPU's time. Almost all of them will be in blocked state while your program is running.
You might think that, since performance is of critical importance, you should code those small critical areas in low-level assembly language. Modern programming languages allow this.
But seriously... compilers and, in case of Java, the JVM, will optimize those portions so well that it just isn't worth it (unless you actually want to exercise something like this). Instead of your calculations finishing in 100 seconds, they'll finish in 97 or 98. The question you must ask yourself is: is it worth all those hours of coding and debugging ?
You asked about the time cost of context switching. These days, these are extremely low. Look at modern day dual-core CPUs that run Windows 7 for example. If you start an Apache web server on that machine and a MySQL database server, you will easily go over 800 threads. The machine just doesn't feel it. To see how low this cost is, read here: How to estimate the thread context switching overhead? . To spare you the searching/reading part: context switching can be done hundreds of thousands of times per second.
4 threads are faster if you can program your 40 tasks switching better than Operating System does.
If you can use 4 threads, use them. There's no way 50 will go faster than 4 on a 4-core machine. All you get is more overhead.
Of course, you're describing an ideal non-real-world situation, so whatever you are actually building, you'll need to measure in order to understand how the performance is affected.
There is Hyperthreading technology which can handle more that one thread per CPU, but it is hardly dependent on type of calculation you want to do. Consider using of GPU or very low assembly language to achieve maximum power.

If 256 threads give better performance than 8 have I likely got the wrong approach?

I've just started programming with POSIX threads on dual-core x86_64 Linux system. It seems that 256 threads is about the optimum for performance with the way I've done it. I'm wondering how this could be? And if it could mean that my approach is wrong and a better approach would require far fewer threads and be just as fast or faster?
For further background (the program in question is a skeleton for a multi-threaded M-set image generator) see the following questions I've asked already:
Using threads, how should I deal with something which ideally should happen in sequential order?
How can my threaded image generating app get it’s data to the gui?
Perhaps I should mention that the skeleton (in which I've reproduced minimal functionality for testing and comparison) is now displaying the image, and the actual calculations are done almost twice as fast as the non-threaded program.
So if 256 threads running faster than 8 threads is not indicative of a poor approach to threading, how come 256 threads does outperform 8 threads?
The speed test case is a portion of the Mandelbrot Set located at:
xmin -0.76243636067708333333333328
xmax -0.7624335575810185185185186
ymax 0.077996663411458333333333929
calculated to a maximum of 30000 iterations.
On the non-threaded version rendering time on my system is around 15 seconds. On the threaded version, averages speed for 8 threads is 7.8 seconds, while 256 threads is 7.6 seconds.
Well, probably yes, you're doing something wrong.
However, there are circumstances where 256 threads would run better than 8 without you necessarily having a bad threading model. One must remember that having 8 threads does not mean all 8 threads are actually running all the time. Anytime one thread makes a blocking syscall to the operating system, the thread will stop running and wait for the result. In the meantime, another thread can often do work.
There's this myth that one can't usefully use more threads than contexts on the CPU, but that's just not true. If your threads block on a syscall, it can be critical to have another thread available to do more work. (In practice when threads block there tends to be less work to do, but this is not always the case.)
It's all very dependent on work-load and there's no one right number of threads for any particular application. Generally you never want less threads available than the OS will run, and that's the only true rule. (Unfortunately this can be very hard to find out and so people tend to just fire up as many threads as contexts and then use non-blocking syscalls where possible.)
Could it be your app is io bound? How is the image data generated?
A performance improvement gained by allocating more threads than cores suggests that the CPU is not the bottleneck. If I/O access such as disk, memory or even network access are involved your results make perfect sense.
You are probably benefitting from Simultaneous Multithreading (SMT). Your operating system schedules more threads than cores available, and will swap in and out the threads that are not stalled waiting for resources (such as a memory load). This can very effectively hide the latencies of your memory system from your program and is the technique used to great effect for massive parallelization in CUDA for general purpose GPU programming.
If you are seeing a performance increase with the jump to 256 threads, then what you are probably dealing with is a resource bottleneck. At some point, your code is waiting for some slow device (a hard disk or a network connection, for example) in order to continue. With multiple threads, waiting on this slow device isn't a problem because instead of sitting idle and twiddling its electronic thumbs, the CPU can process another thread while the first thread is waiting on the slow device. The more parallel threads that are running, the more work the CPU can do while it is waiting on something else.
If you are seeing performance improve all the way up to 256 threads, I am tempted to say that you have a major performance bottleneck somewhere and it's not the CPU. To test this, try to see if you can measure the idle time of individual threads. I suspect that you will see your threads are stuck in a "blocked" or "waiting" state for a longer portion of their lifetime than they spend in the "running" or "active" state. Some debuggers or function profiling tools will let you do this, and I think there are also Linux tools to do this on the command line.

How many simultaneous threads in an application is a lot?

5, 100, 1000?
I guess, "it depends", but on what?
What is common in applications that run as server daemons / services?
What are hard limits?
Given that the machine can handle the overall workload, how do I determine at how many threads the overhead starts to have an impact on performance?
What are important differences between OS's?
What else should be considered?
I'm asking because I would like to employ threads in an application to organize subcomponents of my application that do not share data and are designed to do their work in parallel. As the application would also use thread pools for parallelizing some tasks, I was wondering at what point I should start to think about the number of threads that's going to run in total.
I know the n+1 rule as a guideline for determining the number of threads that simultaneously work on the same task to gain performance. However, I want to use threads like one might use processes in a larger scope, i. e. to organize independent tasks that should not interfere with each other.
In this related question, some people advise to minimise the number of threads because of the added complexity. To me it seems that threads can also help to keep things sorted more orderly and actually reduce interference. Isn't that correct?
I can't answer your question about "how much is many" but I agree that you should not use threads for every task possible.
The optimal amount of threads for performance of application is (n+1), where n is the amount of processors/cores your computer/claster has.
The more your actual thread amount differs from n+1, the less optimal it gets and gets your system resources wasted on thread calculations.
So usually you use 1 thread for the UI, 1 thread for some generic tasks, and (n+1) threads for some huge-calculation tasks.
Actually Ajmastrean is a little out of date. Quoting from his own link
The thread pool has a default size of
250 worker threads per available
processor, and 1000 I/O completion
threads. The number of threads in the
thread pool can be changed by using
the SetMaxThreads method.
But generally I think 25 is really where the law of diminishing returns (and programmers abilities to keep track of what is going on) starts coming into effect. Although Max is right, as long as all of the threads are performing non-blocking calculations n+1 is the optimal number, in the real world most of the threading tasks I perform tend to be done on stuff with some kind of IO.
Also depends on your architecture. E.g. in NVIDIA GPGPU lib CUDA you can put on an 8 thread multiprocessor 512 threads simoultanously. You may ask why assign each of the scalar processors 64 threads? The answer is easy: If the computation is not compute bound but memory IO bound, you can hide the mem latencies by executing other threads. Similar applies to normal CPUs. I can remember that a recommendation for the parallel option for make "-j" is to use approx 1.5 times the number of cores you got. Many of the compiling tasks are heavy IO burden and if a task has to wait for harddisk, mem ... whatever, CPU could work on a different thread.
Next you have to consider, how expensive a task/thread switch is. E.g. it is comes free, while CPU has to perform some work for a context switch. So in general you have to estimate if the penalty for two task switches is longer than the time the thread would block (which depends heavily on your applications).
Microsoft's ThreadPool class limits you to 25 threads per processor. The limit is based on context switching between threads and the memory consumed by each thread. So, that's a good guideline if you're on the Windows platform.

Resources