Similar questions have been asked before but I couldn't find an answer that was more about the low level mechanics of threads themselves.
Problem
I have a physical modeling project in which I need to apply a function to 160 billion data points.
for(int i=0; i < N(160,000,000,000); i++){
physicalModal(input[i]); //Linear function, just additions and subtractions
}
function physicalModal(x){
A*x +B*x +C*x + D*x......... //An over simplification but you get the idea. A linear function
}
Given the nature of this problem am I correct in thinking a single thread on a single core, or 1 thread per core, would be the fastest way to solve this? That using extra threads beyond the number of cores would not help me here?
My Logic (Please correct where my assumptions are wrong)
Threads on a single core don't really work independently, they just share processor time which can be beneficial when one thread is waiting on perhaps a socket response and other threads are processing requests. In the example I posted above I figure the CPU could go to 100% on one thread so using multiple threads would just disturb the computation. Is this correct?
What then determines when threading is useful?
If my above assumption is correct, whats the key factor in determining when other threads would be useful? My guess would be simultaneous operations that have varying completion times, waiting, etc...But thats based on my initial premise which may be incorrect.
I need to apply a function to 160 billion data points.
I assume that your function has no side effects (no writes to global/static variables; no disk/network access; no service to many remote users) and just do some arithmetics on its input (on single point of input or several nearby points as for stencil (it is stencil kernel):
for(int i=0; i < 160_000_000_000; i++){
//Linear function, just additions and subtractions
output[i] = physicalModel(input[i] /* possibly also input[i-1], input[i+1] .. */);
}
Then you have to check how efficient this function works on single CPU. Can you (or your compiler) unroll your loop and convert it to SIMD parallelism?
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = A*input[i-1]+ B*input[i] + C*input[i+1];
}
// unrolled 4 times; if input is float, compiler may load 4 floats
// into single SSE2 reg and do 4 operations from one asm command
for(int i=0+4; i < 160_000_000_000-4; i+=4){
output[i+0] = A*input[i-1]+ B*input[i+0] + C*input[i+1];
output[i+1] = A*input[i+0]+ B*input[i+1] + C*input[i+2];
output[i+2] = A*input[i+1]+ B*input[i+2] + C*input[i+3];
output[i+3] = A*input[i+2]+ B*input[i+3] + C*input[i+4];
}
When your function has good single-threaded performance, you can add thread or process parallelism (using OpenMP/MPI or other method). With our assumptions, there are no threads blocking on some external resource like reading from HDD or from network, so every thread you started can run at any time. Then we should start no more than 1 thread per CPU core. If we started several threads, each will run for some amount of time and displaced by other, having less performance than in case of 1 thread per cpu core.
In C/C++ adding of OpenMP thread level parallelism (https://en.wikipedia.org/wiki/OpenMP, http://www.openmp.org/) can be as easy as adding one line just before your for loop (and adding -fopenmp/-openmp option to your compilation); compiler and library will split your for loop into parts and distribute them between threads ([0..N/4], [N/4..N/2], [N/2..N*3/4], [N*3/4..N] for 4 threads or other split scheme; you can give hints with schedule option)
#pragma omp parallel for
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = physicalModel(input[i]);;
}
Thread count will be determined in runtime by OpenMP library (gomp in gcc - https://gcc.gnu.org/onlinedocs/libgomp/index.html). By default it is "one thread per CPU is used" (per logical cpu core). You can change number of threads with OMP_NUM_THREADS environment variable (export OMP_NUM_THREADS=5; ./program).
On CPU with hardware multithreading on single cpu cores (Intel HT, other variants of SMT: you have 4 physical cores and 8 "logical") in some cases you should use 1 thread per logical core, and in other cases 1 thread per physical core (with correct thread binding), as some resources (FPU units) are shared between logical cores. Do some experiments if your code will be used several (many) times.
If your threads (model) are limited by speed of memory (Memory Bound; they loads many data from memory and does very simple operation on every float), you may want to run less threads than cpu core count, as additional threads will not get addition memory bandwidth.
If your threads do lot of computations for every element loaded from memory, use better SIMD and more threads (compute bound). When you have very good and wide SIMD (full-width AVX), you will have no speedup from using HT, as full-width AVX unit is shared between logical cores (but every physical core has one, so use it); in this case you will also have lower cpu frequency, as full-width AVX unit is very hot under full load.
Illustration of memory and compute limited applications: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
https://crd.lbl.gov/assets/Uploads/FTG/Projects/Roofline/_resampled/ResizedImage600300-rooflineai.png
Related
I have an assignment and the following was mentioned:
Parallelism adds further complications, as the random number
generator function must behave correctly when two or more threads call
it simultaneously, i.e., it must be thread-safe.
However, I can't understand why that is a problem in the first place. Random generators are usually called with a seed as a parameter and output a number by doing multiple operations on it, I understand that we need each thread to use a different seed but other than that where does the problem come from? I have realized that spawning and calling the random generator in a parallel region instead of a serial one also really worsens performance but I can't understand why this would happen as by the looks of it, the random number generator should run concurrently without any problems since they are no dependencies.
Any help in understand the theory behind this would be appreciated.
Aside from getting wrong values from the race conditions (pointed out by #MitchWheat), the code will be less efficient because of cache-line sharing between cores on mainstream x86 processors.
Here is an example of (pretty bad but simple) 32-bit random generator (written in C):
uint32_t seed = 0xD1263AA2;
uint32_t customRandom() {
uint32_t old = seed;
seed = (uint32_t)(((uint64_t)seed * 0x2E094C40) >> 24);
return old;
}
void generateValues(uint32_t* arr, size_t size) {
for(size_t i=0 ; i<size ; ++i)
arr[i] = customRandom();
}
If you run this example in sequential (see the result here), the state seed will likely be stored in the memory hierarchy using mainstream compilers (ie. GCC and Clang). This 32-bit block of memory will be read/written in the L1 cache which is very close to the core executing the code.
When you parallelize the loop naively, using for example #pragma omp parallel for in OpenMP, the state is read/written concurrently by multiple threads. There is a race condition: the state value seed can be read by multiple thread in parallel and be written in parallel. Consequently, the same value can be generated by multiple threads while the results are supposed to be random. Race condition are bad and must be fixed. You can fix that using a thread-local state here.
Assuming you do not fix the code because you want to understand the impact of the race condition on the resulting performance of this code, you should see a performance drop in parallel. The issue comes from the cache coherence protocol used by x86 mainstream processors. Indeed, seed is shared between all the threads executed on each core so the processor will try to synchronize the cache of the core so they are kept coherent. This process is very expensive (much slower than reading/writing in the L1 cache). More specifically, when a thread on a given core write in seed, the processor invalidate the seed stored in all the other threads located in the caches of the other cores. Each thread must then fetch the updated seed (typically from the much slower L3 cache) when seed is read. You can find more information here.
I am new to multithreading ,I read the code like below:
void hello_world(string s)
{
cout<< s << endl;
}
int main()
{
const int n = 1000;
vector<thread> threads;
for(int i = 0; i < n; i++){
threads.push_back(thread(hello_world,"test"));
}
for(int i = 0; i < threads.size(); i++){
threads[i].join();
}
return 0;
}
I believe the program above use 1000 threads to speed up the program,However,this confuses me cause when type the commend lscpu returns:
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
where I believe the number of threads is 12,Based on description above, I have two questions:
(1) How many threads can I call in one program?
(2) Follow question 1 ,I believe the number of threads we can call is limited,what's the information that I can base on to decide how many thread can I call to make sure other programs run well?
How many threads can I call in one program?
You didn't specify any programming language or operating system but in most cases, there's no hard limit. The most important limit is how much memory is available for the thread stacks. It's harder to quantify how many threads it takes before they all spend more time competing with each other for resources than they spend doing any actual work.
the command lscpu returns: Thread(s) per core: 2, Core(s) per socket: 6, Socket(s): 1
That information tells how many threads the system can simultaneously execute. If your program has more than twelve threads that are ready to run then, at most twelve of them will actually be running at any given point in time, while the rest of them await their turns.
But note: In order for twelve threads to be "ready to run," they have to all be doing tasks that do not interfere with each other. That mostly means, doing computations on values that are already in memory. In your example program, all of the threads want to write to the same output stream. Assuming that the output stream is "thread safe," then that will be something that only one thread can do at a time. It doesn't matter how many cores your computer has.
how many thread can I call to make sure other programs run well?
That's hard to answer unless you know what all of the other programs need to do. And what does "run well" mean, anyway? If you want to be able to use office tools or programming tools to get work done while a large, compute-intensive program runs "in the background," then you'll pretty much have to figure out for yourself just how much work the background program can do while still allowing the "foreground" tools to respond to your typing.
I recently discovered the concept of thread pools. As far as I understand GCC, ICC, and MSVC all use thread pools with OpenMP. I'm curious to know what happens when I change the number of threads? For example let's assume the default number of threads is eight. I create a team of eight threads and then in a later section I do four threads and then I go back to eight.
#pragma omp parallel for
for(int i=0; i<n; i++)
#pragma omp parallel for num_threads(4)
for(int i=0; i<n; i++)
#pragma omp parallel for
for(int i=0; i<n; i++)
This is something I actually do now because part of my code gets worse results with hyper-threading so I lower the number of thread to the number of physical cores (for that part of the code only). What if I did the opposite (4 thread, then eight, then 4)?
Does the thread pool have to be recreated each time I change the number of threads? If not, does adding or removing threads cause any significant overhead?
What's the overhead for the thread pool, i.e. what fraction of the work per thread goes to the pool?
It might be late by now to answer this question. However, I am going to do so.
When you start with 8 threads from the beginning, a total of 7 threads will be created, then, including your main process, you have a team of 8. So, first loop in your sample code would be executed using this team. Therefore, the thread pool has 8 threads. After they are done with this region, they go into sleep until woken up.
Now, when you reach second parallel region with 4 threads, only 3 threads from your thread pool is woken up (3 threads + your current main thread) and the rest of the threads are still in sleep mode. So, four of the threads are sleeping.
And then, similar to first parallel region, all threads will incorporate with each other to do the third parallel region.
On the other hand, if you start with 4 threads and the second parallel region asks for 8 threads, then the OpenMP library will react to this change and create 4 extra threads to meet what you asked for (8 threads). Usually created threads are not thrown out of the pool until the end of program life. Hopefully, you might need it in future. It is a general approach that most OpenMP libraries follow. The reason behind this idea is the fact that creating new threads is an expensive job and that's why they try to avoid it and postpone it as much as they can.
Hope this helps you and future commuters in here.
I just bought the Matlab Parallel Computing toolbox.
The command matlabpool open opens parallel workers with the number of the cores in my CPU.
But each of my CPU core has two threads. According to Windows Task Manager, each worker can only use half performance of one CPU core, which seems could be interpreted as one worker = one thread = "half core".
Therefore, after all workers opened, still half of the total power of CPU could be utilized.
Is there any other command could help with that?
By default, the local cluster type for matlabpool considers only "real" cores when choosing the default number of workers to launch. This is because for MATLAB workloads, hyperthreading often does not provide much benefit. However, this value is only a default - you can edit the cluster type and run anything up to 12 local workers.
You need to understand HyperThreading to answer this question.
Matlab launches a worker thread for every CPU. Suppose you now use a directive like parfor to distribute computation over multiple threads. Every thread will now be crunching numbers happily.
Suppose you are doing a sum of a large vector of numbers. What actually happens is the following:
sum = sum + a[0]
array a is not in my CPU cache yet
I will fetch a small part of a from main memory and put it in the CPU cache
sum = sum + a[1]
sum = sum + a[2]
...
During the fetch of a, the CPU stalls, waiting for the system memory. This is called a pipeline bubble, and it is not good for performance. Sometimes, a part of the array a was swapped out to the hard drive. The operating system will need to access the drive to put that part into main memory, after which it will be transferred to the CPU cache. When this happens, your operating system will not let the CPU wait for +200 ms. It will use that time to execute another task instead (like the backup running on your system, or refreshing your screen, or ...).
Switching tasks on a CPU results in a performance penalty. To switch to a different task, the operating system must save the CPU registers in main memory, and load the CPU registers of the other task back into the CPU first. This takes time.
With HyperThreading, the number of registers per CPU is doubled. This means that two processes can 'occupy' the CPU. Only one can be executed, but during a stall, the operating system can switch to the second process without any performance penalty.
Forget how Microsoft Windows reports CPU usage. It's wrong. CPU usage is a lot more complicated than only a simple 47%. The real question is rather: should matlab register two threads per core, or only one?
Arguments pro:
During a stall, the CPU can quickly switch to the other thread and continue executing.
Arguments contra:
There are more threads, and the problem is divided in smaller pieces. This may actually reduce performance, as you need to put more pieces together to get the final result.
A context switch will still 'poison' the L1 and L2 cache, loading in pieces of memory that are of no use to the other thread on the CPU.
If there are no stalls, you have more overhead.
On a desktop, the operating system will also want to run: redrawing the screen, moving your mouse, etc. When all logical cpu's are in use, the operating system is required to do an actual (costly) context switch.
Your problem will only be complete if all pieces of the problem have been calculated. Using all the cores / threads increases the risk of one thread taking more time.
My guess is that the Matlab developers considered the arguments contra to be more important than the arguments pro. My own performance tests certainly suggest that there is little performance gain from HyperThreading for cpu-intensive calculations.
I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.
It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.
Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).
Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.
It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.
Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}