Number of thread in one program - multithreading

I am new to multithreading ,I read the code like below:
void hello_world(string s)
{
cout<< s << endl;
}
int main()
{
const int n = 1000;
vector<thread> threads;
for(int i = 0; i < n; i++){
threads.push_back(thread(hello_world,"test"));
}
for(int i = 0; i < threads.size(); i++){
threads[i].join();
}
return 0;
}
I believe the program above use 1000 threads to speed up the program,However,this confuses me cause when type the commend lscpu returns:
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
where I believe the number of threads is 12,Based on description above, I have two questions:
(1) How many threads can I call in one program?
(2) Follow question 1 ,I believe the number of threads we can call is limited,what's the information that I can base on to decide how many thread can I call to make sure other programs run well?

How many threads can I call in one program?
You didn't specify any programming language or operating system but in most cases, there's no hard limit. The most important limit is how much memory is available for the thread stacks. It's harder to quantify how many threads it takes before they all spend more time competing with each other for resources than they spend doing any actual work.
the command lscpu returns: Thread(s) per core: 2, Core(s) per socket: 6, Socket(s): 1
That information tells how many threads the system can simultaneously execute. If your program has more than twelve threads that are ready to run then, at most twelve of them will actually be running at any given point in time, while the rest of them await their turns.
But note: In order for twelve threads to be "ready to run," they have to all be doing tasks that do not interfere with each other. That mostly means, doing computations on values that are already in memory. In your example program, all of the threads want to write to the same output stream. Assuming that the output stream is "thread safe," then that will be something that only one thread can do at a time. It doesn't matter how many cores your computer has.
how many thread can I call to make sure other programs run well?
That's hard to answer unless you know what all of the other programs need to do. And what does "run well" mean, anyway? If you want to be able to use office tools or programming tools to get work done while a large, compute-intensive program runs "in the background," then you'll pretty much have to figure out for yourself just how much work the background program can do while still allowing the "foreground" tools to respond to your typing.

Related

When can't computation speed be improved by parallelization?

Similar questions have been asked before but I couldn't find an answer that was more about the low level mechanics of threads themselves.
Problem
I have a physical modeling project in which I need to apply a function to 160 billion data points.
for(int i=0; i < N(160,000,000,000); i++){
physicalModal(input[i]); //Linear function, just additions and subtractions
}
function physicalModal(x){
A*x +B*x +C*x + D*x......... //An over simplification but you get the idea. A linear function
}
Given the nature of this problem am I correct in thinking a single thread on a single core, or 1 thread per core, would be the fastest way to solve this? That using extra threads beyond the number of cores would not help me here?
My Logic (Please correct where my assumptions are wrong)
Threads on a single core don't really work independently, they just share processor time which can be beneficial when one thread is waiting on perhaps a socket response and other threads are processing requests. In the example I posted above I figure the CPU could go to 100% on one thread so using multiple threads would just disturb the computation. Is this correct?
What then determines when threading is useful?
If my above assumption is correct, whats the key factor in determining when other threads would be useful? My guess would be simultaneous operations that have varying completion times, waiting, etc...But thats based on my initial premise which may be incorrect.
I need to apply a function to 160 billion data points.
I assume that your function has no side effects (no writes to global/static variables; no disk/network access; no service to many remote users) and just do some arithmetics on its input (on single point of input or several nearby points as for stencil (it is stencil kernel):
for(int i=0; i < 160_000_000_000; i++){
//Linear function, just additions and subtractions
output[i] = physicalModel(input[i] /* possibly also input[i-1], input[i+1] .. */);
}
Then you have to check how efficient this function works on single CPU. Can you (or your compiler) unroll your loop and convert it to SIMD parallelism?
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = A*input[i-1]+ B*input[i] + C*input[i+1];
}
// unrolled 4 times; if input is float, compiler may load 4 floats
// into single SSE2 reg and do 4 operations from one asm command
for(int i=0+4; i < 160_000_000_000-4; i+=4){
output[i+0] = A*input[i-1]+ B*input[i+0] + C*input[i+1];
output[i+1] = A*input[i+0]+ B*input[i+1] + C*input[i+2];
output[i+2] = A*input[i+1]+ B*input[i+2] + C*input[i+3];
output[i+3] = A*input[i+2]+ B*input[i+3] + C*input[i+4];
}
When your function has good single-threaded performance, you can add thread or process parallelism (using OpenMP/MPI or other method). With our assumptions, there are no threads blocking on some external resource like reading from HDD or from network, so every thread you started can run at any time. Then we should start no more than 1 thread per CPU core. If we started several threads, each will run for some amount of time and displaced by other, having less performance than in case of 1 thread per cpu core.
In C/C++ adding of OpenMP thread level parallelism (https://en.wikipedia.org/wiki/OpenMP, http://www.openmp.org/) can be as easy as adding one line just before your for loop (and adding -fopenmp/-openmp option to your compilation); compiler and library will split your for loop into parts and distribute them between threads ([0..N/4], [N/4..N/2], [N/2..N*3/4], [N*3/4..N] for 4 threads or other split scheme; you can give hints with schedule option)
#pragma omp parallel for
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = physicalModel(input[i]);;
}
Thread count will be determined in runtime by OpenMP library (gomp in gcc - https://gcc.gnu.org/onlinedocs/libgomp/index.html). By default it is "one thread per CPU is used" (per logical cpu core). You can change number of threads with OMP_NUM_THREADS environment variable (export OMP_NUM_THREADS=5; ./program).
On CPU with hardware multithreading on single cpu cores (Intel HT, other variants of SMT: you have 4 physical cores and 8 "logical") in some cases you should use 1 thread per logical core, and in other cases 1 thread per physical core (with correct thread binding), as some resources (FPU units) are shared between logical cores. Do some experiments if your code will be used several (many) times.
If your threads (model) are limited by speed of memory (Memory Bound; they loads many data from memory and does very simple operation on every float), you may want to run less threads than cpu core count, as additional threads will not get addition memory bandwidth.
If your threads do lot of computations for every element loaded from memory, use better SIMD and more threads (compute bound). When you have very good and wide SIMD (full-width AVX), you will have no speedup from using HT, as full-width AVX unit is shared between logical cores (but every physical core has one, so use it); in this case you will also have lower cpu frequency, as full-width AVX unit is very hot under full load.
Illustration of memory and compute limited applications: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
https://crd.lbl.gov/assets/Uploads/FTG/Projects/Roofline/_resampled/ResizedImage600300-rooflineai.png

Thread pools with OpenMP: overhead and changing the number of threads

I recently discovered the concept of thread pools. As far as I understand GCC, ICC, and MSVC all use thread pools with OpenMP. I'm curious to know what happens when I change the number of threads? For example let's assume the default number of threads is eight. I create a team of eight threads and then in a later section I do four threads and then I go back to eight.
#pragma omp parallel for
for(int i=0; i<n; i++)
#pragma omp parallel for num_threads(4)
for(int i=0; i<n; i++)
#pragma omp parallel for
for(int i=0; i<n; i++)
This is something I actually do now because part of my code gets worse results with hyper-threading so I lower the number of thread to the number of physical cores (for that part of the code only). What if I did the opposite (4 thread, then eight, then 4)?
Does the thread pool have to be recreated each time I change the number of threads? If not, does adding or removing threads cause any significant overhead?
What's the overhead for the thread pool, i.e. what fraction of the work per thread goes to the pool?
It might be late by now to answer this question. However, I am going to do so.
When you start with 8 threads from the beginning, a total of 7 threads will be created, then, including your main process, you have a team of 8. So, first loop in your sample code would be executed using this team. Therefore, the thread pool has 8 threads. After they are done with this region, they go into sleep until woken up.
Now, when you reach second parallel region with 4 threads, only 3 threads from your thread pool is woken up (3 threads + your current main thread) and the rest of the threads are still in sleep mode. So, four of the threads are sleeping.
And then, similar to first parallel region, all threads will incorporate with each other to do the third parallel region.
On the other hand, if you start with 4 threads and the second parallel region asks for 8 threads, then the OpenMP library will react to this change and create 4 extra threads to meet what you asked for (8 threads). Usually created threads are not thrown out of the pool until the end of program life. Hopefully, you might need it in future. It is a general approach that most OpenMP libraries follow. The reason behind this idea is the fact that creating new threads is an expensive job and that's why they try to avoid it and postpone it as much as they can.
Hope this helps you and future commuters in here.

complex threading with OpenMP

I need to switch from boost::thread to OpenMP because boss says so.
The problem is quiet simple: the result of a simulation is written to disk every 5 iteration (int it = 5,10,15...). For the sake of simplicity, suppose I have an 8-core CPU. I created 9 threads; thread 0 is used for IO, other 8 for computation. When (it%5 == 0), I check thread 0 to see if it has finished. If yes, I create another thread, call 0, and ask it to write the result to disk. If not, all threads have to wait. Usually, the time it takes to write out a result is less than 5 iterations, so I effectively "hide" the IO cost.
I have spent a few hours looking into OpenMP and I guess the same algorithm can be done with the "task" construct but I don't see how I can synchronize the threads. OpenMP experts please help. Thanks.
The current pseudo code look like this
boost::thread pool[9];
for(int it=0;it<1000;it++)
{
- simulate using pool[1,8]
- if(it%5 == 0)
+ check pool[0]
+ if finished: create new thread, assign to pool[0], write data out
+ if not, wait
}
Intel has a very nice answer to the OpenMP vs Threads dilemma, I would defer to them and ask your boss to get some education.
OpenMP is very loop oriented, you parallize the entire loop rather than have synchronised threads.
Your design seems right to me overall: having a separate thread for I/O and a thread pool for computations is right. You might possibly replace Boost threads in pool[1..8] with OpenMP for the computational part, but I would not go beyond that. If you can't use Boost, use POSIX threads.

Linux 2.6.31 Scheduler and Multithreaded Jobs

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.
It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.
Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).
Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.
It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.
Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}

Multithreading in XNA game

Where can I use multithreading in a simple 2D XNA game? Any suggestions would be appreciated
Well, there are many options -
Most games use mutlithreading for things such as:
Physics
Networking
Resource Loading
AI/Logical updates (if you have a lot of computation in the "update" phase of your game)
You really have to think about your specific game architecture, and decide where you'd benefit the most from using multithreading.
Some games use multithreaded renderers as a core design philosophy.
For instance... thread 1 calculates all of the game logic, then sends this information to thread 2. Thread 2 precalculates a display list and passes this to the GPU. Thread 1 ends up running 2 frames behind the GPU, thread 2 runs one frame behind the GPU.
The advantage is really that you can in theory do twice as much work in a frame. Skinning can be done on the CPU and can become "free" in terms of CPU and GPU time. It does require double buffering a large amount of data and careful construction of your engine flow so that all threads stall when (and only when) necessary.
Aside from this, a pretty common technique these days is to have a number of "worker threads" running. Tasks with a common interface can be added to a shared (threadsafe) queue and executed by the worker threads. The main game thread then adds these tasks to the queue before the results are needed and continues with other processing. When the results are eventually required, the main thread has the ability to stall until the worker threads have finished processing all of the required tasks.
For instance, an expensive for loop can be changed to used tasks.
// Single threaded method.
for (i = 0; i < numExpensiveThings; i++)
{
ProcessExpensiveThings (expensiveThings[i]);
}
// Accomplishes the same work, using N worker threads.
for (i = 0; i < numExpensiveThings; i++)
{
AddTask (ProcessExpensiveThingsTask, i);
}
WaitForAll (ProcessExpensiveThingsTask);
You can do this whenever you're guaranteed that ProcessExpensiveThings() is thread-safe with respect to other calls. If you have 80 things at 1ms each and 8 worker threads, you've saved yourself roughly 70ms. (Well, not really, but it's a good hand-wavy approximation.)
There is lots of place to apply to: AI, objects interaction, multiplayer gaming etc. This depends on your concrete game.
Why do you want to use multi-threading?
If it is for practice, a reasonable and easy module to put in its own thread would be the sound system, as communication is primarily one-way.
Multi-threading with GameComponents is meant to be quite straightforward
e.g.
http://roecode.wordpress.com/2008/02/01/xna-framework-gameengine-development-part-8-multi-threading-gamecomponents/

Resources