Synchronization of threads - multithreading

I have a simple program you'll find below. The question is whether the two threads can run parallel given the following settings:
1. One core processor,
2. Two core processor,
3. Two one core processors.
Does this program have the risk of Race Competition?
This is what I've found so far:
1. One core processor - technically threads are not running in parallel, they only appear so, as the CPU switches between them very fast.
2. Two core processor - the number of threads that can run in parallel (simultaneously) is equal to the number of cores, therefore yes, in this case 2 threads can run in parallel.
3. Two one core processors ??
L1: Global
int t1_next=1;
int t2_next=0;
L2: Thread 1
while(1) {
if(t1_next){
printf("a");
printf("b");
printf("c");
printf("d");
t1_next=0;
t2_next=1;}
else sleep(10);
}
L3: Thread 2
while(1) {
if(t2_next){
printf("e");
printf("f");
printf("g");
printf("h");
t1_next=1;
t2_next=0;}
else sleep(10);
}

Related

what happens when omp num_threads (more than 1) and parallel for with only 1 loop is present

l_thread = 4;
max = 1; //4
#pragma omp parallel for num_threads(l_thread) private(i)
for(i=0;i<max;i++)
;//some operation
in this case, 4 threads will be created by omp. I want to know if the for loop which loops only 1 time(in the case), will be taken only by one of 4 threads right? and other threads will be idle state ? and in this case i am seeing cpu usage of 4 threads are nearly same. What might be the reason? only one thread should be high and others must be low right?
Your take on this is correct. If max=1 and you have more than one thread, thread 0 will execute the single loop iteration and the other threads will wait at the end of the parallel region for thread 0 to catch up. The reason you're seeing the n-1 other threads causing load on the system is because they spin-wait at the end of regions, because that is much faster when the threads have to wake up and notice that the parallel work (or in your case: not so parallel work :-)) is completed.
You can change this behavior via the OMP_WAIT_POLICY environment variable. See the OpenMP specification for a full description.

Number of thread in one program

I am new to multithreading ,I read the code like below:
void hello_world(string s)
{
cout<< s << endl;
}
int main()
{
const int n = 1000;
vector<thread> threads;
for(int i = 0; i < n; i++){
threads.push_back(thread(hello_world,"test"));
}
for(int i = 0; i < threads.size(); i++){
threads[i].join();
}
return 0;
}
I believe the program above use 1000 threads to speed up the program,However,this confuses me cause when type the commend lscpu returns:
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
where I believe the number of threads is 12,Based on description above, I have two questions:
(1) How many threads can I call in one program?
(2) Follow question 1 ,I believe the number of threads we can call is limited,what's the information that I can base on to decide how many thread can I call to make sure other programs run well?
How many threads can I call in one program?
You didn't specify any programming language or operating system but in most cases, there's no hard limit. The most important limit is how much memory is available for the thread stacks. It's harder to quantify how many threads it takes before they all spend more time competing with each other for resources than they spend doing any actual work.
the command lscpu returns: Thread(s) per core: 2, Core(s) per socket: 6, Socket(s): 1
That information tells how many threads the system can simultaneously execute. If your program has more than twelve threads that are ready to run then, at most twelve of them will actually be running at any given point in time, while the rest of them await their turns.
But note: In order for twelve threads to be "ready to run," they have to all be doing tasks that do not interfere with each other. That mostly means, doing computations on values that are already in memory. In your example program, all of the threads want to write to the same output stream. Assuming that the output stream is "thread safe," then that will be something that only one thread can do at a time. It doesn't matter how many cores your computer has.
how many thread can I call to make sure other programs run well?
That's hard to answer unless you know what all of the other programs need to do. And what does "run well" mean, anyway? If you want to be able to use office tools or programming tools to get work done while a large, compute-intensive program runs "in the background," then you'll pretty much have to figure out for yourself just how much work the background program can do while still allowing the "foreground" tools to respond to your typing.

When can't computation speed be improved by parallelization?

Similar questions have been asked before but I couldn't find an answer that was more about the low level mechanics of threads themselves.
Problem
I have a physical modeling project in which I need to apply a function to 160 billion data points.
for(int i=0; i < N(160,000,000,000); i++){
physicalModal(input[i]); //Linear function, just additions and subtractions
}
function physicalModal(x){
A*x +B*x +C*x + D*x......... //An over simplification but you get the idea. A linear function
}
Given the nature of this problem am I correct in thinking a single thread on a single core, or 1 thread per core, would be the fastest way to solve this? That using extra threads beyond the number of cores would not help me here?
My Logic (Please correct where my assumptions are wrong)
Threads on a single core don't really work independently, they just share processor time which can be beneficial when one thread is waiting on perhaps a socket response and other threads are processing requests. In the example I posted above I figure the CPU could go to 100% on one thread so using multiple threads would just disturb the computation. Is this correct?
What then determines when threading is useful?
If my above assumption is correct, whats the key factor in determining when other threads would be useful? My guess would be simultaneous operations that have varying completion times, waiting, etc...But thats based on my initial premise which may be incorrect.
I need to apply a function to 160 billion data points.
I assume that your function has no side effects (no writes to global/static variables; no disk/network access; no service to many remote users) and just do some arithmetics on its input (on single point of input or several nearby points as for stencil (it is stencil kernel):
for(int i=0; i < 160_000_000_000; i++){
//Linear function, just additions and subtractions
output[i] = physicalModel(input[i] /* possibly also input[i-1], input[i+1] .. */);
}
Then you have to check how efficient this function works on single CPU. Can you (or your compiler) unroll your loop and convert it to SIMD parallelism?
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = A*input[i-1]+ B*input[i] + C*input[i+1];
}
// unrolled 4 times; if input is float, compiler may load 4 floats
// into single SSE2 reg and do 4 operations from one asm command
for(int i=0+4; i < 160_000_000_000-4; i+=4){
output[i+0] = A*input[i-1]+ B*input[i+0] + C*input[i+1];
output[i+1] = A*input[i+0]+ B*input[i+1] + C*input[i+2];
output[i+2] = A*input[i+1]+ B*input[i+2] + C*input[i+3];
output[i+3] = A*input[i+2]+ B*input[i+3] + C*input[i+4];
}
When your function has good single-threaded performance, you can add thread or process parallelism (using OpenMP/MPI or other method). With our assumptions, there are no threads blocking on some external resource like reading from HDD or from network, so every thread you started can run at any time. Then we should start no more than 1 thread per CPU core. If we started several threads, each will run for some amount of time and displaced by other, having less performance than in case of 1 thread per cpu core.
In C/C++ adding of OpenMP thread level parallelism (https://en.wikipedia.org/wiki/OpenMP, http://www.openmp.org/) can be as easy as adding one line just before your for loop (and adding -fopenmp/-openmp option to your compilation); compiler and library will split your for loop into parts and distribute them between threads ([0..N/4], [N/4..N/2], [N/2..N*3/4], [N*3/4..N] for 4 threads or other split scheme; you can give hints with schedule option)
#pragma omp parallel for
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = physicalModel(input[i]);;
}
Thread count will be determined in runtime by OpenMP library (gomp in gcc - https://gcc.gnu.org/onlinedocs/libgomp/index.html). By default it is "one thread per CPU is used" (per logical cpu core). You can change number of threads with OMP_NUM_THREADS environment variable (export OMP_NUM_THREADS=5; ./program).
On CPU with hardware multithreading on single cpu cores (Intel HT, other variants of SMT: you have 4 physical cores and 8 "logical") in some cases you should use 1 thread per logical core, and in other cases 1 thread per physical core (with correct thread binding), as some resources (FPU units) are shared between logical cores. Do some experiments if your code will be used several (many) times.
If your threads (model) are limited by speed of memory (Memory Bound; they loads many data from memory and does very simple operation on every float), you may want to run less threads than cpu core count, as additional threads will not get addition memory bandwidth.
If your threads do lot of computations for every element loaded from memory, use better SIMD and more threads (compute bound). When you have very good and wide SIMD (full-width AVX), you will have no speedup from using HT, as full-width AVX unit is shared between logical cores (but every physical core has one, so use it); in this case you will also have lower cpu frequency, as full-width AVX unit is very hot under full load.
Illustration of memory and compute limited applications: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
https://crd.lbl.gov/assets/Uploads/FTG/Projects/Roofline/_resampled/ResizedImage600300-rooflineai.png

Linux on quad core: single executable, 4 processes

I have 4 executables that do some very complex tasks, each of these programs alone might take nearly 100% of the power of a single core of a quad core CPU, thus resulting in almost 25% of total CPU power. Since all of these programs use hardware resources that can't be shared between mutiple processes, I wish to run a single executable that spawns 3 child processes which, in turn, occupy the other three cores. I'm on Linux and I'm using C++11. Most of the complex code is running in its own class and the hardest part runs in a function that I usually call Process(), so I have 4 objects, each with its own Process() that, when running, takes 100% of a single core.
I tried using OpenMP but I don't think it's the best solution as I have no control over CPU affinity. Also using std::thread is not a good idea, because threads inherit the main process' CPU affinity. In Linux I think I can do this with fork() but I have no idea how the whole structure is made.
This might be related to my other question that was partly left unanswered, maybe because I was trying the wrong approach that works in some cases but not in my case.
An example of pseudo-code could be this:
int main()
{
// ...init everything...
// This alone takes 100% of a single core
float out1 = object1->Process();
// This should be spawned as a child process running on another core
float out2 = object2->Process();
// on another core...
float out3 ...
// yet another core...
float out4 ...
// This should still run in the parent process
float total_output = out1 + out2 + out3 + out4;
}
You can use std::thread, that's a front-end to pthread_create().
Then set its affinity with sched_setaffinity() from the thread itself as well.
As you asked, here a working stub:
#include <sched.h>
#include <thread>
#include <list>
void thread_func(int cpu_index) {
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(cpu_index, &cpuSet);
sched_setaffinity(0, sizeof( cpu_set_t), &cpuSet);
/* the rest of the thread body here */
}
using namespace std;
int main(int argc, char **argv) {
if (argc != 2) exit(1);
int n_cpus = atoi(argv[1]);
list< shared_ptr< thread > > lot;
for (int i=0; i<n_cpus; ++i) {
lot.push_back( shared_ptr<thread>(new thread(thread_func, i)));
}
for(auto tptr = lot.begin(); tptr != lot.end(); ++tptr) {
(*tptr)->join();
}
}
Note that for optimal behaviour it's important that each thread initialises its memory (that is, constructs its objects) in the thread body, if you want that your code is optimized also on multi-processors, because in case you are working on a NUMA system, memory pages are allocated on memory close to the CPU using them.
For example you can have a look to this blog.
However this is not an issue in your specific case, since your are dealing with a single processor system, or more specifically a system with just one numa node (many current AMD processors do contain two numa nodes, even if within a single physical package), and all the memory banks are attached there.
The final effect of using sched_setaffinity() in this context will be just to pin down each thread to a specific core.
You don't have to program anything. The command taskset changes the CPU affinity of a currently running process or creates and sets it for a new process.
Running a single executable that spawns other programs is no different than executing the programs directly, except for the common initialization implied by the comments in your stub.
taskset 1 program_for_cpu_0 arg1 arg2 arg3...
taskset 2 program_for_cpu_1 arg1 arg2 arg3...
taskset 4 program_for_cpu_2 arg1 arg2 arg3...
taskset 8 program_for_cpu_3 arg1 arg2 arg3...
I am suspicious of setting CPU affinities. I have yet to find an actual use for doing so (other than satisfying some inner need for control).
Unless the CPUs are not identical in some way, there should be no need to restrict a process to a particular CPU.
The Linux kernel normally keeps a process on the same CPU unless it enters an extended wait for i/o, a semaphore, etc.
Switching a process from one CPU to another does not incur any particular overhead except in NUMA configurations with local memory per CPU. AFAIK, ARM implementations do not do that.
If a process should exhibit non-CPU bound behavior, allowing the kernel scheduler flexibility to reassign a process to a now-available CPU should improve system performance. Processes bound to a CPU cannot participate in resource availability.

Assigning tasks to threads in Cilk and assigning threads to NUMA nodes

For example, there are three threads.
Thread 1 is assigned tasks 1, 2, and 3.
Thread 2 is assigned tasks 4, 5, and 6.
Thread 3 is assigned tasks 7, 8, and 9.
Task sizes are not uniform. The tasks assigned to a thread have very similar working sets, so the cache will be used efficiently when all these three tasks are executed by the same thread. I should also note that the tasks will run on a NUMA system that has four nodes. Each one of the four threads must be assigned to a node of the system.
My problem is about load balancing. For example, I want Cilk scheduler to assign task 9 to thread 1 if thread 1 finishes its tasks before the others and task 9 is not started.
All solutions are welcome including Cilk Plus, OpenMP, or other schedulers freely available on the web.
Update: The threads must be assigned to nodes of the NUMA system and memory locations used by these threads must be allocated on specific nodes. I have been successfully using libnuma with OpenMP. However I was not able to find how to map threads to nodes using Cilk, TBB, etc. If it were possible to get thread id of a spawned worker in Cilk Plus, I would map it to a node using numa_run_on_node(nodeid).
For more information about scalability problems of Cilk on NUMA architectures: http://www.sciencedirect.com/science/article/pii/S0167739X03001845#
The correct way to do this in Cilk would be something like:
void task1_task2_task3()
{
cilk_spawn task1();
cilk_spawn task2();
task3();
}
void task4_task5_task6()
{
cilk_spawn task4();
cilk_spawn task5();
task6();
}
void task7_task8_task9()
{
cilk_spawn task7();
cilk_spawn task8();
task8();
}
int main()
{
cilk_spawn task1_task2_task3();
cilk_spawn task4_task5_task6();
task7_task8_task9();
cilk_sync;
finalize_stuff();
return 0;
}
Remember that cilk_spawn is a suggestion to the scheduler that the code after the cilk_spawn can be stolen, not a requirement. When a cilk_spawn is executed, it pushes a notation on tail of the worker's deque that the continuation is available for stealing. Thieves always steal from the head of the deque, so you're guaranteed that some worker will steal the continuation of main() before they steal the continuation of task1_task2_task3(). But since a worker chooses which worker it will steal from randomly, there's no guarantee that the final continuation of main() will be stolen before work from task1_task2_task3().
Barry Tannenbaum
Intel Cilk Development

Resources