OpenMP: splitting loop based on NUMA - multithreading

I am running the following loop using, say, 8 OpenMP threads:
float* data;
int n;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
DO SOMETHING WITH data[i]
}
Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3
and second half (i = n/2, ..., n-1) with threads 4,5,6,7.
Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.
How do I achieve this with OpenMP?
Thank you
PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.
I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:
int n;
int i0 = 0;
int i1 = n / 2;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
int nt = omp_get_thread_num();
int j;
#pragma omp critical
{
if ( nt < 4 ) {
if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
else j = i1++; // of loop unless first half is finished
}
else {
if ( i1 < n ) j = i1++; // Second 4 threads process second half
else j = i0++; // of loop unless second half is finished
}
}
DO SOMETHING WITH data[j]
}

Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:
#include <omp.h>
#include <stdio.h>
int main(int argc, char **argv) {
const int ngroups=2;
const int npergroup=4;
const int ndata = 16;
omp_set_nested(1);
#pragma omp parallel for num_threads(ngroups)
for (int i=0; i<ngroups; i++) {
int start = (ndata*i+(ngroups-1))/ngroups;
int end = (ndata*(i+1)+(ngroups-1))/ngroups;
#pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
for (int j=start; j<end; j++) {
printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
}
}
return 0;
}
Running this gives
$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15
But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.

Related

OpenMP Multithreads becoming one thread

I am programming using OpenMP to get to learn about multithreads. Is it possible for any thread, which is any thread of 11 in this case, to reach the return statement at the end while some threads may be still working on something in the for loop? Or do they become one master thread again after line 13?
int np, iam;
#pragma omp parallel private(np, iam) num_threads(11)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
#pragma omp for
for (int i = 2; i < 100; i++) {
std::cout << i;
doStuff(i);
}
}
} // line 13
// synchronize necessary?
return 0;
There is an implicit barrier ar the end of the parallel construct, so no synchronization is necessary. Any further code is executed only by the master thread (the one that had thread_num == 0 within the parallel region), and only after all threads have reached the end of the parallel region.

Why this simple program on shared variable does not scale? (no lock)

I'm new to concurrent programming. I implement a CPU intensive work and measure how much speedup I could gain. However, I cannot get any speedup as I increase #threads.
The program does the following task:
There's a shared counter to count from 1 to 1000001.
Each thread does the following until the counter reaches 1000001:
increments the counter atomically, then
run a loop for 10000 times.
There're 1000001*10000 = 10^10 operations in total to be perform, so I should be able to get good speedup as I increment #threads.
Here's how I implemented it:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdatomic.h>
pthread_t workers[8];
atomic_int counter; // a shared counter
void *runner(void *param);
int main(int argc, char *argv[]) {
if(argc != 2) {
printf("Usage: ./thread thread_num\n");
return 1;
}
int NUM_THREADS = atoi(argv[1]);
pthread_attr_t attr;
counter = 1; // initialize shared counter
pthread_attr_init(&attr);
const clock_t begin_time = clock(); // begin timer
for(int i=0;i<NUM_THREADS;i++)
pthread_create(&workers[i], &attr, runner, NULL);
for(int i=0;i<NUM_THREADS;i++)
pthread_join(workers[i], NULL);
const clock_t end_time = clock(); // end timer
printf("Thread number = %d, execution time = %lf s\n", NUM_THREADS, (double)(end_time - begin_time)/CLOCKS_PER_SEC);
return 0;
}
void *runner(void *param) {
int temp = 0;
while(temp < 1000001) {
temp = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
for(int i=1;i<10000;i++)
temp%i; // do some CPU intensive work
}
pthread_exit(0);
}
However, as I run my program, I cannot get better performance than sequential execution!!
gcc-4.9 -std=c11 -pthread -o my_program my_program.c
for i in 1 2 3 4 5 6 7 8; do \
./my_program $i; \
done
Thread number = 1, execution time = 19.235998 s
Thread number = 2, execution time = 20.575237 s
Thread number = 3, execution time = 25.161116 s
Thread number = 4, execution time = 28.278671 s
Thread number = 5, execution time = 28.185605 s
Thread number = 6, execution time = 28.050380 s
Thread number = 7, execution time = 28.286925 s
Thread number = 8, execution time = 28.227132 s
I run the program on a 4-core machine.
Does anyone have suggestions to improve the program? Or any clue why I cannot get speedup?
The only work here that can be done in parallel is the loop:
for(int i=0;i<10000;i++)
temp%i; // do some CPU intensive work
gcc, even with the minimal optimisation level, will not emit any code for the temp%i; void expression (disassemble it and see), so this essentially becomes an empty loop, which will execute very fast - the execution time in the case with multiple threads running on different cores will be dominated by the cacheline containing your atomic variable ping-ponging between the different cores.
You need to make this loop actually do a significant amount of work before you'll see a speed-up.

Multithreading in Windows Forms

I wrote app, Caesar Cipher in Windows Forms CLI with dynamic linking libraries(in C++ and in ASM) with my alghorithms for model(eciphering and deciphering). That part of my app is working.
Here is also a multithreading from Windows Forms. User can chose number of threads(1-64). If he chose 2, message to encipher(decipher) will be divided on two substrings which will be divided on two threads. And I want to execute these threads paraller, and finally reduce cost of execution time.
When user push encipher or decipher button there will be displayed enciphered or deciphered text and time costs for execution functions in C++ and ASM. Actualy everything is alright, but times for greater threads than 1 aren't smaller, they are bigger.
There is some code:
/*Function which concats string for substrings to threads*/
array<String^>^ ThreadEncipherFuncCpp(int nThreads, string str2){
//Tablica wątków
array<String^>^ arrayOfThreads = gcnew array <String^>(nThreads);
//Przechowuje n-tą część wiadomosci do przetworzenia
string loopSubstring;
//Długość podstringa w wiadomości
int numberOfSubstring = str2.length() / nThreads;
int isModulo = str2.length() % nThreads;
array<Thread^>^ xThread = gcnew array < Thread^ >(nThreads);
for (int i = 0; i < nThreads; i++)
{
if (i == 0 && numberOfSubstring != 0)
loopSubstring = str2.substr(0, numberOfSubstring);
else if ((i == nThreads - 1) && numberOfSubstring != 0){
if (isModulo != 0)
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring + isModulo);
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
}
else if (numberOfSubstring == 0){
loopSubstring = str2.substr(0, isModulo);
i = nThreads - 1;
}
else
loopSubstring = str2.substr(numberOfSubstring*i, numberOfSubstring);
ThreadExample::inputString = gcnew String(loopSubstring.c_str());
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
arrayOfThreads[i] = ThreadExample::outputString;
}
return arrayOfThreads;
}}
Here is a fragment which is responsible for the calculation of the time for C++:
/*****************C++***************/
auto start = chrono::high_resolution_clock::now();
array<String^>^ arrayOfThreads = ThreadEncipherFuncCpp(nThreads, str2);
auto elapsed = chrono::high_resolution_clock::now() - start;
long long milliseconds = chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
double micro = milliseconds;
this->label4->Text = Convert::ToString(micro + " microseconds");
String^ str3;
String^ str4;
str4 = str3->Concat(arrayOfThreads);
this->textBox2->Text = str4;
/**********************************/
And example of working:
For input data: "Some example text. Some example text2."
Program will display: "Vrph hadpsoh whaw. Vrph hadpsoh whaw2."
Times of execution for 1 thread:
C++ time: 31231us.
Asm time: 31212us.
Times of execution for 2 threads:
C++ time: 62488us.
Asm time: 62505us.
Times of execution for 4 threads:
C++ time: 140254us.
Asm time: 124587us.
Times of execution for 32 threads:
C++ time: 1002548us.
Asm time: 1000020us.
How to solve this problem?
I need this structure of program, this is academic project.
My CPU has 4 cores.
The reason it's not going any faster is because you aren't letting your threads run in parallel.
xThread[i] = gcnew Thread(gcnew ThreadStart(&ThreadExample::ThreadEncipher));
xThread[i]->Start();
xThread[i]->Join();
These three lines create the thread, start it running, and then wait for it to finish. You're not getting any parallelism here, you're just adding the overhead of spawning & waiting for threads.
If you want to have a speedup from multithreading, the way to do it is to start all the threads at once, let them all run, and then collect up the results.
In this case, I'd make it so that ThreadEncipher (which you haven't shown us the source of, so I'm making assumptions) takes a parameter, which is used as an array index. Instead of having ThreadEncipher read from inputString and write to outputString, have it read from & write to one index of an array. That way, each thread can read & write at the same time. After you've spawned all these threads, then you can wait for all of them to finish, and you can either process the output array, or since array<String^>^ is already your return type, just return it as-is.
Other thoughts:
You've got a mix of unmanaged and managed objects here. It will be better if you pick one and stick with it. Since you're in C++/CLI, I'd recommend that you stick with the managed objects. I'd stop using std::string, and use System::String^ exclusively.
Since your CPU has 4 cores, you're not going to get any speedup by using more than 4 threads. Don't be surprised when 32 threads takes longer than 4, because you're doing 8x the string manipulation, and you've got 32 threads fighting over 4 processor cores.
Your string splitting code is more complex than it needs to be. You've got five different cases in there, I'd have to sit down and think about it for a while to be sure it's correct. Try this:
int totalLen = str2->length;
for (int i = 0; i < nThreads; i++)
{
int startIndex = totalLen * i / nThreads;
int endIndex = totalLen * (i+1) / nThreads;
int substrLen = endIndex - startIndex;
String^ substr = str2->SubString(startIndex, substrLen);
...
}

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Why does calculation with OpenMP take 100x more time than with a single thread?

I am trying to test Pi calculation problem with OpenMP. I have this code:
#pragma omp parallel private(i, x, y, myid) shared(n) reduction(+:numIn) num_threads(NUM_THREADS)
{
printf("Thread ID is: %d\n", omp_get_thread_num());
myid = omp_get_thread_num();
printf("Thread myid is: %d\n", myid);
for(i = myid*(n/NUM_THREADS); i < (myid+1)*(n/NUM_THREADS); i++) {
//for(i = 0; i < n; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
printf("Thread ID is: %d\n", omp_get_thread_num());
}
return 4. * numIn / n;
}
When I compile with gcc -fopenmp pi.c -o hello_pi and run it time ./hello_pi for n = 1000000000 I get
real 8m51.595s
user 4m14.004s
sys 60m59.533s
When I run it on with a single thread I get
real 0m20.943s
user 0m20.881s
sys 0m0.000s
Am I missing something? It should be faster with 8 threads. I have 8-core CPU.
Please take a look at the
http://people.sc.fsu.edu/~jburkardt/c_src/openmp/compute_pi.c
This might be a good implementation for pi computing.
It is quite important to know that how your data spread to different threads and how the openmp collect them back. Usually, a bad design (which has data dependencies across threads) running on multiple thread will result in a slower execution than a single thread .
rand() in stdlib.h is not thread-safe. Using it in multi-thread environment causes a race condition on its hidden state variables, thus lead to poor performance.
http://man7.org/linux/man-pages/man3/rand.3.html
In fact the following code work well as an OpenMP demo.
$ gc -fopenmp -o pi pi.c -O3; time ./pi
pi: 3.141672
real 0m4.957s
user 0m39.417s
sys 0m0.005s
code:
#include <stdio.h>
#include <omp.h>
int main()
{
const int n=50000;
const int NUM_THREADS=8;
int numIn=0;
#pragma omp parallel for reduction(+:numIn) num_threads(NUM_THREADS)
for(int i = 0; i < n; i++) {
double x = (double)i/n;
for(int j=0;j<n; j++) {
double y = (double)j/n;
if (x*x + y*y <= 1) numIn++;
}
}
printf("pi: %f\n",4.*numIn/n/n);
return 0;
}
In general I would not compare times without optimization on. Compile with something like
gcc -O3 -Wall -pedantic -fopenmp main.c
The rand() function is not thread safe in Linux (but it's fine with MSVC and I guess mingw32 which uses the same C run-time libraries, MSVCRT, as MSVC). You can use rand_r with a different seed for each thread. See openmp-program-is-slower-than-sequential-one.
In general try to avoid defining the chunk sizes when you parallelize a loop. Just use #pragma omp for schedule(shared). You also don't need to specify that the loop variable in a parallelized loop is private (the variable i in your code).
Try the following code
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
int i, numIn, n;
unsigned int seed;
double x, y, pi;
n = 1000000;
numIn = 0;
#pragma omp parallel private(seed, x, y) reduction(+:numIn)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rand_r(&seed) / RAND_MAX;
y = (double)rand_r(&seed) / RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
You can find a working example of this code here http://coliru.stacked-crooked.com/a/9adf1e856fc2b60d

Resources