I have a question regarding the OpenMP task pragma, if we suppose the following code:
#pragma omp parallel
{
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
//do something with x
}
as far as I understood tasking, it is not guaranteed, which thread executes the task.
So my question is, is "x" in the task now the id of thread generated the task or the one who executes it?
e.g. if thread 0 comes across the task, and thread 3 executes it: x should be 0 then, right?
So my question is, is "x" in the task now the id of thread generated
the task or the one who executes it?
It depends, if the parallel default data-sharing attribute is shared (which by default typically it is) then:
'x' can be equal to any thread ID ranging from 0 to the total number of threads in the team - 1. This is because there is a race condition during the update of the variable 'x'.
This can be show-cased with the following code:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
x = omp_get_thread_num();
if(omp_get_thread_num() == 1){
sleep(5);
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
So the thread with ID=1 creates the task, however, 'x' can have different values than '1' and also different values than the thread currently executing the task. This is because while the thread with ID=1, is waiting during sleep(5);, the remaining threads in the team can update the value of 'x'.
Typically, the canonical form in such use-cases would be to use a single pragma wrapping around the task creation as follows:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
#pragma omp single
{
printf("I am the task creator '%d'\n", omp_get_thread_num());
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
And in this case as #Michael Klemm mentioned on the comments:
..., x will contain the ID of the thread that created the task. So, yes, if thread 0 created the task, x will be zero even though thread 3 is picked to execute the task.
This also applies in the cases that variable 'x' is private by the time the statement x = omp_get_thread_num(); happens.
Therefore, if you run the code above you should always get I am the task creator with the same value as Value of x =, but you can get a different value in ID Thread executing. For example:
I am the task creator '4'
Value of x = 4 | ID Thread executing = 7
This is in accordance to the behaviour specified in the OpenMP standard, namely:
The task construct is a task generating construct. When a thread
encounters a task construct, an explicit task is generated from the
code for the associated structured block. The data environment of the task is created according to the data-sharing attribute clauses on the task construct, per-data environment ICVs, and any defaults that
apply.
The encountering thread may immediately execute the task, or defer its execution. In the latter case, any thread in the team may be assigned the task.
Related
I am programming using OpenMP to get to learn about multithreads. Is it possible for any thread, which is any thread of 11 in this case, to reach the return statement at the end while some threads may be still working on something in the for loop? Or do they become one master thread again after line 13?
int np, iam;
#pragma omp parallel private(np, iam) num_threads(11)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
#pragma omp for
for (int i = 2; i < 100; i++) {
std::cout << i;
doStuff(i);
}
}
} // line 13
// synchronize necessary?
return 0;
There is an implicit barrier ar the end of the parallel construct, so no synchronization is necessary. Any further code is executed only by the master thread (the one that had thread_num == 0 within the parallel region), and only after all threads have reached the end of the parallel region.
I've got the following code:
int working_threads=1;
#pragma omp parallel
{
int my_num=omp_get_thread_num()+1;
int idle=false;
while(working_threads>0) {
if(my_num==1)
working_threads=0;
#pragma omp barrier
}
}
If I run it, it every now and then hangs on the barrier. The more threads, the more likely this is to happen. I've tried to debug it with printf and it seems that sometimes not all threads are executed and thus the barrier waits for them forever. This happens in the first iteration, the second one is obviously never run.
Is it an invalid piece of code? If so, how can I change it? I need to run a while loop in parallel. It is not known how many loops will be executed before, but it is guaranteed that all threads will have the same number of iterations.
Despite your attempt to synchronize with the barrier, you do have a race condition on working_threads that can easily lead to unequal amount of iterations:
thread 0 | thread 1
... | ...
while (working_threads > 0) [==true] | ...
if (my_num == 1) [==true] | ...
working_threads = 0 | ...
| while (working_threads > 0) [==false]
[hangs waiting for barrier] | [hangs trying to exit from parallel]
To fix your specific code, you would have to also add a barrier between the while-condition-check and working_threads = 0.
#pragma omp parallel
{
int my_num=omp_get_thread_num()+1;
int idle=false;
while(working_threads>0) {
#pragma omp barrier
if(my_num==1)
working_threads=0;
#pragma omp barrier
}
}
Note that the code is not exactly the most idiomatic or elegant solution. Depending on your specific work-sharing problem, there may be a better approach. Also you must ensure that worker_threads is written only by a single thread - or use atomics when writing.
I have to calculate the sum of the elements in a bidimensional matrix, using a separate thread to calculate the sum of each row. Then the main thread adds up these sums printing the final result.
Can you guys see what's wrong?
(I'm all new to the threads stuff)
#include <pthread.h>
#include <stdio.h>
void sumR(void* _a,int m,int n,int sum)
{
int i;
int (*a)[m]=_a;
for(i=1;i<=n;i++)
sum+=a[n][i];
}
int main()
{
int a[20][20],sum1,sum;
int m=3,n=3,k=3,i,j;
for(i=1;i<=m;i++)
{
k=k+3;
for(j=1;j<=n;j++)
a[i][j]=k;
}
sum1=0;
for(i=1;i<=m;i++)
{
sum=0;
pthread_t th;
pthread_create(&th,NULL,&sumR,&a,&m,&n,&sum);
sum1+=sum;
pthread_join(&th,NULL);
}
printf("Sum of the matrix is: %d",sum1);
return 0;
}
One problem I see is that your loop does essentially this:
for each row
start thread
add thread's sum to total
wait for thread to exit
That's not going to work because you're adding the thread's sum before the thread is done calculating it. You need to wait for the thread to finish:
start thread
wait for thread to exit
add thread's sum to total
However, that model doesn't take advantage of multiple threads. You only have one thread running at a time.
What you need to do is create all of the threads and store them in an array. Then wait for each thread to exit and add its sum to the total. Something like:
for i = 0 to num_threads-1
threads[i] = pthread_create(&threads[i], NULL, &sums[i], ...)
And then
for i = 0 to num_threads-1
pthread_join(&threads[i], ...);
sum += sums[i];
That way, all of your threads are running at the same time, and you harvest the result only when the thread is done.
Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.
I'm optimizing some instrumentation for my project (Linux,ICC,pthreads), and would like some feedback on this technique to assign a unique index to a thread, so I can use it to index into an array of per-thread data.
The old technique uses a std::map based on pthread id, but I'd like to avoid locks and a map lookup if possible (it is creating a significant amount of overhead).
Here is my new technique:
static PerThreadInfo info[MAX_THREADS]; // shared, each index is per thread
// Allow each thread a unique sequential index, used for indexing into per
// thread data.
1:static size_t GetThreadIndex()
2:{
3: static size_t threadCount = 0;
4: __thread static size_t myThreadIndex = threadCount++;
5: return myThreadIndex;
6:}
later in the code:
// add some info per thread, so it can be aggregated globally
info[ GetThreadIndex() ] = MyNewInfo();
So:
1) It looks like line 4 could be a race condition if two threads where created at exactly the same time. If so - how can I avoid this (preferably without locks)? I can't see how an atomic increment would help here.
2) Is there a better way to create a per-thread index somehow? Maybe by pre-generating the TLS index on thread creation somehow?
1) An atomic increment would help here actually, as the possible race is two threads reading and assigning the same ID to themselves, so making sure the increment (read number, add 1, store number) happens atomically fixes that race condition. On Intel a "lock; inc" would do the trick, or whatever your platform offers (like InterlockedIncrement() for Windows for example).
2) Well, you could actually make the whole info thread-local ("__thread static PerThreadInfo info;"), provided your only aim is to be able to access the data per-thread easily and under a common name. If you actually want it to be a globally accessible array, then saving the index as you do using TLS is a very straightforward and efficient way to do this. You could also pre-compute the indexes and pass them along as arguments at thread creation, as Kromey noted in his post.
Why so averse to using locks? Solving race conditions is exactly what they're designed for...
In any rate, you can use the 4th argument in pthread_create() to pass an argument to your threads' start routine; in this way, you could use your master process to generate an incrementing counter as it launches the threads, and pass this counter into each thread as it is created, giving you your unique index for each thread.
I know you tagged this [pthreads], but you also mentioned the "old technique" of using std::map. This leads me to believe that you're programming in C++. In C++11 you have std::thread, and you can pass out unique indexes (id's) to your threads at thread creation time through an ordinary function parameter.
Below is an example HelloWorld that creates N threads, assigning each an index of 0 through N-1. Each thread does nothing but say "hi" and give it's index:
#include <iostream>
#include <thread>
#include <mutex>
#include <vector>
inline void sub_print() {}
template <class A0, class ...Args>
void
sub_print(const A0& a0, const Args& ...args)
{
std::cout << a0;
sub_print(args...);
}
std::mutex&
cout_mut()
{
static std::mutex m;
return m;
}
template <class ...Args>
void
print(const Args& ...args)
{
std::lock_guard<std::mutex> _(cout_mut());
sub_print(args...);
}
void f(int id)
{
print("This is thread ", id, "\n");
}
int main()
{
const int N = 10;
std::vector<std::thread> threads;
for (int i = 0; i < N; ++i)
threads.push_back(std::thread(f, i));
for (auto i = threads.begin(), e = threads.end(); i != e; ++i)
i->join();
}
My output:
This is thread 0
This is thread 1
This is thread 4
This is thread 3
This is thread 5
This is thread 7
This is thread 6
This is thread 2
This is thread 9
This is thread 8