Does an atomic construct of OpenMp support data dependent operations?
Like, I am working on an example where I have a local_sum variable and a global_sum variable.
Inside the parallel region, each thread calculates its own local_sum. Inside this parallel region, I then use
#pragma omp atomic update
global_sum += local_sum
To update the global_sum in serial. When I print global_sum, I get "nan".
Why?
Note: I know I can do this with a critical construct
Related
I have a several questions about atomic operations and multithreading.
There is a function for which a race condition occurs (julia lang):
function counter(n)
counter = 0
for i in 1:n
counter += i
end
return counter
end
If atomic operations are used to change the global variable "counter", would that help get rid of the race condition?
Does protocol of cache coherence have any real effect to perfomance? Virtual machines like the JVM can use their own architectures to support parallel computing.
Do atomic arithmetic and similar operations require more or less resources than ordinary arithmetic?
It's difficult for me now. Hope for your help.
I don't quite understand your example, the variable counter seems to be local, and then there will be no race conditions in your example.
Anyway, yes, atomic operations will ensure that race conditions do not occur. There are 2 or 3 ways to do that.
1. Your counter can be an Atomic{Int}:
using .Threads
const counter = Atomic{Int}(0)
...
function updatecounter(i)
atomic_add!(counter, i)
end
This is described in the manual: https://docs.julialang.org/en/v1/manual/multi-threading/#Atomic-Operations
2. You can use a field in a struct declared as #atomic:
mutable struct Counter
#atomic c::Int
end
const counter = Counter(0)
...
function updatecounter(i)
#atomic counter.c += i
end
This is described here: https://docs.julialang.org/en/v1/base/multi-threading/#Atomic-operations
It seems the details of the semantics haven't been written yet, but it's the same as in C++.
3. You can use a lock:
counter = 0
countlock = ReentrantLock()
...
function updatecounter(i)
#lock countlock global counter += i
end
and 2. are more or less the same. The lock approach is slower, but can be used if several operations must be done serially. No matter how you do it, there will be a performance degradation relative to non-atomic arithmetic. The atomic primitives in 1. and 2. must do a memory fence to ensure the correct ordering, so cache coherence will matter, depending on the hardware.
I am going through Using OpenMP. The authors compare and contrast the following two constructs:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
}
as against
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
}
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
}
}
They state that Construct 2
has fewer implied barriers, and there might be potential for cache
data reuse between loops. The downside of this approach is that one
can no longer adjust the number of threads on a per loop basis, but
that is often not a real limitation.
I am having a difficult time understanding how Construct 2 has fewer implied barriers. Is there not an implied barrier in Construct 2 after each for loop due to #pragma omp for? So, in each case, isn't the number of implied barriers the same, N? That is, is it not the case in Construct 2 that the first loop occurs first, and so on, and then the Nth for loop is executed last?
Also, how is Construct 2 more favorable for cache reuse between loops?
I am having a difficult time understanding how Construct 2 has fewer
implied barriers. Is there not an implied barrier in Construct 2 after
each for loop due to #pragma omp for? So, in each case, isn't the
number of implied barriers the same, N? That is, is it not the case in
Construct 2 that the first loop occurs first, and so on, and then the
Nth for loop is executed last?
I did not read the book but based on what you have shown it is actually the other way around, namely:
//Construct 1
#pragma omp parallel for
for( ... )
{
/* Work sharing loop 1 */
} // <-- implicit barrier
...
#pragma omp parallel for
for( ... )
{
/* Work sharing loop N */
} // <-- implicit barrier.
has N implicit barriers (at the end of each parallel region), whereas the second code:
//Construct 2
#pragma omp parallel
{
#pragma omp for
for( ... )
{
/* Work sharing loop 1 */
} <-- implicit barrier
...
#pragma omp for
for( ... )
{
/* Work sharing loop N */
} <-- implicit barrier
} <-- implicit barrier
has N+1 barriers (at the end of each for + the parallel region).
Actually, in this case, since there is no computation between the last two implicit barriers, one can add the nowait to the last #pragma omp for to eliminate one of the redundant barriers.
One way for the second code to have fewer implicit barriers than the second would be if you would add a nowait clause to the #pragma omp for clauses.
From the link about the book that you have shown:
Finally, Using OpenMP considers trends likely to influence OpenMP
development, offering a glimpse of the possibilities of a future
OpenMP 3.0 from the vantage point of the current OpenMP 2.5. With
multicore computer use increasing, the need for a comprehensive
introduction and overview of the standard interface is clear.
So the book is using the old OpenMP 2.5 standard, and from that standard about the loop constructor one can read:
There is an implicit barrier at the end of a loop constructor
unless a nowait clause is specified.
A nowait cannot be added to the parallel constructor but it can be added to the for constructor. Therefore, the second code has the potential to have fewer implicit barriers if one can add the nowait clause to the #pragma omp for clauses. However, as it is, the second code has actually more implicit barriers than the first code.
Also, how is Construct 2 more favorable for cache reuse between loops?
If you are using a static distribution of the loop iterations among threads (e.g., #pragma omp for scheduler(static, ...) in the second code, the same threads will be working with the same loop iterations. For instance, with two threads let us call them Thread A and Thread B. If we assume a static distribution with chunk=1, Thread A and B will work with the odd and even iterations of each loop, respectively. Consequently, depending on the actual application code, this might mean that those threads will work with the same memory positions of a given data structure (e.g., the same array positions).
In the first code, in theory (however this will depend on the specific OpenMP implementation), since there are two different parallel regions, different threads can pick up the same loop iterations across the two loops. In other words, in our example with the two threads, there are no guarantees that the same thread that computed the even (or the odd) numbers in one loop would compute those same numbers in the other loops.
we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
cblas_dgemm(...);
}else {
cblas_dgemm(...);
}
}
Here is the issue:
At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.
To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8
However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.
What is going wrong? and how can we have the desired behavior?
Is there any way we could know the number of threads inside the cblas_dgemm function?
Thank you very much for your time and help
The mechanism you are trying to use is called "nesting", that is, creating a new parallel region within an outer, existing parallel region is already active. While most implementations support nesting, it is disabled by default. Try setting OMP_NESTED=true on the command line or call omp_set_nested(true) before the first OpenMP directive in your code.
I would also change the above code to read like this:
#pragma omp parallel num_threads(2)
{
#pragma omp sections
#pragma omp section
{
cblas_dgemm(...);
}
#pragma omp section
{
cblas_dgemm(...);
}
}
That way, the code will also compute the correct thing with only one thread, serializing the two calls to dgemm. In your example with only one thread, the code would run but miss the second dgemm call.
From: https://bisqwit.iki.fi/story/howto/openmp/
The parallel construct
The parallel construct starts a parallel block. It creates a team
of N threads (where N is determined at runtime, usually from the
number of CPU cores, but may be affected by a few things), all of
which execute the next statement (or the next block, if the statement
is a {…} -enclosure). After the statement, the threads join back into
one.
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
By using omp_get_thread_num() you can retrieve the thread ID which enables you to parametrize the so called "same code" with respect to that thread ID.
Take this example:
A is a 1000-dimensional integer array and you need to sum its values using 2 OpenMP threads.
You would design you code something like this:
int A_dim = 1000
long sum[2] = {0,0}
#pragma omp parallel
{
int threadID = omp_get_thread_num();
int start = threadID * (A_dim / 2)
int end = (threadID + 1) * (A_dim / 2)
for(int i = start; i < end; i++)
sum[threadID] += A[i]
}
start is the lower bound which your thread will start summing from (example: thread #0 will start summing from 0, while thread #1 will start summing from 500).
end is pretty much the same of start, but it's the upper bound of which array index the thread will sum up to (example: thread #0 will sum until 500, summing values from A[0] to A[499], while thread #1 will sum until 1000 is reached, values from A[500] to A[999])
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
When you are running the same code on different data.
For example, if I want to invert 10 matrices, I might run the matrix inversion code on 10 threads ... to get (ideally) a 10-fold speedup compared to 1 thread and a for loop.
The basic idea of OpenMP is to distribute work. For this you need to create some threads.
The parallel construct creates this number of threads. Afterwards you can distibute/share work with other constructs like omp for or omp task.
A possible benefit of this distinction is e.g. when you have to allocate memory for each thread (i.e. thread-local data).
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
One example: in physics you got a random process(collision, initial maxwellian etc) in your code and you need to run the code many times to get the average results, in this case you need to run the same code several times.
I want to replace:
omp_set_lock(&bestTimeSeenSoFar_lock);
temp_bestTimeSeenSoFar = bestTimeSeenSoFar; // this is a read
omp_unset_lock(&bestTimeSeenSoFar_lock);
...
omp_set_lock(&bestTimeSeenSoFar_lock);
// update/write bestTimeSeenSoFar
omp_unset_lock(&bestTimeSeenSoFar_lock);
with code that will allow multiple threads to be reading the variable at once UNLESS a thread is trying to write, in which case they wait until the write is done. Help?
What about using something like this?
#pragma flush( bestTimeSeenSoFar )
#pragma omp atomic read
temp_bestTimeSeenSoFar = bestTimeSeenSoFar;
...
#pragma omp atomic write
bestTimeSeenSoFar = whatever;
#pragma flush( bestTimeSeenSoFar )
My reading to the OpenMP standard chapter 2.12.6 dealing with atomic doesn't permit me to decide whether this will perform exactly what you want, but this is the best / closest I can come up with. Moreover, even if this might work in theory, it will be highly dependant on the quality of the implementation of this feature within your compiler. So it not working for you won't necessarily imply that the idea is wrong.
Anyway, I would encourage you to give it a try and, please please, to report if it works for you.