I'm trying to parallelize switch ... case (c++) using OpenMP directive, but despite my best efforts, the code goes slower than normal sequential execution.
I have used #pragma parallel, #pragma sections,
I have tried to rewrite the switch case with an if ... else statement
but with no good result ...
switch (number) {
case 1:
f1();
break;
case 2:
f2();
break;
case 3:
f3();
break;
case 4:
fn();
break;
}
Then there is a second problem, OpenMP won't break or return.
The switch cases cannot be implemented in Openmp, just by adding pragma's like parallel, section. The threads running along the parallel section divide work among themselves via the loop index or else they do the same work in a conditional loop. Openmp section needs to know either how many elements it needs to work on or a master condition which determines start and end. You want to make the input section as parallel instead of the functions (f1, f2, .. fn), so I am guessing you are processing a lot of "number". One way is to collect these numbers in a array/vector. Then, you can make a parallel for along this vector/array, calling the corresponding function.
while(some_condition_on_numbers)
{
// Collect Numbers in a vector / some array
}
#pragma omp parallel for
for(int counter = 0; counter < elements_to_process; counter++)
{
F(array_of_number[counter]);
}
F(int choice)
{
if(choice = 1) {f1(); }
if(choice = 2) {f2(); }
..
}
Related
I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP which is somehow given as a value to a mutex so that the mutex does an adaptive spinning, meaning that it spins in the magnitude of an immediate wakeup through the kernel would last. But how do I utilize this configuration-macro to a thread ?
And as I've developed an improved shared readers-writer lock (it needs only one atomic operation at best in contrast to the three operations given in the Wikipedia-solution) with relative writer-priority (further readers are stalled when there's a writer and the readers before are allowed to proceed) which could also make use of adaptive spinning: how is the number of spinning-cycles calculated ?
I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP
Some pthreads implementations provide a macro PTHREAD_MUTEX_ADAPTIVE_NP (note spelling) that is one of the possible values of the kind_np mutex attribute, but neither that attribute nor the macro are standard. It looks like at least BSD and AIX have them, or at least did at one time, but this is not something you should be using in new code.
But how do I utilize this configuration-macro to a thread ?
You don't. Even if you are using a pthreads implementation that supports it, this is the value of a mutex attribute, not a thread attribute. You obtain a mutex with that attribute value by explicitly requesting it when you initialize the mutex. It would look something like this:
pthread_mutexattr_t attr;
pthread_mutex_t mutex;
int rval;
// Return-value checks omitted for brevity and clarity
rval = pthread_mutexattr_init(&attr);
rval = pthread_mutexattr_setkind_np(&attr, PTHREAD_MUTEX_ADAPTIVE_NP);
rval = pthread_mutex_init(&mutex, &attr);
There are other mutex attributes that you can set in analogous ways, which is one of the reasons I wrote this answer. Although you should not be using the kind_np attribute, you can follow this general model for other mutex attributes. There are also thread attributes, which work similarly.
I found the code in the glibc:
That's the "adaptive" mutex locking code of pthread_mutex_lock
in the glibc 2.31:
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (! __is_smp)
goto simple;
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN (max_adaptive_count (),
mutex->__data.__spins * 2 + 10);
do
{
if (cnt++ >= max_cnt)
{
LLL_MUTEX_LOCK (mutex);
break;
}
atomic_spin_nop ();
}
while (LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
So the spin count is doubled up to a maximum plus 10 first (system configurable or 1000 if thre's no configuration) and after the locking the difference between the actual spins and the predefined spins divided by 8 is added to the next spin-count.
I'm new to atomic techniques and try to implement a safe thread version for the follow code:
// say m_cnt is unsigned
void Counter::dec_counter()
{
if(0==m_cnt)
return;
--m_cnt;
if(0 == m_cnt)
{
// Do seomthing
}
}
Every thread that calls dec_counter must decrement it by one and "Do something" should be done only one time - at when the counter is decremented to 0.
After fighting with it, I did the follow code that does it well (I think), but I wonder if this is the way to do it, or is there a better way. Thanks.
// m_cnt is std::atomic<unsigned>
void Counter::dec_counter()
{
// loop until decrement done
unsigned uiExpectedValue;
unsigned uiNewValue;
do
{
uiExpectedValue = m_cnt.load();
// if other thread already decremented it to 0, then do nothing.
if (0 == uiExpectedValue)
return;
uiNewValue = uiExpectedValue - 1;
// at the short time from doing
// uiExpectedValue = m_cnt.load();
// it is possible that another thread had decremented m_cnt, and it won't be equal here to uiExpectedValue,
// thus the loop, to be sure we do a decrement
} while (!m_cnt.compare_exchange_weak(uiExpectedValue, uiNewValue));
// if we are here, that means we did decrement . so if it was to 0, then do something
if (0 == uiNewValue)
{
// do something
}
}
The thing with atomic is that only that one statement is atomic.
If you write
std::atomic<int> i {20}
...
if (!--i)
...
Then just 1 thread will enter the if.
However, if you split up the change and the test, then other threads can get into the gap, and you may get strange results:
std::atomic<int> i {20}
...
--i;
// other thread(s) can modify i just here
if (!i)
...
Of course you can split the condition test for the decrement by using a local variable:
std::atomic<int> i {20}
...
int j=--i;
// other thread(s) can modify i just here
if (!j)
...
All the simple math operations are generally efficiently supported for small atomics in c++
For more complex types and expressions, you need to use the read/modify/write member methods.
These allow you to read the current value, calculate the new value, and then call compare_exchange_strong or compare_exchange_weak say "if the value has not changed, then store my new value, otherwise give me the new current value" a a single atomic operation. You can stick this in a loop and keep recalculating the new value until you are lucky enough that your thread is the only writer. If there are not too many threads trying too often to change the value this is reasonably efficient as well.
I want to make the following loop that fills the A matrix parallel. For every A[i][j] element that is calculated I want the price in A[i-1][j], A[i-1][j -1] and A[i0][j-1] to have been calculated first. So my thread has to wait for the threads in these positions to have calculated their results. I've tried to achieve this like this:
#pragma omp parallel for num_threads(threadcnt) \
default(none) shared(A, lenx, leny) private(j)
for (i=1; i<lenx; i++)
{
for (j=1; j<leny; j++)
{
do
{
} while (A[i-1][j-1] == -1 || A[i-1][j] == -1 || A[i][j-1] == -1);
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
My A matrix is initialized in -1 so if A[][] equals to -1 the operation in this cell is not completed. It takes more time than the serial program though.. Any idea to avoid the while loop?
The waiting loop seems sub-optimal. Apart from burning cores that are spin-waiting, you will also need a plethora of well-placed flush directives to make this code work.
One alternative, especially in the context of a more general parallelization scheme would be to use tasks and task dependences to model the dependences between the different array elements:
#pragma omp parallel
#pragma omp single
for (i=1; i<lenx; i++) {
for (j=1; j<leny; j++) {
#pragma omp task depend(in:A[i-1][j-1],A[i-1][j],A[i][j-1]) depend(out:A[i][j])
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
You may want to think about block the matrix updates, so that each task receives a block of the matrix instead of a single element, but the general idea will remain the same.
Another useful OpenMP feature could be the ordered construct and it's ability to adhere to exactly this kind of data dependency:
#pragma omp parallel for
for (int i=1; i<lenx; i++) {
for (int j=1; j<leny; j++) {
#pragma omp ordered depend(source)
#pragma omp ordered depend(sink:i-1,j-1)
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
PS: The code above is untested, but it should get the rough idea across.
Your solution cannot work. As A is initialized to -1, and A[0][j] is never modified, if i==1, it will test A[1-1][j] and will always fail. BTW, if A is initiliazed to -1, how cannot you have anything but -1 with the max?
When you have dependencies problem, there are two solutions.
First one is to sequentialize your code. Openmp has the ordered directive to do that, but the drawback is that you loose parallelism (while still paying thread creation cost). Openmp 4.5 has a way to describe dependencies with the depend and sink/source statements, but I do not know how efficient can the compiler be to deal with that. And my compilers (gcc 7.3 or clang 6.0) do not support this feature.
Second solution is to modify your algorithm to suppress dependencies. Now, you are computing the maximum of all values that are at the left or above a given element. Lets turn it to a simpler problem. Compute the maximum of all values at the left of a given element. We can easily parallelize it by computing on the different rows, as there no interrow dependency.
// compute b[i][j]=max_k<j(a[i][k]
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
// max per row
if((j!=0)) b[i][j]=max(a[i][j],b[i][j-1]);
else b[i][j]=a[i][j]; // left column initialised to value of a
}
}
Consider another simple problem, to compute the prefix maximum on the different columns. It is again easy to parallelize, but this time on the inner loop, as there is not inter-column dependency.
// compute c[i][j]=max_i<k(b[k,j])
for(int i=0;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
if((i!=0)) c[i][j]=max(b[i][j],c[i-1][j]);
else c[i][j]=b[i][j]; // upper row initialised to value of b
}
}
Now you just have to chain these computations to get the expected result. Here is the final code (with a unique array used and some cleanup in the code).
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=1;j<n;j++){
// max per row
a[i][j]=max(a[i][j],a[i][j-1]);
}
}
for(int i=1;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
a[i][j]=max(a[i][j],a[i-1][j]);
}
}
I am using a MultiThreading class which creates the required number of threads in its own threadpool and deletes itself after use.
std::thread *m_pool; //number of threads according to available cores
std::mutex m_locker;
std::condition_variable m_condition;
std::atomic<bool> m_exit;
int m_processors
m_pool = new std::thread[m_processors + 1]
void func()
{
//code
}
for (int i = 0; i < m_processors; i++)
{
m_pool[i] = std::thread(func);
}
void reset(void)
{
{
std::lock_guard<std::mutex> lock(m_locker);
m_exit = true;
}
m_condition.notify_all();
for(int i = 0; i <= m_processors; i++)
m_pool[i].join();
delete[] m_pool;
}
After running through all tasks, the for-loop is supposed to join all running threads before delete[] is being executed.
But there seems to be one last thread still running, while the m_pool does not exist anymore.
This leads to the problem, that I can't close my program anymore.
Is there any way to check if all threads are joined or wait for all threads to be joined before deleting the threadpool?
Simple typo bug I think.
Your loop that has the condition i <= m_processors is a bug and will actually process one extra entry past the end of the array. This is an off-by-one bug. Suppose m_processors is 2. You'll have an array that contains 2 elements with indices [0] and [1]. Yet, you'll be reading past the end of the array, attempting to join with the item at index [2]. m_pool[2] is undefined memory and you're likely going to either crash or block forever there.
You likely intended i < m_processors.
The real source of the problem is addressed by Wick's answer. I will extend it with some tips that also solve your problem while improving other aspects of your code.
If you use C++11 for std::thread, then you shouldn't create your thread handles using operator new[]. There are better ways of doing that with other C++ constructs, which will make everything simpler and exception safe (you don't leak memory if an unexpected exception is thrown).
Store your thread objects in a std::vector. It will manage the memory allocation and deallocation for you (no more new and delete). You can use other more flexible containers such as std::list if you insert/delete threads dynamically.
Fill the vector in place with std::generate or similar
std::vector<std::thread> m_pool;
m_pool.reserve(n_processors);
// Fill the vector
std::generate_n( std::back_inserter(m_pool), m_processors,
[](){ return std::thread(func); } );
Join all the elements using range-for loop and delete handles using container's functions.
for( std::thread& t: m_pool ) {
t.join();
}
m_pool.clear();
I would like to use OpenMP to make my program run faster. Unfortunately, the opposite is the case. My code looks something like this:
const int max_iterations = 10000;
int num_interation = std::numeric_limits<int>::max();
#pragma omp parallel for
for(int i = 0; i < std::min(num_interation, max_iterations); i++)
{
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
num_interation = update_iterations(...);
}
For some reason, many more iterations are processed than required. Without OpenMP, it takes 500 iterations on avarage. However, even when setting the numbers of threads to one (set_num_threads(1)), it computes more than one thousand iterations. The same happens if I use mutliple threads, and also when using a writelock when updating num_iterations.
I would assume that it has something todo with memory bandwidth or a race condition. But those problems should not appear in case of set_num_threads(1).
Therefore, I assume that it could have something todo with the scheduling and the chunk size. However, I am really not sure about this.
Can somebody give me a hint?
A quick answer for the behaviour you experience is given by the OpenMP standard page 56:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes any
of the values used to compute any of the iteration counts, then the
behavior is unspecified.
In essence, this means that you cannot modify the boundaries of your loop once you entered it. Although according to the standard the behaviour is "unspecified", in your case, what happen is quite clear since as soon as you switch OpenMP on on your code, you compute the number of iterations you had specified initially.
So you have to take another approach to this problem.
This is a possible solution (amongst many other) which I hope scales OK. It has the drawback of potentially allowing more iterations to happen than the number you intended (up to OMP_NUM_THREADS-1 more iterations than expected, assuming that //do sth. is balanced, and many more if not). Also, it assumes that update_iterations(...) is thread safe and can be called in parallel without unwanted side effects... This is a very strong assumption which you'd better enforce!
num_interation = std::min(num_interation, max_iterations);
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( i < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(...);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
}
}
A more synchronised solution, if the //do sth. isn't so balanced and not doing too many extra iterations is important, could be:
num_interation = std::min(num_interation, max_iterations);
int nb_it_done = 0;
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( nb_it_done < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(i);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
#pragma omp single
nb_it_done += nbth;
}
}
Another weird thing here is that, since you didn't show what i is used for, it isn't clear if iterating somewhat randomly into the domain is a problem. If it isn't, the first solution should work well, even for unbalanced //do sth.. But if it is a problem, then you'd better stick with the second solution (and even potentially reinforce the synchronism).
But at the end of the day, there is now way (that I can think of and with decent parallelism) to avoid potential extra work to be done, since the number of iterations can change along the way.