Critical Section inside for loop in Openmp - multithreading

I have the following code:
#pragma omp parallel for private(dist,i,j)
for(k=0;k<K;k++)
{
//some code
for(i=0;i<N;i++)
{
#pragma omp critical
{
if(min_dist[i]>dist[i])//The point i belongs to the cluster k
{
newmembership[i]=k;
min_dist[i]=dist[i];
}
}
dist[i]=0;
}
}
dist is a private variable whereas newmemebership and min_dist is a shared variable.For my test case the code still works if we run the code without adding the critical section construct.To the best of my understanding,it should not as two threads might be running on the same value of i and might modify the min_dist[i] and newmembership[i] leading to conflict.
Please explain whether it is necessary to add the critical section construct and also is there a better way to implement the above ie using locks or semaphores ?

Removing the critical section would be a data race. Consider the following execution:
(min_dist[42] == 100)
time | Thread 0 | Thread 1
----------------------------------------------------------------------
0 | k = 13 |
1 | i = 42 | k = 14
2 | dist[i] = 50 | i = 42
3 | min_dist[i] > dist[i] ==> true | dist[i] = 75
4 | newmembership[i] = 13 | min_dist[i] > dist[i] ==> true
5 | min_dist[i]=50 | newmembership[i] = 14
6 | ... | min_dist[i]=75
So you end up with a non-minmal solution. You can even end up with conflicting min_dist / newmembership values.
The alternative way is to create thread private local_min_dist / local_newmembership arrays, and merge them at the end of the execution:
#pragma omp parallel
{
// Note: implicitly private because defined inside the parallel region
int local_newmembership[N];
int local_min_dist[N];
#pragma omp for private(dist,i,j)
for(k=0;k<K;k++)
{
//some code
for(i=0;i<N;i++)
{
// NOTE: No critical region necessary
// as we operate on private variables
if(local_min_dist[i]>dist[i])//The point i belongs to the cluster k
{
local_newmembership[i]=k;
local_min_dist[i]=dist[i];
}
dist[i]=0;
}
}
for (i = 0; i < N; i++)
{
// Here we have a critical region,
// but it is outside of the k-loop
#pragma omp critical
if (min_dist[i] > local_min_dist[i])
{
newmembership[i] = local_newmembership[i];
local_min_dist[i] = local_min_dist[i];
}
}
}

Related

Atomic variable vs Normal variable with locks in C++

Recently I gave an interview to a tech company and the interviewer asked me to tell which of the operation is faster among a normal increment operation on atomic variable and increment operation on a normal variable with locks. (This came as a sub question of an original question which goes out of context)
As I don't know what's going under the hood, I gave the reason as one hardware instruction for one increment instead of three and I claimed atomic to be faster.
Now after the interview, while I though of extracting the solution, I found this happening.
Code I've written to test:
#include <iostream>
#include <chrono>
#include <thread>
#include <mutex>
#include <atomic>
#include <cmath>
using namespace std;
int iters = 1;
class Timer{
private:
std::chrono::time_point<std::chrono::high_resolution_clock> startTime,stopTime;
string method_name;
// stop function
void stop(){
stopTime = std::chrono::high_resolution_clock::now();
auto start = std::chrono::time_point_cast<std::chrono::microseconds>(startTime).time_since_epoch().count();
auto stop = std::chrono::time_point_cast<std::chrono::microseconds>(stopTime).time_since_epoch().count();
auto duration = stop-start;
cout<<"Time taken : "<<duration<<" μs"<<endl;
}
public:
// constructor
Timer(){
startTime = std::chrono::high_resolution_clock::now();
}
// destructor
~Timer(){
stop();
}
};
std::mutex Lock;
long n_variable = 0;
std::atomic<long> a_variable(0);
void updateAtomicVariable(){
for(int i = 0 ; i < iters ; i++){
a_variable++;
}
}
void updateNormalVariableWithLocks(){
for(int i = 0 ; i < iters ; i++){
Lock.lock();
n_variable++;
Lock.unlock();
}
}
int main(){
int no_th = 1;
std::thread atomic_threads[10];
std::thread normal_threads[10];
// updating once
cout<<"Updating atomic variable once"<<endl;
{
Timer timer;
for(int i = 0 ; i < no_th ; i++){
atomic_threads[i] = std::thread(updateAtomicVariable);
}
for(int i = 0 ; i < no_th ; i++){
atomic_threads[i].join();
}
}
cout<<"Updating normal variable once with locks"<<endl;
{
Timer timer;
for(int i = 0 ; i < no_th ; i++){
normal_threads[i] = std::thread(updateNormalVariableWithLocks);
}
for(int i = 0 ; i < no_th ; i++){
normal_threads[i].join();
}
}
no_th = 10;
iters = 1e7;
//updating multiple times
cout<<"Updating atomic variable 1e8 times with 10 threads"<<endl;
{
Timer timer;
for(int i = 0 ; i < no_th ; i++){
atomic_threads[i] = std::thread(updateAtomicVariable);
}
for(int i = 0 ; i < no_th ; i++){
atomic_threads[i].join();
}
}
cout<<"Updating normal variable 1e8 times with 10 threads and with locks"<<endl;
{
Timer timer;
for(int i = 0 ; i < no_th ; i++){
normal_threads[i] = std::thread(updateNormalVariableWithLocks);
}
for(int i = 0 ; i < no_th ; i++){
normal_threads[i].join();
}
}
}
Interestingly I got the output as :
Updating atomic variable once
Time taken : 747 μs
Updating normal variable once with locks
Time taken : 255 μs
Updating atomic variable 1e8 times with 10 threads
Time taken : 1806461 μs
Updating normal variable 1e8 times with 10 threads and with locks
Time taken : 12974378 μs
Although the numbers are varying each and every time, the relative magnitude remained same.
This is showing that for one increment operation atomic increment is slower and on the other hand while incrementing the same for multiple times it is showing atomic increment is much faster which is the contrary to first recorded observation of one increment.
I ran this on multiple machines and multiple times and found the same.
Could some one help me to understand what's going on here.
Thanks in advance.

OpenMP in Biham-Middleton-Levine BML model

I've got a serial version of BML and I'm trying to write a parallel one with OpenMP. Basically my code works with a main witin a loop calling two functions for horizontal and vertical moves. Like that:
for (s = 0; s < nmovss; s++) {
horizontal_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
vertical_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
}
Where cur is the current grid. Then horizontal and vertical functions are similar and have a nested loop:
for(i = 1; i <= n; i++) {
for(j = 1; j <= n+1; j++) {
if(grid[cur][i][j-1] == LR && grid[cur][i][j] == EMPTY) {
grid[1-cur][i][j-1] = EMPTY;
grid[1-cur][i][j] = LR;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
The code produces a ppm image at every step, and whit a certain input the serial version produce an output that we can suppose good. But using #pragma omp parallel for inside the two functions H and V, the ppm file results splitted in such zones as the number of threads(i.e. 4):
I suppose the problem is that every thread should be doing both functions in sequence before termitate because movememnts are strictcly connected. I don't know how to do that. If I set pragma at a highter level like before main loop, there is no speed-up. Obviously the ppm file has to be not sliced like the image.
Goin'on I tried this solution that gives me an identical result as the serial code, but I don't excatly understand why
# pragma omp parallel num_threads(thread_count) default(none) \
shared(grid, n, cur) private(i, j)
for(i = 1; i <= n+1; i++) {
# pragma omp for
for(j = 1; j <= n; j++) {
if(grid[cur][i-1][j] == TB && grid[cur][i][j] == EMPTY) {
grid[1-cur][i-1][j] = EMPTY;
grid[1-cur][i][j] = TB;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
}
Therefore, if i use just one thread more than available cores(4), the execution time "explodes" instead of remain barely the same.

How to divide a huge loop into multiple threads and then add result in collection?

I am performing some task in a loop. I need to divide this loop of 1.2 million into multiple threads. Each thread will have some result in list. When all threads are completed I need to add all threads list data into one common list. I can not use ExecutorService. How can I do this?
It should be compatible to jdk 1.6.
This is what I am doing right now:
List<Thread> threads = new ArrayList<Thread>();
int elements = 1200000;
public void function1() {
int oneTheadElemCount = 10000;
float fnum_threads = (float)elements / (float)oneTheadElemCount ;
String s = String.valueOf(fnum_threads);
int num_threads = Integer.parseInt(s.substring(0, s.indexOf("."))) + 1 ;
for(int count =0 ; count < num_threads ; count++) {
int endIndex = ((oneTheadElemCount * (num_threads - count)) + 1000) ;
int startindex = endIndex - oneTheadElemCount ;
if(count == (num_threads-1) )
{
startindex = 0;
}
if(startindex == 0 && endIndex > elements) {
endIndex = elements -1 ;
}
dothis( startindex,endIndex);
}
for(Thread t : threads) {
t.run();
}
}
public List dothis(int startindex, int endIndex) throws Exception {
Thread thread = new Thread(new Runnable() {
#Override
public void run() {
for (int i = startindex;
(i < endIndex && (startindex < elements && elements) ) ; i++)
{
//task adding elements in list
}
}
});
thread.start();
threads.add(thread);
return list;
}
I don't know which version of Java you are using but in Java 7 and higher, you can use Fork/Join ForkJoinPool.
Basically,
Fork/Join, introduced in Java 7, isn't intended to replace or compete
with the existing concurrency utility classes; instead it updates and
completes them. Fork/Join addresses the need for divide-and-conquer,
or recursive task-processing in Java programs (see Resources).
Fork/Join's logic is very simple: (1) separate (fork) each large task
into smaller tasks; (2) process each task in a separate thread
(separating those into even smaller tasks if necessary); (3) join the
results.
Citation.
There are various example online that can help with it. I haven't used it myself.
I hope this helps.
For Java6, you can follow this related SO question.

Concurrency in Linux: Alternate access of two threads in critical zone

I'm new with the concepts of concurrency and threads in Linux and I tried to solve a relative simple problem. I create two threads which run the same function which increment a global variable. What I really want from my program is to increment that variable alternatively, namely ,say in each step the thread that increment that variable prints to the screen a message , so the expected output should look like:
Thread 1 is incrementing variable count... count = 1
Thread 2 is incrementing variable count... count = 2
Thread 1 is incrementing variable count... count = 3
Thread 2 is incrementing variable count... count = 4
and so on.
I tried an implementation with a semaphore which ensures mutual exclusion, but nonetheless the result resembles this:
Thread 2 is incrementing variable count... count = 1
Thread 2 is incrementing variable count... count = 2
Thread 2 is incrementing variable count... count = 3
Thread 2 is incrementing variable count... count = 4
Thread 2 is incrementing variable count... count = 5
Thread 1 is incrementing variable count... count = 6
Thread 1 is incrementing variable count... count = 7
Thread 1 is incrementing variable count... count = 8
Thread 1 is incrementing variable count... count = 9
Thread 1 is incrementing variable count... count = 10
My code looks like:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <semaphore.h>
int count = 0;
sem_t mutex;
void function1(void *arg)
{
int i = 0;
int *a = (int*) arg;
while (i < 10)
{
sem_wait(&mutex);
count++;
i++;
printf("From the function : %d count is %d\n",*a,count);
sem_post(&mutex);
}
}
int main()
{
pthread_t t1,t2;
int a = 1;
int b = 2;
pthread_create(&t1,NULL,(void *)function1,&a);
pthread_create(&t2,NULL,(void *)function1,&b);
sem_init(&mutex,0,1);
pthread_join(t2,NULL);
pthread_join(t1,NULL);
sem_destroy(&mutex);
return 0;
}
My big question is now , how do I achieve this alternation between threads? I got mutual exclusion , but the alternation is still missing. Maybe I lack a good insight of semaphores usage, but I would be very grateful if someone would explain that to me. (I have read several courses on this topic ,namely,Linux semaphores,concurrency and threads, but the information there wasn't satisfactory enough)
Mutex locks don't guarantee any fairness. This means that if thread 1 releases it and then attempts to get it again, there is no guarantee that it won't - it's just guarantees that no two pieces of code can be running in that block at the same time.
EDIT: Removed previous C-style solution because it was probably incorrect. The question asks for a synchronization solution.
If you really want to do this correctly you would use something called a monitor and a guard (or condition variable). I'm not too familiar with C and pThreads so you would need to take a look how to do it with that, but in Java it would look something like:
public class IncrementableInteger {
public int value;
#Override
public String toString() {
return String.valueOf(value);
}
}
#Test
public void testThreadAlternating() throws InterruptedException {
IncrementableInteger i = new IncrementableInteger();
Monitor m = new Monitor();
Monitor.Guard modIsZeroGuard = new Monitor.Guard(m) {
#Override public boolean isSatisfied() {
return i.value % 2 == 0;
}
};
Monitor.Guard modIsOneGuard = new Monitor.Guard(m) {
#Override public boolean isSatisfied() {
return i.value % 2 == 1;
}
};
Thread one = new Thread(() -> {
while (true) {
m.enterWhenUninterruptibly(modIsZeroGuard);
try {
if (i.value >= 10) return;
i.value++;
System.out.println("Thread 1 inc: " + String.valueOf(i));
} finally {
m.leave();
}
}
});
Thread two = new Thread(() -> {
while (true) {
m.enterWhenUninterruptibly(modIsOneGuard);
try {
if (i.value >= 10) return;
i.value++;
System.out.println("Thread 2 inc: " + String.valueOf(i));
} finally {
m.leave();
}
}
});
one.start();
two.start();
one.join();
two.join();
}

Error while reading same mem positions on different threads

I have a problem while reading a couple of positions in a double array from different threads.
I enqueue the execution with :
nelements = nx*ny;
err = clEnqueueNDRangeKernel(queue,kernelTvl2of,1,NULL,&nelements,NULL,0,NULL,NULL);
kernelTvl2of has (among other) the code
size_t k = get_global_id(0);
(...)
u1_[k] = (float)u1[k];
(...)
barrier(CLK_GLOBAL_MEM_FENCE);
forwardgradient(u1_,u1x,u1y,k,nx,ny);
barrier(CLK_GLOBAL_MEM_FENCE);
and forwardgradient has the code:
void forwardgradient(global double *f, global double *fx, global double *fy, int ker,int nx, int ny){
unsigned int rowsnotlast = ((nx)*(ny-1));
if(ker<rowsnotlast){
fx[ker] = f[ker+1] - f[ker];
fy[ker] = f[ker+nx] - f[ker];
}
if(ker<nx*ny){
fx[ker] = f[ker+1] - f[ker];
if(ker==4607){
fx[0] = f[4607];
fx[1] = f[4608];
fx[2] = f[4608] - f[4607];
fx[3] = f[ker];
fx[4] = f[ker+1];
fx[5] = f[ker+1] - f[ker];
}
}
if(ker==(nx*ny)-1){
fx[ker] = 0;
fy[ker] = 0;
}
if(ker%nx == nx-1){
fx[ker]=0;
}
fx[6] = f[4608];
}
When I get the contents of the first positions of fx, they are:
-6 0 6 -6 0 6 -6
And here's my problem: when I query fx[ker+1] or fx[4608] on thread with id 4607 I get a '0' (positions second and fifth of the output array), but from other threads I get a '-6' last position of the output array)
Anyone has a clue on what I'm doing wrong, or where I could look to?
Thanks a lot,
Anton
Within a kernel, global memory consistency is only achievable within a single work-group. This means that if a work-item writes a value to global memory, a barrier(CLK_GLOBAL_MEM_FENCE) only guarantees that other work-items within the same work-group will be able to read the updated value.
If you need global memory consistency across multiple work-groups, you need to split your kernel into multiple kernels.

Resources