Opencl: global thread synchronization between two loops - multithreading

I have an opencl kernel that computes two global buffers in two loops.
The first loop does some computations with a global thread and writes the result to the output buffer "OutBuff". Then the second loop updates the values of the global buffer "UpdateBuff" according to the results computed in "OutBuff" in the first loop(on the previous level). The prolem is that the global thread between the two loops changed since the threads are executed in parallel. But in my case, I need to keep the order of thread execution between these two loops. I need to compute the two loops with the same global id.
for example
__kernel void globalSynch(__global double4* input,__global uint *points,__global double4* OutBuff,__global double4* UpdateBuff)
{
int gid = get_global_id(0);
uint pt;
for(int level=0;level<N;level++)
{
for(int i=0;i<blocksize;i++)
{
pt== points[gid*i*level];
OutBuff[pt]= do_some_computations(UpdateBuff,....);
}
barrier( CLK_GLOBAL_MEM_FENCE);
for(int j=0;j<blocksize1;j++)
{
pt=points[gid*j*(level+1)];
UpdateBuff[pt]= do_some_computations(OutBuff,...);
}
barrier( CLK_GLOBAL_MEM_FENCE);
}
}
Is this related to use Semaphores?

This is a common OpenCL misunderstanding. The barrier statement is only within a work group, not the global work size. There is no statement for global synchronization (because of how work groups are executed; some run to completion before others even start). The solution for global synchronization is to use separate kernels. The first will run to completion, and then the second one will.

Related

What is the point of running same code under different threads - openMP?

From: https://bisqwit.iki.fi/story/howto/openmp/
The parallel construct
The parallel construct starts a parallel block. It creates a team
of N threads (where N is determined at runtime, usually from the
number of CPU cores, but may be affected by a few things), all of
which execute the next statement (or the next block, if the statement
is a {…} -enclosure). After the statement, the threads join back into
one.
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
By using omp_get_thread_num() you can retrieve the thread ID which enables you to parametrize the so called "same code" with respect to that thread ID.
Take this example:
A is a 1000-dimensional integer array and you need to sum its values using 2 OpenMP threads.
You would design you code something like this:
int A_dim = 1000
long sum[2] = {0,0}
#pragma omp parallel
{
int threadID = omp_get_thread_num();
int start = threadID * (A_dim / 2)
int end = (threadID + 1) * (A_dim / 2)
for(int i = start; i < end; i++)
sum[threadID] += A[i]
}
start is the lower bound which your thread will start summing from (example: thread #0 will start summing from 0, while thread #1 will start summing from 500).
end is pretty much the same of start, but it's the upper bound of which array index the thread will sum up to (example: thread #0 will sum until 500, summing values from A[0] to A[499], while thread #1 will sum until 1000 is reached, values from A[500] to A[999])
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
When you are running the same code on different data.
For example, if I want to invert 10 matrices, I might run the matrix inversion code on 10 threads ... to get (ideally) a 10-fold speedup compared to 1 thread and a for loop.
The basic idea of OpenMP is to distribute work. For this you need to create some threads.
The parallel construct creates this number of threads. Afterwards you can distibute/share work with other constructs like omp for or omp task.
A possible benefit of this distinction is e.g. when you have to allocate memory for each thread (i.e. thread-local data).
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
One example: in physics you got a random process(collision, initial maxwellian etc) in your code and you need to run the code many times to get the average results, in this case you need to run the same code several times.

OpenCL - Sync and Signal?

I want to use a sync-signal setting in OpenCL to make sure, that only one thread can go into a critical kernel part.
Here is the code, I have so far:
void sync(int barrierID) {
int ID = get_global_id(0);
barrier(CLK_GLOBAL_MEM_FENCE);
while (ID - barrierID != 0) {
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
//critical part
void signal(int threadCount, int barrierID) {
barrierID++;
barrier(CLK_GLOBAL_MEM_FENCE);
while (barrierID != threadCount) {
barrier(CLK_GLOBAL_MEM_FENCE);
}
barrierID = 0;
}
with threadcount for the amount of threads, that wnat to access the critical part and barrierID is the counter for how many threads has passed this part.
unfortunally, this code does not work in OpenCL.
Does anyone knows, how to fix this code?
You are approaching GPU computing as a multi thread computing, which is completely wrong approach.
The reason is that in GPU computing all the "threads" (in reality they are work items), run at the same time. A work item cannot enter a zone, run code, while the others are doing something else.
Therefore, having any type of branching in the GPU is a terrible idea, since it will slow down your application, making the GPU run all the branches, even if some items do not enter in that case.
For your specific case:
You are getting a deadlock in your kernel because you are creating a barrier in a branch. After one single work item enters, will wait until all the others have entered. If that case does never happen, then you have a deadlock.
Check the barrier command: https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/barrier.html
If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.

Speed-up from multi-threading

I have a highly parallelizable problem. Hundreds of separate problems need to be solved by the same function. The problems each take an average of perhaps 120 ms (0.12 s) on a single core, but there is substantial variation, and some extreme and rare ones may take 10 times as long. Each problem needs memory, but this is allocated ahead of time. The problems do not need disk I/O, and they do not pass back and forth any variables once they are running. They do access different parts (array elements) of the same global struct, though.
I have C++ code, based on someone else's code, that works. (The global array of structs is not shown.) It runs 20 problems (for instance) and then returns. I think 20 is enough to even out the variability on 4 cores. I see the execution time flattening out from about 10 already.
There is a Win32 and an OpenMP version, and they behave almost identically in terms of execution time. I run the program on a 4-core Windows system. I include some OpenMP code below since it is shorter. (I changed names etc. to make it more generic and I may have made mistakes -- it won't compile stand-alone.)
The speed-up over the single-threaded version flattens out at about a factor of 2.3. So if it takes 230 seconds single-threaded, it takes 100 s multi-threaded. I am surprised that the speed-up is not a lot closer to 4, the number of cores.
Am I right to be disappointed?
Is there anything I can do to get closer to my theoretical expectation?
int split_bigtask(Inputs * inputs, Outputs * results)
{
for (int k = 0; k < MAXNO; k++)
results->solved[k].value = 0;
int res;
#pragma omp parallel shared(inputs, outputs)
{
#pragma omp for schedule(dynamic)
for (int k = 0; k < inputs->no; k++)
{
res = bigtask(inputs->values[k],
outputs->solved[k],
omp_get_thread_num()
);
}
}
return TRUE;
}
I Assume that there is no synchronization done within bigtask() (Obvious, but I'd still check it first).
It's possible that you run into a "dirty cache" problem: If you manipulate data that is close to each other (e.g. same cache line!) from multiple cores each manipulation will mark the cache line as dirty (which means that the processor needs to signal this to all other processeors which in turn involves synchronization again...).
you create too many threads (allocating a thread is quite an overhead. So creating one thread for each core is a lot more efficient than creating 5 threads each).
I personally would assume that you have case 2 ("Big Global Array").
Solution to the problem (if it's indeed case 2):
Write the results to a local array which is merged into the "Big Global Array" by the main thread after the end of the work
Split the global array into several smaller arrays (and give each thread one of these arrays)
Ensure that the records within the structure align on Cache-Line boundaries (this is a bit a hack since cache line boundaries may change for future processors)
You may want to try to create a local copy of the array for each thread (at least for the results)

Reading a Global Variable from a Thread and Writing to that Variable from another Thread

My program has 2 threads and a int global variable. One thread is reading from that variable and other thread is writing to that variable. Should I use mutex lock in this situation.
These functions are executing from 2 threads simultaneously and repetitively in my program.
void thread1()
{
if ( condition1 )
iVariable = 1;
else if ( condition2 )
iVariable = 2;
}
void thread2()
{
if ( iVariable == 1)
//do something
else if ( iVarable == 2 )
//do another thing
}
If you don't use any synchronization then it is entirely unpredictable when the 2nd thread sees the updated value. This ranges somewhere between a handful of nanoseconds and never. With the never outcome being particularly troublesome of course, it can happen on a x86 processor when you don't declare the variable volatile and you run the Release build of your program. It can take a long time on processors with a weak memory model, like ARM cores. The only thing you don't have to worry about is seeing a partially updated value, int updates are atomic.
That's about all that can be said about the posted code. Fine-grained locking rarely works well.
Yes you should (under most circumstances). Mutexes will ensure that the data you are protecting will be correctly visible from multiple contending CPUs. Unless you have a performance problem, you should use a mutex. If performance is an issue, look into lock free data structures.

multiple threads but only one allowed to use method

So basically the situation I am in is I have a bunch of threads each doing different calculations throughout the week. At the end of the week, every thread calls function X() and then starts calculating for the next week and repeats this cycle.
However, only one thread is allowed to actually do the operations in method X() and only when all threads have reached method X(). Furthermore, none of the threads can continue on their way until the one thread that got to use method X() is finished.
So I'm having difficulty implementing this. I feel like I need to use a condition variable but I'm still shaky with threads and whatnot.
Barriers are a useful synchronization method here.
In pthreads, you can use two barriers, each initialized to a require however many threads are running. The first synchronizes threads after they've finished calculating, and the second after one of them has called X(). Conveniently, the pthread_barrier_wait will elect one and only one of your N waiting threads to actually call X():
void *my_thread(void *whatever) { // XXX error checking omitted
while (1) {
int rc;
do_intense_calculations();
// Wait for all calculations to finish
rc = pthread_barrier_wait(&calc_barrier);
// Am I nominated to run X() ?
if (rc == PTHREAD_BARRIER_SERIAL_THREAD) X();
// Wait for everyone, including whoever is doing X()
rc = pthread_barrier_wait(&x_barrier);
}
Java's CyclicBarrier with a Runnable argument would let you do the same thing with but one barrier. (The Runnable is run after all parties arrive but before any are released.)

Resources