OpenMP C++: Overlapping computations with communications - multithreading

I need to read a file in chunks. Since I am reading it in chunks, I thought I could parallelize the communications with computations.
I need three active threads, such that while T1 is reading data, T2 could be processing, and T3 could be writing at the same time.
So far, I have managed to create three local buffers where I read the data from a file and store it in these local buffers. I take N elements from the file. Every time I reach N elements, I change thread_id, which means that next thread should store the next elements to its corresponding local buffer.
The #pragma omp critical make sure that while one thread is processing or writing data, no other thread is doing the same. However, if T1 is processing its chunk of data, T2 can write its output. This makes sure that the input is read and written correctly.
What I have so far is this:
int i = 0;
int thread_id = 0;
while (input_file >> stream_element) {
if(thread_id == 0) {
input_buffer_1[i] = stream_element;
}
else if (thread_id == 1) {
input_buffer_2[i] = stream_element;
}
else
input_buffer_3[i] = stream_element;
i++;
if(i == N) {
i = 0;
if(thread_id == 0) {
#pragma omp critical (processing_data)
process(input_buffer_1, output_buffer_1, N);
#pragma omp critical (writing_data)
write(output_file, output_buffer_1, N);
}
else if(thread_id == 1) {
#pragma omp critical (processing_data)
process(input_buffer_2, output_buffer_2, N);
#pragma omp critical (writing_data)
write(output_file, output_buffer_2, N);
}
else {
#pragma omp critical (processing_data)
process(input_buffer_3, output_buffer_3, N);
#pragma omp critical (writing_data)
write(output_file, output_buffer_3, N);
}
thread_id++;
if(thread_id == STREAM_THREADS)
thread_id = 0;
}
What would be the best way to initiate three threads and do these tasks in parallel and synchronized manner?
A simple visualization of the desired result is shown below.
Best regards,
Suejb

Related

How to make a thread wait another to finish using OpenMP threads?

I want to make the following loop that fills the A matrix parallel. For every A[i][j] element that is calculated I want the price in A[i-1][j], A[i-1][j -1] and A[i0][j-1] to have been calculated first. So my thread has to wait for the threads in these positions to have calculated their results. I've tried to achieve this like this:
#pragma omp parallel for num_threads(threadcnt) \
default(none) shared(A, lenx, leny) private(j)
for (i=1; i<lenx; i++)
{
for (j=1; j<leny; j++)
{
do
{
} while (A[i-1][j-1] == -1 || A[i-1][j] == -1 || A[i][j-1] == -1);
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
My A matrix is initialized in -1 so if A[][] equals to -1 the operation in this cell is not completed. It takes more time than the serial program though.. Any idea to avoid the while loop?
The waiting loop seems sub-optimal. Apart from burning cores that are spin-waiting, you will also need a plethora of well-placed flush directives to make this code work.
One alternative, especially in the context of a more general parallelization scheme would be to use tasks and task dependences to model the dependences between the different array elements:
#pragma omp parallel
#pragma omp single
for (i=1; i<lenx; i++) {
for (j=1; j<leny; j++) {
#pragma omp task depend(in:A[i-1][j-1],A[i-1][j],A[i][j-1]) depend(out:A[i][j])
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
You may want to think about block the matrix updates, so that each task receives a block of the matrix instead of a single element, but the general idea will remain the same.
Another useful OpenMP feature could be the ordered construct and it's ability to adhere to exactly this kind of data dependency:
#pragma omp parallel for
for (int i=1; i<lenx; i++) {
for (int j=1; j<leny; j++) {
#pragma omp ordered depend(source)
#pragma omp ordered depend(sink:i-1,j-1)
A[i][j] = max(A[i-1][j-1]+1,A[i-1][j]-1, A[i][j-1] -1);
}
}
PS: The code above is untested, but it should get the rough idea across.
Your solution cannot work. As A is initialized to -1, and A[0][j] is never modified, if i==1, it will test A[1-1][j] and will always fail. BTW, if A is initiliazed to -1, how cannot you have anything but -1 with the max?
When you have dependencies problem, there are two solutions.
First one is to sequentialize your code. Openmp has the ordered directive to do that, but the drawback is that you loose parallelism (while still paying thread creation cost). Openmp 4.5 has a way to describe dependencies with the depend and sink/source statements, but I do not know how efficient can the compiler be to deal with that. And my compilers (gcc 7.3 or clang 6.0) do not support this feature.
Second solution is to modify your algorithm to suppress dependencies. Now, you are computing the maximum of all values that are at the left or above a given element. Lets turn it to a simpler problem. Compute the maximum of all values at the left of a given element. We can easily parallelize it by computing on the different rows, as there no interrow dependency.
// compute b[i][j]=max_k<j(a[i][k]
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
// max per row
if((j!=0)) b[i][j]=max(a[i][j],b[i][j-1]);
else b[i][j]=a[i][j]; // left column initialised to value of a
}
}
Consider another simple problem, to compute the prefix maximum on the different columns. It is again easy to parallelize, but this time on the inner loop, as there is not inter-column dependency.
// compute c[i][j]=max_i<k(b[k,j])
for(int i=0;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
if((i!=0)) c[i][j]=max(b[i][j],c[i-1][j]);
else c[i][j]=b[i][j]; // upper row initialised to value of b
}
}
Now you just have to chain these computations to get the expected result. Here is the final code (with a unique array used and some cleanup in the code).
#pragma omp parallel for
for(int i=0;i<n;i++){
for(int j=1;j<n;j++){
// max per row
a[i][j]=max(a[i][j],a[i][j-1]);
}
}
for(int i=1;i<n;i++){
#pragma omp parallel for
for(int j=0;j<n;j++){
// max per column
a[i][j]=max(a[i][j],a[i-1][j]);
}
}

How to use multiple threads for multiple file reading/writing operations in C?

I'm trying to analyse the optimum number of threads required to perform a large file reading/writing operation on multiple files.
So how do I proceed with creating multiple threads and assigning each thread with some number of files to speed up the execution time?
Language used- C
Look into https://en.wikipedia.org/wiki/OpenMP for creating threads and managing them efficiently for your analysis.
In you case what you would be doing is using multiple single threads and assigning the workload to them or a team of threads which will recieve equal share of work. Examples below:
int main(int argc, char **argv)
{
// split work in sections
#pragma omp parallel sections
{
#pragma omp section
{
function_1(); // where you perform R/W operations
}
#pragma omp section
{
function_2(); // where you perform R/W operations or other
}
}
// split work on N threads equally
#pragma omp parallel
{
// where you perform R/W operations
}
int threadNum = omp_get_thread_num();
printf("Number of threads used: %d\n",threadNum);
/*
example for loop: if you have 2 threads this work would be split with
i that runs from 0 to 49999 while the second gets a version running from 50000 to 99999.
*/
int arr[100000];
#pragma omp parallel for
{
for (int i = 0; i < 100000; i++) {
arr[i] = 2 * i;
}
}
int threadNum = omp_get_thread_num();
printf("Number of threads used: %d\n",threadNum);
return 0;
}

OpenMP Multithreads becoming one thread

I am programming using OpenMP to get to learn about multithreads. Is it possible for any thread, which is any thread of 11 in this case, to reach the return statement at the end while some threads may be still working on something in the for loop? Or do they become one master thread again after line 13?
int np, iam;
#pragma omp parallel private(np, iam) num_threads(11)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
#pragma omp for
for (int i = 2; i < 100; i++) {
std::cout << i;
doStuff(i);
}
}
} // line 13
// synchronize necessary?
return 0;
There is an implicit barrier ar the end of the parallel construct, so no synchronization is necessary. Any further code is executed only by the master thread (the one that had thread_num == 0 within the parallel region), and only after all threads have reached the end of the parallel region.

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Avoid thread creation overhead in open MP

I am using open MP to parallelize a part of code in HEVC. The basic structure of the code is as given below
Void funct()
{
for(...)
{
#pragma OMP parallel for private(....)
for (...)
{
/// do some parallel work
}
//end of inner for loop
//other tasks
}
/// end of outer for loop
}
//end of function
Now i have modified the inner for loop so that the code is parallelized and every thread perform task independently. I am not getting any errors but the overall processing time is increased with multiple threads than what it would have taken with single thread. I guess the main reason is that for every iteration of outer loop there is thread creation overhead for innner loop. Is there any way to avoid this issue or any way by which we can create thread only once. I cannot parallelize the outer for loop since i have made modifications in inner for loop to enable each thread to work independently. Please suggest any possible solutions.
You can use separate directives #pragma omp parallel and #pragma omp for.
#pragma omp parallel creates parallel threads, whereas #pragma omp for distributes the work between the threads. For sequential part of the outer loop you can use #pragma omp single.
Here is an example:
int n = 3, m = 10;
#pragma omp parallel
{
for (int i = 0; i < n; i++){
#pragma omp single
{
printf("Outer loop part 1, thread num = %d\n",
omp_get_thread_num());
}
#pragma omp for
for(int j = 0; j < m; j++) {
int thread_num = omp_get_thread_num();
printf("j = %d, Thread num = %d\n", j, thread_num);
}
#pragma omp single
{
printf("Outer loop part 2, thread num = %d\n",
omp_get_thread_num());
}
}
}
But I am not sure will it help you or not. To diagnose OpenMP performance issues, it would be better to use some profiler, such as Scalasca or VTune.

Resources