Pass multiple args to thread using struct (pthread) - struct

I'm learning to programming using pthread for a adder program, after reference several codes still don't get how to pass multiple arguments into a thread using a struct, here is my buggy program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>
typedef struct s_addition {
int num1;
int num2;
int sum;
} addition;
void *thread_add_function (void *ad)
{
printf ("ad.num1:%d, ad.num2:%d\n",ad.num1, ad.num2);
ad.sum = ad.num1 + ad.num2;
pthread_exit(0);
}
int main()
{
int N = 5;
int a[N], b[N], c[N];
srand (time(NULL));
// fill them with random numbers
for ( int j = 0; j < N; j++ ) {
a[j] = rand() % 392;
b[j] = rand() % 321;
}
addition ad1;
pthread_t thread[N];
for (int i = 0; i < N; i++) {
ad1.num1 = a[i];
ad1.num2 = b[i];
printf ("ad1.num1:%d, ad1.num2:%d\n",ad1.num1, ad1.num2);
pthread_create (&thread[i], NULL, thread_add_function, &ad1);
pthread_join(thread[i], NULL);
c[i] = ad.sum;
}
printf( "This is the result of using pthread.\n");
for ( int i = 0; i < N; i++) {
printf( "%d + %d = %d\n", a[i], b[i], c[i]);
}
}
But when compiling I got the following error:
vecadd_parallel.c:15:39: error: member reference base type 'void *' is not a
structure or union
printf ("ad.num1:%d, ad.num2:%d\n",ad.num1, ad.num2);
I tried but still cannot get a clue, what I am doing wrong with it?

Seems like you have a problem with trying to access the members of a void datatype.
You will need to add a line to cast your parameter to thread_add_function to the correct datatype similar to addition* add = (addition*)ad;, and then use this variable in your function (note that you also have to change you r .'s to -> because it's a pointer)
You also should only pass data to threads that was malloc()'d, as stack allocated data may not be permanent. It should be fine for the current implementation, but changes later could easily give strange, unpredictable behaviour.

Related

Non collective write using in file view

When trying to write blocks to a file, with my blocks being unevenly distributed across my processes, one can use MPI_File_write_at with the good offset. As this function is not a collective operation, this works well.
Exemple :
#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>
int main(int argc, char* argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int global = 7; // prime helps have unbalanced procs
int local = (global/size) + (global%size>rank?1:0);
int strsize = 5;
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
for (int i=0; i<local; ++i)
{
size_t idx = i * size + rank;
std::string buffer = std::string(strsize, 'a' + idx);
size_t offset = buffer.size() * idx;
MPI_File_write_at(fh, offset, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);
}
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
However for more complexe write, particularly when writting multi dimensional data like raw images, one may want to create a view at the file with MPI_Type_create_subarray. However, when using this methods with simple MPI_File_write (which is suppose to be non collective) I run in deadlocks. Exemple :
#include <cstdio>
#include <cstdlib>
#include <string>
#include <mpi.h>
int main(int argc, char* argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int global = 7; // prime helps have unbalanced procs
int local = (global/size) + (global%size>rank?1:0);
int strsize = 5;
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
for (int i=0; i<local; ++i)
{
size_t idx = i * size + rank;
std::string buffer = std::string(strsize, 'a' + idx);
int dim = 2;
int gsizes[2] = { buffer.size(), global };
int lsizes[2] = { buffer.size(), 1 };
int offset[2] = { 0, idx };
MPI_Datatype filetype;
MPI_Type_create_subarray(dim, gsizes, lsizes, offset, MPI_ORDER_C, MPI_CHAR, &filetype);
MPI_Type_commit(&filetype);
MPI_File_set_view(fh, 0, MPI_CHAR, filetype, "native", MPI_INFO_NULL);
MPI_File_write(fh, buffer.c_str(), buffer.size(), MPI_CHAR, MPI_STATUS_IGNORE);
}
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
How to avoid such a code to lock ? Keep in mind that by real code will really use the multidimensional capabilities of MPI_Type_create_subarray and cannot just use MPI_File_write_at
Also, it is difficult for me to know the maximum number of block in a process, so I'd like to avoid doing a reduce_all and then loop on the max number of block with empty writes when localnb <= id < maxnb
You don't use MPI_REDUCE when you have a variable number of blocks per node. You use MPI_SCAN or MPI_EXSCAN: MPI IO Writing a file when offset is not known
MPI_File_set_view is collective, so if 'local' is different on each processor, you'll find yourself calling a collective routine from less than all processors in the communicator. If you really really need to do so, open the file with MPI_COMM_SELF.
the MPI_SCAN approach means each process can set the file view as needed, and then blammo you can call the collective MPI_File_write_at_all (even if some processes have zero work -- they still need to participate) and take advantage of whatever clever optimizations your MPI-IO implementation provides.

Pthread create function

I am having difficulty understanding the creation of a pthread.
This is the function I declared in the beginning of my code
void *mini(void *numbers); //Thread calls this function
Initialization of thread
pthread_t minThread;
pthread_create(&minThread, NULL, (void *) mini, NULL);
void *mini(void *numbers)
{
min = (numbers[0]);
for (i = 0; i < 8; i++)
{
if ( numbers[i] < min )
{
min = numbers[i];
}
}
pthread_exit(0);
}
numbers is an array of integers
int numbers[8];
Im not sure if I created the pthread correctly.
In the function, mini, I get the following error about setting min (declared as an int) equal to numbers[0]:
Assigning to 'int' from incompatible type 'void'
My objective is to compute the minimum value in numbers[ ] (min) in this thread and use that value later to pass it to another thread to display it. Thanks for any help I can get.
You need to pass 'numbers' as the last argument to pthread_create(). The new thread can then call 'mini' on its own stack with 'numbers' as the argument.
In 'mini', you shoudl cast the void* back to an integer array in order to dereference it correctly - you cannot dereference a void* directly - it does not point to anything:)
Also, it's very confusing to have multiple vars in different threads with the name 'numbers'.
There are some minor improprieties in this pgm but it illustrates basically what you want to do. You should play around, break and improve it.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void *mini(void *numbs)
{
int *numbers = (int *) numbs;
int *min = malloc(sizeof(int));
*min = (numbers[0]);
for (int i = 0; i < 8; i++)
if (numbers[i] < *min )
*min = numbers[i];
pthread_exit(min);
}
int main(int argc, char *argv[])
{
pthread_t minThread;
int *min;
int numbers[8] = {28, 47, 36, 45, 14, 23, 32, 16};
pthread_create(&minThread, NULL, (void *) mini, (void *) numbers);
pthread_join(minThread, (void *) &min);
printf("min: %d\n", *min);
free(min);
return(0);
}

C++ 11 std::thread strange behavior

I am experimenting a bit with std::thread and C++11, and I am encountering strange behaviour.
Please have a look at the following code:
#include <cstdlib>
#include <thread>
#include <vector>
#include <iostream>
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i);
count = i;
}
class A {
public:
A(const size_t x) : x_(x) {}
size_t sum_up(const size_t num_threads) const {
size_t i;
std::vector<std::thread> threads;
std::vector<size_t> data_vector;
for (i = 0; i < num_threads; ++i) {
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
}
std::cout << "Threads started ...\n";
for (i = 0; i < num_threads; ++i)
threads[i].join();
size_t sum = 0;
for (i = 0; i < num_threads; ++i)
sum += data_vector[i];
return sum;
}
private:
const size_t x_;
};
int main(int argc, char* argv[]) {
const size_t x = atoi(argv[1]);
const size_t num_threads = atoi(argv[2]);
A a(x);
std::cout << a.sum_up(num_threads) << std::endl;
return 0;
}
The main idea here is that I want to specify a number of threads which do independent computations (in this case, simple increments).
After all threads are finished, the results should be merged in order to obtain an overall result.
Just to clarify: This is only for testing purposes, in order to get me understand how
C++11 threads work.
However, when compiling this code using the command
g++ -o threads threads.cpp -pthread -O0 -std=c++0x
on a Ubuntu box, I get very strange behaviour, when I execute the resulting binary.
For example:
$ ./threads 1000 4
Threads started ...
Segmentation fault (core dumped)
(should yield the output: 4000)
$ ./threads 100000 4
Threads started ...
200000
(should yield the output: 400000)
Does anybody has an idea what is going on here?
Thank you in advance!
Your code has many problems (see even thread_sum_up for about 2-3 bugs) but the main bug I found by glancing your code is here:
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
See, when you push_back into a vector (I'm talking about data_vector), it can move all previous data around in memory. But then you take the address of (reference to) a cell for your thread, and then push back again (making the previous reference invalid)
This will cause you to crash.
For an easy fix - add data_vector.reserve(num_threads); just after creating it.
Edit at your request - some bugs in thread_sum_up
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i); // see that last ';' there? means this loop is empty. it shouldn't be there
count = i; // You're just setting count to be i. why do that in a loop? Did you mean +=?
}
The cause of your crash might be that std::ref(data_vector[i]) being invalidated by the next push_back in data_vector. Since you know the number of threads, do a data_vector.reserve(num_threads) before you start spawning off the threads to keep the references from being invalidated.
As you resize the vector with the calls to push_back, it is likely to have to reallocate the storage space, causing the references to the contained values to be invalidated. This causes the thread to write to non-allocated memory, which is undefined behavior.
Your options are to pre-allocate the size you need (vector::reserve is one option), or choose a different container.

Why does calculation with OpenMP take 100x more time than with a single thread?

I am trying to test Pi calculation problem with OpenMP. I have this code:
#pragma omp parallel private(i, x, y, myid) shared(n) reduction(+:numIn) num_threads(NUM_THREADS)
{
printf("Thread ID is: %d\n", omp_get_thread_num());
myid = omp_get_thread_num();
printf("Thread myid is: %d\n", myid);
for(i = myid*(n/NUM_THREADS); i < (myid+1)*(n/NUM_THREADS); i++) {
//for(i = 0; i < n; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
printf("Thread ID is: %d\n", omp_get_thread_num());
}
return 4. * numIn / n;
}
When I compile with gcc -fopenmp pi.c -o hello_pi and run it time ./hello_pi for n = 1000000000 I get
real 8m51.595s
user 4m14.004s
sys 60m59.533s
When I run it on with a single thread I get
real 0m20.943s
user 0m20.881s
sys 0m0.000s
Am I missing something? It should be faster with 8 threads. I have 8-core CPU.
Please take a look at the
http://people.sc.fsu.edu/~jburkardt/c_src/openmp/compute_pi.c
This might be a good implementation for pi computing.
It is quite important to know that how your data spread to different threads and how the openmp collect them back. Usually, a bad design (which has data dependencies across threads) running on multiple thread will result in a slower execution than a single thread .
rand() in stdlib.h is not thread-safe. Using it in multi-thread environment causes a race condition on its hidden state variables, thus lead to poor performance.
http://man7.org/linux/man-pages/man3/rand.3.html
In fact the following code work well as an OpenMP demo.
$ gc -fopenmp -o pi pi.c -O3; time ./pi
pi: 3.141672
real 0m4.957s
user 0m39.417s
sys 0m0.005s
code:
#include <stdio.h>
#include <omp.h>
int main()
{
const int n=50000;
const int NUM_THREADS=8;
int numIn=0;
#pragma omp parallel for reduction(+:numIn) num_threads(NUM_THREADS)
for(int i = 0; i < n; i++) {
double x = (double)i/n;
for(int j=0;j<n; j++) {
double y = (double)j/n;
if (x*x + y*y <= 1) numIn++;
}
}
printf("pi: %f\n",4.*numIn/n/n);
return 0;
}
In general I would not compare times without optimization on. Compile with something like
gcc -O3 -Wall -pedantic -fopenmp main.c
The rand() function is not thread safe in Linux (but it's fine with MSVC and I guess mingw32 which uses the same C run-time libraries, MSVCRT, as MSVC). You can use rand_r with a different seed for each thread. See openmp-program-is-slower-than-sequential-one.
In general try to avoid defining the chunk sizes when you parallelize a loop. Just use #pragma omp for schedule(shared). You also don't need to specify that the loop variable in a parallelized loop is private (the variable i in your code).
Try the following code
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
int i, numIn, n;
unsigned int seed;
double x, y, pi;
n = 1000000;
numIn = 0;
#pragma omp parallel private(seed, x, y) reduction(+:numIn)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rand_r(&seed) / RAND_MAX;
y = (double)rand_r(&seed) / RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
You can find a working example of this code here http://coliru.stacked-crooked.com/a/9adf1e856fc2b60d

Strange behaviour in OpenMP nested loop

In the following program I get different results (serial vs OpenMP), what is the reason? At the moment I can only think that perhaps the loop is too "large" for the threads and perhaps I should write it in some other way but I am not sure, any hints?
Compilation: g++-4.2 -fopenmp main.c functions.c -o main_elec_gcc.exe
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NRACK 64
#define NSTARS 1024
double mysumallatomic_serial(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
int j,i;
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
return S2;
}
double mysumallatomic(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int j=0; j<NRACK; j++){
for(int i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
float myterm=ql[i]*temp_div*qr[j];
#pragma omp atomic
S2 += myterm;
}
}
return S2;
int main(int argc, char *argv[]) {
float rocks[NRACK][3], moon[NSTARS][3];
float qr[NRACK], ql[NSTARS];
int i,j;
for(j=0;j<NRACK;j++){
rocks[j][0]=j;
rocks[j][1]=j+1;
rocks[j][2]=j+2;
qr[j] = j*1e-4+1e-3;
//qr[j] = 1;
}
for(i=0;i<NSTARS;i++){
moon[i][0]=12000+i;
moon[i][1]=12000+i+1;
moon[i][2]=12000+i+2;
ql[i] = i*1e-3 +1e-2 ;
//ql[i] = 1 ;
}
printf(" serial: %f\n", mysumallatomic_serial(rocks,moon,qr,ql));
printf(" openmp: %f\n", mysumallatomic(rocks,moon,qr,ql));
return(0);
}
}
I think you should use reduction instead of shared variable and remove #pragma omp atomic, like:
#pragma omp parallel for reduction(+:S2)
And it should work faster, because there are no need for atomic operations which are quite painful in terms of performance and threads synchronization.
UPDATE
You can also have some difference in results because of the operations order:
\sum_1^100(x[i]) != \sum_1^50(x[i]) + \sum_51^100(x[i])
You have data races on most of the temporary variables you are using in the parallel region - difx, dify, difz, mod2x, mod2y, mod2z, temp_sqrt, and temp_div should all be private. You should make these variables private by using a private clause on the parallel for directive.

Resources