Why does calculation with OpenMP take 100x more time than with a single thread? - multithreading

I am trying to test Pi calculation problem with OpenMP. I have this code:
#pragma omp parallel private(i, x, y, myid) shared(n) reduction(+:numIn) num_threads(NUM_THREADS)
{
printf("Thread ID is: %d\n", omp_get_thread_num());
myid = omp_get_thread_num();
printf("Thread myid is: %d\n", myid);
for(i = myid*(n/NUM_THREADS); i < (myid+1)*(n/NUM_THREADS); i++) {
//for(i = 0; i < n; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
printf("Thread ID is: %d\n", omp_get_thread_num());
}
return 4. * numIn / n;
}
When I compile with gcc -fopenmp pi.c -o hello_pi and run it time ./hello_pi for n = 1000000000 I get
real 8m51.595s
user 4m14.004s
sys 60m59.533s
When I run it on with a single thread I get
real 0m20.943s
user 0m20.881s
sys 0m0.000s
Am I missing something? It should be faster with 8 threads. I have 8-core CPU.

Please take a look at the
http://people.sc.fsu.edu/~jburkardt/c_src/openmp/compute_pi.c
This might be a good implementation for pi computing.
It is quite important to know that how your data spread to different threads and how the openmp collect them back. Usually, a bad design (which has data dependencies across threads) running on multiple thread will result in a slower execution than a single thread .

rand() in stdlib.h is not thread-safe. Using it in multi-thread environment causes a race condition on its hidden state variables, thus lead to poor performance.
http://man7.org/linux/man-pages/man3/rand.3.html
In fact the following code work well as an OpenMP demo.
$ gc -fopenmp -o pi pi.c -O3; time ./pi
pi: 3.141672
real 0m4.957s
user 0m39.417s
sys 0m0.005s
code:
#include <stdio.h>
#include <omp.h>
int main()
{
const int n=50000;
const int NUM_THREADS=8;
int numIn=0;
#pragma omp parallel for reduction(+:numIn) num_threads(NUM_THREADS)
for(int i = 0; i < n; i++) {
double x = (double)i/n;
for(int j=0;j<n; j++) {
double y = (double)j/n;
if (x*x + y*y <= 1) numIn++;
}
}
printf("pi: %f\n",4.*numIn/n/n);
return 0;
}

In general I would not compare times without optimization on. Compile with something like
gcc -O3 -Wall -pedantic -fopenmp main.c
The rand() function is not thread safe in Linux (but it's fine with MSVC and I guess mingw32 which uses the same C run-time libraries, MSVCRT, as MSVC). You can use rand_r with a different seed for each thread. See openmp-program-is-slower-than-sequential-one.
In general try to avoid defining the chunk sizes when you parallelize a loop. Just use #pragma omp for schedule(shared). You also don't need to specify that the loop variable in a parallelized loop is private (the variable i in your code).
Try the following code
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
int i, numIn, n;
unsigned int seed;
double x, y, pi;
n = 1000000;
numIn = 0;
#pragma omp parallel private(seed, x, y) reduction(+:numIn)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rand_r(&seed) / RAND_MAX;
y = (double)rand_r(&seed) / RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
You can find a working example of this code here http://coliru.stacked-crooked.com/a/9adf1e856fc2b60d

Related

Computing 2 independent matrix-vector multiplications with OpenMP tasks is slower than processing each one in turn

I am computing the matrix vector product of two different matrices. I.e. I need to compute A*v and B*v. I am seeing speedup by using thread and vector level parallelism. I have noticed that, for the matrix sizes I have tested, the speedup curve flattens out after ~8 threads (for appropriately large matrix sizes). As I am on a 16 thread machine, I thought it natural to throw another 4 or 5 threads at the other matrix-vector multiplication and compute both at the same time by assigning each to a task. However, as the problem size increases, computing with tasks in this thread range is consistently worse than computing without. See the figure below. This is not intuitive to me, I would have expected tasks to be out performing the sequential function calls.
// This is consistently worse
#pragma omp task
matrixVectorMultiplicationParallel(A, V, resultsA, matrixSize, numThreads/2);
#pragma omp task
matrixVectorMultiplicationParallel(B, V, resultsB, matrixSize, numThreads/2);
#pragma omp taskwait
// this is consistently better (this is corresponds to the blue line in the figures)
matrixVectorMultiplicationParallel(A, V, resultsA, matrixSize, numThreads);
matrixVectorMultiplicationParallel(B, V, resultsB, matrixSize, numThreads);
Have I misunderstood some aspect of tasking, or is this simply the overhead of creating tasks exceeding their usefulness here? Why, for instance, is computing A*v and B*v with 5 threads each at the same time (10 threads total) equal to computing A*v then B*v
This is being used as part of a larger recursive algorithm, in which each recusive call computes a matrix-vector multiplication on smaller and smaller matrices.
Full code:
#include <time.h>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <assert.h>
#include <stdbool.h>
#include <omp.h>
#include <assert.h>
#define THREAD_RANGE 16 // Run for 1:THREAD_RANGE threads
#define NUM_AVERAGES 10 // take the average of 5 timings for each matrix size, and each number of threads
#define MATRIX_SIZE 3000
// gcc -fopenmp matrix_tasking.c -o matrix_tasking -O3 -Wall -Werror
double parallelTimings[THREAD_RANGE];
void matrixVectorMultiplicationSequential(double *restrict M, double *restrict V, double *restrict results, unsigned long matrixSize)
{
int i, j;
for (i = 0; i < matrixSize; i++)
{
double *MHead = &M[i * matrixSize];
double tmp = 0;
for (j = 0; j < matrixSize; j++)
{
tmp += MHead[j] * V[j];
}
results[i] = tmp;
}
}
void matrixVectorMultiplicationParallel(double *restrict M, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for private(j)
for (i = 0; i < matrixSize; i++)
{
double tmp = 0;
double *MHead = &M[i * matrixSize];
#pragma omp simd reduction(+ \
: tmp)
for (j = 0; j < matrixSize; j++)
{
tmp += MHead[j] * V[j];
}
results[i] = tmp;
}
}
void doParallelComputation(double *restrict A, double *restrict B, double *restrict V, double *restrict resultsA, double *restrict resultsB, unsigned long matrixSize, int numThreads)
{
#pragma omp task
matrixVectorMultiplicationParallel(A, V, resultsA, matrixSize, numThreads/2);
#pragma omp task
matrixVectorMultiplicationParallel(B, V, resultsB, matrixSize, numThreads/2);
#pragma omp taskwait
}
void genRandVector(double *S, unsigned long size)
{
srand(time(0));
unsigned long i;
#pragma omp parallel for private(i)
for (i = 0; i < size; i++)
{
double n = rand() % 3;
S[i] = n;
}
}
void doSequentialComputation(double *restrict A, double *restrict B, double *restrict V, double *restrict resultsA, double *restrict resultsB, unsigned long matrixSize)
{
matrixVectorMultiplicationSequential(A, V, resultsA, matrixSize);
matrixVectorMultiplicationSequential(B, V, resultsB, matrixSize);
}
void genRandMatrix(double *A, unsigned long size)
{
srand(time(0));
unsigned long i, j;
for (i = 0; i < size; i++)
{
for (j = 0; j < size; j++)
{
double n = rand() % 3;
A[i * size + j] = n;
}
}
}
int main(int argc, char *argv[])
{
struct timespec start, finish;
double elapsed;
unsigned long matrixSize = 100;
double *V = (double *)malloc(matrixSize * sizeof(double));
double *seqVA = (double *)malloc(matrixSize * sizeof(double)); // Store the results of A*v in the sequential implementation here
double *parVA = (double *)malloc(matrixSize * sizeof(double)); // Store the results of A*v in the parallel implementation here
double *seqVB = (double *)malloc(matrixSize * sizeof(double)); // Store the results of B*v in the sequential implementation here
double *parVB = (double *)malloc(matrixSize * sizeof(double)); // Store the results of B*v in the parallel implementation here
double *A = (double *)malloc(matrixSize * matrixSize * sizeof(double)); // First matrix to multiply by V
double *B = (double *)malloc(matrixSize * matrixSize * sizeof(double)); // Second matrix to multiply by V
genRandVector(V, matrixSize);
genRandMatrix(A, matrixSize);
genRandMatrix(B, matrixSize);
double sequentialTiming = 0;
for (int a = 0; a < NUM_AVERAGES; a++)
{
clock_gettime(CLOCK_MONOTONIC, &start);
doSequentialComputation(A, B, V, seqVA, seqVB, matrixSize);
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
sequentialTiming += elapsed;
for (int t = 1; t <= THREAD_RANGE; t++)
{
clock_gettime(CLOCK_MONOTONIC, &start);
omp_set_num_threads(t);
#pragma omp parallel
{
#pragma omp single
doParallelComputation(A, B, V, parVA, parVB, matrixSize, t);
}
// doParallelComputation(A, B, V, parVA, parVB, matrixSize, t);
clock_gettime(CLOCK_MONOTONIC, &finish);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
parallelTimings[t - 1] += elapsed;
// parallelTiming += elapsed;
for (int i = 0; i < matrixSize; i++)
{
assert(fabs(seqVA[i] - parVA[i]) < 0.01);
}
}
}
sequentialTiming /= NUM_AVERAGES;
printf("Sequential: %f \n", sequentialTiming);
printf("Parallel: ");
for (int t = 0; t < THREAD_RANGE; t++)
{
parallelTimings[t] /= NUM_AVERAGES;
printf("%f ", parallelTimings[t]);
}
printf("\n");
free(seqVA);
free(parVA);
free(seqVB);
free(parVB);
free(A);
free(V);
free(B);
return 0;
}
I suspect the answers lie in the hardware, not so much in the thread or task level.
First, you say you have 16 threads, but are they hyperthreads ? By that I mean, do you have in fact 8 cores or 16 cores in your CPU ? Because such threads are meant to increase the utilisation of common resources in a core, such as FPUs. But if these are already saturated, then adding more threads does not increase the number of operations your machine can perform.
That would answer why your performance flattens at 8 threads, and thus also why adding more parallelism (ie two multiplications simultaneously) doesn’t further accelerate things.
Second, I think the rest of the difference, especially in large vector sizes, might be up to caches and/or memory bandwidth:
for each row result (ie each inner loop on j), you need the full input vector and a full row of the matrix. If all threads are executing the same multiplication, the they use the same input vector.
This means more pressure on cache and more memory bandwidth required to sustain computation if you’re computing both multiplications at the same time, and those are both finite resources. To see which is the real bottleneck you’d need to run some profiling.
So the effect you’re seeing could be due to:
same utilisation of computing resources, contrarily to what you’re suspecting,
less efficient utilisation of caches and memory bandwidth
Don't do this at all!
Use the appropriate BLAS routine which will already be parallelised, vectorised, cache-blocked, and portable to the next machine you happen to run on.
See, for instance, LAPACK dgemv and then consider which of the many BLAS libraries (LAPACK, MKL, ...) you want to use.

Using rand_r in OpenMP 'for' is slower with 2 threads

The following code performs better with 1 thread than with 2 (using 4 threads gives speed up, though):
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
int n = atoi(argv[1]);
int num_threads = atoi(argv[2]);
omp_set_num_threads(num_threads);
unsigned int *seeds = malloc(num_threads * sizeof(unsigned int));
for (int i = 0; i < num_threads; ++i) {
seeds[i] = 42 + i;
}
unsigned long long sum = 0;
double begin_time = omp_get_wtime();
#pragma omp parallel
{
unsigned int *seedp = &seeds[omp_get_thread_num()];
#pragma omp for reduction(+ : sum)
for (int i = 0; i < n; ++i) {
sum += rand_r(seedp);
}
}
double end_time = omp_get_wtime();
printf("%fs\n", end_time - begin_time);
free(seeds);
return EXIT_SUCCESS;
}
On my laptop (2 cores, HT enabled) I get the following results:
$ gcc -fopenmp test.c && ./a.out 100000000 1
0.821497s
$ gcc -fopenmp test.c && ./a.out 100000000 2
1.096394s
$ gcc -fopenmp test.c && ./a.out 100000000 3
0.933494s
$ gcc -fopenmp test.c && ./a.out 100000000 4
0.748038s
The problem persists without reduction, drand48_r brings no difference, dynamic scheduling makes things even worse. However, if I replace the body of the loop with something not connected with random, i. e. sum += *seedp + i;, everything works as expected.
This is textbook example of false sharing. By using an array of seeds upon which each thread take one element, you force the logically private variables to be physically located next to each-other in memory. Therefore, the are all in the same cache line. This means that although no thread tries to modify a some other thread's seed, the cache line itself is modified by each threads at each iteration. And the actual trouble is that the system cannot detect variable's modifications for cache coherency, only cache line modifications. Therefore, at each iteration for each thread, the cache line has been modified by another thread and is no longer valid from a system's point of view. It has to be reloaded from memory (well, most likely from shared L3 cache here), leading to slowing down your code.
Try this one instead (not tested):
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
int n = atoi(argv[1]);
int num_threads = atoi(argv[2]);
omp_set_num_threads(num_threads);
unsigned long long sum = 0;
double begin_time = omp_get_wtime();
#pragma omp parallel
{
unsigned int seed = 42 + omp_get_thread_num();
#pragma omp for reduction(+ : sum)
for (int i = 0; i < n; ++i) {
sum += rand_r(&seed);
}
}
double end_time = omp_get_wtime();
printf("%fs\n", end_time - begin_time);
return EXIT_SUCCESS;
}

pthreads code not scaling up

I wrote the following very simple pthread code to test how it scales up. I am running the code on a machine with 8 logical processors and at no time do I create more than 8 threads (to avoid context switching).
With increasing number of threads, each thread has to do lesser amount of work. Also, it is evident from the code that there are no shared Data structures between the threads which might be a bottleneck. But still, my performance degrades as I increase the number of threads.
Can somebody tell me what am I doing wrong here.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int NUM_THREADS = 3;
unsigned long int COUNTER = 10000000000000;
unsigned long int LOOP_INDEX;
void* addNum(void *data)
{
unsigned long int sum = 0;
for(unsigned long int i = 0; i < LOOP_INDEX; i++) {
sum += 100;
}
return NULL;
}
int main(int argc, char** argv)
{
NUM_THREADS = atoi(argv[1]);
pthread_t *threads = (pthread_t*)malloc(sizeof(pthread_t) * NUM_THREADS);
int rc;
clock_t start, diff;
LOOP_INDEX = COUNTER/NUM_THREADS;
start = clock();
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_create((threads + t), NULL, addNum, NULL);
if (rc) {
printf("ERROR; return code from pthread_create() is %d", rc);
exit(-1);
}
}
void *status;
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_join(threads[t], &status);
}
diff = clock() - start;
int sec = diff / CLOCKS_PER_SEC;
printf("%d",sec);
}
Note: All the answers I found online said that the overhead of creating the threads is more than the work they are doing. To test it, I commented out everything in the "addNum()" function. But then, after doing that no matter how many threads I create, the time taken by the code is 0 seconds. So there is no overhead as such, I think.
clock() counts CPU time used, across all threads. So all that's telling you is that you're using a little bit more total CPU time, which is exactly what you would expect.
It's the total wall clock elapsed time which should be going down if your parallelisation is effective. Measure that with clock_gettime() specifying the CLOCK_MONOTONIC clock instead of clock().

C++ 11 std::thread strange behavior

I am experimenting a bit with std::thread and C++11, and I am encountering strange behaviour.
Please have a look at the following code:
#include <cstdlib>
#include <thread>
#include <vector>
#include <iostream>
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i);
count = i;
}
class A {
public:
A(const size_t x) : x_(x) {}
size_t sum_up(const size_t num_threads) const {
size_t i;
std::vector<std::thread> threads;
std::vector<size_t> data_vector;
for (i = 0; i < num_threads; ++i) {
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
}
std::cout << "Threads started ...\n";
for (i = 0; i < num_threads; ++i)
threads[i].join();
size_t sum = 0;
for (i = 0; i < num_threads; ++i)
sum += data_vector[i];
return sum;
}
private:
const size_t x_;
};
int main(int argc, char* argv[]) {
const size_t x = atoi(argv[1]);
const size_t num_threads = atoi(argv[2]);
A a(x);
std::cout << a.sum_up(num_threads) << std::endl;
return 0;
}
The main idea here is that I want to specify a number of threads which do independent computations (in this case, simple increments).
After all threads are finished, the results should be merged in order to obtain an overall result.
Just to clarify: This is only for testing purposes, in order to get me understand how
C++11 threads work.
However, when compiling this code using the command
g++ -o threads threads.cpp -pthread -O0 -std=c++0x
on a Ubuntu box, I get very strange behaviour, when I execute the resulting binary.
For example:
$ ./threads 1000 4
Threads started ...
Segmentation fault (core dumped)
(should yield the output: 4000)
$ ./threads 100000 4
Threads started ...
200000
(should yield the output: 400000)
Does anybody has an idea what is going on here?
Thank you in advance!
Your code has many problems (see even thread_sum_up for about 2-3 bugs) but the main bug I found by glancing your code is here:
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
See, when you push_back into a vector (I'm talking about data_vector), it can move all previous data around in memory. But then you take the address of (reference to) a cell for your thread, and then push back again (making the previous reference invalid)
This will cause you to crash.
For an easy fix - add data_vector.reserve(num_threads); just after creating it.
Edit at your request - some bugs in thread_sum_up
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i); // see that last ';' there? means this loop is empty. it shouldn't be there
count = i; // You're just setting count to be i. why do that in a loop? Did you mean +=?
}
The cause of your crash might be that std::ref(data_vector[i]) being invalidated by the next push_back in data_vector. Since you know the number of threads, do a data_vector.reserve(num_threads) before you start spawning off the threads to keep the references from being invalidated.
As you resize the vector with the calls to push_back, it is likely to have to reallocate the storage space, causing the references to the contained values to be invalidated. This causes the thread to write to non-allocated memory, which is undefined behavior.
Your options are to pre-allocate the size you need (vector::reserve is one option), or choose a different container.

Strange behaviour in OpenMP nested loop

In the following program I get different results (serial vs OpenMP), what is the reason? At the moment I can only think that perhaps the loop is too "large" for the threads and perhaps I should write it in some other way but I am not sure, any hints?
Compilation: g++-4.2 -fopenmp main.c functions.c -o main_elec_gcc.exe
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NRACK 64
#define NSTARS 1024
double mysumallatomic_serial(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
int j,i;
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
return S2;
}
double mysumallatomic(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int j=0; j<NRACK; j++){
for(int i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
float myterm=ql[i]*temp_div*qr[j];
#pragma omp atomic
S2 += myterm;
}
}
return S2;
int main(int argc, char *argv[]) {
float rocks[NRACK][3], moon[NSTARS][3];
float qr[NRACK], ql[NSTARS];
int i,j;
for(j=0;j<NRACK;j++){
rocks[j][0]=j;
rocks[j][1]=j+1;
rocks[j][2]=j+2;
qr[j] = j*1e-4+1e-3;
//qr[j] = 1;
}
for(i=0;i<NSTARS;i++){
moon[i][0]=12000+i;
moon[i][1]=12000+i+1;
moon[i][2]=12000+i+2;
ql[i] = i*1e-3 +1e-2 ;
//ql[i] = 1 ;
}
printf(" serial: %f\n", mysumallatomic_serial(rocks,moon,qr,ql));
printf(" openmp: %f\n", mysumallatomic(rocks,moon,qr,ql));
return(0);
}
}
I think you should use reduction instead of shared variable and remove #pragma omp atomic, like:
#pragma omp parallel for reduction(+:S2)
And it should work faster, because there are no need for atomic operations which are quite painful in terms of performance and threads synchronization.
UPDATE
You can also have some difference in results because of the operations order:
\sum_1^100(x[i]) != \sum_1^50(x[i]) + \sum_51^100(x[i])
You have data races on most of the temporary variables you are using in the parallel region - difx, dify, difz, mod2x, mod2y, mod2z, temp_sqrt, and temp_div should all be private. You should make these variables private by using a private clause on the parallel for directive.

Resources