Parallel, but slower - multithreading

Parallel, but slower - multithreading

I'm using monte carlo method to calculate pi and do a basic experience with parallel programming and openmp
the problem is that when i use 1 thread, x iterations, always runs faster than n thread, x iterations. Can anyone tell me why?
For example the code runs like this "a.out 1 1000000", where 1 is threads and 1000000 the iterations
include <omp.h>
include <stdio.h>
include <stdlib.h>
include <iostream>
include <iomanip>
include <math.h>
using namespace std;
int main (int argc, char *argv[]) {
double arrow_area_circle, pi;
float xp, yp;
int i, n;
double pitg= atan(1.0)*4.0; //for pi error
cout << "Number processors: " << omp_get_num_procs() << endl;
//Number of divisions
iterarions=atoi(argv[2]);
arrow_area_circle = 0.0;
#pragma omp parallel num_threads(atoi(argv[1]))
{
srandom(omp_get_thread_num());
#pragma omp for private(xp, yp) reduction(+:arrow_area_circle) //*,/,-,+
for (i = 0; i < iterarions; i++) {
xp=rand()/(float)RAND_MAX;
yp=rand()/(float)RAND_MAX;
if(pow(xp,2.0)+pow(yp,2.0)<=1) arrow_area_circle++;
}
}
pi = 4*arrow_area_circle / iterarions;
cout << setprecision(18) << "PI = " << pi << endl << endl;
cout << setprecision(18) << "Erro = " << pitg-pi << endl << endl;
return 0;
}

A CPU intensive task like this will be slower if you do the work in more threads than there are CPU's in the system. If you are running it on a single CPU system, you will definitely see a slowdown with more than one thread. This is due to the OS having to switch between the various threads - this is pure overhead. You should ideally have the same number of threads as cores for a task like this.
Another issue is that arrow_area_circle is shared between threads. If you have a thread running on each core, incrementing arrow_area_circle will invalidate the copy in the caches of the other cores, causing them to have to refetch. arrow_area_circle++ which should take a couple cycles will take dozens or hundreds of cycles. Try creating an arrow_area_circle per thread and combining them at the end.
EDIT: Joe Duffy just posted a blog entry on the cost of sharing data between threads.

It looks like you are using some kind of auto-parallelizing compiler. I am going to assume you have more than 1 core/CPU in your system (as that would be too obvious -- and no hyperthreading on a Pentium 4 doesn't count as having two cores, regardless of what Intel's marketing would have you believe.) There are two problems that I see. The first is trivial and probably not your problem:
If the variable arrow_area_circle is shared between your processes, then the act of executing arrow_area_circle++ will cause an interlocking instruction to be used to synchronize the value in a way that is atomically sound. You should increment a "private" variable, then add that value just once at the end to the common arrow_area_circle variable instead of incrementing arrow_area_circle in your inner loop.
The rand() function, to operate soundly, must internally execute with a critical section. The reason is that its internal state/seed is a static shared variable; if it were not, it would be possible for two different processes to get the same output from rand() with unusually high probability, just because they were calling rand() at nearly the same time. That means rand() runs slowly, and especially so as more threads/processes are calling it at the same time. Unlike the arrow_area_circle variable (which just needs an atomic increment), a true critical section has to be invoked by rand() because its state update is more complicated. To work around this, you should obtain the source code for your own random number generator and use it with a private seed or state. The source code for the standard rand() implementation in most compilers is widely available.
I'd also like to point out that you are using the pow(,) function to perform the same thing as x * x. The later is about 300 times faster than the former. Though this point is irrelevant to the question you are asking. :)

Context switching.

rand() is blocking function. It means that it has critical section inside.

Just to stress that you have to be really careful using random numbers in a parallel setting. In fact you should use something like SPRNG
Whatever you do, make sure that each thread isn't using the same random numbers.

Related

Issue in parallelising inner loop of a nested for in OpenMP

I need to parallelize the inner of a nested loop with OpenMP. They way I did it is not working fine. Each thread should iterate on each of the M points, but only iterate(in the second loop) on its own chunk of coordinates. So I want the first loop to go from 0 to M , the second one frommy_first_coord to my_last_coord. In the code I posted, the program is faster when launched with 4 threads than when with 8, so there's some issue. I know one way to do this is by "manually" dividing the coordinates, meaning that each thread gets its own num_of_coords / thread_count(and considering the remainder), I did that with Pthread. I would prefer to make use of pragmas in OpenMP. I'm sure I'm missing something. Let me show you the code
#pragma omp parallel
...
for (int i = 0; i < M; i++) { //All iterate from i to M
# pragma omp for nowait
for (int coord = 0; coord < N; coord++) { //each works on its portion of coords
centroids[points[i].cluster].accumulator.coordinates[coord] += points[i].coordinates[coord];
}
}
I put the Pthread version too, so that you don't misunderstand what I want to achieve, but with the use of pragmas
/*M is global,
first_nn and last_nn are local*/
for (long i = 0; i < M; i++)
for(long coord = first_nn; coord <= last_nn; coord++)
centroids[points[i].cluster].accumulator.coordinates[coord] += points[i].coordinates[coord];
I hope that it is clear enough. Thank you
Edit:
I'm using gcc 12.2.0. By adding the -O3 flag times have improved.
With larger inputs the difference is speedup between 4 and 8 threads is more significant.

Your comment indicates that you are worried about speedup.
How many physical cores does your processor have? Try every thread count from 1 to that number.
Do not use hyperthreads
You may find a good speedup for low thread counts, but a leveling off effect: that is because you have a "streaming" operation, which is limited by bandwidth. Unless you have a very expensive processor, there is not enough bandwidth to keep all cores running fast.
You could try setting OMP_PROC_BIND=true which prevents the OS from migrating your threads. That can improve cache usage.
You have some sort of indirect addressing going on with the i variable so further memory effects related to the TLB may make your parallel code not scale optimally.
But start with point 3 and report.

Is the following code thread unsafe? Is so, how can I make a possible result more likely to come out?

Is the screen output of the following program deterministic? My understanding is that it is not, as it could be either 1 or 2 depending on whether the latest thread to pick up the value of i picks it up before or after the other thread has written 1 into it.
On the other, hand I keep seeing the same output as if each thread waits the previous to finish, as in I get 2 on screen in this case, or 100 if I create similar threads from t1 to t100 and join them all.
If the answer is no, the result is not deterministic, is there a way with a simple toy program to increase the odds that the one of the possible results comes out?
#include <iostream>
#include <thread>
int main() {
int i = 0;
std::thread t1([&i](){ ++i; });
std::thread t2([&i](){ ++i; });
t1.join();
t2.join();
std::cout << i << '\n';
}
(I'm compiling and running it like this: g++ -std=c++11 -lpthread prova.cpp -o exe && ./exe.)

Your are always seeing the same result because the first thread starts and runs its operations before the second one. This narrows the window for a race condition to occur.
But ultimately, there is still a chance that it occurs because the ++ operation is not atomic (read value, then increment, then write).
If the two threads start at the same time (eg: thread 1 slowed down due to the CPU being busy), then they will read the same value and the final result will be 1.

What is the point of running same code under different threads - openMP?

From: https://bisqwit.iki.fi/story/howto/openmp/
The parallel construct
The parallel construct starts a parallel block. It creates a team
of N threads (where N is determined at runtime, usually from the
number of CPU cores, but may be affected by a few things), all of
which execute the next statement (or the next block, if the statement
is a {…} -enclosure). After the statement, the threads join back into
one.
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?

By using omp_get_thread_num() you can retrieve the thread ID which enables you to parametrize the so called "same code" with respect to that thread ID.
Take this example:
A is a 1000-dimensional integer array and you need to sum its values using 2 OpenMP threads.
You would design you code something like this:
int A_dim = 1000
long sum[2] = {0,0}
#pragma omp parallel
{
int threadID = omp_get_thread_num();
int start = threadID * (A_dim / 2)
int end = (threadID + 1) * (A_dim / 2)
for(int i = start; i < end; i++)
sum[threadID] += A[i]
}
start is the lower bound which your thread will start summing from (example: thread #0 will start summing from 0, while thread #1 will start summing from 500).
end is pretty much the same of start, but it's the upper bound of which array index the thread will sum up to (example: thread #0 will sum until 500, summing values from A[0] to A[499], while thread #1 will sum until 1000 is reached, values from A[500] to A[999])

I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
When you are running the same code on different data.
For example, if I want to invert 10 matrices, I might run the matrix inversion code on 10 threads ... to get (ideally) a 10-fold speedup compared to 1 thread and a for loop.

The basic idea of OpenMP is to distribute work. For this you need to create some threads.
The parallel construct creates this number of threads. Afterwards you can distibute/share work with other constructs like omp for or omp task.
A possible benefit of this distinction is e.g. when you have to allocate memory for each thread (i.e. thread-local data).

I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
One example: in physics you got a random process(collision, initial maxwellian etc) in your code and you need to run the code many times to get the average results, in this case you need to run the same code several times.

why doesn't the System Monitor show correct CPU affinity?

I have search for questions/answers on CPU affinity and read the results but I am still cannot get my threads to nail up to a single CPU.
I am working on an application that will be run on a dedicated linux box so I am not concerned about other processes, only my own. This app currently spawns off one pthread and then the main thread enters a while loop to process control messages using POSIX msg queues. This while loop blocks waiting for a control msg to come in and then processes it. So the main thread is very simple and non-critical. My code is working very well as I can send this app messages and it will process them just fine. All control messages are very small in size and are use to just control the functionality of the application, that is, only a few control messages are ever send/received.
Before I enter this while loop, I use sched_getaffinity() to log all of the CPUs available. Then I use sched_setaffinity() to set this process to a single CPU. Then I call sched_getaffinity() again to check if it is set to run on only one CPU and it is indeed correct.
The single pthread that was spawned off does a similar thing. The first thing I do in the newly created pthread is call pthread_getaffinity_np() and check the available CPUs, then call pthread_setaffinity_np() to set it to a different CPU then call pthread_getaffinity_np() to check if it is set as desired and it is indeed correct.
This is what is confusing. When I run the app and view the CPU History in System Monitor, I see no difference from when I run the app without all of this set affinity stuff. The scheduler still runs a couple of seconds in each of the 4 CPUs on this quad core box. So it appears that the scheduler is ignoring my affinity settings.
Am I wrong in expecting to see some proof that the main thread and the pthread are actually running in their own single CPU?? or have I forgotten to do something more to get this to work as I intend?
Thanks,
-Andres

You have no answers, I will give you what I can:some partial help
Assuming you checked the return values from pthread_setaffinity_np:
How you assign your cpuset is very important, create it in the main thread. For what you want. It will propagate to successive threads. Did you check return codes?
The cpuset you actually get will be the intersection of hardware available cpus and the cpuset you define.
min.h in the code below is a generic build include file. You have to define _GNU_SOURCE - please note the comment on the last line of the code. CPUSET and CPUSETSIZE are macros. I think I define them somewhere else, I do not remember. They may be in a standard header.
#define _GNU_SOURCE
#include "min.h"
#include <pthread.h>
int
main(int argc, char **argv)
{
int s, j;
cpu_set_t cpuset;
pthread_t tid=pthread_self();
// Set affinity mask to include CPUs 0 & 1
CPU_ZERO(&cpuset);
for (j = 0; j < 2; j++)
CPU_SET(j, &cpuset);
s = pthread_setaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
// lets see what we really have in the actual affinity mask assigned our thread
s = pthread_getaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
printf("my cpuset has:\n");
for (j = 0; j < CPU_SETSIZE; j++)
if (CPU_ISSET(j, &cpuset))
printf(" CPU %d\n", j);
// #Andres note: any pthread_create call from here on creates a thread with the identical
// cpuset - you do not have to call it in every thread.
return 0;
}

TBB ThreadingBuildingBlocks strange behaviour

My Question: Why is my program freezing if i use "read only" const_accessors?
It seems to be locking up, from the API description it seems to be ok to have one accessors and multiple const_accessors, (writer, reader). Maybe somebody can tell me a different story.
The Goal i try to achieve is to use this concurrent hash map and make it available to 10-200 Threads so that they can lookup and add/delete information. If you have a better solution than the current one i' am using than you also welcome to post the alternatives.
tbb::size_t hashInitSize = 1200000;
concurrent_hash_map<long int,char*> hashmap(hashInitSize);
cout << hashmap.bucket_count() << std::endl;
long int l = 200;
long int c = 201;
concurrent_hash_map<long int,char*>::accessor o;
concurrent_hash_map<long int,char*>::const_accessor t;
concurrent_hash_map<long int,char*>::const_accessor h;
cout << "Trying to find 200 "<< hashmap.find(t,200) << std::endl;
hashmap.insert(o,l);
o->second = "testother";
TBB Community Tutorial Guide Page 43 describes the concept of accessors

From the TBB reference manual:
An accessor acts as a smart pointer to a pair in a concurrent_hash_map. It holds an implicit lock on a pair until the instance is destroyed or method release is called on the accessor.
Accessors acquire a lock when they are used. Multiple accessors can exist at the same time. However, if a program uses multiple accessors concurrently and does not release the locks, it is likely to deadlock.
To avoid deadlock, release the lock on a hash map entry once you are done accessing it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string