Thread and sum of bidimensional matrix - multithreading

I have to calculate the sum of the elements in a bidimensional matrix, using a separate thread to calculate the sum of each row. Then the main thread adds up these sums printing the final result.
Can you guys see what's wrong?
(I'm all new to the threads stuff)
#include <pthread.h>
#include <stdio.h>
void sumR(void* _a,int m,int n,int sum)
{
int i;
int (*a)[m]=_a;
for(i=1;i<=n;i++)
sum+=a[n][i];
}
int main()
{
int a[20][20],sum1,sum;
int m=3,n=3,k=3,i,j;
for(i=1;i<=m;i++)
{
k=k+3;
for(j=1;j<=n;j++)
a[i][j]=k;
}
sum1=0;
for(i=1;i<=m;i++)
{
sum=0;
pthread_t th;
pthread_create(&th,NULL,&sumR,&a,&m,&n,&sum);
sum1+=sum;
pthread_join(&th,NULL);
}
printf("Sum of the matrix is: %d",sum1);
return 0;
}

One problem I see is that your loop does essentially this:
for each row
start thread
add thread's sum to total
wait for thread to exit
That's not going to work because you're adding the thread's sum before the thread is done calculating it. You need to wait for the thread to finish:
start thread
wait for thread to exit
add thread's sum to total
However, that model doesn't take advantage of multiple threads. You only have one thread running at a time.
What you need to do is create all of the threads and store them in an array. Then wait for each thread to exit and add its sum to the total. Something like:
for i = 0 to num_threads-1
threads[i] = pthread_create(&threads[i], NULL, &sums[i], ...)
And then
for i = 0 to num_threads-1
pthread_join(&threads[i], ...);
sum += sums[i];
That way, all of your threads are running at the same time, and you harvest the result only when the thread is done.

Related

OpenMP task firstprivate

I have a question regarding the OpenMP task pragma, if we suppose the following code:
#pragma omp parallel
{
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
//do something with x
}
as far as I understood tasking, it is not guaranteed, which thread executes the task.
So my question is, is "x" in the task now the id of thread generated the task or the one who executes it?
e.g. if thread 0 comes across the task, and thread 3 executes it: x should be 0 then, right?
So my question is, is "x" in the task now the id of thread generated
the task or the one who executes it?
It depends, if the parallel default data-sharing attribute is shared (which by default typically it is) then:
'x' can be equal to any thread ID ranging from 0 to the total number of threads in the team - 1. This is because there is a race condition during the update of the variable 'x'.
This can be show-cased with the following code:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
x = omp_get_thread_num();
if(omp_get_thread_num() == 1){
sleep(5);
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
So the thread with ID=1 creates the task, however, 'x' can have different values than '1' and also different values than the thread currently executing the task. This is because while the thread with ID=1, is waiting during sleep(5);, the remaining threads in the team can update the value of 'x'.
Typically, the canonical form in such use-cases would be to use a single pragma wrapping around the task creation as follows:
#include <omp.h>
#include <stdio.h>
int main(){
int x;
#pragma omp parallel
{
#pragma omp single
{
printf("I am the task creator '%d'\n", omp_get_thread_num());
x = omp_get_thread_num();
#pragma omp task firstprivate(x)
{
printf("Value of x = %d | ID Thread executing = %d\n", x, omp_get_thread_num());
}
}
}
return 0;
}
And in this case as #Michael Klemm mentioned on the comments:
..., x will contain the ID of the thread that created the task. So, yes, if thread 0 created the task, x will be zero even though thread 3 is picked to execute the task.
This also applies in the cases that variable 'x' is private by the time the statement x = omp_get_thread_num(); happens.
Therefore, if you run the code above you should always get I am the task creator with the same value as Value of x =, but you can get a different value in ID Thread executing. For example:
I am the task creator '4'
Value of x = 4 | ID Thread executing = 7
This is in accordance to the behaviour specified in the OpenMP standard, namely:
The task construct is a task generating construct. When a thread
encounters a task construct, an explicit task is generated from the
code for the associated structured block. The data environment of the task is created according to the data-sharing attribute clauses on the task construct, per-data environment ICVs, and any defaults that
apply.
The encountering thread may immediately execute the task, or defer its execution. In the latter case, any thread in the team may be assigned the task.

What thread competition infulenceļ¼Ÿ

As you see,when I remove mt.lock() and mt.unlockļ¼Œthe result is smaller than 50000.
Why?What actually happens? I will be very grateful if you can explain it for me.
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
using namespace std;
class counter{
public:
mutex mt;
int value;
public:
counter():value(0){}
void increase()
{
//mt.lock();
value++;
//mt.unlock();
}
};
int main()
{
counter c;
vector<thread> threads;
for(int i=0;i<5;++i){
threads.push_back(thread([&]()
{
for(int i=0;i<10000;++i){
c.increase();
}
}));
}
for(auto& t:threads){
t.join();
}
cout << c.value <<endl;
return 0;
}
++ is actually two operations. One is reading the value, the other is incrementing it. Since it isn't an atomic operation, multiple threads operating in the same region of code will get mixed up.
As an example, consider three threads operating in the same region without any locking:
Threads 1 and 2 read value as 999
Thread 1 computes the incremented value as 1000 and updates the variable
Thread 3 reads 1000, increments to 1001 and updates the variable
Thread 2 computes incremented value as 999 + 1 = 1000 and overwrites 3's work with with 1000
Now if you were using something like the "fetch-and-add" instruction, which is atomic, you wouldn't need any locks. See fetch_add

New to OpenMP and parallel programming need to partition a loop using scheduling

Ok so here's what the problem says.
Implement a simple loop that calls a function containing a delay. Partition this loop across four threads using static, dynamic and guided scheduling. Measure execution times for each type of scheduling with respect to both the size of the loop and the size of the delay.
this is what I've done so far, I have no idea if I'm on the right track
#include <omp.h>
#include <stdio.h>
int main() {
double start_time, run_time;
omp_set_num_threads(4);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp for schedule(static)
for (int n = 0; n < 100; n++){
printf("square of %d=%d\n", n, n*n);
printf("cube of %d=%d\n", n, n*n*n);
int ID = omp_get_thread_num();
printf("Thread(%d) \n", ID);
}
run_time = omp_get_wtime() - start_time;
printf("Time Elapsed (%f)", run_time);
getchar();
}
At first you need a loop, where the distribution makes a difference. The loop has 100 iterations, so the OpenMP schedule will only 100 times decide what is the next iteration for a thread what takes no mensurable time. The output with printf takes very long so in your code it makes no difference which schedule is used. Its better to make a loop without console output and a very high loop count like
#pragma omp parallel
{
#pragma omp for schedule(static) private(value)
for (int i = 0; i < 100000000; i++) {
value = ...
}
}
At last you have to write code in the loop which "result" is used after the loop with a printf for example. If not the body could be deleted by the compiler because of optimize the code (it is not used later so its not needed). You can concentrate the time measurings on the parallel pool without the output of the results.
If your iterations nearly takes the same time, then a static distribution should be faster. If they differ very much the dynamic and guided schedules should dominate your measurings.

Why does pthread_self() return the same id multiple times?

I am trying to create a number of threads (representing persons), in a for loop, and display the person id, which is passed as an argument, together with the thread id. The person id is displayed as exepected, but the thread id is always the same.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void* travelers(void* arg) {
int* person_id = (int*) arg;
printf("\nPerson %d was created, TID = %d", *person_id, pthread_self());
}
int main(int argc, char** argv)
{
int i;
pthread_t th[1000];
for (i=0; i < 10; i++) {
if ((pthread_create(&th[i], NULL, travelers, &i)) != 0) {
perror("Could not create threads");
exit(2);
}
else {
// Join thread
pthread_join(th[i], NULL);
}
}
printf("\n");
return 0;
}
The output I get is something like this:
Person 0 was created, TID = 881035008
Person 1 was created, TID = 881035008
Person 2 was created, TID = 881035008
Person 3 was created, TID = 881035008
Person 4 was created, TID = 881035008
Person 5 was created, TID = 881035008
Person 6 was created, TID = 881035008
Person 7 was created, TID = 881035008
Person 8 was created, TID = 881035008
Person 9 was created, TID = 881035008
What am I doing wrong?
Since only one of the created threads runs at a time, every new one gets the same ID as the one that finished before, i.e. the IDs are simply reused. Try creating threads in a loop and then joining them in a second loop.
However, you will then have to take care that each thread independently reads the content of i, which will give you different headaches. I'd pass the index as context argument, and then cast it to an int inside the thread function.
It does that, because it re-uses thread-ids. The thread id is only unique among all running threads, but not for threads running at different times; look what your for-loop does essentially:
for (i = 0 to 10) {
start a thread;
wait for termination of the thread;
}
So the program has only one thread running at any given time, one thread is only started after the previous started thread has terminated (with pthread_join ()). To make them run at the same time, use two for loops:
for (i = 0 to 10) {
start thread i;
}
for (i = 0 to 10) {
wait until thread i is finished;
}
Then you will likely get different thread-ids. (I.e. you will get different thread-ids, but if the printf-function will write them out differently depends on your specific implementation/architecture, in particular if thread_t is essentially an int or not. It might be a long int, for example).
if ((pthread_create(&th[i], NULL, travelers, &i)) != 0)
If the thread is successfully created it returns 0. If != 0 will return false and you will execute the pthread_join. You are effectively creating one thread repeatedly.

Can I assign a per-thread index, using pthreads?

I'm optimizing some instrumentation for my project (Linux,ICC,pthreads), and would like some feedback on this technique to assign a unique index to a thread, so I can use it to index into an array of per-thread data.
The old technique uses a std::map based on pthread id, but I'd like to avoid locks and a map lookup if possible (it is creating a significant amount of overhead).
Here is my new technique:
static PerThreadInfo info[MAX_THREADS]; // shared, each index is per thread
// Allow each thread a unique sequential index, used for indexing into per
// thread data.
1:static size_t GetThreadIndex()
2:{
3: static size_t threadCount = 0;
4: __thread static size_t myThreadIndex = threadCount++;
5: return myThreadIndex;
6:}
later in the code:
// add some info per thread, so it can be aggregated globally
info[ GetThreadIndex() ] = MyNewInfo();
So:
1) It looks like line 4 could be a race condition if two threads where created at exactly the same time. If so - how can I avoid this (preferably without locks)? I can't see how an atomic increment would help here.
2) Is there a better way to create a per-thread index somehow? Maybe by pre-generating the TLS index on thread creation somehow?
1) An atomic increment would help here actually, as the possible race is two threads reading and assigning the same ID to themselves, so making sure the increment (read number, add 1, store number) happens atomically fixes that race condition. On Intel a "lock; inc" would do the trick, or whatever your platform offers (like InterlockedIncrement() for Windows for example).
2) Well, you could actually make the whole info thread-local ("__thread static PerThreadInfo info;"), provided your only aim is to be able to access the data per-thread easily and under a common name. If you actually want it to be a globally accessible array, then saving the index as you do using TLS is a very straightforward and efficient way to do this. You could also pre-compute the indexes and pass them along as arguments at thread creation, as Kromey noted in his post.
Why so averse to using locks? Solving race conditions is exactly what they're designed for...
In any rate, you can use the 4th argument in pthread_create() to pass an argument to your threads' start routine; in this way, you could use your master process to generate an incrementing counter as it launches the threads, and pass this counter into each thread as it is created, giving you your unique index for each thread.
I know you tagged this [pthreads], but you also mentioned the "old technique" of using std::map. This leads me to believe that you're programming in C++. In C++11 you have std::thread, and you can pass out unique indexes (id's) to your threads at thread creation time through an ordinary function parameter.
Below is an example HelloWorld that creates N threads, assigning each an index of 0 through N-1. Each thread does nothing but say "hi" and give it's index:
#include <iostream>
#include <thread>
#include <mutex>
#include <vector>
inline void sub_print() {}
template <class A0, class ...Args>
void
sub_print(const A0& a0, const Args& ...args)
{
std::cout << a0;
sub_print(args...);
}
std::mutex&
cout_mut()
{
static std::mutex m;
return m;
}
template <class ...Args>
void
print(const Args& ...args)
{
std::lock_guard<std::mutex> _(cout_mut());
sub_print(args...);
}
void f(int id)
{
print("This is thread ", id, "\n");
}
int main()
{
const int N = 10;
std::vector<std::thread> threads;
for (int i = 0; i < N; ++i)
threads.push_back(std::thread(f, i));
for (auto i = threads.begin(), e = threads.end(); i != e; ++i)
i->join();
}
My output:
This is thread 0
This is thread 1
This is thread 4
This is thread 3
This is thread 5
This is thread 7
This is thread 6
This is thread 2
This is thread 9
This is thread 8

Resources