I was trying to redo an example of threads. Here are the two functions that I am running from main one after another. They are the typical increment and decrement functions.
void* increment(void *arg)
{
int incr_step = *(int*) arg;
free(arg);
unsigned long int i;
for(i=0; i<5;i++) {
//pthread_mutex_lock(&lock);
counter = counter + incr_step;
//pthread_mutex_unlock(&lock);
printf("Thread ID %lu --> counter = %d\n", pthread_self(), counter);
sleep(1);
}
return NULL;
}
void* decrement(void *arg)
{
int decr_step = *(int*)arg;
free(arg);
unsigned long int i;
for(i=0; i<5;i++) {
//pthread_mutex_lock(&lock);
counter = counter - decr_step;
//pthread_mutex_unlock(&lock);
printf("Thread ID %lu--> counter = %d\n", pthread_self(), counter);
sleep(1);
}
return NULL;
}
In main I just create two pthreads and call these two functions in both of these threads one after another and of course I am also joining them. I have a global variable counter, which is initially 5, and I am testing with passing increment value as 3, and decrement value as 2. So if my threads were synchronized, my final value of counter would be 10(since an increment of 3 happens five times, so counter becomes 5 + 5*3 = 20 and a decrement of 2 happens five times, so counter becomes 20 - 5*2 = 10).
However I have commented the mutex statements and I expect my final value of counter(which was 10 if threads were in sync) to be a different value, but I keep getting 10 again. Why?
The behavior of accessing shared variables without synchronizing mechanisms like mutex lock is non-deterministic.
It is by chance that you are seeing the value of the variable same as with the mutex lock.
No initial conditions guarantee that race conditions won't happen even if you don't implement synchronized access of shared variables by threads.
I'm new to concurrent programming. I implement a CPU intensive work and measure how much speedup I could gain. However, I cannot get any speedup as I increase #threads.
The program does the following task:
There's a shared counter to count from 1 to 1000001.
Each thread does the following until the counter reaches 1000001:
increments the counter atomically, then
run a loop for 10000 times.
There're 1000001*10000 = 10^10 operations in total to be perform, so I should be able to get good speedup as I increment #threads.
Here's how I implemented it:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdatomic.h>
pthread_t workers[8];
atomic_int counter; // a shared counter
void *runner(void *param);
int main(int argc, char *argv[]) {
if(argc != 2) {
printf("Usage: ./thread thread_num\n");
return 1;
}
int NUM_THREADS = atoi(argv[1]);
pthread_attr_t attr;
counter = 1; // initialize shared counter
pthread_attr_init(&attr);
const clock_t begin_time = clock(); // begin timer
for(int i=0;i<NUM_THREADS;i++)
pthread_create(&workers[i], &attr, runner, NULL);
for(int i=0;i<NUM_THREADS;i++)
pthread_join(workers[i], NULL);
const clock_t end_time = clock(); // end timer
printf("Thread number = %d, execution time = %lf s\n", NUM_THREADS, (double)(end_time - begin_time)/CLOCKS_PER_SEC);
return 0;
}
void *runner(void *param) {
int temp = 0;
while(temp < 1000001) {
temp = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
for(int i=1;i<10000;i++)
temp%i; // do some CPU intensive work
}
pthread_exit(0);
}
However, as I run my program, I cannot get better performance than sequential execution!!
gcc-4.9 -std=c11 -pthread -o my_program my_program.c
for i in 1 2 3 4 5 6 7 8; do \
./my_program $i; \
done
Thread number = 1, execution time = 19.235998 s
Thread number = 2, execution time = 20.575237 s
Thread number = 3, execution time = 25.161116 s
Thread number = 4, execution time = 28.278671 s
Thread number = 5, execution time = 28.185605 s
Thread number = 6, execution time = 28.050380 s
Thread number = 7, execution time = 28.286925 s
Thread number = 8, execution time = 28.227132 s
I run the program on a 4-core machine.
Does anyone have suggestions to improve the program? Or any clue why I cannot get speedup?
The only work here that can be done in parallel is the loop:
for(int i=0;i<10000;i++)
temp%i; // do some CPU intensive work
gcc, even with the minimal optimisation level, will not emit any code for the temp%i; void expression (disassemble it and see), so this essentially becomes an empty loop, which will execute very fast - the execution time in the case with multiple threads running on different cores will be dominated by the cacheline containing your atomic variable ping-ponging between the different cores.
You need to make this loop actually do a significant amount of work before you'll see a speed-up.
I am running the following loop using, say, 8 OpenMP threads:
float* data;
int n;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
DO SOMETHING WITH data[i]
}
Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3
and second half (i = n/2, ..., n-1) with threads 4,5,6,7.
Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.
How do I achieve this with OpenMP?
Thank you
PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.
I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:
int n;
int i0 = 0;
int i1 = n / 2;
#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
int nt = omp_get_thread_num();
int j;
#pragma omp critical
{
if ( nt < 4 ) {
if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
else j = i1++; // of loop unless first half is finished
}
else {
if ( i1 < n ) j = i1++; // Second 4 threads process second half
else j = i0++; // of loop unless second half is finished
}
}
DO SOMETHING WITH data[j]
}
Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:
#include <omp.h>
#include <stdio.h>
int main(int argc, char **argv) {
const int ngroups=2;
const int npergroup=4;
const int ndata = 16;
omp_set_nested(1);
#pragma omp parallel for num_threads(ngroups)
for (int i=0; i<ngroups; i++) {
int start = (ndata*i+(ngroups-1))/ngroups;
int end = (ndata*(i+1)+(ngroups-1))/ngroups;
#pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
for (int j=start; j<end; j++) {
printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
}
}
return 0;
}
Running this gives
$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15
But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.
The code below was taken from llnl tutorials on pthreads with two modifications:
comment the sleep(1); in function
comment the pthread_join(thread[i],NULL); in function main
/******************************************************************************
* FILE: condvar.c
* DESCRIPTION:
* Example code for using Pthreads condition variables. The main thread
* creates three threads. Two of those threads increment a "count" variable,
* while the third thread watches the value of "count". When "count"
* reaches a predefined limit, the waiting thread is signaled by one of the
* incrementing threads. The waiting thread "awakens" and then modifies
* count. The program continues until the incrementing threads reach
* TCOUNT. The main program prints the final value of count.
* SOURCE: Adapted from example code in "Pthreads Programming", B. Nichols
* et al. O'Reilly and Associates.
* LAST REVISED: 10/14/10 Blaise Barney
******************************************************************************/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 3
#define TCOUNT 10
#define COUNT_LIMIT 12
int count = 0;
pthread_mutex_t count_mutex;
pthread_cond_t count_threshold_cv;
void *inc_count(void *t)
{
int i;
long my_id = (long)t;
for (i=0; i < TCOUNT; i++) {
pthread_mutex_lock(&count_mutex);
count++;
/*
Check the value of count and signal waiting thread when condition is
reached. Note that this occurs while mutex is locked.
*/
if (count == COUNT_LIMIT) {
printf("inc_count(): thread %ld, count = %d Threshold reached. ",
my_id, count);
pthread_cond_signal(&count_threshold_cv);
printf("Just sent signal.\n");
}
printf("inc_count(): thread %ld, count = %d, unlocking mutex\n",
my_id, count);
pthread_mutex_unlock(&count_mutex);
/* Do some work so threads can alternate on mutex lock */
/*sleep(1);*/
}
pthread_exit(NULL);
}
void *watch_count(void *t)
{
long my_id = (long)t;
printf("Starting watch_count(): thread %ld\n", my_id);
/*
Lock mutex and wait for signal. Note that the pthread_cond_wait routine
will automatically and atomically unlock mutex while it waits.
Also, note that if COUNT_LIMIT is reached before this routine is run by
the waiting thread, the loop will be skipped to prevent pthread_cond_wait
from never returning.
*/
pthread_mutex_lock(&count_mutex);
while (count < COUNT_LIMIT) {
printf("watch_count(): thread %ld Count= %d. Going into wait...\n", my_id,count);
pthread_cond_wait(&count_threshold_cv, &count_mutex);
printf("watch_count(): thread %ld Condition signal received. Count= %d\n", my_id,count);
printf("watch_count(): thread %ld Updating the value of count...\n", my_id,count);
count += 125;
printf("watch_count(): thread %ld count now = %d.\n", my_id, count);
}
printf("watch_count(): thread %ld Unlocking mutex.\n", my_id);
pthread_mutex_unlock(&count_mutex);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int i, rc;
long t1=1, t2=2, t3=3;
pthread_t threads[3];
pthread_attr_t attr;
/* Initialize mutex and condition variable objects */
pthread_mutex_init(&count_mutex, NULL);
pthread_cond_init (&count_threshold_cv, NULL);
/* For portability, explicitly create threads in a joinable state */
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_create(&threads[0], &attr, watch_count, (void *)t1);
pthread_create(&threads[1], &attr, inc_count, (void *)t2);
pthread_create(&threads[2], &attr, inc_count, (void *)t3);
/* Wait for all threads to complete */
for (i = 0; i < NUM_THREADS; i++) {
/* pthread_join(threads[i], NULL);*/
}
printf ("Main(): Waited and joined with %d threads. Final value of count = %d. Done.\n",
NUM_THREADS, count);
/* Clean up and exit */
pthread_attr_destroy(&attr);
pthread_mutex_destroy(&count_mutex);
pthread_cond_destroy(&count_threshold_cv);
pthread_exit (NULL);
}
On Mac OS X 10.9(Apple LLVM version 5.1 (clang-503.0.40)), this will output:
Starting watch_count(): thread 1
Main(): Waited and joined with 3 threads. Final value of count = 0. Done.
inc_count(): thread 2, count = 1, unlocking mutex
inc_count(): thread 3, count = 2, unlocking mutex
watch_count(): thread 1 Count= 2. Going into wait...
watch_count(): thread 1 Condition signal received. Count= 2
watch_count(): thread 1 Updating the value of count...
watch_count(): thread 1 count now = 127.
watch_count(): thread 1 Unlocking mutex.
inc_count(): thread 2, count = 128, unlocking mutex
inc_count(): thread 3, count = 129, unlocking mutex
inc_count(): thread 2, count = 130, unlocking mutex
inc_count(): thread 3, count = 131, unlocking mutex
inc_count(): thread 2, count = 132, unlocking mutex
inc_count(): thread 3, count = 133, unlocking mutex
inc_count(): thread 2, count = 134, unlocking mutex
inc_count(): thread 3, count = 135, unlocking mutex
inc_count(): thread 2, count = 136, unlocking mutex
inc_count(): thread 3, count = 137, unlocking mutex
inc_count(): thread 2, count = 138, unlocking mutex
inc_count(): thread 3, count = 139, unlocking mutex
inc_count(): thread 2, count = 140, unlocking mutex
inc_count(): thread 3, count = 141, unlocking mutex
inc_count(): thread 2, count = 142, unlocking mutex
inc_count(): thread 3, count = 143, unlocking mutex
inc_count(): thread 2, count = 144, unlocking mutex
inc_count(): thread 3, count = 145, unlocking mutex
And on CentOS 5 and 6(gcc 4.1.2 and gcc 4.4.3 x86_64-redhat-linux), the output appears more randomly, sometimes it is:
Main(): Waited and joined with 3 threads. Final value of count = 0. Done.
Starting watch_count(): thread 1
watch_count(): thread 1 Count= 0. Going into wait...
inc_count(): thread 2, count = 1, unlocking mutex
inc_count(): thread 2, count = 2, unlocking mutex
inc_count(): thread 2, count = 3, unlocking mutex
inc_count(): thread 2, count = 4, unlocking mutex
inc_count(): thread 2, count = 6, unlocking mutex
inc_count(): thread 2, count = 7, unlocking mutex
inc_count(): thread 2, count = 8, unlocking mutex
inc_count(): thread 2, count = 9, unlocking mutex
inc_count(): thread 2, count = 10, unlocking mutex
inc_count(): thread 2, count = 11, unlocking mutex
inc_count(): thread 3, count = 5, unlocking mutex
and it hangs,
and some times it gives:
Main(): Waited and joined with 3 threads. Final value of count = 0. Done.
Starting watch_count(): thread 1
watch_count(): thread 1 Count= 0. Going into wait...
inc_count(): thread 2, count = 1, unlocking mutex
inc_count(): thread 2, count = 2, unlocking mutex
inc_count(): thread 2, count = 3, unlocking mutex
inc_count(): thread 2, count = 4, unlocking mutex
inc_count(): thread 2, count = 5, unlocking mutex
inc_count(): thread 2, count = 6, unlocking mutex
inc_count(): thread 2, count = 7, unlocking mutex
inc_count(): thread 2, count = 8, unlocking mutex
inc_count(): thread 2, count = 9, unlocking mutex
inc_count(): thread 3, count = 10, unlocking mutex
inc_count(): thread 3, count = 11, unlocking mutex
inc_count(): thread 3, count = 12 Threshold reached. inc_count(): thread 2, count = 13, unlocking mutex
and also hangs.
To summarize:
On Mac OS X, if I do not sleep in inc_count and join the 3 threads in main, the watch_count will receive signal(which apparently is not signaled by any of the 2 inc_count thread) when count is 2, and pthread_cond_signal never gets called.
On Linux however, it will hangs at some point.
my question is: How does pthread_join influence the behavior of condition variable? Why does Mac OS X behave so differently from Linux?
Taking out the pthread_join() is probably a mistake. Having started the various threads, the pthread_join() allows those threads to complete before exiting. If you don't do this then as the program exits, all threads which haven't completed are unceremoniously terminated. What you observe happening will depend on how far the threads got before the axe fell.
The sleep() just slows things down so the mechanics are more obvious.
I suspect that what is wrong here is that pthread_cond_signal() has no effect if there is no waiter... the signal is not remembered. Conditions should be used together with some other state. When the state is updated, the related condition is used to signal that the state has changed, in case some thread is waiting for that. So, a wait involving a simple flag condition will be:
pthread_mutex_lock(mutex) ;
while (!whatever)
pthread_mutex_wait(condition, mutex) ;
pthread_mutex_unlock(mutex) ;
Note the while -- pthread_cond_signal() is allowed to wake up more than one thread, so from the waiters perspective, being woken up does not guarantee that the state has changed, or is what is waited for.
To signal this (where we are assuming that whatever is initialises false, elsewhere):
pthread_mutex_lock(mutex) ;
whatever = true ;
pthread_mutex_signal(condition) ;
pthread_mutex_unlock(mutex) ;
Noting, again, that the whatever flag is essential, because without it the waiter will wait forever if the signal is performed before the waiter thread gets to the wait !
Having said all that, I suggest that the watch_count() function is broken.
I am trying to create a number of threads (representing persons), in a for loop, and display the person id, which is passed as an argument, together with the thread id. The person id is displayed as exepected, but the thread id is always the same.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void* travelers(void* arg) {
int* person_id = (int*) arg;
printf("\nPerson %d was created, TID = %d", *person_id, pthread_self());
}
int main(int argc, char** argv)
{
int i;
pthread_t th[1000];
for (i=0; i < 10; i++) {
if ((pthread_create(&th[i], NULL, travelers, &i)) != 0) {
perror("Could not create threads");
exit(2);
}
else {
// Join thread
pthread_join(th[i], NULL);
}
}
printf("\n");
return 0;
}
The output I get is something like this:
Person 0 was created, TID = 881035008
Person 1 was created, TID = 881035008
Person 2 was created, TID = 881035008
Person 3 was created, TID = 881035008
Person 4 was created, TID = 881035008
Person 5 was created, TID = 881035008
Person 6 was created, TID = 881035008
Person 7 was created, TID = 881035008
Person 8 was created, TID = 881035008
Person 9 was created, TID = 881035008
What am I doing wrong?
Since only one of the created threads runs at a time, every new one gets the same ID as the one that finished before, i.e. the IDs are simply reused. Try creating threads in a loop and then joining them in a second loop.
However, you will then have to take care that each thread independently reads the content of i, which will give you different headaches. I'd pass the index as context argument, and then cast it to an int inside the thread function.
It does that, because it re-uses thread-ids. The thread id is only unique among all running threads, but not for threads running at different times; look what your for-loop does essentially:
for (i = 0 to 10) {
start a thread;
wait for termination of the thread;
}
So the program has only one thread running at any given time, one thread is only started after the previous started thread has terminated (with pthread_join ()). To make them run at the same time, use two for loops:
for (i = 0 to 10) {
start thread i;
}
for (i = 0 to 10) {
wait until thread i is finished;
}
Then you will likely get different thread-ids. (I.e. you will get different thread-ids, but if the printf-function will write them out differently depends on your specific implementation/architecture, in particular if thread_t is essentially an int or not. It might be a long int, for example).
if ((pthread_create(&th[i], NULL, travelers, &i)) != 0)
If the thread is successfully created it returns 0. If != 0 will return false and you will execute the pthread_join. You are effectively creating one thread repeatedly.