Pthreads program is slower than the serial program - Linux - linux

Thank you for being generous with your time and helping me in this matter. I am trying to calculate the sum of the squared numbers using pthread. However, it seems that it is even slower than the serial implementation. Moreover, when I increase the number of threads the program becomes even slower. I made sure that each thread is running on a different core (I have 6 cores assigned to the virtual machine)
This is the serial program:
#include <stdio.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <time.h>
int main(int argc, char *argv[]) {
struct timeval start, end;
gettimeofday(&start, NULL); //start time of calculation
int n = atoi(argv[1]);
long int sum = 0;
for (int i = 1; i < n; i++){
sum += (i * i);
}
gettimeofday(&end, NULL); //end time of calculation
printf("The sum of squares in [1,%d): %ld | Time Taken: %ld mirco seconds \n",n,sum,
((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)));
return 0;
}
This the Pthreads program:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/time.h>
#include <time.h>
void *Sum(void *param);
// structure for thread arguments
struct thread_args {
int tid;
int a; //start
int b; //end
long int result; // partial results
};
int main(int argc, char *argv[])
{
struct timeval start, end;
gettimeofday(&start, NULL); //start time of calculation
int numthreads;
int number;
double totalSum=0;
if(argc < 3 ){
printf("Usage: ./sum_pthreads <numthreads> <number> ");
return 1;
}
numthreads = atoi(argv[1]);
number = atoi(argv[2]);;
pthread_t tid[numthreads];
struct thread_args targs[numthreads];
printf("I am Process | range: [%d,%d)\n",1,number);
printf("Running Threads...\n\n");
for(int i=0; i<numthreads;i++ ){
//Setting up the args
targs[i].tid = i;
targs[i].a = (number)*(targs[i].tid)/(numthreads);
targs[i].b = (number)*(targs[i].tid+1)/(numthreads);
if(i == numthreads-1 ){
targs[i].b = number;
}
pthread_create(&tid[i],NULL,Sum, &targs[i]);
}
for(int i=0; i< numthreads; i++){
pthread_join(tid[i],NULL);
}
printf("Threads Exited!\n");
printf("Process collecting information...\n");
for(int i=0; i<numthreads;i++ ){
totalSum += targs[i].result;
}
gettimeofday(&end, NULL); //end time of calculation
printf("Total Sum is: %.2f | Taken Time: %ld mirco seconds \n",totalSum,
((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)));
return 0;
}
void *Sum(void *param) {
int start = (*(( struct thread_args*) param)).a;
int end = (*((struct thread_args*) param)).b;
int id = (*((struct thread_args*)param)).tid;
long int sum =0;
printf("I am thread %d | range: [%d,%d)\n",id,start,end);
for (int i = start; i < end; i++){
sum += (i * i);
}
(*((struct thread_args*)param)).result = sum;
printf("I am thread %d | Sum: %ld\n\n", id ,(*((struct thread_args*)param)).result );
pthread_exit(0);
}
Results:
hamza#hamza:~/Desktop/lab4$ ./sum_serial 10
The sum of squares in [1,10): 285 | Time Taken: 7 mirco seconds
hamza#hamza:~/Desktop/lab4$ ./sol 2 10
I am Process | range: [1,10)
Running Threads...
I am thread 0 | range: [0,5)
I am thread 0 | Sum: 30
I am thread 1 | range: [5,10)
I am thread 1 | Sum: 255
Threads Exited!
Process collecting information...
Total Sum is: 285.00 | Taken Time: 670 mirco seconds
hamza#hamza:~/Desktop/lab4$ ./sol 3 10
I am Process | range: [1,10)
Running Threads...
I am thread 0 | range: [0,3)
I am thread 0 | Sum: 5
I am thread 1 | range: [3,6)
I am thread 1 | Sum: 50
I am thread 2 | range: [6,10)
I am thread 2 | Sum: 230
Threads Exited!
Process collecting information...
Total Sum is: 285.00 | Taken Time: 775 mirco seconds
hamza#hamza:~/Desktop/lab4$

The two programs do very different things. For example, the threaded program produces much more text output and creates a bunch of threads. You're comparing very short runs (less than a thousandth of a second) so the overhead of those additional things is significant.
You have to test with much longer runs such that the cost of producing additional output and creating and synchronizing threads is lost.
To use an analogy, one person can tighten three screws faster than three people can because of the overhead of getting a tool to each person, deciding who will tighten which screw, and so on. But if you have 500 screws to tighten, then three people will get it done faster.

Related

Using rand_r in OpenMP 'for' is slower with 2 threads

The following code performs better with 1 thread than with 2 (using 4 threads gives speed up, though):
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
int n = atoi(argv[1]);
int num_threads = atoi(argv[2]);
omp_set_num_threads(num_threads);
unsigned int *seeds = malloc(num_threads * sizeof(unsigned int));
for (int i = 0; i < num_threads; ++i) {
seeds[i] = 42 + i;
}
unsigned long long sum = 0;
double begin_time = omp_get_wtime();
#pragma omp parallel
{
unsigned int *seedp = &seeds[omp_get_thread_num()];
#pragma omp for reduction(+ : sum)
for (int i = 0; i < n; ++i) {
sum += rand_r(seedp);
}
}
double end_time = omp_get_wtime();
printf("%fs\n", end_time - begin_time);
free(seeds);
return EXIT_SUCCESS;
}
On my laptop (2 cores, HT enabled) I get the following results:
$ gcc -fopenmp test.c && ./a.out 100000000 1
0.821497s
$ gcc -fopenmp test.c && ./a.out 100000000 2
1.096394s
$ gcc -fopenmp test.c && ./a.out 100000000 3
0.933494s
$ gcc -fopenmp test.c && ./a.out 100000000 4
0.748038s
The problem persists without reduction, drand48_r brings no difference, dynamic scheduling makes things even worse. However, if I replace the body of the loop with something not connected with random, i. e. sum += *seedp + i;, everything works as expected.
This is textbook example of false sharing. By using an array of seeds upon which each thread take one element, you force the logically private variables to be physically located next to each-other in memory. Therefore, the are all in the same cache line. This means that although no thread tries to modify a some other thread's seed, the cache line itself is modified by each threads at each iteration. And the actual trouble is that the system cannot detect variable's modifications for cache coherency, only cache line modifications. Therefore, at each iteration for each thread, the cache line has been modified by another thread and is no longer valid from a system's point of view. It has to be reloaded from memory (well, most likely from shared L3 cache here), leading to slowing down your code.
Try this one instead (not tested):
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv) {
int n = atoi(argv[1]);
int num_threads = atoi(argv[2]);
omp_set_num_threads(num_threads);
unsigned long long sum = 0;
double begin_time = omp_get_wtime();
#pragma omp parallel
{
unsigned int seed = 42 + omp_get_thread_num();
#pragma omp for reduction(+ : sum)
for (int i = 0; i < n; ++i) {
sum += rand_r(&seed);
}
}
double end_time = omp_get_wtime();
printf("%fs\n", end_time - begin_time);
return EXIT_SUCCESS;
}

MPI_Recv takes a long and strange time to return

I am trying to compare the execution time of two programs:
-the first one uses non blocking functions MPI_Ssend and MPI_Irecv, which allows to do some calculations "while messages are being sent and received".
-the other one uses blocking OpenMPI functions.
I have no problem to assess the performances of the first program and they look "good".
My problem is that the second program, that uses MPI_Recv, often takes a very long and kind of "strange" time to finish : always a little bit more than 1 second : (for example 1.001033 seconds). As if the process had to do something that takes exactly 1 second.
I modified my initial code to show you an equivalent one.
#include "mpi.h"
#include <stdlib.h>
#include <stdio.h>
#define SIZE 5000
int main(int argc, char * argv[])
{
MPI_Request trash;
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int myVector[SIZE];
for(int i = 0 ; i < SIZE ; i++)
myVector[i] = 100;
int buff[SIZE];
double startTime = MPI_Wtime();
if(rank == 0)
MPI_Issend(myVector, SIZE, MPI_INT, 1, 1, MPI_COMM_WORLD, &trash);
if(rank == 1) {
MPI_Recv(buff, SIZE, MPI_INT, 0, 1, MPI_COMM_WORLD, NULL);
printf("Processus 1 finished in %lf seconds\n", MPI_Wtime()-startTime);
}
MPI_Finalize();
return 0;
}
Thank you.

Large overhead in CUDA kernel launch outside GPU execution

I am measuring the running time of kernels, as seen from a CPU thread, by measuring the interval from before launching a kernel to after a cudaDeviceSynchronize (using gettimeofday). I have a cudaDeviceSynchronize before I start recording the interval. I also instrument the kernels to record the timestamp on the GPU (using clock64) at the start of the kernel by thread(0,0,0) of each block from block(0,0,0) to block(occupancy-1,0,0) to an array of size equal to number of SMs. Every thread at the end of the kernel code, updates the timestamp to another array (of the same size) at the index equal to the index of the SM it runs on.
The intervals calculated from the two arrays are 60-70% of that measured from the CPU thread.
For example, on a K40, while gettimeofday gives an interval of 140ms, the avg of intervals calculated from GPU timestamps is only 100ms. I have experimented with many grid sizes (15 blocks to 6K blocks) but have found similar behavior so far.
__global__ void some_kernel(long long *d_start, long long *d_end){
if(threadIdx.x==0){
d_start[blockIdx.x] = clock64();
}
//some_kernel code
d_end[blockIdx.x] = clock64();
}
Does this seem possible to the experts?
Does this seem possible to the experts?
I suppose anything is possible for code you haven't shown. After all, you may just have a silly bug in any of your computation arithmetic. But if the question is "is it sensible that there should be 40ms of unaccounted-for time overhead on a kernel launch, for a kernel that takes ~140ms to execute?" I would say no.
I believe the method I outlined in the comments is reasonably accurate. Take the minimum clock64() timestamp from any thread in the grid (but see note below regarding SM restriction). Compare it to the maximum time stamp of any thread in the grid. The difference will be comparable to the reported execution time of gettimeofday() to within 2 percent, according to my testing.
Here is my test case:
$ cat t1040.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define LS_MAX 2000000000U
#define MAX_SM 64
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
__device__ int result;
__device__ unsigned long long t_start[MAX_SM];
__device__ unsigned long long t_end[MAX_SM];
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__global__ void kernel(unsigned ls){
unsigned long long int ts = clock64();
unsigned my_sm = __mysmid();
atomicMin(t_start+my_sm, ts);
// junk code to waste time
int tv = ts&0x1F;
for (unsigned i = 0; i < ls; i++){
tv &= (ts+i);}
result = tv;
// end of junk code
ts = clock64();
atomicMax(t_end+my_sm, ts);
}
// optional command line parameter 1 = kernel duration, parameter 2 = number of blocks, parameter 3 = number of threads per block
int main(int argc, char *argv[]){
unsigned ls;
if (argc > 1) ls = atoi(argv[1]);
else ls = 1000000;
if (ls > LS_MAX) ls = LS_MAX;
int num_sms = 0;
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("cuda get attribute fail");
int gpu_clk = 0;
cudaDeviceGetAttribute(&gpu_clk, cudaDevAttrClockRate, 0);
if ((num_sms < 1) || (num_sms > MAX_SM)) {printf("invalid sm count: %d\n", num_sms); return 1;}
unsigned blks;
if (argc > 2) blks = atoi(argv[2]);
else blks = num_sms;
if ((blks < 1) || (blks > 0x3FFFFFFF)) {printf("invalid blocks: %d\n", blks); return 1;}
unsigned ntpb;
if (argc > 3) ntpb = atoi(argv[3]);
else ntpb = 256;
if ((ntpb < 1) || (ntpb > 1024)) {printf("invalid threads: %d\n", ntpb); return 1;}
kernel<<<1,1>>>(100); // warm up
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
unsigned long long *h_start, *h_end;
h_start = new unsigned long long[num_sms];
h_end = new unsigned long long[num_sms];
for (int i = 0; i < num_sms; i++){
h_start[i] = 0xFFFFFFFFFFFFFFFFULL;
h_end[i] = 0;}
cudaMemcpyToSymbol(t_start, h_start, num_sms*sizeof(unsigned long long));
cudaMemcpyToSymbol(t_end, h_end, num_sms*sizeof(unsigned long long));
unsigned long long htime = dtime_usec(0);
kernel<<<blks,ntpb>>>(ls);
cudaDeviceSynchronize();
htime = dtime_usec(htime);
cudaMemcpyFromSymbol(h_start, t_start, num_sms*sizeof(unsigned long long));
cudaMemcpyFromSymbol(h_end, t_end, num_sms*sizeof(unsigned long long));
cudaCheckErrors("some error");
printf("host elapsed time (ms): %f \n device sm clocks:\n start:", htime/1000.0f);
unsigned long long max_diff = 0;
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_start[i]);}
printf("\n end: ");
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_end[i]);}
for (int i = 0; i < num_sms; i++) if ((h_start[i] != 0xFFFFFFFFFFFFFFFFULL) && (h_end[i] != 0) && ((h_end[i]-h_start[i]) > max_diff)) max_diff=(h_end[i]-h_start[i]);
printf("\n max diff clks: %lu\nmax diff kernel time (ms): %f\n", max_diff, max_diff/(float)(gpu_clk));
return 0;
}
$ nvcc -o t1040 t1040.cu -arch=sm_35
$ ./t1040 1000000 1000 128
host elapsed time (ms): 2128.818115
device sm clocks:
start: 3484744 3484724
end: 2219687393 2228431323
max diff clks: 2224946599
max diff kernel time (ms): 2128.117432
$
Notes:
This code can only be run on a cc3.5 or higher GPU due to the use of 64-bit atomicMin and atomicMax.
I've run it on a variety of grid configurations, on both a GT640 (very low end cc3.5 device) and K40c (high end) and the timing results between host and device agree to within 2% (for reasonably long kernel execution times. If you pass 1 as the command line parameter, with very small grid sizes, the kernel execution time will be very short (nanoseconds) whereas the host will see about 10-20us. This is kernel launch overhead being measured. So the 2% number is for kernels that take much longer than 20us to execute).
It accepts 3 (optional) command line parameters, the first of which varies the amount of time the kernel will execute.
My timestamping is done on a per-SM basis, because the clock64() resource is indicated to be a per-SM resource. The sm clocks are not guaranteed to be synchronized between SMs.
You can modify the grid dimensions. The second optional command line parameter specifies the number of blocks to launch. The third optional command line parameter specifies the number of threads per block. The timing methodology I have shown here should not be dependent on number of blocks launched or number of threads per block. If you specify fewer blocks than SMs, the code should ignore "unused" SM data.

pthreads code not scaling up

I wrote the following very simple pthread code to test how it scales up. I am running the code on a machine with 8 logical processors and at no time do I create more than 8 threads (to avoid context switching).
With increasing number of threads, each thread has to do lesser amount of work. Also, it is evident from the code that there are no shared Data structures between the threads which might be a bottleneck. But still, my performance degrades as I increase the number of threads.
Can somebody tell me what am I doing wrong here.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int NUM_THREADS = 3;
unsigned long int COUNTER = 10000000000000;
unsigned long int LOOP_INDEX;
void* addNum(void *data)
{
unsigned long int sum = 0;
for(unsigned long int i = 0; i < LOOP_INDEX; i++) {
sum += 100;
}
return NULL;
}
int main(int argc, char** argv)
{
NUM_THREADS = atoi(argv[1]);
pthread_t *threads = (pthread_t*)malloc(sizeof(pthread_t) * NUM_THREADS);
int rc;
clock_t start, diff;
LOOP_INDEX = COUNTER/NUM_THREADS;
start = clock();
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_create((threads + t), NULL, addNum, NULL);
if (rc) {
printf("ERROR; return code from pthread_create() is %d", rc);
exit(-1);
}
}
void *status;
for (int t = 0; t < NUM_THREADS; t++) {
rc = pthread_join(threads[t], &status);
}
diff = clock() - start;
int sec = diff / CLOCKS_PER_SEC;
printf("%d",sec);
}
Note: All the answers I found online said that the overhead of creating the threads is more than the work they are doing. To test it, I commented out everything in the "addNum()" function. But then, after doing that no matter how many threads I create, the time taken by the code is 0 seconds. So there is no overhead as such, I think.
clock() counts CPU time used, across all threads. So all that's telling you is that you're using a little bit more total CPU time, which is exactly what you would expect.
It's the total wall clock elapsed time which should be going down if your parallelisation is effective. Measure that with clock_gettime() specifying the CLOCK_MONOTONIC clock instead of clock().

pthreads_setaffinity_np: Invalid argument?

I've managed to get my pthreads program sort of working. Basically I am trying to manually set the affinity of 4 threads such that thread 1 runs on CPU 1, thread 2 runs on CPU 2, thread 3 runs on CPU 3, and thread 4 runs on CPU 4.
After compiling, my code works for a few threads but not others (seems like thread 1 never works) but running the same compiled program a couple of different times gives me different results.
For example:
hao#Gorax:~/Desktop$ ./a.out
Thread 3 is running on CPU 3
pthread_setaffinity_np: Invalid argument
Thread Thread 2 is running on CPU 2
hao#Gorax:~/Desktop$ ./a.out
Thread 2 is running on CPU 2
pthread_setaffinity_np: Invalid argument
pthread_setaffinity_np: Invalid argument
Thread 3 is running on CPU 3
Thread 3 is running on CPU 3
hao#Gorax:~/Desktop$ ./a.out
Thread 2 is running on CPU 2
pthread_setaffinity_np: Invalid argument
Thread 4 is running on CPU 4
Thread 4 is running on CPU 4
hao#Gorax:~/Desktop$ ./a.out
pthread_setaffinity_np: Invalid argument
My question is "Why does this happen? Also, why does the message sometimes print twice?"
Here is the code:
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <sched.h>
#include <errno.h>
#define handle_error_en(en, msg) \
do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
void *thread_function(char *message)
{
int s, j, number;
pthread_t thread;
cpu_set_t cpuset;
number = (int)message;
thread = pthread_self();
CPU_SET(number, &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
handle_error_en(s, "pthread_setaffinity_np");
}
printf("Thread %d is running on CPU %d\n", number, sched_getcpu());
exit(EXIT_SUCCESS);
}
int main()
{
pthread_t thread1, thread2, thread3, thread4;
int thread1Num = 1;
int thread2Num = 2;
int thread3Num = 3;
int thread4Num = 4;
int thread1Create, thread2Create, thread3Create, thread4Create, i, temp;
thread1Create = pthread_create(&thread1, NULL, (void *)thread_function, (char *)thread1Num);
thread2Create = pthread_create(&thread2, NULL, (void *)thread_function, (char *)thread2Num);
thread3Create = pthread_create(&thread3, NULL, (void *)thread_function, (char *)thread3Num);
thread4Create = pthread_create(&thread4, NULL, (void *)thread_function, (char *)thread4Num);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_join(thread3, NULL);
pthread_join(thread4, NULL);
return 0;
}
The first CPU is CPU 0 not CPU 1. So you'll want to change your threadNums:
int thread1Num = 0;
int thread2Num = 1;
int thread3Num = 2;
int thread4Num = 3;
You should initialize cpuset with the CPU_ZERO() macro this way:
CPU_ZERO(&cpuset);
CPU_SET(number, &cpuset);
Also don't call exit() from a thread as it will stop the whole process with all its threads:
exit(EXIT_SUCCESS);
return 0; // Use this instead or call pthread_exit()

Resources