Using thrust with openmp: no substantial speed up obtained - multithreading

I am interested in porting a code I had written using mostly the Thrust GPU library to multicore CPU's. Thankfully, the website says that thrust code can be used with threading environments such as OpenMP / Intel TBB.
I wrote a simple code below for sorting a large array to see the speedup using a machine which can support upto 16 Open MP threads.
The timings obtained on this machine for sorting a random array of size 16 million are
STL : 1.47 s
Thrust (16 threads) : 1.21 s
There seems to be barely any speed-up. I would like to know how to get a substantial speed-up for sorting arrays using OpenMP like I do with GPUs.
The code is below (the file sort.cu). Compilation was performed as follows:
nvcc -O2 -o sort sort.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
The NVCC version is 5.5
The Thrust library version being used is v1.7.0
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <time.h>
#include "thrust/sort.h"
int main(int argc, char *argv[])
{
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
}
std::cout << "-------------\n";
clock_t start,stop;
start=clock();
std::sort(myarr,myarr+N);
stop=clock();
std::cout << "Time taken for sorting the array with STL is " << (stop-start)/(double)CLOCKS_PER_SEC;
//--------------------------------------------
srand(1);
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
}
start=clock();
thrust::sort(myarr,myarr+N);
stop=clock();
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << (stop-start)/(double)CLOCKS_PER_SEC;
return 0;
}

The device backend refers to the behavior of operations performed on a thrust::device_vector or similar reference. Thrust interprets the array/pointer you are passing it as a host pointer, and performs host-based operations on it, which are not affected by the device backend setting.
There are a variety of ways to fix this issue. If you read the device backend documentation you will find general examples and omp-specific examples. You could even specify a different host backend which should have the desired behavior (OMP usage) with your code, I think.
Once you fix this, you'll get an additional result surprise, perhaps: thrust appears to sort the array quickly, but reports a very long execution time. I believe this is due (on linux, anyway) to the clock() function being affected by the number of OMP threads in use.
The following code/sample run has those issues addressed, and seems to give me a ~3x speedup for 4 threads.
$ cat t592.cu
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <sys/time.h>
#include <time.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>
int main(int argc, char *argv[])
{
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
}
std::cout << "-------------\n";
timeval t1, t2;
gettimeofday(&t1, NULL);
std::sort(myarr,myarr+N);
gettimeofday(&t2, NULL);
float et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "Time taken for sorting the array with STL is " << et << std::endl;;
//--------------------------------------------
srand(1);
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
}
thrust::device_ptr<double> darr = thrust::device_pointer_cast<double>(myarr);
gettimeofday(&t1, NULL);
thrust::sort(darr,darr+N);
gettimeofday(&t2, NULL);
et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << et << std::endl ;
return 0;
}
$ nvcc -O2 -o t592 t592.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t592
-------------
Time taken for sorting the array with STL is 1.31956
------------------
Time taken for sorting the array with Thrust is 0.468176
$
Your mileage may vary. In particular, you may not see any improvement as you go above 4 threads. There may be a number of factors which prevent an OMP code from scaling beyond a certain number of threads. Sorting generally tends to be a memory-bound algorithm, so you will probably observe an increase until you have saturated the memory subsystem, and then no further increase from additional cores. Depending on your system, it's possible you could be in this situation already, in which case you may not see any improvement from OMP style multithreading.

Related

Why the run time is shorter when I use a lock in a c++ program?

I am practise the multithreaded programming with cpp. And when I use the std::lock_guard in the same code, its run time becomes shorter than before. That's amazing, why?
The lock version:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the ouput is:
2000000000
time = 5.71852s
the no lock version is:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
// std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the output is:
1010269798
time = 10.765s
I'm using the ubuntu1804, g++ version is :
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
In my opinion, the lock is an extra operation, it should cost more time of course.
Maybe someone can help me? Thanks.
Modifying a variable from multiple threads cause an undefined behaviour. This means the compiler ans the processor are free to do whatever they want in this case (like removing the loop for example, or not reloading the variable from memory since it is not supposed to be modified by another thread in the first place). As a result, studying performance of this case is not really relevant.
Assuming the compiler do not perform any (allowed) advanced optimizations, the program should contain a race condition. It is certainly slower because of a cache-line bouncing effect: multiple cores compete for the same locked cache-line and moving it from one core to another is very slow compared to increasing the variable from the L1 cache (this is certainly the overhead you see). Indeed, on standard x86-64 platforms like mainstream Intel processors, moving a locked cache line from one core to another means invalidating copies of the cache line of other L1/L2 cores and fetching it from the L3 cache which is much slower than the L1 (lower throughput & much higher latency). Note that this behaviour is dependent of the target platform (mainly the processor, besides compiler optimizations), but most platforms work similarly. For more information please read this and that about cache-coherence protocols.

Question about this error: ‘bdget’ was not declared in this scope

trying to figure out why I’m getting this error about ’bdget’ (when on Linux Ubuntu, info below) and how to get rid of this error, maybe you could suggest some specific steps for me to try and/or ask me to provide you some additional info about the error, thank you for any help.
error: ‘bdget’ was not declared in this scope
b_dev = bdget(st.st_rdev);
The code (for a short sample/tester program put together quickly to demonstrate the problem here to you) that produces this error is below, here’s other info: When I comment out that ‘bdget’ line, the program (seen below) compiles, otherwise it doesn’t and produces the error. I understand from other stackoverflow.com questions that using bdget requires #include <linux/fs.h>, which I’ve done here as you can see in the code below.
I think this is my first question here at stackoverflow.com, haven’t yet learned what question-related actions I need to take here at stackoverflow.com about your answers such as accept, and the steps for taking those actions, so bear with me a bit.
Regards,
Yaman Aksu,PhD
The build used this g++ version (full path: /usr/bin/g++) :
g++ (Ubuntu 7.3.0-27ubuntu1-18.04) 7.3.0 <snip>
and was simple as follows:
g++ -Wall -03 tester.cpp -o tester
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/sysmacros.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <dirent.h>
#include <linux/fs.h>
#define FNABS_SIZE 1024
char prog[FNABS_SIZE];
using namespace std;
int main(int argc, char *argv[])
{
int rv=0;
struct dirent *de;
struct stat st;
struct block_device *b_dev;
char direc[FNABS_SIZE], fnabs[FNABS_SIZE], fnabs_bfile[FNABS_SIZE], hd_serial[NAME_MAX], hd_model[NAME_MAX];
int status;
unsigned int maj_rdev, min_rdev;
ios::sync_with_stdio();
strcpy(prog,argv[0]); argc--;
strcpy(direc,"/dev/disk/by-id");
for (DIR *p = opendir(direc); (de=readdir(p)); ) {
if (de->d_type==DT_LNK) {
sprintf(fnabs,"%s/%s",direc,de->d_name);
cout << "[glk] Now running 'stat' on this: " << fnabs << endl;
status = stat(fnabs, &st);
cout << "[glk] 'stat' results:" << endl;
cout << "[glk]\t return value:" << status << endl;
cout << "[glk]\t st.st_rdev:" << st.st_rdev << endl;
cout << "[glk]\t st.st_dev=" << st.st_dev << endl;
cout << "[glk]\t st.st_ino=" << st.st_ino << endl;
maj_rdev = major(st.st_rdev); min_rdev = minor(st.st_rdev);
printf("[glk]\t major:minor = %u:%u\n",maj_rdev, min_rdev);
if ((st.st_mode & S_IFMT) == S_IFBLK) {
printf("[glk]\t This is a 'block device' !!\n");
}
b_dev = bdget(st.st_rdev);
}
}
}

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections 17.6.4.10 and 17.6.5.9.) The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async,
generate_randn,
count/threads,
make_seed()));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
}
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

SDL2 Threads C++ pointer corruption

So, i have the following problem which may seem pretty strange or too elementary. This code snippet demonstrates my problem.
#ifdef __cplusplus
#include <cstdlib>
#else
#include <stdlib.h>
#endif
#include "SDL2/SDL.h"
#include <iostream>
using namespace std;
int doSTH(void* data){
int* data2 = (int*)data;
cout << data2 << endl;
return 0;
}
int main(){
SDL_Init(SDL_INIT_EVERYTHING);
int* data = new int(2);
cout << data << endl;
SDL_CreateThread(doSTH, "sth", (void*)data);
SDL_Delay(1);
delete data;
SDL_Quit();
}
Output is
0x2479f40
0x400c05
That means that somehow the function i call doesn't get the pointer i give it, am i missing something?
I am using Linux Ubuntu 14.04, g++ 4.8 and codeblocks.
Please tell me if i should give any more info.
Thanks in advance.
Nevermind, somehow the build of SDL2 was screwed up. I just uninstalled libx11-dev, rebooted and then reinstalled libsdl2-dev and now it works correctly.

Strange behaviour in OpenMP nested loop

In the following program I get different results (serial vs OpenMP), what is the reason? At the moment I can only think that perhaps the loop is too "large" for the threads and perhaps I should write it in some other way but I am not sure, any hints?
Compilation: g++-4.2 -fopenmp main.c functions.c -o main_elec_gcc.exe
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NRACK 64
#define NSTARS 1024
double mysumallatomic_serial(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
int j,i;
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
for(j=0; j<NRACK; j++){
for(i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
S2 += ql[i]*temp_div*qr[j];
}
}
return S2;
}
double mysumallatomic(float rocks[NRACK][3],float moon[NSTARS][3],float qr[NRACK],float ql[NSTARS]) {
float temp_div=0.,temp_sqrt=0.;
float difx,dify,difz;
float mod2x, mod2y, mod2z;
double S2 = 0.;
#pragma omp parallel for shared(S2)
for(int j=0; j<NRACK; j++){
for(int i=0; i<NSTARS;i++){
difx=rocks[j][0]-moon[i][0];
dify=rocks[j][1]-moon[i][1];
difz=rocks[j][2]-moon[i][2];
mod2x=difx*difx;
mod2y=dify*dify;
mod2z=difz*difz;
temp_sqrt=sqrt(mod2x+mod2y+mod2z);
temp_div=1/temp_sqrt;
float myterm=ql[i]*temp_div*qr[j];
#pragma omp atomic
S2 += myterm;
}
}
return S2;
int main(int argc, char *argv[]) {
float rocks[NRACK][3], moon[NSTARS][3];
float qr[NRACK], ql[NSTARS];
int i,j;
for(j=0;j<NRACK;j++){
rocks[j][0]=j;
rocks[j][1]=j+1;
rocks[j][2]=j+2;
qr[j] = j*1e-4+1e-3;
//qr[j] = 1;
}
for(i=0;i<NSTARS;i++){
moon[i][0]=12000+i;
moon[i][1]=12000+i+1;
moon[i][2]=12000+i+2;
ql[i] = i*1e-3 +1e-2 ;
//ql[i] = 1 ;
}
printf(" serial: %f\n", mysumallatomic_serial(rocks,moon,qr,ql));
printf(" openmp: %f\n", mysumallatomic(rocks,moon,qr,ql));
return(0);
}
}
I think you should use reduction instead of shared variable and remove #pragma omp atomic, like:
#pragma omp parallel for reduction(+:S2)
And it should work faster, because there are no need for atomic operations which are quite painful in terms of performance and threads synchronization.
UPDATE
You can also have some difference in results because of the operations order:
\sum_1^100(x[i]) != \sum_1^50(x[i]) + \sum_51^100(x[i])
You have data races on most of the temporary variables you are using in the parallel region - difx, dify, difz, mod2x, mod2y, mod2z, temp_sqrt, and temp_div should all be private. You should make these variables private by using a private clause on the parallel for directive.

Resources