Multithreading in MSVC is showing no improvement - multithreading

I am trying to run the following code to test the speedup I can get on my system, and check that my code is mult-threading. Using gcc on linux, I get a factor of about 7. Using Visual Studio on Windows, I get no improvement. In MSVS 2012, I set /Qpar and /MD ... am I missing something? What am I doing wrong?
#include <iostream>
#include <thread>
#ifdef WIN32
#include <windows.h>
double getTime() {
LARGE_INTEGER freq, val;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&val);
return 1000*(double)val.QuadPart / (double)freq.QuadPart;
};
#define clock_type double
#else
#include <ctime>
#define clock_type std::clock_t
#define getTime std::clock
#endif
static const int num_threads = 10;
//This function will be called from a thread
void f()
{
volatile double d=0;
for(int n=0; n<10000; ++n)
for(int m=0; m<10000; ++m)
d += d*n*m;
}
int main()
{
clock_type c_start = getTime();
auto t_start = std::chrono::high_resolution_clock::now();
std::thread t[num_threads];
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(f);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
clock_type c_end = getTime();
auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "CPU time used: "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC
<< " ms\n";
std::cout << "Wall clock time passed: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count()
<< " ms\n";
std::cout << "Acceleration factor: "
<< 1000.0 * (c_end-c_start) / CLOCKS_PER_SEC / std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_start).count() << "\n";
return 0;
}
The output using MSVS is:
CPU time used: 1003.64 ms
Wall clock time passed: 998 ms
Acceleration factor: 1.00565
In Linux, I get:
CPU time used: 5264.83 ms
Wall clock time passed: 698 ms
Acceleration factor: 7.54274
EDIT 1: increased size of matrix in f() from 1000 to 10000.
EDIT 2: added getTime() function using QueryPerformanceCounter, and included #define's to switch between std::clock() and getTime()

On MSVC, clock returns wall time and is thus not standards compliant.

Related

Why the run time is shorter when I use a lock in a c++ program?

I am practise the multithreaded programming with cpp. And when I use the std::lock_guard in the same code, its run time becomes shorter than before. That's amazing, why?
The lock version:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the ouput is:
2000000000
time = 5.71852s
the no lock version is:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
// std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the output is:
1010269798
time = 10.765s
I'm using the ubuntu1804, g++ version is :
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
In my opinion, the lock is an extra operation, it should cost more time of course.
Maybe someone can help me? Thanks.
Modifying a variable from multiple threads cause an undefined behaviour. This means the compiler ans the processor are free to do whatever they want in this case (like removing the loop for example, or not reloading the variable from memory since it is not supposed to be modified by another thread in the first place). As a result, studying performance of this case is not really relevant.
Assuming the compiler do not perform any (allowed) advanced optimizations, the program should contain a race condition. It is certainly slower because of a cache-line bouncing effect: multiple cores compete for the same locked cache-line and moving it from one core to another is very slow compared to increasing the variable from the L1 cache (this is certainly the overhead you see). Indeed, on standard x86-64 platforms like mainstream Intel processors, moving a locked cache line from one core to another means invalidating copies of the cache line of other L1/L2 cores and fetching it from the L3 cache which is much slower than the L1 (lower throughput & much higher latency). Note that this behaviour is dependent of the target platform (mainly the processor, besides compiler optimizations), but most platforms work similarly. For more information please read this and that about cache-coherence protocols.

Multiples threads running on one core instead of four depending on the OS

I am using Raspbian on Raspberry 3.
I need to divide my code in few blocks (2 or 4) and assign a thread per block to speed up calculations.
At the moment, I am testing with simple loops (see attached code) on one thread and then on 4 threads. And executions time on 4 threads is always 4 times longer, so it looks like this 4 threads are scheduled to run on the same CPU.
How to assign each thread to run on other CPUs? Even 2 threads on 2 CPUs should make big difference to me.
I even tried to use g++6 and no improvement. And using parallel libs openmp in the code with "#pragma omp for" still running on one CPU.
I tried to run this code on Fedora Linux x86 and I had the same behavior, but on Windows 8.1 and VS2015 i have got different results where time was the same one one thread and then on 4 threads, so it was running on different CPUs.
Would you have any suggestions??
Thank you.
#include <iostream>
//#include <arm_neon.h>
#include <ctime>
#include <thread>
#include <mutex>
#include <iostream>
#include <vector>
using namespace std;
float simd_dot0() {
unsigned int i;
unsigned long rezult;
for (i = 0; i < 0xfffffff; i++) {
rezult = i;
}
return rezult;
}
int main() {
unsigned num_cpus = std::thread::hardware_concurrency();
std::mutex iomutex;
std::vector<std::thread> threads(num_cpus);
cout << "Start Test 1 CPU" << endl; // prints !!!Hello World!!!
double t_start, t_end, scan_time;
scan_time = 0;
t_start = clock();
simd_dot0();
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 1 CPU: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 1 CPU" << endl; // prints !!!Hello World!!!
cout << "Start Test 4 CPU" << endl; // prints !!!Hello World!!!
scan_time = 0;
t_start = clock();
for (unsigned i = 0; i < 4; ++i) {
threads[i] = std::thread([&iomutex, i] {
{
simd_dot0();
std::cout << "\nExecution time on CPU: "
<< i << std::endl;
}
// Simulate important work done by the tread by sleeping for a bit...
});
}
for (auto& t : threads) {
t.join();
}
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 4 CPUs: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 4 CPU" << endl; // prints !!!Hello World!!!
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
while (1);
return 0;
}
Edit :
On Raspberry Pi3 Raspbian I used g++4.9 and 6 with the following flags :
-std=c++11 -ftree-vectorize -Wl--no-as-needed -lpthread -march=armv8-a+crc -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -funsafe-math-optimizations -O3

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections 17.6.4.10 and 17.6.5.9.) The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async,
generate_randn,
count/threads,
make_seed()));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
}
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

zLib transparent write mode "wT" performance degradation

I would expect zLib transparent mode ( gzptintf() ) as fast as regular fprintf(). I found zLib gzprintf() with "wT" is 2.5x slower than fprintf(). Is there any workaround on this performance issue?
Details:
I’m using libz.so.1.2.8 on Linux (fedora 22, kernel 4.0.5, Intel(R) Core(TM) i7-3770 CPU # 3.40GHz) to provide output file compress option to my event trace collector. To keep legacy compatibility I need transparent file format writing mode.
As I see, the option “T” in gzopen allow to write files with no compression and no gzip header record.
The problem is in performance. The transparent mode is ~2.5x slower than simple standard fprintf.
Here is quick test result (values are in TSC):
zLib]$ ./zlib_transparent
Performance fprintf vs gzprintf (transparent):
fprintf 22883026324
zLib transp 62305122876
ratio 2.72277
The source for this test:
#include <stdio.h>
#include <zlib.h>
#include <iostream>
#include <sstream>
#include <iomanip>
#define NUMITERATIONS 10000000
static double buffer[NUMITERATIONS];
static __inline__ unsigned long long rdtsc(void){
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
long long test_fprintf(double *buffer){
long long t = rdtsc();
#ifdef USE_FPRINTF
double tmp = 0;
FILE *file = fopen("fprintf_file.txt", "w");
for (int i = 0; i < NUMITERATIONS; ++i) {
fprintf(file, "[%f:%f]\n", buffer[i], buffer[i] - tmp);
tmp = buffer[i] + i;
}
fclose(file);
#endif
return rdtsc() - t;
}
long long test_zlib_transparent(double *buffer){
long long t = rdtsc();
#ifdef USE_ZLIB
double tmp = 0;
gzFile file = gzopen("zlib_file.txt.gz", "wT");
for (int i = 0; i < NUMITERATIONS; ++i) {
gzprintf(file, "[%f:%f]\n", buffer[i], buffer[i] - tmp);
tmp = buffer[i] + i;
}
gzclose(file);
#endif
return rdtsc() - t;
}
int main(){
std::cout << "Performance fprintf vs gzprintf (transparent):" << std::endl;
long long dPrint = test_fprintf(buffer);
std::cout << " fprintf " << dPrint << std::endl;
long long dStream = test_zlib_transparent(buffer);
std::cout << "zLib transp " << dStream << std::endl;
std::cout << "ratio " << double(dStream)/double(dPrint) << std::endl;
return 0;
}
Build:
g++ -g -O3 -DUSE_ZLIB=1 -DUSE_FPRINTF=1 zlib_transparent.cpp -o zlib_transparent –lz
Thank you
Sergey
My bad. (I wrote gzprintf().)
write() is being called too often. You will get approximately the same performance as zlib if you replace fprintf() with snprintf() and write().
I will improve this in the next version of zlib. If you would like to try it, apply this diff. I don't know how it will perform on Linux, but on Mac OS X, gzprintf() in transparent mode is now 10% faster than fprintf(). (Wasn't expecting that.)

Interpreting time command output on a multi threaded program

I have a multi threaded program and I am profiling time taken starting before all pthread_create's and after all pthread_join's.
Now I find that this time, lets call it X, which is shown below in "Done in xms" is actually user + sys time of time output. In my app the number argument to a.out controls how many threads to spawn. ./a.out 1 spawn 1 pthread and ./a.out 2 spawns 2 threads where each thread does the same amount of work.
I was expecting X to be the real time instead of user + sys time. Can someone please tell me why this is not so? Then this really means my app is indeed running parallel without any locking between threads.
[jithin#whatsoeverclever tests]$ time ./a.out 1
Done in 320ms
real 0m0.347s
user 0m0.300s
sys 0m0.046s
[jithin#whatsoeverclever tests]$ time ./a.out 2
Done in 450ms
real 0m0.266s
user 0m0.383s
sys 0m0.087s
[jithin#whatsoeverclever tests]$ time ./a.out 3
Done in 630ms
real 0m0.310s
user 0m0.532s
sys 0m0.105s
Code
int main(int argc, char **argv) {
//Read the words
getWords();
//Set number of words to use
int maxWords = words.size();
if(argc > 1) {
int numWords = atoi(argv[1]);
if(numWords > 0 && numWords < maxWords) maxWords = numWords;
}
//Init model
model = new Model(MODEL_PATH);
pthread_t *threads = new pthread_t[maxWords];
pthread_attr_t attr;
void *status;
// Initialize and set thread joinable
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
int rc;
clock_t startTime = clock();
for(unsigned i=0; i<maxWords; i++) {
//create thread
rc = pthread_create(&threads[i], NULL, processWord, (void *)&words[i] );
if (rc){
cout << "Error:unable to create thread: " << i << "," << rc << endl;
exit(-1);
}
}
// free attribute and wait for the other threads
pthread_attr_destroy(&attr);
for(unsigned i=0; i<maxWords; i++) {
rc = pthread_join(threads[i], &status);
if (rc){
cout << "Error:unable to join thread: " << i << "," << rc << endl;
exit(-1);
}
}
clock_t endTime = clock();
float diff = (((float)endTime - (float)startTime) / 1000000.0F ) * 1000;
cout<<"Done in "<< diff << "ms\n";
delete[] threads;
delete model;
}
The clock function is specifically documented to return the processor time used by a process. If you want to measure wall time elapsed, it's not the right function.

Resources