Why the run time is shorter when I use a lock in a c++ program? - multithreading

I am practise the multithreaded programming with cpp. And when I use the std::lock_guard in the same code, its run time becomes shorter than before. That's amazing, why?
The lock version:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the ouput is:
2000000000
time = 5.71852s
the no lock version is:
#include <iostream>
#include <thread>
#include <mutex>
#include <ctime>
using namespace std;
class test {
std::mutex m;
int a;
public:
test() :a(0) {}
void add() {
// std::lock_guard<std::mutex> guard(m);
for(int i = 0; i < 1e9; i++) {
a++;
}
}
void print() {
std::cout << a << std::endl;
}
};
int main() {
test t;
auto start = clock();
std::thread t1(&test::add, ref(t));
std::thread t2(&test::add, ref(t));
t1.join();
t2.join();
auto end = clock();
t.print();
cout << "time = " << double(end - start) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
and the output is:
1010269798
time = 10.765s
I'm using the ubuntu1804, g++ version is :
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
In my opinion, the lock is an extra operation, it should cost more time of course.
Maybe someone can help me? Thanks.

Modifying a variable from multiple threads cause an undefined behaviour. This means the compiler ans the processor are free to do whatever they want in this case (like removing the loop for example, or not reloading the variable from memory since it is not supposed to be modified by another thread in the first place). As a result, studying performance of this case is not really relevant.
Assuming the compiler do not perform any (allowed) advanced optimizations, the program should contain a race condition. It is certainly slower because of a cache-line bouncing effect: multiple cores compete for the same locked cache-line and moving it from one core to another is very slow compared to increasing the variable from the L1 cache (this is certainly the overhead you see). Indeed, on standard x86-64 platforms like mainstream Intel processors, moving a locked cache line from one core to another means invalidating copies of the cache line of other L1/L2 cores and fetching it from the L3 cache which is much slower than the L1 (lower throughput & much higher latency). Note that this behaviour is dependent of the target platform (mainly the processor, besides compiler optimizations), but most platforms work similarly. For more information please read this and that about cache-coherence protocols.

Related

What thread competition infulence?

As you see,when I remove mt.lock() and mt.unlock,the result is smaller than 50000.
Why?What actually happens? I will be very grateful if you can explain it for me.
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
using namespace std;
class counter{
public:
mutex mt;
int value;
public:
counter():value(0){}
void increase()
{
//mt.lock();
value++;
//mt.unlock();
}
};
int main()
{
counter c;
vector<thread> threads;
for(int i=0;i<5;++i){
threads.push_back(thread([&]()
{
for(int i=0;i<10000;++i){
c.increase();
}
}));
}
for(auto& t:threads){
t.join();
}
cout << c.value <<endl;
return 0;
}
++ is actually two operations. One is reading the value, the other is incrementing it. Since it isn't an atomic operation, multiple threads operating in the same region of code will get mixed up.
As an example, consider three threads operating in the same region without any locking:
Threads 1 and 2 read value as 999
Thread 1 computes the incremented value as 1000 and updates the variable
Thread 3 reads 1000, increments to 1001 and updates the variable
Thread 2 computes incremented value as 999 + 1 = 1000 and overwrites 3's work with with 1000
Now if you were using something like the "fetch-and-add" instruction, which is atomic, you wouldn't need any locks. See fetch_add

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections 17.6.4.10 and 17.6.5.9.) The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async,
generate_randn,
count/threads,
make_seed()));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
}
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

Using thrust with openmp: no substantial speed up obtained

I am interested in porting a code I had written using mostly the Thrust GPU library to multicore CPU's. Thankfully, the website says that thrust code can be used with threading environments such as OpenMP / Intel TBB.
I wrote a simple code below for sorting a large array to see the speedup using a machine which can support upto 16 Open MP threads.
The timings obtained on this machine for sorting a random array of size 16 million are
STL : 1.47 s
Thrust (16 threads) : 1.21 s
There seems to be barely any speed-up. I would like to know how to get a substantial speed-up for sorting arrays using OpenMP like I do with GPUs.
The code is below (the file sort.cu). Compilation was performed as follows:
nvcc -O2 -o sort sort.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
The NVCC version is 5.5
The Thrust library version being used is v1.7.0
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <time.h>
#include "thrust/sort.h"
int main(int argc, char *argv[])
{
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
}
std::cout << "-------------\n";
clock_t start,stop;
start=clock();
std::sort(myarr,myarr+N);
stop=clock();
std::cout << "Time taken for sorting the array with STL is " << (stop-start)/(double)CLOCKS_PER_SEC;
//--------------------------------------------
srand(1);
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
}
start=clock();
thrust::sort(myarr,myarr+N);
stop=clock();
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << (stop-start)/(double)CLOCKS_PER_SEC;
return 0;
}
The device backend refers to the behavior of operations performed on a thrust::device_vector or similar reference. Thrust interprets the array/pointer you are passing it as a host pointer, and performs host-based operations on it, which are not affected by the device backend setting.
There are a variety of ways to fix this issue. If you read the device backend documentation you will find general examples and omp-specific examples. You could even specify a different host backend which should have the desired behavior (OMP usage) with your code, I think.
Once you fix this, you'll get an additional result surprise, perhaps: thrust appears to sort the array quickly, but reports a very long execution time. I believe this is due (on linux, anyway) to the clock() function being affected by the number of OMP threads in use.
The following code/sample run has those issues addressed, and seems to give me a ~3x speedup for 4 threads.
$ cat t592.cu
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <sys/time.h>
#include <time.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>
int main(int argc, char *argv[])
{
int N = 16000000;
double* myarr = new double[N];
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
}
std::cout << "-------------\n";
timeval t1, t2;
gettimeofday(&t1, NULL);
std::sort(myarr,myarr+N);
gettimeofday(&t2, NULL);
float et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "Time taken for sorting the array with STL is " << et << std::endl;;
//--------------------------------------------
srand(1);
for (int i = 0; i < N; ++i)
{
myarr[i] = (1.0*rand())/RAND_MAX;
//std::cout << myarr[i] << std::endl;
}
thrust::device_ptr<double> darr = thrust::device_pointer_cast<double>(myarr);
gettimeofday(&t1, NULL);
thrust::sort(darr,darr+N);
gettimeofday(&t2, NULL);
et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);
std::cout << "------------------\n";
std::cout << "Time taken for sorting the array with Thrust is " << et << std::endl ;
return 0;
}
$ nvcc -O2 -o t592 t592.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t592
-------------
Time taken for sorting the array with STL is 1.31956
------------------
Time taken for sorting the array with Thrust is 0.468176
$
Your mileage may vary. In particular, you may not see any improvement as you go above 4 threads. There may be a number of factors which prevent an OMP code from scaling beyond a certain number of threads. Sorting generally tends to be a memory-bound algorithm, so you will probably observe an increase until you have saturated the memory subsystem, and then no further increase from additional cores. Depending on your system, it's possible you could be in this situation already, in which case you may not see any improvement from OMP style multithreading.

C++ 11 std::thread strange behavior

I am experimenting a bit with std::thread and C++11, and I am encountering strange behaviour.
Please have a look at the following code:
#include <cstdlib>
#include <thread>
#include <vector>
#include <iostream>
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i);
count = i;
}
class A {
public:
A(const size_t x) : x_(x) {}
size_t sum_up(const size_t num_threads) const {
size_t i;
std::vector<std::thread> threads;
std::vector<size_t> data_vector;
for (i = 0; i < num_threads; ++i) {
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
}
std::cout << "Threads started ...\n";
for (i = 0; i < num_threads; ++i)
threads[i].join();
size_t sum = 0;
for (i = 0; i < num_threads; ++i)
sum += data_vector[i];
return sum;
}
private:
const size_t x_;
};
int main(int argc, char* argv[]) {
const size_t x = atoi(argv[1]);
const size_t num_threads = atoi(argv[2]);
A a(x);
std::cout << a.sum_up(num_threads) << std::endl;
return 0;
}
The main idea here is that I want to specify a number of threads which do independent computations (in this case, simple increments).
After all threads are finished, the results should be merged in order to obtain an overall result.
Just to clarify: This is only for testing purposes, in order to get me understand how
C++11 threads work.
However, when compiling this code using the command
g++ -o threads threads.cpp -pthread -O0 -std=c++0x
on a Ubuntu box, I get very strange behaviour, when I execute the resulting binary.
For example:
$ ./threads 1000 4
Threads started ...
Segmentation fault (core dumped)
(should yield the output: 4000)
$ ./threads 100000 4
Threads started ...
200000
(should yield the output: 400000)
Does anybody has an idea what is going on here?
Thank you in advance!
Your code has many problems (see even thread_sum_up for about 2-3 bugs) but the main bug I found by glancing your code is here:
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
See, when you push_back into a vector (I'm talking about data_vector), it can move all previous data around in memory. But then you take the address of (reference to) a cell for your thread, and then push back again (making the previous reference invalid)
This will cause you to crash.
For an easy fix - add data_vector.reserve(num_threads); just after creating it.
Edit at your request - some bugs in thread_sum_up
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i); // see that last ';' there? means this loop is empty. it shouldn't be there
count = i; // You're just setting count to be i. why do that in a loop? Did you mean +=?
}
The cause of your crash might be that std::ref(data_vector[i]) being invalidated by the next push_back in data_vector. Since you know the number of threads, do a data_vector.reserve(num_threads) before you start spawning off the threads to keep the references from being invalidated.
As you resize the vector with the calls to push_back, it is likely to have to reallocate the storage space, causing the references to the contained values to be invalidated. This causes the thread to write to non-allocated memory, which is undefined behavior.
Your options are to pre-allocate the size you need (vector::reserve is one option), or choose a different container.

Getting stack traces on Unix systems, automatically

What methods are there for automatically getting a stack trace on Unix systems? I don't mean just getting a core file or attaching interactively with GDB, but having a SIGSEGV handler that dumps a backtrace to a text file.
Bonus points for the following optional features:
Extra information gathering at crash time (eg. config files).
Email a crash info bundle to the developers.
Ability to add this in a dlopened shared library
Not requiring a GUI
FYI,
the suggested solution (using backtrace_symbols in a signal handler) is dangerously broken. DO NOT USE IT -
Yes, backtrace and backtrace_symbols will produce a backtrace and a translate it to symbolic names, however:
backtrace_symbols allocates memory using malloc and you use free to free it - If you're crashing because of memory corruption your malloc arena is very likely to be corrupt and cause a double fault.
malloc and free protect the malloc arena with a lock internally. You might have faulted in the middle of a malloc/free with the lock taken, which will cause these function or anything that calls them to dead lock.
You use puts which uses the standard stream, which is also protected by a lock. If you faulted in the middle of a printf you once again have a deadlock.
On 32bit platforms (e.g. your normal PC of 2 year ago), the kernel will plant a return address to an internal glibc function instead of your faulting function in your stack, so the single most important piece of information you are interested in - in which function did the program fault, will actually be corrupted on those platform.
So, the code in the example is the worst kind of wrong - it LOOKS like it's working, but it will really fail you in unexpected ways in production.
BTW, interested in doing it right? check this out.
Cheers,
Gilad.
If you are on systems with the BSD backtrace functionality available (Linux, OSX 1.5, BSD of course), you can do this programmatically in your signal handler.
For example (backtrace code derived from IBM example):
#include <execinfo.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
void sig_handler(int sig)
{
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (int i = 0; i < nSize; i++)
{
puts(symbols[i]);;
}
free(symbols);
signal(sig, &sig_handler);
}
void h()
{
kill(0, SIGSEGV);
}
void g()
{
h();
}
void f()
{
g();
}
int main(int argc, char ** argv)
{
signal(SIGSEGV, &sig_handler);
f();
}
Output:
0 a.out 0x00001f2d sig_handler + 35
1 libSystem.B.dylib 0x95f8f09b _sigtramp + 43
2 ??? 0xffffffff 0x0 + 4294967295
3 a.out 0x00001fb1 h + 26
4 a.out 0x00001fbe g + 11
5 a.out 0x00001fcb f + 11
6 a.out 0x00001ff5 main + 40
7 a.out 0x00001ede start + 54
This doesn't get bonus points for the optional features (except not requiring a GUI), however, it does have the advantage of being very simple, and not requiring any additional libraries or programs.
Here is an example of how to get some more info using a demangler. As you can see this one also logs the stacktrace to file.
#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <cxxabi.h>
void sig_handler(int sig)
{
std::stringstream stream;
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (unsigned int i = 0; i < size; i++) {
int status;
char *realname;
std::string current = symbols[i];
size_t start = current.find("(");
size_t end = current.find("+");
realname = NULL;
if (start != std::string::npos && end != std::string::npos) {
std::string symbol = current.substr(start+1, end-start-1);
realname = abi::__cxa_demangle(symbol.c_str(), 0, 0, &status);
}
if (realname != NULL)
stream << realname << std::endl;
else
stream << symbols[i] << std::endl;
free(realname);
}
free(symbols);
std::cerr << stream.str();
std::ofstream file("/tmp/error.log");
if (file.is_open()) {
if (file.good())
file << stream.str();
file.close();
}
signal(sig, &sig_handler);
}
Dereks solution is probably the best, but here's an alternative anyway:
Recent Linux kernel version allow you to pipe core dumps to a script or program. You could write a script to catch the core dump, collect any extra information you need and mail everything back.
This is a global setting though, so it'd apply to any crashing program on the system. It will also require root rights to set up.
It can be configured through the /proc/sys/kernel/core_pattern file. Set that to something like ' | /home/myuser/bin/my-core-handler-script'.
The Ubuntu people use this feature as well.

Resources