What is the difference between two join statements in the code? - multithreading

In the below code, there are two joins (of course one is commented). I would like to know what is the difference between
when join is executed before the loop and when join is executed after the loop?
#include <iostream>
#include <thread>
using namespace std;
void ThreadFunction();
int main()
{
thread ThreadFunctionObj(ThreadFunction);
//ThreadFunctionObj.join();
for (int j=0;j<10;++j)
{
cout << "\tj = " << j << endl;
}
ThreadFunctionObj.join();
return 0;
}
void ThreadFunction()
{
for (int i=0;i<10;++i)
{
cout << "i = " << i << endl;
}
}

A join() on a thread waits for it to finish execution, your code doesn't continue as long as the thread isn't done. As such, calling join() right after starting a new thread defeats the purpose of multi-threading, as it would be the same as executing those two for loops in a serial way. Calling join() after your loop in main() ensures that both for loops execute in parallel, meaning that at the end of your for loop in your main(), you wait for the ThreadFunction() loop to be done too. This is the equivalent of you and a friend going out to eat, for example. You both start eating at relatively the same time, but the first one to finish still has to wait for the other (might not be the best example, but hope it does the job).
Hope it helps

Related

Reading and writing different files in their own threads from my main loop in C++11

I am trying to understand, then, write some code that has to read from, and write to many different files and do so from the main loop of my application. I am hoping to use the C++11 model present in VS 2013.
I don't want to stall the main loop so I am investigating spinning off a thread each time a request to write or read a file is generated.
I've tried many things including using the async keyword which sounds promising. I boiled down some code to a simple example:
#include <future>
#include <iostream>
#include <string>
bool write_file(const std::string filename)
{
std::cout << "write_file: filename is " << filename << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(2000));
std::cout << "write_file: written" << std::endl;
return true;
}
int main(int argc, char* argv[])
{
const std::string filename = "foo.txt";
auto write = std::async(std::launch::async, write_file, filename);
while (true)
{
std::cout << "working..." << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::cout << "write result is " << write.get() << std::endl;
}
}
I'm struggling to understand the basics but my expectation would be that this code would constantly print "working..." and interspersed in the output would be the write_file start and end messages. Instead, I see that the write_file thread seems to block the main loop output until the timer expires.
I realize I need to also consider mutex/locking on the code to actually write the file but I would like to understand this bit first.
Thank you if you can point me in the right direction.
Molly.
write.get() will wait for the async task to finish. You want to use wait_for() instead:
do {
std::cout << "working...\n";
} while(write.wait_for(std::chrono::milliseconds(100)) != std::future_status::ready);
std::cout << "write result is " << write.get() << "\n";

boost::thread not updating global variable

I am using a wrapper function in an external software to start a new thread, which updates a global variable, but yet this seems invisible to the main thread. I cant call join(), not to block the main thread and crash the software. boost::async, boost::thread and boost::packaged_task all behave the same way.
uint32 *Dval;
bool hosttask1()
{
while(*Dval<10)
{
++*Dval;
PlugIn::gResultOut << " within thread global value: " << *Dval << std::endl;
Sleep(500);
}
return false;
}
void SU_HostThread1(uint32 *value)
{
Dval = value;
*Dval = 2;
PlugIn::gResultOut << " before thread: " << *value << " before thread global: " << *Dval << std::endl;
auto myFuture = boost::async(boost::launch::async,&hosttask1);
//boost::thread thread21 = boost::thread(&hosttask1);
//boost::packaged_task<bool> pt(&hosttask1);
//boost::thread thread21 = boost::thread(boost::move(pt));
}
When I call the function:
number a=0
su_hostthread1(a)
sleep(2) //seconds
result(" function returned "+a+" \n")
OUTPUT:
before thread value: 2 before thread global value: 2
within thread global value: 3
within thread global value: 4
within thread global value: 5
within thread global value: 6
function returned 2
within thread global value: 7
within thread global value: 8
within thread global value: 9
within thread global value: 10
Any ideas?
Thanks in advance!
If you share data between threads, you must syncronize access to that data. The two possible ways are a mutex protecting said data and atomic operations. The simple reason is that caches and read/write reordering (both by CPU and compiler) exist. This is a complex topic though and it's nothing that can be explained in an answer here, but there are a few good books out there and also a bunch of code that gets it right.
The following code correctly reproduces what I intend to do. Mainly, the thread updates a global variable which the main thread correctly observes.
#include "stdafx.h"
#include <iostream>
#include <boost/thread.hpp>
#include <boost/chrono.hpp>
unsigned long *dataR;
bool hosttask1()
{
bool done = false;
std::cout << "In thread global value: " << *dataR << "\n"; //*value11 << *dataL <<
unsigned long cc = 0;
boost::mutex m;
while (!done)
{
m.lock();
*dataR = cc;
m.unlock();
cc++;
std::cout << "In thread loop global value: "<< *dataR << "\n";
if (cc==5) done = true;
}
return done;
}
void SU_HostThread1(unsigned long *value)
{
dataR = value;
std::cout << "Before thread value: " << *value << " Before thread global value: " << *dataR << "\n"; //*value11 << *dataL <<
auto myFuture = boost::async(boost::launch::async, &hosttask1);
return;
}
int main()
{
unsigned long value =1;
unsigned long *value11;
value11 = &value;
SU_HostThread1(value11);
boost::this_thread::sleep(boost::posix_time::seconds(1));
std::cout << "done with end value: " << *value11 << "\n";
return 0;
}
output:
Before thread value: 1 Before thread global value: 1
In thread global value: 1
In thread loop global value: 0
In thread loop global value: 1
In thread loop global value: 2
In thread loop global value: 3
In thread loop global value: 4
done with end value: 4
Yet when I copy this exactly to the SDK of the external software, the main thread does not update global value. Any ideas how this is so?
Thanks
output in external software:
before thread value: 1 before thread global value: 1
In thread global value: 1
In thread loop global value: 0
In thread loop global value: 1
In thread loop global value: 2
In thread loop global value: 3
In thread loop global value: 4
done with end value: 1
Likely this is because the compiler doesn't generally think about multithreading when optimising your code. If has seen you code checks a value repeatedly, and it knows that in single threading that value cannot change, so it just omitted the check.
If you declare the variable as volatile, then it will probably generate less efficient code that checks more often.
Of course you have to also understand that when a value is written, there are circumstances when it may not all be written in one go, so if you are unlucky enough to read it back when it is half-written, then you get back a garbage value. The fix for that is to declare it as std::atomic (which is automatically considered volatile by the optimiser), and then even more complex code will be emitted to ensure that the write and the read cannot intersect (or different processor primitives might be used for small objects such as integers)
most variables are not shared between threads, and when they are it is up to the programmer to identify those and balance optimisation against the thread synchronisation needs during design.

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections 17.6.4.10 and 17.6.5.9.) The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
{
// Print "S" when a thread starts
std::cout << "S";
std::cout.flush();
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
{
rvalue += randn(gen);
}
// Print "F" when a thread finishes
std::cout << "F";
std::cout.flush();
return rvalue/iterations;
}
int main(int argc, char *argv[])
{
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
{
// Start async tasks
futures.push_back(std::async(std::launch::async,
generate_randn,
count/threads,
make_seed()));
}
for (auto &future : futures)
{
// Wait for tasks to finish
future.wait();
total += future.get();
}
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
}
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

std::map insert thread safe in c++11?

I have very simple code in which multiple threads are trying to insert data in std::map and as per my understanding this should led to program crash because this is data race
std::map<long long,long long> k1map;
void Ktask()
{
for(int i=0;i<1000;i++)
{
long long random_variable = (std::rand())%1000;
std::cout << "Thread ID -> " << std::this_thread::get_id() << " with looping index " << i << std::endl;
k1map.insert(std::make_pair(random_variable, random_variable));
}
}
int main()
{
std::srand((int)std::time(0)); // use current time as seed for random generator
for (int i = 0; i < 1000; ++i)
{
std::thread t(Ktask);
std::cout << "Thread created " << t.get_id() << std::endl;
t.detach();
}
return 0;
}
However i ran it multiple time and there is no application crash and if run same code with pthread and c++03 application is crashing so I am wondering is there some change in c++11 that make map insert thread safe ?
No, std::map::insert is not thread-safe.
There are many reasons why your example may not crash. Your threads may be running in a serial fashion due to the system scheduler, or because they finish very quickly (1000 iterations isn't that much). Your map will fill up quickly (only having 1000 nodes) and therefore later insertions won't actually modify the structure and reduce possibility of crashes. Or perhaps the implementation you're using IS thread-safe.
For most standard library types, the only thread safety guarantee you get is that it is safe to use separate object instances in separate threads. That's it.
And std::map is not one of the exceptions to that rule. An implementation might offer you more of a guarantee, or you could just be getting lucky.
And when it comes to fixing threading bugs, there's only one kind of luck.

What's the correct way of waiting for detached threads to finish?

Look at this sample code:
void OutputElement(int e, int delay)
{
this_thread::sleep_for(chrono::milliseconds(100 * delay));
cout << e << '\n';
}
void SleepSort(int v[], uint n)
{
for (uint i = 0 ; i < n ; ++i)
{
thread t(OutputElement, v[i], v[i]);
t.detach();
}
}
It starts n new threads and each one sleeps for some time before outputting a value and finishing. What's the correct/best/recommended way of waiting for all threads to finish in this case? I know how to work around this but I want to know what's the recommended multithreading tool/design that I should use in this situation (e.g. condition_variable, mutex etc...)?
And now for the slightly dissenting answer. And I do mean slightly because I mostly agree with the other answer and the comments that say "don't detach, instead join."
First imagine that there is no join(). And that you have to communicate among your threads with a mutex and condition_variable. This really isn't that hard nor complicated. And it allows an arbitrarily rich communication, which can be anything you want, as long as it is only communicated while the mutex is locked.
Now a very common idiom for such communication would simply be a state that says "I'm done". Child threads would set it, and the parent thread would wait on the condition_variable until the child said "I'm done." This idiom would in fact be so common as to deserve a convenience function that encapsulated the mutex, condition_variable and state.
join() is precisely this convenience function.
But imho one has to be careful. When one says: "Never detach, always join," that could be interpreted as: Never make your thread communication more complicated than "I'm done."
For a more complex interaction between parent thread and child thread, consider the case where a parent thread launches several child threads to go out and independently search for the solution to a problem. When the problem is first found by any thread, that gets communicated to the parent, and the parent can then take that solution, and tell all the other threads that they don't need to search any more.
For example:
#include <chrono>
#include <iostream>
#include <iterator>
#include <random>
#include <thread>
#include <vector>
void OneSearch(int id, std::shared_ptr<std::mutex> mut,
std::shared_ptr<std::condition_variable> cv,
int& state, int& solution)
{
std::random_device seed;
// std::mt19937_64 eng{seed()};
std::mt19937_64 eng{static_cast<unsigned>(id)};
std::uniform_int_distribution<> dist(0, 100000000);
int test = 0;
while (true)
{
for (int i = 0; i < 100000000; ++i)
{
++test;
if (dist(eng) == 999)
{
std::unique_lock<std::mutex> lk(*mut);
if (state == -1)
{
state = id;
solution = test;
cv->notify_one();
}
return;
}
}
std::unique_lock<std::mutex> lk(*mut);
if (state != -1)
return;
}
}
auto findSolution(int n)
{
std::vector<std::thread> threads;
auto mut = std::make_shared<std::mutex>();
auto cv = std::make_shared<std::condition_variable>();
int state = -1;
int solution = -1;
std::unique_lock<std::mutex> lk(*mut);
for (uint i = 0 ; i < n ; ++i)
threads.push_back(std::thread(OneSearch, i, mut, cv,
std::ref(state), std::ref(solution)));
while (state == -1)
cv->wait(lk);
lk.unlock();
for (auto& t : threads)
t.join();
return std::make_pair(state, solution);
}
int
main()
{
auto p = findSolution(5);
std::cout << '{' << p.first << ", " << p.second << "}\n";
}
Above I've created a "dummy problem" where a thread searches for how many times it needs to query a URNG until it comes up with the number 999. The parent thread puts 5 child threads to work on it. The child threads work for awhile, and then every once in a while, look up and see if any other thread has found the solution yet. If so, they quit, else they keep working. The main thread waits until solution is found, and then joins with all the child threads.
For me, using the bash time facility, this outputs:
$ time a.out
{3, 30235588}
real 0m4.884s
user 0m16.792s
sys 0m0.017s
But what if instead of joining with all the threads, it detached those threads that had not yet found a solution. This might look like:
for (unsigned i = 0; i < n; ++i)
{
if (i == state)
threads[i].join();
else
threads[i].detach();
}
(in place of the t.join() loop from above). For me this now runs in 1.8 seconds, instead of the 4.9 seconds above. I.e. the child threads are not checking with each other that often, and so main just detaches the working threads and lets the OS bring them down. This is safe for this example because the child threads own everything they are touching. Nothing gets destructed out from under them.
One final iteration can be realized by noticing that even the thread that finds the solution doesn't need to be joined with. All of the threads could be detached. The code is actually much simpler:
auto findSolution(int n)
{
auto mut = std::make_shared<std::mutex>();
auto cv = std::make_shared<std::condition_variable>();
int state = -1;
int solution = -1;
std::unique_lock<std::mutex> lk(*mut);
for (uint i = 0 ; i < n ; ++i)
std::thread(OneSearch, i, mut, cv,
std::ref(state), std::ref(solution)).detach();
while (state == -1)
cv->wait(lk);
return std::make_pair(state, solution);
}
And the performance remains at about 1.8 seconds.
There is still (sort of) an effective join with the solution-finding thread here. But it is accomplished with the condition_variable::wait instead of with join.
thread::join() is a convenience function for the very common idiom that your parent/child thread communication protocol is simply "I'm done." Prefer thread::join() in this common case as it is easier to read, and easier to write.
However don't unnecessarily constrain yourself to such a simple parent/child communication protocol. And don't be afraid to build your own richer protocol when the task at hand needs it. And in this case, thread::detach() will often make more sense. thread::detach() doesn't necessarily imply a fire-and-forget thread. It can simply mean that your communication protocol is more complex than "I'm done."
Don't detach, but instead join:
std::vector<std::thread> ts;
for (unsigned int i = 0; i != n; ++i)
ts.emplace_back(OutputElement, v[i], v[i]);
for (auto & t : threads)
t.join();

Resources