C++11 When To Use A Memory Fence?

C++11 When To Use A Memory Fence? - multithreading

I'm writing some threaded C++11 code, and I'm not totally sure on when I need to use a memory fence or something. So here is basically what I'm doing:
class Worker
{
std::string arg1;
int arg2;
int arg3;
std::thread thread;
public:
Worker( std::string arg1, int arg2, int arg3 )
{
this->arg1 = arg1;
this->arg2 = arg2;
this->arg3 = arg3;
}
void DoWork()
{
this->thread = std::thread( &Worker::Work, this );
}
private:
Work()
{
// Do stuff with args
}
}
int main()
{
Worker worker( "some data", 1, 2 );
worker.DoWork();
// Wait for it to finish
return 0;
}
I was wondering, what steps do I need to take to make sure that the args are safe to access in the Work() function which runs on another thread. Is it enough that it's written in the constructor, and then the thread is created in a separate function? Or do I need a memory fence, and how do I make a memory fence to make sure all 3 args are written by the main thread, and then read by the Worker thread?
Thanks for any help!

The C++11 standard section 30.3.1.2 thread constructors [thread.thread.constr] p5 describes the constructor template <class F, class... Args> explicit thread(F&& f, Args&&... args):
Synchronization: the completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f.
So everything in the current thread happens before the thread function is called. You don't need to do anything special to ensure that the assignments to the Worker members are complete and will be visible to the new thread.
In general, you should never have to use a memory fence when writing multithreaded C++11: synchronization is built into mutexes/atomics and they handle any necessary fences for you. (Caveat: you are on your own if you use relaxed atomics.)

Related

Is it safe to initialize a c++11 function-static variable from a linux signal handler?

2 questions (below) about the C++11 static initialization at [1] in this reference code (this is a complete tested c++11 example program).
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <string.h>
struct Foo {
/* complex member variables. */
};
void DoSomething(Foo *foo) {
// Complex, but signal safe, use of foo.
}
Foo InitFoo() {
Foo foo;
/* complex, but signal safe, initialization of foo */
return foo;
}
Foo* GetFoo() {
static Foo foo = InitFoo(); // [1]
return &foo;
}
void Handler(int sig) {
DoSomething(GetFoo());
}
int main() {
// [2]
struct sigaction act;
memset(&act, 0, sizeof(act));
act.sa_handler = Handler;
sigaction(SIGINT, &act, nullptr);
for (;;) {
sleep(1);
DoSomething(GetFoo());
}
}
Question1: Is this guaranteed safe (no deadlocks etc)? C++11 static initialization involves locks. What if the signal is delivered before/after/during the first call to GetFoo() in main?
Question2: Is this guaranteed safe if a call to GetFoo() is inserted at [2] before the signal handler is installed? (Edit:) I.e. does inserting GetFoo() at [2] ensure that, later, when a signal arrives while the loop is operating, that there will be no deadlock?
I'm assuming C++11 (g++ or clang) on recent GNU/Linux, although answers for various Unices would also be interesting. (Spoiler: I think the answer is 1:NO and 2:YES but I don't know how to prove it.)
Edit: To be clear, I can imagine static initialization could be implemeted like this:
Mutex mx; // global variable
bool done = false; // global variable
...
lock(mx);
if (!done) {
foo = InitFoo();
done = true;
}
unlock(mx);
and then it would not be deadlock safe because the signal handler might lock mx while the main thread has it locked.
But there are other implementations, for example:
Mutex mx; // global variable
std::atomic<bool> done = false; // global variable
...
if (!done.load()) {
lock(mx);
if (!done.load()) {
foo = InitFoo();
done.store(true);
}
unlock(mx);
}
which would not have potential for deadlock provided the codepath was run completely at least once before a signal handler runs it.
My question is whether the c++11 (or any later) standard requires the implementation to be async-signal-safe (deadlock free, aka lock free) after the initial pass through the code has completed?

How static Foo foo = InitFoo(); gets initialized must be stated first before getting into signals.
It requires dynamic initialization, where it'll be initialized the first time GetFoo() gets called since the "complex initialization" you mention in InitFoo() can't be done at compile-time:
Dynamic initialization of a block-scope variable with static storage
duration or thread storage duration is performed the first time
control passes through its declaration; such a variable is considered
initialized upon the completion of its initialization. If the
initialization exits by throwing an exception, the initialization is
not complete, so it will be tried again the next time control enters
the declaration. If control enters the declaration concurrently while
the variable is being initialized, the concurrent execution shall wait
for completion of the initialization. 85 If control re-enters the declaration recursively while the variable is being initialized, the
behavior is undefined.
85 The implementation must not introduce any deadlock around execution of the initializer. Deadlocks might still be caused by the program logic; the implementation need only avoid deadlocks due to its own synchronization operations.
With that established, we can go to the questions.
Question1: Is this guaranteed safe (no deadlocks etc)? C++11 static initialization involves locks. What if the signal is delivered before/after/during the first call to GetFoo() in main?
No, this isn't guaranteed. Consider when GetFoo() is called the first time from inside the for loop:
GetFoo() -> a lock is taken to initialize 'foo'-> a signal arrives [control goes to signal handling function] -> blocked here for signal handling to complete
--> Handler() -> DoSomething(GetFoo()) -> GetFoo() -> waits here because the lock is unavailable.
(The signal handler has to wait here since the initialization of 'foo' isn't complete yet -- refer the quote above).
So the deadlock occurs in this scenario (even without any threads) as the thread is blocked on itself.
Question2: Is this guaranteed safe if a call to GetFoo() is inserted at [2] before the signal handler is installed?
In this case, there's no signal handler established at all for SIGINT. So if SIGINT arrives, the program simply exits. The default disposition for SIGINT is to terminate the process. It doesn't matter whether the initialization of GetFoo() is progress or not. So this is fine.
The fundamental problem with case (1) is that the signal handler Handler isn't async-signal-safe because it calls GetFoo() which isn't async-signal-safe.
Re. updated question with possible implementations of static initialization:
The C++11 standard only guarantees that the initialization of foo is done in a thread-safe manner (see the bold quote above). But handling signals is not "concurrent execution". It's more like "recursively re-entering" as it can happen even in a single-threaded program - and thus it'd be undefined. This is true even if static initialization is implemented like in your second method that'd avoid deadlocks.
Put it the other way: if static initialization is implemented like your first method, does it violate the standard? The answer is no. So you can't rely on static initialization being implemented in an async-signal-safe way.
Given you ensure "...provided the codepath was run completely at least once before a signal handler runs it." then you could introduce another check that'd ensure GetFoo() is async-signal-safe regardless of how static initialization is implemented:
std::atomic<bool> foo_done = false;
static_assert( std::atomic<bool>::is_lock_free );
Foo* GetFoo() {
if (!foo_done) {
static Foo foo = InitFoo(); // [1]
foo_done = true;
}
return &foo;
}

How to join a thread in Linux kernel?

The main question is: How we can wait for a thread in Linux kernel to complete? I have seen a few post concerned about proper way of handling threads in Linux kernel but i'm not sure how we can wait for a single thread in the main thread to be completed (suppose we need the thread[3] be done then proceed):
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/errno.h>
#include <linux/sched.h>
#include <linux/kthread.h>
#include <linux/slab.h>
void *func(void *arg) {
// doing something
return NULL;
}
int init_module(void) {
struct task_struct* thread[5];
int i;
for(i=0; i<5; i++) {
thread[i] = kthread_run(func, (void*) arg, "Creating thread");
}
return 0;
}
void cleanup_module(void) {
printk("cleaning up!\n");
}

AFAIK there is no equivalent of pthread_join() in kernel. Also, I feel like your pattern (of starting bunch of threads and waiting only for one of them) is not really common in kernel. That being said, there kernel does have few synchronization mechanism that may be used to accomplish your goal.
Note that those mechanisms will not guarantee that the thread finished, they will only let main thread know that they finished doing the work they were supposed to do. It may still take some time to really stop this tread and free all resources.
Semaphores
You can create a locked semaphore, then call down in your main thread. This will put it to sleep. Then you will up this semaphore inside of your thread just before exiting. Something like:
struct semaphore sem;
int func(void *arg) {
struct semaphore *sem = (struct semaphore*)arg; // you could use global instead
// do something
up(sem);
return 0;
}
int init_module(void) {
// some initialization
init_MUTEX_LOCKED(&sem);
kthread_run(&func, (void*) &sem, "Creating thread");
down(&sem); // this will block until thread runs up()
}
This should work but is not the most optimal solution. I mention this as it's a known pattern that is also used in userspace. Semaphores in kernel are designed for cases where it's mostly available and this case has high contention. So a similar mechanism optimized for this case was created.
Completions
You can declare completions using:
struct completion comp;
init_completion(&comp);
or:
DECLARE_COMPLETION(comp);
Then you can use wait_for_completion(&comp); instead of down() to wait in main thread and complete(&comp); instead of up() in your thread.
Here's the full example:
DECLARE_COMPLETION(comp);
struct my_data {
int id;
struct completion *comp;
};
int func(void *arg) {
struct my_data *data = (struct my_data*)arg;
// doing something
if (data->id == 3)
complete(data->comp);
return 0;
}
int init_module(void) {
struct my_data *data[] = kmalloc(sizeof(struct my_data)*N, GFP_KERNEL);
// some initialization
for (int i=0; i<N; i++) {
data[i]->comp = &comp;
data[i]->id = i;
kthread_run(func, (void*) data[i], "my_thread%d", i);
}
wait_for_completion(&comp); // this will block until some thread runs complete()
}
Multiple threads
I don't really see why you would start 5 identical threads and only want to wait for 3rd one but of course you could send different data to each thread, with a field describing it's id, and then call up or complete only if this id equals 3. That's shown in the completion example. There are other ways to do this, this is just one of them.
Word of caution
Go read some more about those mechanisms before using any of them. There are some important details I did not write about here. Also those examples are simplified and not tested, they are here just to show the overall idea.

kthread_stop() is a kernel's way for wait thread to end.
Aside from waiting, kthread_stop() also sets should_stop flag for waited thread and wake up it, if needed. It is usefull for threads which repeat some actions infinitely.
As for single-shot tasks, it is usually simpler to use works for them, instead of kthreads.
EDIT:
Note: kthread_stop() can be called only when kthread(task_struct) structure is not freed.
Either thread function should return only after it found kthread_should_stop() return true, or get_task_struct() should be called before start thread (and put_task_struct() should be called after kthread_stop()).

thread.join does not return when called in global var destructor

Using C++11 STL with VS2013 to implementing a asynchronous print class.
Failing to get thread.join() returns with no deadlocking.
I am trying to debug and finally find this issue may caused by global/local class variable declaration. Here is the details and I dont know why it happened?
#include <iostream>
#include <string>
#include <chrono>
#include <mutex>
#include <thread>
#include <condition_variable>
#include "tbb/concurrent_queue.h"
using namespace std;
class logger
{
public:
~logger()
{
fin();
}
void init()
{
m_quit = false;
m_thd = thread(bind(&logger::printer, this));
//thread printer(bind(&logger::printer, this));
//m_thd.swap(printer);
}
void fin()
{
//not needed
//unique_lock<mutex> locker(m_mtx);
if (m_thd.joinable())
{
m_quit = true;
write("fin");
//locker.unlock();
m_thd.join();
}
}
void write(const char *msg)
{
m_queue.push(msg);
m_cond.notify_one();
}
void printer()
{
string msgstr;
unique_lock<mutex> locker(m_mtx);
while (1)
{
if (m_queue.try_pop(msgstr))
cout << msgstr << endl;
else if (m_quit)
break;
else
m_cond.wait(locker);
}
cout << "printer quit" <<endl;
}
bool m_quit;
mutex m_mtx;
condition_variable m_cond;
thread m_thd;
tbb::concurrent_queue<string> m_queue;
};
For more convenience I placed thread.join into class's destructor in order to ensure the m_thread can be quit normally.
I test the whole class and something wrong occured.
m_thd.join() never return when class logger declared as a global var
like this:
logger lgg;
void main()
{
lgg.init();
for (int i = 0; i < 100; ++i)
{
char s[8];
sprintf_s(s, 8, "%d", i);
lgg.write(s);
}
//if first call lgg.fin() here, m_thd can be joined normally
//lgg.fin();
system("pause");
//dead&blocked here and I observed that printer() finished successfully
}
If class logger declared as a local variable, it seems everything works well.
void main()
{
logger lgg;
lgg.init();
for (int i = 0; i < 100; ++i)
{
char s[8];
sprintf_s(s, 8, "%d", i);
lgg.write(s);
}
system("pause");
}
update 2015/02/27
I tried to delete std::cout in printer(), but program still blocked at same place, seems it is not the std::cout problem?
Deleting supernumerary lock in fin()

Globals and statics are constructed and destructed just prior or post to DllMain getting called respectively for DLL_PROCESS_ATTACH and DLL_PROCESS_DETACH. The problem with this is that it occurs inside the loader lock. Which is the most dangerous place on the planet to be if dealing with kernel objects as it may cause deadlock, or the application to randomly crash. As such you should never use thread primitives as statics on windows EVER. Thus dealing with threading in a destructor of a global object is basically doing the exact things we're warned not to do in DllMain.
To quote Raymond Chen
The building is being demolished. Don't bother sweeping the floor and emptying the trash cans and erasing the whiteboards. And don't line up at the exit to the building so everybody can move their in/out magnet to out. All you're doing is making the demolition team wait for you to finish these pointless housecleaning tasks.
and again:
If your DllMain function creates a thread and then waits for the thread to do something (e.g., waits for the thread to signal an event that says that it has finished initializing, then you've created a deadlock. The DLL_PROCESS_ATTACH notification handler inside DllMain is waiting for the new thread to run, but the new thread can't run until the DllMain function returns so that it can send a new DLL_THREAD_ATTACH notification.
This deadlock is much more commonly seen in DLL_PROCESS_DETACH, where a DLL wants to shut down its worker threads and wait for them to clean up before it unloads itself. You can't wait for a thread inside DLL_PROCESS_DETACH because that thread needs to send out the DLL_THREAD_DETACH notifications before it exits, which it can't do until your DLL_PROCESS_DETACH handler returns.
This also occurs even when using an EXE because the visual C++ runtime cheats and registers its constructors and destructors with the C runtime to be run when the runtime is loaded or unloaded, thus ending up with the same issue:
The answer is that the C runtime library hires a lackey. The hired lackey is the C runtime library DLL (for example, MSVCR80.DLL). The C runtime startup code in the EXE registers all the destructors with the C runtime library DLL, and when the C runtime library DLL gets its DLL_PROCESS_DETACH, it calls all the destructors requested by the EXE.

I'm wondering how you're using m_mtx. The normal pattern is that both thread lock it and both threads unlock it. But fin() fails to lock it.
Similarly unexpected is m_cond.wait(m_mtx). This would release the mutex, except that it isn't locked in the first place!
Finally, as m_mtx isn't locked, I don't see how m_quit = true should become visible in m_thd.

One problem you have is that std::condition_variable::notify_one is called while the same std::mutex that the waiting thread is holding, is held (happens when logger::write is called by logger::fin).
This causes the notified thread to immediately block again, and hence the printer thread will block possibly indefinitely upon destruction (or until spurious wakeup).
You should never notify while holding the same mutex as the waiting thread(s).
Quote from en.cppreference.com:
The notifying thread does not need to hold the lock on the same mutex as the one held by the waiting thread(s); in fact doing so is a pessimization, since the notified thread would immediately block again, waiting for the notifying thread to release the lock.

C++11 - Managing worker threads

I am new to threading in C++11 and I am wondering how to manage worker threads (using the standard library) to perform some task and then die off. I have a pool of threads vector<thread *> thread_pool that maintains a list of active threads.
Let's say I launch a new thread and add it to the pool using thread_pool.push_back(new thread(worker_task)), where worker_task is defined as follows:
void worker_task()
{
this_thread::sleep_for(chrono::milliseconds(1000));
cout << "Hello, world!\n"
}
Once the worker thread has terminated, what is the best way to reliably remove the thread from the pool? The main thread needs to run continuously and cannot block on a join call. I am more confused about the general structure of the code than the intricacies of synchronization.
Edit: It looks like I misused the concept of a pool in my code. All I meant was that I have a list of threads that are currently running.

You can use std::thread::detach to "separate the thread of execution from the thread object, allowing execution to continue independently. Any allocated resources will be freed once the thread exits."
If each thread should make its state visible, you can move this functionality into the thread function.
std::mutex mutex;
using strings = std::list<std::string>;
strings info;
strings::iterator insert(std::string value) {
std::unique_lock<std::mutex> lock{mutex};
return info.insert(info.end(), std::move(value));
}
auto erase(strings::iterator p) {
std::unique_lock<std::mutex> lock{mutex};
info.erase(p);
}
template <typename F>
void async(F f) {
std::thread{[f] {
auto p = insert("...");
try {
f();
} catch (...) {
erase(p);
throw;
}
erase(p);
}}.detach();
}

starting std::thread with anonymous class call

I am curious as to how to correctly start a std::thread using an anonymous class call.
With the below code, if my class only having 1 member variable and I call std::thread td(someclass(shared_mutex)); I get a compiler warning of warning
C4930: 'std::thread td(someclass)': prototyped function not called (was a variable definition intended?)
However, if I add a second member variable as below and call it with
std::thread td(someclass(shared_mutex,x));
I get an error with error C2064: term does not evaluate to a function taking 0 arguments.
class someclass
{
private:
std::mutex& shared_mutex;
int x;
public:
someclass(std::mutex& init_mutex, int init_x) :
shared_mutex(init_mutex),
x(init_x)
{}
//...
};
int main()
{
std::mutex shared_mutex;
int x = 10;
std::thread td(someclass(shared_mutex,x));
td.join();
return 0;
}
The only way around this is by creating an
void operator()()
{}
within the class, but is that the correct method, just to have some kind of starting function for the thread reference or am I missing some other point here? I thought the constructor would be resolver for that?

Try using { and } syntax to construct your object to avoid veximg parses as a function declaration.
std::thread td(someclass(shared_mutex,x))
becomes
std::thread td{someclass{shared_mutex,x}}

It seems that you want your thread to execute the long-running constructor of someclass and then immediately discard the newly constructed someclass. This can be done by passing the thread constructor a function object that does just that:
int main()
{
std::mutex shared_mutex;
int x = 10;
std::thread td([&]{someclass(shared_mutex,x);});
td.join();
return 0;
}
Be warned: constructing a new thread is a hugely expensive operation, so you should avoid casually spawning new threads if you have the ability to instead reuse existing threads, unless you are only going to create new threads very infrequently.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string