An Efficient Non-Enforcing, Verifying, Mutex - multithreading

Class foo has a method bar. According to some synchronization protocol, the bar method of a specific foo object, will be only called by at most one thread at any point in time.
I'd like to add a very lightweight verification_mutex to verify this / debug synchronization abuses. It will be used similarly to a regular mutex:
class foo {
public:
void bar() {
std::lock_guard<verification_mutex> lk{m};
...
}
private:
mutable verification_mutex m;
};
however, it will not in itself necessarily lock or unlock anything. Rather, it will just throw if multithreaded simultaneous access is detected. The point is to make its runtime footprint as low as possible (including its effect on other code, e.g., through memory barriers).
Here are three options for implementing verification_mutex:
A wrapper around std::mutex, but with lock implemented by a check that trylock succeeded (this is just to get the idea; clearly not very fast)
An atomic variable noting the current "locking" thread id, with atomic exchange operations (see implementation sketch below).
Same as 2, but without atomics.
Are these correct or incorrect (in particular, 2 and esp. 3)? How will they affect performance (esp. of surrounding code)? Is there an altogether superior alternative?
Edit The answer by #SergeyA below is fine, but I'm in particular curious about the memory barriers. A solution not utilizing them would be great, as would be an answer giving some intuitive explanation why any solution omitting them would necessarily fail.
Implementation Sketch
#include <atomic>
#include <thread>
#include <functional>
class verification_mutex {
public:
verification_mutex() : m_holder{0}{}
void lock() {
if(m_holder.exchange(get_this_thread_id()) != 0)
throw std::logic_error("lock");
}
void unlock() {
if(m_holder.exchange(0) != get_this_thread_id())
throw std::logic_error("unlock");
}
bool try_lock() {
lock();
return true;
}
private:
static inline std::size_t get_this_thread_id() {
return std::hash<std::thread::id>()(std::this_thread::get_id());
}
private:
std::atomic_size_t m_holder;
};

Option 3 is not viable. You need a memory barrier when reading/writing a variable from multiple threads.
Of all options, atomic boolean variable would be the fastest, since it won't require context switches (mutexes might). Something like that:
class verifying_mutex {
std::atomic<bool> locked{false};
public:
bool lock() {
if (!locked.compare_exchange_strong(false, true))
throw std::runtime_error("Incorrect mt-access pattern");
}
bool unlock() {
locked = false;
}
};
On a side note, your original version of lock used thread_id, which would slow you down unnecessary. Do not do this.

Related

How to move/swap a std::vector efficiently and thread safe?

Imagine a thread which continuously writes to a vector of strings which is being collected every now and then by another thread (see code).
#include <string>
#include <vector>
#include <chrono>
#include <thread>
#include <iostream>
#include <cassert>
// some public vector being filled by one and consumed by another
// thread
static std::vector<std::string> buffer;
// continuously writes data to buffer (has to be fast)
static const auto filler(std::thread([] {
for (size_t i = 0;; ++i) {
buffer.push_back(std::to_string(i));
}
}));
// returns collected data and clears the buffer being written to
std::vector<std::string> fetch() {
return std::move(buffer);
}
// continuously fetch buffered data and process it (can be slow)
int main() {
size_t expected{};
for(;;) {
std::this_thread::sleep_for(std::chrono::seconds(1));
const auto fetched(fetch());
for (auto && e : fetched) {
size_t read(std::stoi(e));
std::cout << read << " " << expected << std::endl;
assert(read == expected);
++expected;
}
}
}
The provided example generally does what I want it to do but it crashes because it's not thread safe. Obvious approaches would be
to secure the shared vector using a lock_guard
using two buffers and an atomic pointer
using a thread safe vector implementation.
The provided scenario seems very simple to me. I don't think I need a thread safe vector because that would cover a lot more scenarios at the cost of performance.
Using a mutex or swapping between two instances of the vector seem plausible to me but I wonder if there is some solution specially made to 'atomically grab all data and leave an empty container'.
Maybe there's an obvious solution and it's just time to go to bed for me?
Important note: this question is somewhat academical since performance is not (necessarily) a real issue here. The provided example gets throttled by about 15% but there is hardly any 'real' work being done. I think in a real world example the benefit would be about 2-5%
First of all I would not recommend to have a non-const static variable. So I propose to encapsulate vector with a class with the following interface
class ValuesHolder
{
public:
void push_back(std::string value);
std::vector<std::string> take();
};
The second note about 'atomically grab all data and leave an empty container' - you could make this trick with swapping pointers but the main issue is that push_back should be in a sync with it (during the push_back is executed vector shouldn't be moved). Otherwise there may be issues with the following workflow
Thead 1 Thread 2
auto values = holder.take(); // push_back starts before take
for (const auto& value : values) // but value is inserted during the iteration
{...}
So the first option is just to lock during both calls:
class ValuesHolder
{
public:
void push_back(std::string value)
{
std::lock_guard<std::mutex> lock(mut);
values.push_back(std::move(value));
}
std::vector<std::string> take()
{
std::lock_guard<std::mutex> lock(mut);
return std::move(values);
}
private:
std::mutex mut;
std::vector<std::string> values;
};
Otherwise you could switch from std::vector to lock-free stack container. However the performance should be accurately measured since the number of allocations can increase, so the performance can be worser.

Is it safe to initialize a c++11 function-static variable from a linux signal handler?

2 questions (below) about the C++11 static initialization at [1] in this reference code (this is a complete tested c++11 example program).
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <string.h>
struct Foo {
/* complex member variables. */
};
void DoSomething(Foo *foo) {
// Complex, but signal safe, use of foo.
}
Foo InitFoo() {
Foo foo;
/* complex, but signal safe, initialization of foo */
return foo;
}
Foo* GetFoo() {
static Foo foo = InitFoo(); // [1]
return &foo;
}
void Handler(int sig) {
DoSomething(GetFoo());
}
int main() {
// [2]
struct sigaction act;
memset(&act, 0, sizeof(act));
act.sa_handler = Handler;
sigaction(SIGINT, &act, nullptr);
for (;;) {
sleep(1);
DoSomething(GetFoo());
}
}
Question1: Is this guaranteed safe (no deadlocks etc)? C++11 static initialization involves locks. What if the signal is delivered before/after/during the first call to GetFoo() in main?
Question2: Is this guaranteed safe if a call to GetFoo() is inserted at [2] before the signal handler is installed? (Edit:) I.e. does inserting GetFoo() at [2] ensure that, later, when a signal arrives while the loop is operating, that there will be no deadlock?
I'm assuming C++11 (g++ or clang) on recent GNU/Linux, although answers for various Unices would also be interesting. (Spoiler: I think the answer is 1:NO and 2:YES but I don't know how to prove it.)
Edit: To be clear, I can imagine static initialization could be implemeted like this:
Mutex mx; // global variable
bool done = false; // global variable
...
lock(mx);
if (!done) {
foo = InitFoo();
done = true;
}
unlock(mx);
and then it would not be deadlock safe because the signal handler might lock mx while the main thread has it locked.
But there are other implementations, for example:
Mutex mx; // global variable
std::atomic<bool> done = false; // global variable
...
if (!done.load()) {
lock(mx);
if (!done.load()) {
foo = InitFoo();
done.store(true);
}
unlock(mx);
}
which would not have potential for deadlock provided the codepath was run completely at least once before a signal handler runs it.
My question is whether the c++11 (or any later) standard requires the implementation to be async-signal-safe (deadlock free, aka lock free) after the initial pass through the code has completed?
How static Foo foo = InitFoo(); gets initialized must be stated first before getting into signals.
It requires dynamic initialization, where it'll be initialized the first time GetFoo() gets called since the "complex initialization" you mention in InitFoo() can't be done at compile-time:
Dynamic initialization of a block-scope variable with static storage
duration or thread storage duration is performed the first time
control passes through its declaration; such a variable is considered
initialized upon the completion of its initialization. If the
initialization exits by throwing an exception, the initialization is
not complete, so it will be tried again the next time control enters
the declaration. If control enters the declaration concurrently while
the variable is being initialized, the concurrent execution shall wait
for completion of the initialization. 85 If control re-enters the declaration recursively while the variable is being initialized, the
behavior is undefined.
85 The implementation must not introduce any deadlock around execution of the initializer. Deadlocks might still be caused by the program logic; the implementation need only avoid deadlocks due to its own synchronization operations.
With that established, we can go to the questions.
Question1: Is this guaranteed safe (no deadlocks etc)? C++11 static initialization involves locks. What if the signal is delivered before/after/during the first call to GetFoo() in main?
No, this isn't guaranteed. Consider when GetFoo() is called the first time from inside the for loop:
GetFoo() -> a lock is taken to initialize 'foo'-> a signal arrives [control goes to signal handling function] -> blocked here for signal handling to complete
--> Handler() -> DoSomething(GetFoo()) -> GetFoo() -> waits here because the lock is unavailable.
(The signal handler has to wait here since the initialization of 'foo' isn't complete yet -- refer the quote above).
So the deadlock occurs in this scenario (even without any threads) as the thread is blocked on itself.
Question2: Is this guaranteed safe if a call to GetFoo() is inserted at [2] before the signal handler is installed?
In this case, there's no signal handler established at all for SIGINT. So if SIGINT arrives, the program simply exits. The default disposition for SIGINT is to terminate the process. It doesn't matter whether the initialization of GetFoo() is progress or not. So this is fine.
The fundamental problem with case (1) is that the signal handler Handler isn't async-signal-safe because it calls GetFoo() which isn't async-signal-safe.
Re. updated question with possible implementations of static initialization:
The C++11 standard only guarantees that the initialization of foo is done in a thread-safe manner (see the bold quote above). But handling signals is not "concurrent execution". It's more like "recursively re-entering" as it can happen even in a single-threaded program - and thus it'd be undefined. This is true even if static initialization is implemented like in your second method that'd avoid deadlocks.
Put it the other way: if static initialization is implemented like your first method, does it violate the standard? The answer is no. So you can't rely on static initialization being implemented in an async-signal-safe way.
Given you ensure "...provided the codepath was run completely at least once before a signal handler runs it." then you could introduce another check that'd ensure GetFoo() is async-signal-safe regardless of how static initialization is implemented:
std::atomic<bool> foo_done = false;
static_assert( std::atomic<bool>::is_lock_free );
Foo* GetFoo() {
if (!foo_done) {
static Foo foo = InitFoo(); // [1]
foo_done = true;
}
return &foo;
}

How to join a thread in Linux kernel?

The main question is: How we can wait for a thread in Linux kernel to complete? I have seen a few post concerned about proper way of handling threads in Linux kernel but i'm not sure how we can wait for a single thread in the main thread to be completed (suppose we need the thread[3] be done then proceed):
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/errno.h>
#include <linux/sched.h>
#include <linux/kthread.h>
#include <linux/slab.h>
void *func(void *arg) {
// doing something
return NULL;
}
int init_module(void) {
struct task_struct* thread[5];
int i;
for(i=0; i<5; i++) {
thread[i] = kthread_run(func, (void*) arg, "Creating thread");
}
return 0;
}
void cleanup_module(void) {
printk("cleaning up!\n");
}
AFAIK there is no equivalent of pthread_join() in kernel. Also, I feel like your pattern (of starting bunch of threads and waiting only for one of them) is not really common in kernel. That being said, there kernel does have few synchronization mechanism that may be used to accomplish your goal.
Note that those mechanisms will not guarantee that the thread finished, they will only let main thread know that they finished doing the work they were supposed to do. It may still take some time to really stop this tread and free all resources.
Semaphores
You can create a locked semaphore, then call down in your main thread. This will put it to sleep. Then you will up this semaphore inside of your thread just before exiting. Something like:
struct semaphore sem;
int func(void *arg) {
struct semaphore *sem = (struct semaphore*)arg; // you could use global instead
// do something
up(sem);
return 0;
}
int init_module(void) {
// some initialization
init_MUTEX_LOCKED(&sem);
kthread_run(&func, (void*) &sem, "Creating thread");
down(&sem); // this will block until thread runs up()
}
This should work but is not the most optimal solution. I mention this as it's a known pattern that is also used in userspace. Semaphores in kernel are designed for cases where it's mostly available and this case has high contention. So a similar mechanism optimized for this case was created.
Completions
You can declare completions using:
struct completion comp;
init_completion(&comp);
or:
DECLARE_COMPLETION(comp);
Then you can use wait_for_completion(&comp); instead of down() to wait in main thread and complete(&comp); instead of up() in your thread.
Here's the full example:
DECLARE_COMPLETION(comp);
struct my_data {
int id;
struct completion *comp;
};
int func(void *arg) {
struct my_data *data = (struct my_data*)arg;
// doing something
if (data->id == 3)
complete(data->comp);
return 0;
}
int init_module(void) {
struct my_data *data[] = kmalloc(sizeof(struct my_data)*N, GFP_KERNEL);
// some initialization
for (int i=0; i<N; i++) {
data[i]->comp = &comp;
data[i]->id = i;
kthread_run(func, (void*) data[i], "my_thread%d", i);
}
wait_for_completion(&comp); // this will block until some thread runs complete()
}
Multiple threads
I don't really see why you would start 5 identical threads and only want to wait for 3rd one but of course you could send different data to each thread, with a field describing it's id, and then call up or complete only if this id equals 3. That's shown in the completion example. There are other ways to do this, this is just one of them.
Word of caution
Go read some more about those mechanisms before using any of them. There are some important details I did not write about here. Also those examples are simplified and not tested, they are here just to show the overall idea.
kthread_stop() is a kernel's way for wait thread to end.
Aside from waiting, kthread_stop() also sets should_stop flag for waited thread and wake up it, if needed. It is usefull for threads which repeat some actions infinitely.
As for single-shot tasks, it is usually simpler to use works for them, instead of kthreads.
EDIT:
Note: kthread_stop() can be called only when kthread(task_struct) structure is not freed.
Either thread function should return only after it found kthread_should_stop() return true, or get_task_struct() should be called before start thread (and put_task_struct() should be called after kthread_stop()).

Initializing empty polymorphic Singleton type without magic statics

Suppose you had a polymorphic Singleton type (in our case a custom std::error_category type). The type is stateless, so no data members, but it does have a couple of virtual functions. The problem arises when instantiating this type in a multithreaded environment.
The easiest way to achieve this would be to use C++11's magic statics:
my_type const& instantiate() {
static const my_type instance;
return instance;
}
Unfortunately, one of our compilers (VC11) does not support this feature.
Should I expect that this will explode in a multithreaded environment? I'm quite certain that as far as the standard goes, all bets are off. But given that the type does not contain any data members and only virtual functions, what kind of errors should I expect from a mainstream implementation like VC11? For example, neither Boost.System nor VC seem to take any precautions against this in their implementation of error_category. Are they just being careless or is it unreasonably paranoid to worry about races here?
What would be the best way to get rid of the data race in a standard-compliant way? Since the type in this case is an error_category I want to avoid doing a heap allocation if possible. Keep in mind that the Singleton semantics are vital in this case, since equality of error categories is determined by pointer-comparison. This means that for example thread-local storage is not an option.
Here is a possibly simpler version of Casey's answer, which uses an atomic spinlock to guard a normal static declaration.
my_type const& instantiate()
{
static std::atomic_int flag;
while (flag != 2)
{
int expected = 0;
if (flag.compare_exchange_weak(expected, 1))
break;
}
try
{
static my_type instance = whatever; // <--- normal static decl and init
flag = 2;
return instance;
}
catch (...)
{
flag = 0;
throw;
}
}
This code is also easier to turn into three macro's for reuse, which are easily #defined to nothing on platforms which support magic statics.
my_type const& instantiate()
{
MY_MAGIC_STATIC_PRE;
static my_type instance = whatever; // <--- normal static decl and init
MY_MAGIC_STATIC_POST;
return instance;
MY_MAGIC_STATIC_SCOPE_END;
}
Attempt #2b: Implement your own equivalent of std::once_flag, with an atomic<int> (Live at Rextester):
my_type const& instantiate() {
static std::aligned_storage<sizeof(my_type), __alignof(my_type)>::type storage;
static std::atomic_int flag;
while (flag < 2) {
// all threads spin until the object is properly initialized
int expected = 0;
if (flag.compare_exchange_weak(expected, 1)) {
// only one thread succeeds at the compare_exchange.
try {
::new (&storage) my_type;
} catch(...) {
// Initialization failed. Let another thread try.
flag = 0;
throw;
}
// Success!
if (!std::is_trivially_destructible<my_type>::value) {
std::atexit([] {
reinterpret_cast<my_type&>(storage).~my_type();
});
}
flag = 2;
}
}
return reinterpret_cast<my_type&>(storage);
}
This only relies on the compiler to correctly zero-initialize all static storage duration objects, and also uses the nonstandard extension __alignof(<type>) to properly align storage since Microsoft's compiler team can't be bothered add the keyword without the two underscores.
Attempt#1: Use std::call_once in conjunction with a std::once_flag (Live demo at Coliru):
my_type const& instantiate() {
struct empty {};
union storage_t {
empty e;
my_type instance;
constexpr storage_t() : e{} {}
~storage_t() {}
};
static std::once_flag flag;
static storage_t storage;
std::call_once(flag, []{
::new (&storage.instance) my_type;
std::atexit([]{
storage.instance.~my_type();
});
});
return storage.instance;
}
The default constructor for std::once_flag is constexpr, so it's guaranteed to be constructed during constant initialization. I am under the impression [citation needed] that VC correctly performs constant initialization. EDIT: Unfortunately, MSVC up through VS12 still doesn't support constexpr, so this technique has some undefined behavior. I'll try again.
The standard is silent on the question of how statics are constructed when the function is called on multiple threads.
gcc uses locks to make function level statics threadsafe (can be disabled by a flag). Most (all?) versions of Visual C++ do NOT have threadsafe function level statics.
It is recommended to use a lock around the variable declaration to guarantee thread-safety.

Looking for an optimum multithread message queue

I want to run several threads inside a process. I'm looking for the most efficient way of being able to pass messages between the threads.
Each thread would have a shared memory input message buffer. Other threads would write the appropriate buffer.
Messages would have priority. I want to manage this process myself.
Without getting into expensive locking or synchronizing, what's the best way to do this? Or is there already a well proven library available for this? (Delphi, C, or C# is fine).
This is hard to get right without repeating a lot of mistakes other people already made for you :)
Take a look at Intel Threading Building Blocks - the library has several well-designed queue templates (and other collections) that you can test and see which suits your purpose best.
If you are going to work with multiple threads, it is hard to avoid synchronisation. Fortunately it is not very hard.
For a single process, a Critical Section is frequently the best choice. It is fast and easy to use. For simplicity, I normally wrap it in a class to handle initialisation and cleanup.
#include <Windows.h>
class CTkCritSec
{
public:
CTkCritSec(void)
{
::InitializeCriticalSection(&m_critSec);
}
~CTkCritSec(void)
{
::DeleteCriticalSection(&m_critSec);
}
void Lock()
{
::EnterCriticalSection(&m_critSec);
}
void Unlock()
{
::LeaveCriticalSection(&m_critSec);
}
private:
CRITICAL_SECTION m_critSec;
};
You can make it even simpler using an "autolock" class you lock/unlock it.
class CTkAutoLock
{
public:
CTkAutoLock(CTkCritSec &lock)
: m_lock(lock)
{
m_lock.Lock();
}
virtual ~CTkAutoLock()
{
m_lock.Unlock();
}
private:
CTkCritSec &m_lock;
};
Anywhere you want to lock something, instantiate an autolock. When the function finishes, it will unlock. Also, if there is an exception, it will automatically unlock (giving exception safety).
Now you can make a simple message queue out of an std priority queue
#include <queue>
#include <deque>
#include <functional>
#include <string>
struct CMsg
{
CMsg(const std::string &s, int n=1)
: sText(s), nPriority(n)
{
}
int nPriority;
std::string sText;
struct Compare : public std::binary_function<bool, const CMsg *, const CMsg *>
{
bool operator () (const CMsg *p0, const CMsg *p1)
{
return p0->nPriority < p1->nPriority;
}
};
};
class CMsgQueue :
private std::priority_queue<CMsg *, std::deque<CMsg *>, CMsg::Compare >
{
public:
void Push(CMsg *pJob)
{
CTkAutoLock lk(m_critSec);
push(pJob);
}
CMsg *Pop()
{
CTkAutoLock lk(m_critSec);
CMsg *pJob(NULL);
if (!Empty())
{
pJob = top();
pop();
}
return pJob;
}
bool Empty()
{
CTkAutoLock lk(m_critSec);
return empty();
}
private:
CTkCritSec m_critSec;
};
The content of CMsg can be anything you like. Note that the CMsgQue inherits privately from std::priority_queue. That prevents raw access to the queue without going through our (synchronised) methods.
Assign a queue like this to each thread and you are on your way.
Disclaimer The code here was slapped together quickly to illustrate a point. It probably has errors and needs review and testing before being used in production.

Resources