I am trying to implement a thread safe STL vector without mutexes. So I followed through this post and implemented a wrapper for the atomic primitives.
However when I ran the code below, it displayed out Failed!twice from the below code (only two instances of race conditions) so it doesn't seem to be thread safe. I'm wondering how can I fix that?
Wrapper Class
template<typename T>
struct AtomicVariable
std::atomic<T> atomic;
AtomicVariable() : atomic(T()) {}
explicit AtomicVariable(T const& v) : atomic(v) {}
explicit AtomicVariable(std::atomic<T> const& a) : atomic(a.load()) {}
AtomicVariable(AtomicVariable const&other) :
atomic(other.atomic.load()) {}
inline AtomicVariable& operator=(AtomicVariable const &rhs) {
return *this;
inline AtomicVariable& operator+=(AtomicVariable const &rhs) {
atomic.store(rhs.atomic.load() + atomic.load());
return *this;
inline bool operator!=(AtomicVariable const &rhs) {
return !(atomic.load() == rhs.atomic.load());
typedef AtomicVariable<int> AtomicInt;
Functions and Testing
// Vector of 100 elements.
vector<AtomicInt> common(100, AtomicInt(0));
void add10(vector<AtomicInt> &param){
for (vector<AtomicInt>::iterator it = param.begin();
it != param.end(); ++it){
*it += AtomicInt(10);
void add100(vector<AtomicInt> &param){
for (vector<AtomicInt>::iterator it = param.begin();
it != param.end(); ++it){
*it += AtomicInt(100);
void doParallelProcessing(){
// Create threads
std::thread t1(add10, std::ref(common));
std::thread t2(add100, std::ref(common));
// Join 'em
// Print vector again
for (vector<AtomicInt>::iterator it = common.begin();
it != common.end(); ++it){
if (*it != AtomicInt(110)){
cout << "Failed!" << endl;
int main(int argc, char *argv[]) {
// Just for testing purposes
for (int i = 0; i < 100000; i++){
// Reset vector
common.resize(100, AtomicInt(0));
Is there such a thing as an atomic container? I've also tested this with a regular vector<int> it didn't have any Failed output but that might just be a coincidence.

Just write operator += as:
inline AtomicVariable& operator+=(AtomicVariable const &rhs) {
atomic += rhs.atomic;
return *this;
In documentation: http://en.cppreference.com/w/cpp/atomic/atomic operator += is atomic.
Your example fails because below scenario of execution is possible:
Thread1 - rhs.atomic.load() - returns 10 ; Thread2 - rhs.atomic.load() - returns 100
Thread1 - atomic.load() - returns 0 ; Thread2 - atomic.load - returns 0
Thread1 - add values (0 + 10 = 10) ; Thread2 - add values (0 + 100)
Thread1 - atomic.store(10) ; Thread2 - atomic.store(100)
Finally in this case in atomic value might be 10 or 100, depends of which thread first execute atomic.store.

please note that
atomic.store(rhs.atomic.load() + atomic.load());
is not atomic
You have two options to solve it.
1) Use a mutex.
EDIT as T.C mentioned in the comments this is irrelevant since the operation here will be load() then load() then store() (not relaxed mode) - so memory order is not related here.
2) Use memory order http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/
memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.
memory_order_release: preceding stores are not moved past the current store or any subsequent stores.
I'm still not sure about 2, but I think if the stores will not be on parallel, it will work.


how to implement std::weak_ptr::lock with atomic operations?

I recently tried to implement an atomic reference counter in C, so I referred to the implementation of std::shared_ptr in STL, and I am very confused about the implementation of weak_ptr::lock.
When executing compared_and_exchange, clang specified memory_order_seq_cst, g++ specified memory_order_acq_rel, and MSVC specified memory_order_relaxed.
I think memory_order_relaxed has been enough, since there is no data needed to synchronize if user_count is non-zero.
I am not an expert in this area, can anyone provide some advice?
Following are code snippets:
bool _Incref_nz() noexcept { // increment use count if not zero, return true if successful
auto& _Volatile_uses = reinterpret_cast<volatile long&>(_Uses);
#ifdef _M_CEE_PURE
long _Count = *_Atomic_address_as<const long>(&_Volatile_uses);
long _Count = __iso_volatile_load32(reinterpret_cast<volatile int*>(&_Volatile_uses));
while (_Count != 0) {
const long _Old_value = _INTRIN_RELAXED(_InterlockedCompareExchange)(&_Volatile_uses, _Count + 1, _Count);
if (_Old_value == _Count) {
return true;
_Count = _Old_value;
return false;
__shared_weak_count::lock() noexcept
long object_owners = __libcpp_atomic_load(&__shared_owners_);
while (object_owners != -1)
if (__libcpp_atomic_compare_exchange(&__shared_owners_,
return this;
return nullptr;
inline bool
_M_add_ref_lock_nothrow() noexcept
// Perform lock-free add-if-not-zero operation.
_Atomic_word __count = _M_get_use_count();
if (__count == 0)
return false;
// Replace the current counter value with the old value + 1, as
// long as it's not changed meanwhile.
while (!__atomic_compare_exchange_n(&_M_use_count, &__count, __count + 1,
return true;
I am trying to answer this question myself.
The standard spec only says that weak_ptr::lock should be executed as an atomic operation, but nothing more about the memory order. So that different threads can invoke directly weak_ptr::lock in parallel without any race condition, and when that happens, different implementations offer different memory_order.
But no matter what, all the above implementations are correct.

Predefined order of executing threads

There are 10 functions, say, Function_1 (that prints 1 and exits), Function_2 (that prints 2 and exits), and so on till Function_10 (that prints 10 and exits).
A main function forks 10 threads, T1 to T10. T1 calls Function_1, T2 calls Function_2, and so on till T10 calls Function_10.
When I execute the main function, I expect output as 1 2 3 4 ... 10.
How can I achieve this?
You need to establish what amounts to a protocol between T0 (main) and each of T{1..10}. That protocol could look like (T0 sends Print to Tn; Tn sends Printed to T0).
If you don't have message passing (which would make life easy; look at golang if interested), you can simulate it crudely with condition variables. Make a structure that looks like:
struct ToDo {
enum { Print, Printed} Op;
int Id;
condvar_t cv;
mutex_t lock;
And each thread then becomes:
void *Proc(int Id, struct ToDo *Next) {
while (Next->Id != Id) {
condvar_wait(&Next->cv, &Next->lock);
assert(Next->Op == Print);
printf("%d\n", Id);
Next->Op = Printed;
Next->Id = 0;
And finally your program
main() {
struct ToDo Next;
... /* create condvar, lock /
Next.Id = 0;
/ Create threads and pass structure */
int i;
for (i = 1; i < 10; i++) {
Next.Id = i;
Next.Op = Print;
while (Next.Id != 0) {
condvar_wait(&Next.condvar, &Next.lock);
assert(Next.Op == Printed);
... /* join threads */

Try to compare 2 methods to implement bounded blocking queue

bounded blocking queue is famous, of course. There are mostly 2 methods to implement it. I try to understand which way is better:
Method 1: use counting semaphore
void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
int tmp = get();
printf("%d\n", tmp);
Method 2: classic monitor pattern
class BoundedBuffer {
int buffer[MAX];
int fill, use;
int fullEntries;
pthread_mutex_t monitor; // monitor lock
pthread_cond_t empty;
pthread_cond_t full;
BoundedBuffer() {
use = fill = fullEntries = 0;
void produce(int element) {
while (fullEntries == MAX)
pthread_cond_wait(&empty, &monitor);
//do something
int consume() {
while (fullEntries == 0)
pthread_cond_wait(&full, &monitor);
//do something
return tmp;
I understand the 2nd method can solve a lot of other problems. But how to compare these 2 methods? Looks like they can both fulfill the task.
Is there any link on detailed comparision?
Appreciate your help.
The big difference between those two methods is that the first one does not use pthread_ specific functions (semaphores are not part of pthread) and as such is not guaranteed to work in multithreaded enviornment.
In particular, semaphores do not protect memory ordering, so things written in one thread might not be readable on another. Mutexes are suitable for multi-thread message queue.

Copy struct with function pointer to device

I have a struct containing the parameters of a linear function, as well as the function itself. What I want to do is copy this struct to the device and then evaluate the linear function. The following example doesn't make sense but it is sufficient to describe the difficulties I have:
struct model
double* params;
double (*func)(double*, double);
I don't know how to copy this struct to the device.
Here are my functions:
Init function
// init function for struct model
__host__ void model_init(model* m, double* params, double(*func)(double*,double))
m->params = params;
m->func = func;
Evaluation function
__device__ double model_evaluate(model* m, double x)
return m->func(m->params, x);
return 0.0;
The actual function
__host__ __device__ double linear_function(double* params, double x)
return params[0] + params[1] * x;
Function called inside kernel
__device__ double compute(model *d_linear_model)
return model_evaluate(d_linear_model,1.0);
The kernel itself
__global__ void kernel(double *array, model *d_linear_model, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N)
array[idx] = compute(d_linear_model);
I know how to copy an array from host to device but I don't know how to do this for this concrete struct which contains a function.
The kernel call in main then looks like this:
int block_size = 4;
int n_blocks = N_array/block_size + (N_array % block_size == 0 ? 0:1);
kernel<<<n_blocks, block_size>>>(device_array, d_linear_model, N_array);
You've outlined two items that I consider to be somewhat more difficult than beginner-level CUDA programming:
use of device function pointers
a "deep copy" operation (on the embedded params pointer in your model structure)
Both of these topics have been covered in other questions. For example this question/answer discusses deep copy operations - when a data structure has embedded pointers to other data. And this question/answer links to a variety of resources on device function pointer usage.
But I'll go ahead and offer a possible solution for your posted case. Most of your code is usable as-is (at least for demonstration purposes). As mentioned already, your model structure will present two challenges:
struct model
double* params; // requires a "deep copy" operation
double (*func)(double*, double); // requires special handling for device function pointers
As a result, although most of your code is usable as-is, your "init" function is not. That might work for a host realization, but not for a device realization.
The deep copy operation requires us to copy the overall structure, plus separately copy the data pointed to by the embedded pointer, plus separately copy or "fixup" the embedded pointer itself.
The usage of a device function pointer is restricted by the fact that we cannot grab the actual device function pointer in host code - that is illegal in CUDA. So one possible solution is to use a __device__ construct to "capture" the device function pointer, then do a cudaMemcpyFromSymbol operation in host code, to retrieve the numerical value of the device function pointer, which can then be moved about in ordinary fashion.
Here's a worked example building on what you have shown, demonstrating the two concepts above. I have not created a "device init" function - but all the code necessary to do that is in the main function. Once you've grasped the concepts, you can take whatever code you wish out of the main function below and craft it into your "device init" function, if you wish to create one.
Here's a worked example:
$ cat t968.cu
#include <iostream>
#define NUM_PARAMS 2
#define ARR_SIZE 1
#define nTPB 256
struct model
double* params;
double (*func)(double*, double);
// init function for struct model -- not using this for device operations
__host__ void model_init(model* m, double* params, double(*func)(double*,double))
m->params = params;
m->func = func;
__device__ double model_evaluate(model* m, double x)
return m->func(m->params, x);
return 0.0;
__host__ __device__ double linear_function(double* params, double x)
return params[0] + params[1] * x;
__device__ double compute(model *d_linear_model)
return model_evaluate(d_linear_model,1.0);
__global__ void kernel(double *array, model *d_linear_model, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N)
array[idx] = compute(d_linear_model);
__device__ double (*linear_function_ptr)(double*, double) = linear_function;
int main(){
// grab function pointer from device code
double (*my_fp)(double*, double);
cudaMemcpyFromSymbol(&my_fp, linear_function_ptr, sizeof(void *));
// setup model
model my_model;
my_model.params = new double[NUM_PARAMS];
my_model.params[0] = 1.0;
my_model.params[1] = 2.0;
my_model.func = my_fp;
// setup for device copy of model
model *d_model;
cudaMalloc(&d_model, sizeof(model));
// setup "deep copy" for params
double *d_params;
cudaMalloc(&d_params, NUM_PARAMS*sizeof(double));
cudaMemcpy(d_params, my_model.params, NUM_PARAMS*sizeof(double), cudaMemcpyHostToDevice);
// copy model to device
cudaMemcpy(d_model, &my_model, sizeof(model), cudaMemcpyHostToDevice);
// fixup device params pointer in device model
cudaMemcpy(&(d_model->params), &d_params, sizeof(double *), cudaMemcpyHostToDevice);
// run test
double *d_array, *h_array;
cudaMalloc(&d_array, ARR_SIZE*sizeof(double));
h_array = new double[ARR_SIZE];
for (int i = 0; i < ARR_SIZE; i++) h_array[i] = i;
cudaMemcpy(d_array, h_array, ARR_SIZE*sizeof(double), cudaMemcpyHostToDevice);
kernel<<<(ARR_SIZE+nTPB-1)/nTPB,nTPB>>>(d_array, d_model, ARR_SIZE);
cudaMemcpy(h_array, d_array, ARR_SIZE*sizeof(double), cudaMemcpyDeviceToHost);
std::cout << "Results: " << std::endl;
for (int i = 0; i < ARR_SIZE; i++) std::cout << h_array[i] << " ";
std::cout << std::endl;
return 0;
$ nvcc -o t968 t968.cu
$ cuda-memcheck ./t968
========= ERROR SUMMARY: 0 errors
For brevity of presentation, I've dispensed with proper cuda error checking (instead I have run the code with cuda-memcheck to demonstrate that it is without runtime error) but I would recommend proper error checking if you're having any trouble with a code.

Readers / writer implementation using std::atomic (mutex free)

Below is an attempt at a multiple reader / writer shared data which uses std::atomics and busy waits instead of mutex and condition variables to synchronize between readers and writers. I am puzzled as to why the asserts in there are being hit. I'm sure there's a bug somewhere in the logic, but I'm not certain where it is.
The idea behind the implementation is that threads that read are spinning until the writer is done writing. As they enter the read function they increase m_numReaders count and as they are waiting for the writer they increase the m_numWaiting count.
The idea is that the m_numWaiting should then always be smaller or equal to m_numReaders, provided m_numWaiting is always incremented after m_numReaders and decremented before m_numReaders.
There shouldn't be a case where m_numWaiting is bigger than m_numReaders (or I'm not seeing it) since a reader always increments the reader counter first and only sometimes increments the waiting counter and the waiting counter is always decremented first.
Yet, this seems to be whats happening because the asserts are being hit.
Can someone point out the logic error, if you see it?
#include <iostream>
#include <thread>
#include <assert.h>
template<typename T>
class ReadWrite
ReadWrite() : m_numReaders(0), m_numWaiting(0), m_writing(false)
template<typename functor>
void read(functor& readFunc)
while (m_writing)
m_numWaiting++; // m_numWaiting should always be increased after m_numReaders
waiting = true;
assert(m_numWaiting <= m_numReaders);
assert(m_numWaiting <= m_numReaders); // <-- These asserts get hit ?
m_numWaiting--; // m_numWaiting should always be decreased before m_numReaders
assert(m_numWaiting <= m_numReaders); // <-- These asserts get hit ?
// Only a single writer can operate on this at any given time.
template<typename functor>
void write(functor& writeFunc)
while (m_writeFlag.test_and_set());
// Ensure no readers present
while (m_numReaders);
// At this point m_numReaders may have been increased !
m_writing = true;
// If a reader entered before the writing flag was set, wait for
// it to finish
while (m_numReaders > m_numWaiting);
m_writing = false;
T m_data;
std::atomic<int64_t> m_numReaders;
std::atomic<int64_t> m_numWaiting;
std::atomic<bool> m_writing;
std::atomic_flag m_writeFlag;
int main(int argc, const char * argv[])
const size_t numReaders = 2;
const size_t numWriters = 1;
const size_t numReadWrites = 10000000;
std::thread readThreads[numReaders];
std::thread writeThreads[numWriters];
ReadWrite<int> dummyData;
auto writeFunc = [&](int* pData) { return; }; // intentionally empty
auto readFunc = [&](int* pData) { return; }; // intentionally empty
auto readThreadProc = [&]()
size_t numReads = numReadWrites;
while (numReads--)
auto writeThreadProc = [&]()
size_t numWrites = numReadWrites;
while (numWrites--)
for (std::thread& thread : writeThreads) { thread = std::thread(writeThreadProc);}
for (std::thread& thread : readThreads) { thread = std::thread(readThreadProc);}
for (std::thread& thread : writeThreads) { thread.join();}
for (std::thread& thread : readThreads) { thread.join();}
