Behavior of WaitForMultipleObjects when multiple handles signal at the same time - multithreading

Given: I fill up an array of handles with auto reset events and pass it off to WaitForMultipleObjects with bWaitAll = FALSE.
From MSDN:
“When bWaitAll is FALSE, this function checks the handles in the array in order starting with index 0, until one of the objects is signaled. If multiple objects become signaled, the function returns the index of the first handle in the array whose object was signaled.”
So, now if multiple objects signal I’ll get the index of the first one. Do I have to loop though my array to see if any others have signaled?
Right now I have a loop that’s along the lines of:
For ( ; ; )
If (not failed)
Process object that called.
Remove the handle that signaled from the array.
Compact the arrary.

Why not just go back round into the Wait()? if multiple objects signalled, they will still be signalled when you come back round. Of course, if you have a very rapidly firing first object in the wait object array, it will starve the others; what you do is order your objects in the wait object array by frequency of firing, with the least frequent being first.
BTW, where you're using an endless for(), you could use a goto. If you really are not leaving a loop, an unconditional goto most properly expresses your intent.

Yes. One alternative would be that you could do WaitForSingleObject(handle, 0) on each handle which will return immediately and indicate if they are signaled or not.
EDIT: Here's sample pseudocode for what I mean:
ret = WaitForMultipleObjects()
if (ret >= WAIT_OBJECT_0 && ret < WAIT_OBJECT_0 + (count))
firstSignaled = ret - WAIT_OBJECT_0;
// handles[firstSignaled] guaranteed signalled!!
for (i = firstSignaled + 1; i < count; i++)
if (WaitForSingleObject(handles[i], 0) == WAIT_OBJECT_0)
// handles[i] Signaled!

One other option you might have is to use RegisterWaitForSingleObject. The idea is that you flag the signaled state of event in a secondary array from the callback function and then signal a master event which is used to wake up your primary thread (which calls WaitForSingleObject on the master event).
Obviously you'd have to take care to ensure that the secondary array was protected from access by the main thread but it would work.

Only the auto-reset event that ended the wait (whose index is returned) will be reset. If the wait times out no events will be reset.


Inserting with threads in C

i have 2 Threads which work on the same list with the same insert-function. Each Thread should insert it`s values (200 each) whenever it has the CPU.
I am confused now, how I can implement the "loop" which counts the inserts per thread?. I am using mutexes before and after i call the insert-function in the thread-function. So if i use a while loop in there, thread A would insert it's 200, then B it's 200. But thats not what i want here. Any Ideas how i can make each thread insert it's number as soon as they have the cpu, and stop, when they inserted 200?
Once you create the threads, it is out of your control as far as cpu access is concerned. As far as counting the inserts for each thread, I would create a wrapper function for each thread that counts each separately.
void first_thread_wrapper()
// while first_thread count < 200
// lock
// insert
// increment first_thread count.
// unlock
void second_thread_wrapper()
// while second_thread count < 200
// lock
// insert
// increment second_thread count.
// unlock
Then you would give the first thread the first wrapper and the second thread the second wrapper. Obviously this is just pseudo code, but I think you can understand where I'm going.

Why only __add_wait_queue(q, wait) when wait is empty?
void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
unsigned long flags;
wait->flags &= ~WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
In above code, we can see __add_wait_queue(q, wait) only executed when list_empty(&wait->task_list) is true.
Why when &wait->task_list is not empty, then wait don't need to be added to q (wait_queue_head_t)?
Does that mean if wait (wait_queue_t) already in a q (wait_queue_head_t ) then don't change it?
Yes, the branch
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
means, that wait is added to the wait queue q only if wait hasn't belong to any queue already.
Otherwise, if it is determined that wait has already belonged to (some) wait queue, it is assumed that wait belongs to q specifically, and it is not added again.
There is some specific with calling list_empty function for an object, which can be a list's element (not a head of the list).
list_empty always returns false, if object belongs to a list.
But if the object doesn't belong to any list, then return value is generally unspecified (and in the most cases it is false too).
Exception is an object, initialized with INIT_LIST_HEAD function or LIST_HEAD_INIT macro or deleted from the list with list_del_init function: in such cases list_empty returns true with guarantee.
If look for the usage of INIT_LIST_HEAD, LIST_HEAD_INIT or list_del_init in the wait.h header, then it can be found that prepare_to_wait function is allowed only for wait object:
Created with DEFINE_WAIT macro or one of DEFINE_WAIT_* macros.
Initialized with init_wait function, which is called e.g. from one of wait_event_* macros.
Which has been passed to finish_wait function.
But prepare_to_wait function cannot be used for a wait object, created with DECLARE_WAITQUEUE macro: this macro initializes the task_list field with {NULL, NULL}, so list_empty would return false for it (as if the wait object is already added into the wait queue).

Safely close an indefinitely running thread

So first off, I realize that if my code was in a loop I could use a do while loop to check a variable set when I want the thread to close, but in this case that is not possible (so it seems):
DWORD WINAPI recv thread (LPVOID random) {
recv(ClientSocket, recvbuffer, recvbuflen, 0);
return 1;
In the above, recv() is a blocking function.
(Please pardon me if the formatting isn't correct. It's the best I can do on my phone.)
How would I go about terminating this thread since it never closes but never loops?
Amongst other solutions you can
a) set a timeout for the socket and handle timeouts correctly by checking the return values and/or errors in an appropriate loop:
setsockopt(ClientSocket,SOL_SOCKET,SO_RCVTIMEO,(char *)&timeout,sizeof(timeout))
b) close the socket with recv(..) returning from blocked state with error.
You can use poll before recv() to check if some thing there to receive.
struct pollfd poll;
int res;
poll.fd = ClientSocket; = POLLIN;
res = poll(&poll, 1, 1000); // 1000 ms timeout
if (res == 0)
// timeout
else if (res == -1)
// error
// implies (poll.revents & POLLIN) != 0
recv(ClientSocket, recvbuffer, recvbuflen,0); // we can read ...
The way I handle this problem is to never block inside recv() -- preferably by setting the socket to non-blocking mode, but you may also be able to get away with simply only calling recv() when you know the socket currently has some bytes available to read.
That leads to the next question: if you don't block inside recv(), how do you prevent CPU-spinning? The answer to that question is to call select() (or poll()) with the correct arguments so that you'll block there until the socket has bytes ready to recv().
Then comes the third question: if your thread is now blocked (possibly forever) inside select(), aren't we back to the original problem again? Not quite, because now we can implement a variation of the self-pipe trick. In particular, because select() (or poll()) can 'watch' multiple sockets at the same time, we can tell the call to block until either of two sockets has data ready-to-read. Then, when we want to shut down the thread, all the main thread has to do is send a single byte of data to the second socket, and that will cause select() to return immediately. When the thread sees that it is this second socket that is ready-for-read, it should respond by exiting, so that the main thread's blocking call to WaitForSingleObject(theThreadHandle) will return, and then the main thread can clean up without any risk of race conditions.
The final question is: how to set up a socket-pair so that your main thread can call send() on one of the pair's sockets, and your recv-thread will see the sent data appear on the other socket? Under POSIX it's easy, there is a socketpair() function that does exactly that. Under Windows, socketpair() does not exist, but you can roll your own implementation of it as shown here.

lock-free bounded MPMC ringbuffer failure

I've been banging my head against (my attempt) at a lock-free multiple producer multiple consumer ring buffer. The basis of the idea is to use the innate overflow of unsigned char and unsigned short types, fix the element buffer to either of those types, and then you have a free loop back to beginning of the ring buffer.
The problem is - my solution doesn't work for multiple producers (it does though work for N consumers, and also single producer single consumer).
#include <atomic>
template<typename Element, typename Index = unsigned char> struct RingBuffer
std::atomic<Index> readIndex;
std::atomic<Index> writeIndex;
std::atomic<Index> scratchIndex;
Element elements[1 << (sizeof(Index) * 8)];
RingBuffer() :
bool push(const Element & element)
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
return false;
currentWriteIndex, nextWriteIndex))
elements[currentWriteIndex] = element;
writeIndex = nextWriteIndex;
return true;
bool pop(Element & element)
Index currentReadIndex = readIndex.load();
const Index currentWriteIndex = writeIndex.load();
const Index nextReadIndex = currentReadIndex + 1;
if(currentReadIndex == currentWriteIndex)
return false;
element = elements[currentReadIndex];
currentReadIndex, nextReadIndex))
return true;
The main idea for writing was to use a temporary index 'scratchIndex' that acts a pseudo-lock to allow only one producer at any one time to copy-construct into the elements buffer, before updating the writeIndex and allowing any other producer to make progress. Before I am called heathen for implying my approach is 'lock-free' I realise that this approach isn't exactly lock-free, but in practice (if it would work!) it is significantly faster than having a normal mutex!
I am aware of a (more complex) MPMC ringbuffer solution here, but I am really experimenting with my idea to then compare against that approach and find out where each excels (or indeed whether my approach just flat out fails!).
Things I have tried;
Using compare_exchange_weak
Using more precise std::memory_order's that match the behaviour I want
Adding cacheline pads between the various indices I have
Making elements std::atomic instead of just Element array
I am sure that this boils down to a fundamental segfault in my head as to how to use atomic accesses to get round using mutex's, and I would be entirely grateful to whoever can point out which neurons are drastically misfiring in my head! :)
This is a form of the A-B-A problem. A successful producer looks something like this:
load currentReadIndex
load currentWriteIndex
cmpxchg store scratchIndex = nextWriteIndex
store element
store writeIndex = nextWriteIndex
If a producer stalls for some reason between steps 2 and 3 for long enough, it is possible for the other producers to produce an entire queue's worth of data and wrap back around to the exact same index so that the compare-exchange in step 3 succeeds (because scratchIndex happens to be equal to currentWriteIndex again).
By itself, that isn't a problem. The stalled producer is perfectly within its rights to increment scratchIndex to lock the queue—even if a magical ABA-detecting cmpxchg rejected the store, the producer would simply try again, reload exactly the same currentWriteIndex, and proceed normally.
The actual problem is the nextWriteIndex == currentReadIndex check between steps 2 and 3. The queue is logically empty if currentReadIndex == currentWriteIndex, so this check exists to make sure that no producer gets so far ahead that it overwrites elements that no consumer has popped yet. It appears to be safe to do this check once at the top, because all the consumers should be "trapped" between the observed currentReadIndex and the observed currentWriteIndex.
Except that another producer can come along and bump up the writeIndex, which frees the consumer from its trap. If a producer stalls between steps 2 and 3, when it wakes up the stored value of readIndex could be absolutely anything.
Here's an example, starting with an empty queue, that shows the problem happening:
Producer A runs steps 1 and 2. Both loaded indices are 0. The queue is empty.
Producer B interrupts and produces an element.
Consumer pops an element. Both indices are 1.
Producer B produces 255 more elements. The write index wraps around to 0, the read index is still 1.
Producer A awakens from its slumber. It had previously loaded both read and write indices as 0 (empty queue!), so it attempts step 3. Because the other producer coincidentally paused on index 0, the compare-exchange succeeds, and the store progresses. At completion the producer lets writeIndex = 1, and now both stored indices are 1, and the queue is logically empty. A full queue's worth of elements will now be completely ignored.
(I should mention that the only reason I can get away with talking about "stalling" and "waking up" is that all the atomics used are sequentially consistent, so I can pretend that we're in a single-threaded environment.)
Note that the way that you are using scratchIndex to guard concurrent writes is essentially a lock; whoever successfully completes the cmpxchg gets total write access to the queue until it releases the lock. The simplest way to fix this failure is to just replace scratchIndex with a spinlock—it won't suffer from A-B-A and it's what's actually happening.
bool push(const Element & element)
const Index currentReadIndex = readIndex.load();
Index currentWriteIndex = writeIndex.load();
const Index nextWriteIndex = currentWriteIndex + 1;
if(nextWriteIndex == currentReadIndex)
return false;
currentWriteIndex, nextWriteIndex))
elements[currentWriteIndex] = element;
// Problem here!
writeIndex = nextWriteIndex;
return true;
I've marked the problematic spot. Multiple threads can get to the writeIndex = nextWriteIndex at the same time. The data will be written in any order, although each write will be atomic.
This is a problem because you're trying to update two values using the same atomic condition, which is generally not possible. Assuming the rest of your method is fine, one way around this would be to combine both scratchIndex and writeIndex into a single value of double-size. For example, treating two uint32_t values as a single uint64_t value and operating atomically on that.

Design pattern for asynchronous while loop

I have a function that boils down to:
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
if(thread_count < max_threads)
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
usleep(100); // don't consume too much CPU
void checkResult(Result value)
if(value == good) doWork = false;
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).
