Vulkan Queue Synchronization in Multithreading

Vulkan Queue Synchronization in Multithreading - multithreading

In my application it is imperative that "state" and "graphics" are processed in separate threads. So for example, the "state" thread is only concerned with updating object positions, and the "graphics" thread is only concerned with graphically outputting the current state.
For simplicity, let's say that the entirety of the state data is contained within a single VkBuffer. The "state" thread creates a Compute Pipeline with a Storage Buffer backed by the VkBuffer, and periodically vkCmdDispatchs to update the VkBuffer.
Concurrently, the "graphics" thread creates a Graphics Pipeline with a Uniform Buffer backed by the same VkBuffer, and periodically draws/vkQueuePresentKHRs.
Obviously there must be some sort of synchronization mechanism to prevent the "graphics" thread from reading from the VkBuffer whilst the "state" thread is writing to it.
The only idea I have is to employ the usage of a host mutex fromvkQueueSubmit to vkWaitForFences in both threads.
I want to know, is there perhaps some other method that is more efficient or is this considered to be OK?

Try using semaphores. They are used to synchronize operations solely on the GPU, which is much more optimal than waiting in the app and submitting work after previous work is fully processed.
When You submit work You can provide a semaphore which gets signaled when this work is finished. When You submit another work You can provide the same semaphore on which the second batch should wait. Processing of the second batch will start automatically when the semaphore gets signaled (this semaphore is also automatically unsignaled and can be reused).
(I think there are some constraints on using semaphores, associated with queues. I will update the answer later when I confirm this but they should be sufficient for Your purposes.
[EDIT] There are constraints on using semaphores but it shouldn't affect You - when You use a semaphore as a wait semaphore during submission, no other queue can wait on the same semaphore.)
There are also events in Vulkan which can be used for similar purposes but their use is a little bit more complicated.
If You really need to synchronize GPU and Your application, use fences. They are signaled in a similar way as semaphores. But You can check their state on the app side and You need to manually unsignal them before You can use then again.
[EDIT]
I've added an image that more or less shows what I think You should do. One thread calculates state and with each submission adds a semaphore to the top of the list (or a ring buffer as #NicolasBolas wrote). This semaphore gets signaled when the submission is finished (it is provided in pSignalSemaphores during "compute" batch submission).
Second thread renders Your scene. It manages it's own list of semaphores similarly to the compute thread. But when You want to render things, You need to be sure that compute thread finished calculations. That's why You need to take the latest "compute" semaphore and wait on it (provide it in pWaitSemaphores during "render" batch submission). When You submit rendering commands, compute thread can't start and modify the data because it may influence the results of a rendering. So compute thread also needs to wait until the most recent rendering is done. That's why compute thread also needs to provide a wait semaphore (the most recent "rendering" semaphore).
You just need to synchronize submissions. Rendering thread cannot start when a compute threads submits commands and vice versa. That's why adding semaphores to the lists (and taking semaphores from the list) should be synchronized. But this has nothing to do with Vulkan. Probably some mutex will be helpful (for example a C++-ish std::lock_guard<std::mutex>). But this synchronization is a problem only when You have a single buffer.
Another thing is what to do with old semaphores from both lists. You cannot directly check what is their state and You cannot directly unsignal them. The state of semaphores can be checked by using additional fences provided with each submission. You don't wait on them but from time to time check if a given fence is signaled and, if it is, You can destroy old semaphore (as You cannot unsignal it from the application) or You can make an empty submission, with no command buffers, and use that semaphore as a wait semaphore. This way the semaphore will be unsignaled and You can reuse it. But I don't know which solution is more optimal: destroying old and creating new semaphores, or unsignaling them with empty submissions.
When You have a single buffer, a one-element list/ring is probably enough. But more optimal solution would have some kind of a ping-pong set of buffers - You read data from one buffer, but store results in another buffer. And in the next step You swap them. That's why in the image above, the lists of semaphores (rings) may have more elements depending on Your setup. The more independent buffers and semaphores in the lists (of course to some reasonable count), the best performance You will get as You reduce time wasted on waiting. But this complicates Your code and it may also increase a lag (rendering thread gets data that is a bit older than the data currently processed by the compute thread). So You may need to balance performance, code complexity and a rendering lag.

How you do this depends on two factors:
Whether you want to dispatch the compute operation on the same queue as its corresponding graphics operation.
The ratio of compute operations to their corresponding graphics operations.
#2 is the most important part.
Even though they are generated in separate threads, there must be at least some idea that the graphics operation is being fed by a particular compute operation (otherwise, how would the graphics thread know where the data is to read from?). So, how do you do that?
At the end of the day, that part has nothing to do with Vulkan. You need to use some inter-thread communication mechanism to allow the graphics thread to ask, "which compute task's data should I be using?"
Typically, this would be done by having the compute thread add every compute operation it does to some kind of circular buffer (thread-safe of course. And non-locking). When the graphics thread goes to decide where to read its data from, it asks the circular buffer for the most recently added compute operation.
In addition to the "where to read its data from" information, this would also provide the graphics thread with an appropriate Vulkan synchronization primitive to use to synchronize its command buffer(s) with the compute operation's CB.
If the compute and graphics operations are being dispatched on the same queue, then this is pretty simple. There doesn't have to actually be a synchronization primitive. So long as the graphics CBs are issued after the compute CBs in the batch, all the graphics CBs need is to have a vkCmdPipelineBarrier at the front which waits on all memory operations from the compute stage.
srcStageMask would be STAGE_COMPUTE_SHADER_BIT, with dstStageMask being, well, pretty much everything (you could narrow it down, but it won't matter, since at the very least your vertex shader stage will need to be there).
You would need a single VkMemoryBarrier in the pipeline barrier. It's srcAccessMask would be SHADER_WRITE_BIT, while the dstAccessMask would be however you intend to read it. If the compute operations wrote some vertex data, you need VERTEX_ATTRIBUTE_READ_BIT. If they wrote some uniform buffer data, you need UNIFORM_READ_BIT. And so on.
If you're dispatching these operations on separate queues, that's where you need an actual synchronization object.
There are several problems:
You cannot detect if a Vulkan semaphore has been signaled by user code. Nor can you set a semaphore to the unsignaled state by user code. Nor can you reasonably submit a batch that has a semaphore in it that is currently signaled and nobody's waiting on it. You can do the latter, but it won't do the right thing.
In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it.
You cannot issue a batch that waits on a semaphore, unless a batch that signals it is "pending execution". That is, your graphics thread cannot vkQueueSubmit its batch until it is certain that the compute queue has submitted its signaling batch.
So what you have to do is this. When the graphics queue goes to get its compute data, this must send a signal to the compute thread to add a semaphore to its next submit call. When the graphics thread submits its graphics operation, it then waits on that semaphore.
But to ensure proper ordering, the graphics thread cannot submit its operation until the compute thread has submitted the semaphore signaling operation. That requires a CPU-synchronization operation of some form. It could be as simple as the graphics thread polling an atomic variable set by the compute thread.

Related

Vulkan Compute dispatch from a child CPU thread

Can Vulkan Compute dispatch from a child CPU thread, or does it have to dispatch from the main thread? I don't think this is possible to dispatch compute shaders in Unity from child threads and I wanted to find out if it could be done in Unreal Engine.

It depends on what you mean by "dispatch" and "main thread".
vkCmdDispatch, as the "Cmd" prefix suggests, puts a command in a command buffer. This can be called on any thread, so long as the VkCommandBuffer object will not have other vkCmd functions called on it at the same time (typically, you reserve specific command buffers for a single thread). So by one definition, you can "dispatch" compute operations from other threads.
Of course, recording commands in a command buffer doesn't actually do anything. Commands only get executed when you queue up those CBs via vkQueueSubmit. Like vkCmdDispatch, it doesn't matter what thread you call that function on. However, like vkCmdDispatch, it does matter that multiple threads be prevented from accessing the same VkQueue object at the same time.
Now, you don't have to use a single thread for that VkQueue; you can lock the VkQueue behind some kind of mutex, so that only one thread can own it at a time. And thus, a thread that creates a CB could submit its own work.
However, ignoring the fact that tasks often need to be inserted into the queue in an order (one task might generate some compute data that a graphics task needs to wait on, so the graphics task CB must be after the compute CB), there's a bigger problem. vkQueueSubmit takes a long time. If you look at the function, it can take an arbitrarily large number of CBs to insert, and it has the ability to have multiple batches, with each batch guarded by semaphores and fences for synchronization. As such, you are strongly encouraged to make as few vkQueueSubmit calls as possible, since each call has a quantity of overhead to it that has nothing to do with how many CBs you are queuing up.
There's even a warning about this in the spec itself.
So the typical way applications are structured is that you farm out tasks to the available CPU threads, and these tasks build command buffers. One particular thread will be anointed as the owner of the queue. That thread may perform some CB building, but once it is done, it will wait for the other tasks to complete and gather up all of the CBs from the other threads. Once gathered, that thread will vkQueueSubmit them in appropriate batches.
You could call that thread the "main thread", but Vulkan itself doesn't really care which thread "owns" the queue. It certainly doesn't have to be your process's initial thread.

Which are blocking Vulkan functions?

In Vulkan, it is recommended to break the API calls into separate threads for better throughput. I am unsure which category of calls are the computationally expensive one which would cause the thread to block, and thus should be used asynchronously.
As I see it, these are the potential calls/family-of-calls that could take a long time to execute.
vkAcquireImageKHR()
vkQueueSubmit()
vkQueuePresentKHR()
memcpy into mapped memory
vkBegin/EndCommandBuffer
vkCmd* calls for drawing and compute
But, the more I think about them, the more it seems that most would be fairly cheap to call. I'll explain my rational, which is probably flawed.
vkAcquireImageKHR()
This could block, if you choose a timeout. But, it's likely that a sufficiently optimized app would call this function with a 0 timeout, and just do other work if the image is not yet available. So, this function can be made instant. There's no need to wait, if the app is smart enough.
vkQueueSubmit()
This function takes a fence, which will be signaled when the GPU has finished executing the command buffers. So, it doesn't actually wait around while the GPU performs the work. I'm assuming this function is the one that starts the physical movement of the command buffer data to the GPU, but I'm assuming that it tell the hardware to read from some memory location, and then the function returns as quickly as possible. So, it wouldn't wait around while the command buffers get sent to the GPU.
vkQueuePresentKHR()
Signal to the GPU to send some image to the window/monitor. It doesn't have to wait for much, does it?
memcpy into mapped memory
This is probably slow.
vkCmd* calls
This family of calls is the one I'm most unsure about. When I read about threads and Vulkan, it's usually these calls that get put onto the threads. But, what are these calls doing, really? Are they building some opcode buffer, made up of some ints and pointers, to be sent to the GPU? If so, that should be extremely fast. The actual work would be carrying out the operations described by those opcodes.

Define "block". The traditional definition of "block"ing is to wait on some internal synchronization, and thereby taking longer than would strictly be necessary for the operation. Doing a memcpy is not doing any synchronization; it's just copying data.
So you don't seem to be concerned about "block"ing; you're merely talking about what operations are expensive.
vkQueueSubmit does not block. But that doesn't mean it's not expensive. It is not "tell[ing] the hardware to read from some memory location" Just look at its interface. It doesn't take a single command buffer; it takes an arbitrary number of them, which are grouped into batches, with each batch waiting on semaphores before execution, signaling semaphores after execution, and the whole operation signaling a fence.
You cannot reasonably expect an implementation of such a thing to merely copy some pointers around.
And that doesn't even get into issues of different types of command buffers. Submitting SIMULTANEOUS_USE command buffers may require creating temporary copies of its buffered data, so that different batches can contain the same command buffer.
Now obviously, vkQueueSubmit is going to return well before any of the work it submits actually gets executed. But don't make the mistake of thinking that it's free to ship work off to the GPU. The Vulkan specification takes time out in a note to directly tell you not to call the function any more frequently than you can get away with:
Submission can be a high overhead operation, and applications should attempt to batch work together into as few calls to vkQueueSubmit as possible.
The reason to present on the same thread that submitted the CBs that generates the image being presented is not because any of those operations are necessarily slow. It's for simple pragmatism; these three operations (acquire, submit, present) must happen in order. And the simplest and easiest way to ensure that is to do them on the same thread.
You cannot submit work that renders to a swapchain image until you have acquired it. Therefore, either you do it on the same thread, or you have to have some inter-thread communication pipe to tell the thread waiting to build the primary CB what the acquired image is. The two processes cannot overlap.
Unlike acquire, present is a queue operation. And both vkQueueSubmit and vkQueuePresent require that access to their VkQueue parameters must be "externally synchoronized". That of course means that you cannot call them both from different threads, on the same VkQueue, at the same time. So if you tried to do these in parallel, you'd need a mutex or something to synchronize CPU access to the VkQueue.
Whereas if you do them on the same thread, there's no need.
Additionally, in order to present an image, you must provide a semaphore that the present will wait on. This semaphore will get signaled by the batch that generates data for the image. Vulkan requires semaphore signal/wait pairs to be ordered; you cannot perform a queue operation that waits on a semaphore until the operation that signals that semaphore has been submitted. Therefore, either you do it on the same thread in sequence, or you use some inter-thread communication pipe to tell whatever thread is waiting to present the image that the submit operation that renders to it has been issued.
So what is to be gained by splitting these operations up onto different threads? They have to happen in sequence, so you may as well do them in sequence the easiest way that exists: on the same thread.
While timeline semaphores now allow you to call the present function before submitting the work that increments the semaphore counter, you still can't call them on separate threads (without synchronization) because they affect the same queue. So you may as well issue them on the same thread (though not necessarily in acquire, submit, present order).
Ultimately, it's not clear what the point of this exercise is. Yes, an individual vkCmd* call will be pretty fast. So what? In a real scene, you will be calling these functions thousands of times per frame. Spreading them evenly across 4 cores saves you ~4x the performance.

Why is this MPMC Queue not lock free? [duplicate]

Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.
This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).
However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.
One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.
The relevant algorithm is in lfds711_queue_bmm_enqueue.c. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.
The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index and write_index. Each slot in the queue contains a field for user-data, and a sequence_number value, which is basically like an epoch counter. (This avoids ABA issues).
The PUSH algorithm is as follows:
Atomically LOAD the write_index
Attempt to reserve a slot in the queue at write_index % queue_size using a CompareAndSwap loop that attempts to set write_index to write_index + 1.
If the CompareAndSwap is successful, copy the user data into the
reserved slot.
Finally, update the sequence_index on the
slot by making it equal to write_index + 1.
The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:
bool mcmp_queue::enqueue(void* data)
{
int write_index = m_write_index.load(std::memory_order_relaxed);
for (;;)
{
slot& s = m_slots[write_index % m_num_slots];
int sequence_number = s.sequence_number.load(std::memory_order_acquire);
int difference = sequence_number - write_index;
if (difference == 0)
{
if (m_write_index.compare_exchange_weak(
write_index,
write_index + 1,
std::memory_order_acq_rel
))
{
break;
}
}
if (difference < 0) return false; // queue is full
}
// Copy user-data and update sequence number
//
s.user_data = data;
s.sequence_number.store(write_index + 1, std::memory_order_release);
return true;
}
Now, a thread that wants to POP an element from the slot at read_index will not be able to do so until it observes that the slot's sequence_number is equal to read_index + 1.
Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.
But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".
Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.
So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.
So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?

This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:
A structure is lock-free if only if any thread can be indefinitely
suspended at any point while still leaving the structure usable by the
remaining threads.
Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.
In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.
There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.
Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.
Performance
Uncontended Performance
The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.
This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)1, and little other overhead.
This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.
So I expect this queue to perform reasonably well in the uncontended fast-path.
Contended Performance
One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.
This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.
Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.
Context-switch Immunity
One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.
Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.
This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.
While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!
Functional Aspects
Async Thread Termination
On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.
This is not the case for this queue, as described above.
Queue Access from Interrupt or Signal
A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.
This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.
The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.
1 In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.

I am the author of liblfds.
The OP is correct in his description of this queue.
It is the single data structure in the library which is not lock-free.
This is described in the documentation for the queue;
http://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_%28bounded,_many_producer,_many_consumer%29#Lock-free_Specific_Behaviour
"It must be understood though that this is not actually a lock-free data structure."
This queue is an implementation of an idea from Dmitry Vyukov (1024cores.net) and I only realised it was not lock-free while I was making the test code work.
By then it was working, so I included it.
I do have some thought to remove it, since it is not lock-free.

Most of the time people use lock-free when they really mean lockless. lockless means a data-structure or algorithm that does not use locks, but there is no guarantee for forward progress. Also check this question. So the queue in liblfds is lockless, but as BeeOnRope mentioned is not lock-free.

A thread that calls POP before the next update in sequence is complete is NOT "effectively blocked" if the POP call returns FALSE immediately. The thread can go off and do something else. I'd say that this queue qualifies as lock-free.
However, I wouldn't say that it qualifies as a "queue" -- at least not the kind of queue that you could publish as a queue in a library or something -- because it doesn't guarantee a lot of the behaviors that you can normally expect from a queue. In particular, you can PUSH and element and then try and FAIL to POP it, because some other thread is busy pushing an earlier item.
Even so, this queue could still be useful in some lock-free solutions for various problems.
For many applications, however, I would worry about the possibility for consumer threads to be starved while a producer thread is pre-empted. Maybe liblfds does something about that?

"Lock-free" is a property of the algorithm, which implements some functionality. The property doesn't correlate with a way, how given functionality is used by a program.
When talk about mcmp_queue::enqueue function, which returns FALSE if underlying queue is full, its implementation (given in the question post) is lock-free.
However, implementing mcmp_queue::dequeue in lock-free manner would be difficult. E.g., this pattern is obviously not-lock free, as it spins on the variable changed by other thread:
while(s.sequence_number.load(std::memory_order_acquire) == read_index);
data = s.user_data;
...
return data;

I did formal verification on this same code using Spin a couple years ago for a course in concurrency testing and it is definitely not lock-free.
Just because there is no explicit "locking", doesn't mean it's lock-free. When it comes to reasoning about progress conditions, think of it from an individual thread's perspective:
Blocking/locking: if another thread gets descheduled and this can block my progress, then it is blocking.
Lock-free/non-blocking: if I am able to eventually make progress in the absence of contention from other threads, then it is at most lock-free.
If no other thread can block my progress indefinitely, then it is wait-free.

what is "synchronising point" in multi threading?

I need to decide on how many counting semaphores needs to be used for one of the multi threaded application. I came to know, if we know synchronising points then we can decide the number of semaphores to be used. What are synchronising points?

A synchronization point is a place in the flow of execution where a thread must wait for other busy threads, so that the data they are working on is in a proper state to continue.
For instance, a process may have to way for a free slot in a buffer, while another thread is emptying the buffer.

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.

Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)

When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.

The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.

This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.

The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string