How to synchronize multithreaded OpenGL buffer access?

How to synchronize multithreaded OpenGL buffer access? - multithreading

I have vertex buffers holding meshes of terrain chunks. Whenever the player edits terrain, the mesh of the corresponding chunk must be regenerated and uploaded to the vertex buffer. Since regenerating the mesh takes some time, I do it in an asynchronous worker thread.
The issue is that the main threads draws the buffer in the same moment the worker thread uploads new data. That means, after the player editing the terrain, a corrupted chunk gets rendered for one frame. It just flares up once and after that, the correct buffers gets drawn.
This kind of made sense to me, we shouldn't write and read the same data at the same time of course. So instead of updating the old buffer, I created a new one, filled it and swapped them. The swapping was just changing the buffer id stored within the terrain chunk struct, so that should be atomic. Hoever, that didn't help.
Due to the fact that OpenGL commands are sent to a queue on GPU, they don't have to be executed when the application on the CPU continues. So I may have swapped the buffers before the new one was actually ready.
I also tried an alternative to switching the buffers, using a mutex for buffer access. The main thread locks the mutex while drawing and the worker thread locks it while uploading new buffer data. However, this didn't help either and it may be because of OpenGL's asynchronous nature, too. The main thread didn't actually draw, but just send draw commands to the GPU. On the other hand, when there really is only one command queue, uploading buffers and drawing them could never occur at the same time, does it?
How can I synchronize the vertex buffer access from my two threads to prevent that an undefined buffer gets drawn for one frame?

You must make sure that the buffer update is actually completed before you can use that buffer in your draw thread. The easieast solution would be to call glFinish in your update thread after you issued all the update GL commands, and only notify the the draw thread after that returned.
To have a more fine grained control over the synchronization, I would advice you to have a look at fence sync objects (as described in the GL_ARB_sync extension). You can issue a fence sync after you issued your update commands and actually store the sync object handle with your buffer handle so that the draw thread can check if the update actually completed (or wait for it). Note that sync objects are kind of special since they are the only objects not tied to the GL context, so that they can be used in multi-context setups.

Related

Calling glClientWaitSync() without making GL context current

I have a setup where an scene is rendered in an offscreen OpenGL framebuffer, then a compute shader extracts some data from it, and buts it into a ring buffer allocated on the device. This ring buffer is mapped with glMapBufferRange() to host-readable memory.
On the host side, there should be an interface where a Push() function enqueues the OpenGL operations on the command queue, followed by a glFenceSync().
And a Pull() function uses glClientWaitSync() to wait for a sync object to be finished, and then reads and returns the data from part of the ring buffer.
Ideally it should be possible to call Push() and Pull() from different threads.
But there is the problem that an OpenGL context can only be current on one thread at a time, but glClientWaitSync() like all other GL functions needs the proper context to be current.
So with this the Pull() thread would take the OpenGL context, and then call glClientWaitSync() which can be blocking. During that time Push() cannot be called because the context still belongs to the other context while it is waiting.
Is there a way to temporarily release the thread's current OpenGL context while waiting in glClientWaitSync() (in a way similar to how std::condition_variable::wait() unlocks the mutex), or to wait on a GLSync object belonging to another context?
The only solution seems to be to periodically poll glClientWaitSync() with zero timeout instead (and release the context inbetween), or to setup a second OpenGL context with resource sharing.

You cannot change the current context in someone else's thread. Well you can (by making that context current in yours), but that causes a data race if the other thread is currently in, or tries to call, an OpenGL function.
Instead, you should have two contexts with objects shared between them. Sync objects are shared between contexts, so that's not a problem. However, you need to flush the fence after you create it on the context which created the fence before another thread tries to wait on it.

Which are blocking Vulkan functions?

In Vulkan, it is recommended to break the API calls into separate threads for better throughput. I am unsure which category of calls are the computationally expensive one which would cause the thread to block, and thus should be used asynchronously.
As I see it, these are the potential calls/family-of-calls that could take a long time to execute.
vkAcquireImageKHR()
vkQueueSubmit()
vkQueuePresentKHR()
memcpy into mapped memory
vkBegin/EndCommandBuffer
vkCmd* calls for drawing and compute
But, the more I think about them, the more it seems that most would be fairly cheap to call. I'll explain my rational, which is probably flawed.
vkAcquireImageKHR()
This could block, if you choose a timeout. But, it's likely that a sufficiently optimized app would call this function with a 0 timeout, and just do other work if the image is not yet available. So, this function can be made instant. There's no need to wait, if the app is smart enough.
vkQueueSubmit()
This function takes a fence, which will be signaled when the GPU has finished executing the command buffers. So, it doesn't actually wait around while the GPU performs the work. I'm assuming this function is the one that starts the physical movement of the command buffer data to the GPU, but I'm assuming that it tell the hardware to read from some memory location, and then the function returns as quickly as possible. So, it wouldn't wait around while the command buffers get sent to the GPU.
vkQueuePresentKHR()
Signal to the GPU to send some image to the window/monitor. It doesn't have to wait for much, does it?
memcpy into mapped memory
This is probably slow.
vkCmd* calls
This family of calls is the one I'm most unsure about. When I read about threads and Vulkan, it's usually these calls that get put onto the threads. But, what are these calls doing, really? Are they building some opcode buffer, made up of some ints and pointers, to be sent to the GPU? If so, that should be extremely fast. The actual work would be carrying out the operations described by those opcodes.

Define "block". The traditional definition of "block"ing is to wait on some internal synchronization, and thereby taking longer than would strictly be necessary for the operation. Doing a memcpy is not doing any synchronization; it's just copying data.
So you don't seem to be concerned about "block"ing; you're merely talking about what operations are expensive.
vkQueueSubmit does not block. But that doesn't mean it's not expensive. It is not "tell[ing] the hardware to read from some memory location" Just look at its interface. It doesn't take a single command buffer; it takes an arbitrary number of them, which are grouped into batches, with each batch waiting on semaphores before execution, signaling semaphores after execution, and the whole operation signaling a fence.
You cannot reasonably expect an implementation of such a thing to merely copy some pointers around.
And that doesn't even get into issues of different types of command buffers. Submitting SIMULTANEOUS_USE command buffers may require creating temporary copies of its buffered data, so that different batches can contain the same command buffer.
Now obviously, vkQueueSubmit is going to return well before any of the work it submits actually gets executed. But don't make the mistake of thinking that it's free to ship work off to the GPU. The Vulkan specification takes time out in a note to directly tell you not to call the function any more frequently than you can get away with:
Submission can be a high overhead operation, and applications should attempt to batch work together into as few calls to vkQueueSubmit as possible.
The reason to present on the same thread that submitted the CBs that generates the image being presented is not because any of those operations are necessarily slow. It's for simple pragmatism; these three operations (acquire, submit, present) must happen in order. And the simplest and easiest way to ensure that is to do them on the same thread.
You cannot submit work that renders to a swapchain image until you have acquired it. Therefore, either you do it on the same thread, or you have to have some inter-thread communication pipe to tell the thread waiting to build the primary CB what the acquired image is. The two processes cannot overlap.
Unlike acquire, present is a queue operation. And both vkQueueSubmit and vkQueuePresent require that access to their VkQueue parameters must be "externally synchoronized". That of course means that you cannot call them both from different threads, on the same VkQueue, at the same time. So if you tried to do these in parallel, you'd need a mutex or something to synchronize CPU access to the VkQueue.
Whereas if you do them on the same thread, there's no need.
Additionally, in order to present an image, you must provide a semaphore that the present will wait on. This semaphore will get signaled by the batch that generates data for the image. Vulkan requires semaphore signal/wait pairs to be ordered; you cannot perform a queue operation that waits on a semaphore until the operation that signals that semaphore has been submitted. Therefore, either you do it on the same thread in sequence, or you use some inter-thread communication pipe to tell whatever thread is waiting to present the image that the submit operation that renders to it has been issued.
So what is to be gained by splitting these operations up onto different threads? They have to happen in sequence, so you may as well do them in sequence the easiest way that exists: on the same thread.
While timeline semaphores now allow you to call the present function before submitting the work that increments the semaphore counter, you still can't call them on separate threads (without synchronization) because they affect the same queue. So you may as well issue them on the same thread (though not necessarily in acquire, submit, present order).
Ultimately, it's not clear what the point of this exercise is. Yes, an individual vkCmd* call will be pretty fast. So what? In a real scene, you will be calling these functions thousands of times per frame. Spreading them evenly across 4 cores saves you ~4x the performance.

Vulkan Queue Synchronization in Multithreading

In my application it is imperative that "state" and "graphics" are processed in separate threads. So for example, the "state" thread is only concerned with updating object positions, and the "graphics" thread is only concerned with graphically outputting the current state.
For simplicity, let's say that the entirety of the state data is contained within a single VkBuffer. The "state" thread creates a Compute Pipeline with a Storage Buffer backed by the VkBuffer, and periodically vkCmdDispatchs to update the VkBuffer.
Concurrently, the "graphics" thread creates a Graphics Pipeline with a Uniform Buffer backed by the same VkBuffer, and periodically draws/vkQueuePresentKHRs.
Obviously there must be some sort of synchronization mechanism to prevent the "graphics" thread from reading from the VkBuffer whilst the "state" thread is writing to it.
The only idea I have is to employ the usage of a host mutex fromvkQueueSubmit to vkWaitForFences in both threads.
I want to know, is there perhaps some other method that is more efficient or is this considered to be OK?

Try using semaphores. They are used to synchronize operations solely on the GPU, which is much more optimal than waiting in the app and submitting work after previous work is fully processed.
When You submit work You can provide a semaphore which gets signaled when this work is finished. When You submit another work You can provide the same semaphore on which the second batch should wait. Processing of the second batch will start automatically when the semaphore gets signaled (this semaphore is also automatically unsignaled and can be reused).
(I think there are some constraints on using semaphores, associated with queues. I will update the answer later when I confirm this but they should be sufficient for Your purposes.
[EDIT] There are constraints on using semaphores but it shouldn't affect You - when You use a semaphore as a wait semaphore during submission, no other queue can wait on the same semaphore.)
There are also events in Vulkan which can be used for similar purposes but their use is a little bit more complicated.
If You really need to synchronize GPU and Your application, use fences. They are signaled in a similar way as semaphores. But You can check their state on the app side and You need to manually unsignal them before You can use then again.
[EDIT]
I've added an image that more or less shows what I think You should do. One thread calculates state and with each submission adds a semaphore to the top of the list (or a ring buffer as #NicolasBolas wrote). This semaphore gets signaled when the submission is finished (it is provided in pSignalSemaphores during "compute" batch submission).
Second thread renders Your scene. It manages it's own list of semaphores similarly to the compute thread. But when You want to render things, You need to be sure that compute thread finished calculations. That's why You need to take the latest "compute" semaphore and wait on it (provide it in pWaitSemaphores during "render" batch submission). When You submit rendering commands, compute thread can't start and modify the data because it may influence the results of a rendering. So compute thread also needs to wait until the most recent rendering is done. That's why compute thread also needs to provide a wait semaphore (the most recent "rendering" semaphore).
You just need to synchronize submissions. Rendering thread cannot start when a compute threads submits commands and vice versa. That's why adding semaphores to the lists (and taking semaphores from the list) should be synchronized. But this has nothing to do with Vulkan. Probably some mutex will be helpful (for example a C++-ish std::lock_guard<std::mutex>). But this synchronization is a problem only when You have a single buffer.
Another thing is what to do with old semaphores from both lists. You cannot directly check what is their state and You cannot directly unsignal them. The state of semaphores can be checked by using additional fences provided with each submission. You don't wait on them but from time to time check if a given fence is signaled and, if it is, You can destroy old semaphore (as You cannot unsignal it from the application) or You can make an empty submission, with no command buffers, and use that semaphore as a wait semaphore. This way the semaphore will be unsignaled and You can reuse it. But I don't know which solution is more optimal: destroying old and creating new semaphores, or unsignaling them with empty submissions.
When You have a single buffer, a one-element list/ring is probably enough. But more optimal solution would have some kind of a ping-pong set of buffers - You read data from one buffer, but store results in another buffer. And in the next step You swap them. That's why in the image above, the lists of semaphores (rings) may have more elements depending on Your setup. The more independent buffers and semaphores in the lists (of course to some reasonable count), the best performance You will get as You reduce time wasted on waiting. But this complicates Your code and it may also increase a lag (rendering thread gets data that is a bit older than the data currently processed by the compute thread). So You may need to balance performance, code complexity and a rendering lag.

How you do this depends on two factors:
Whether you want to dispatch the compute operation on the same queue as its corresponding graphics operation.
The ratio of compute operations to their corresponding graphics operations.
#2 is the most important part.
Even though they are generated in separate threads, there must be at least some idea that the graphics operation is being fed by a particular compute operation (otherwise, how would the graphics thread know where the data is to read from?). So, how do you do that?
At the end of the day, that part has nothing to do with Vulkan. You need to use some inter-thread communication mechanism to allow the graphics thread to ask, "which compute task's data should I be using?"
Typically, this would be done by having the compute thread add every compute operation it does to some kind of circular buffer (thread-safe of course. And non-locking). When the graphics thread goes to decide where to read its data from, it asks the circular buffer for the most recently added compute operation.
In addition to the "where to read its data from" information, this would also provide the graphics thread with an appropriate Vulkan synchronization primitive to use to synchronize its command buffer(s) with the compute operation's CB.
If the compute and graphics operations are being dispatched on the same queue, then this is pretty simple. There doesn't have to actually be a synchronization primitive. So long as the graphics CBs are issued after the compute CBs in the batch, all the graphics CBs need is to have a vkCmdPipelineBarrier at the front which waits on all memory operations from the compute stage.
srcStageMask would be STAGE_COMPUTE_SHADER_BIT, with dstStageMask being, well, pretty much everything (you could narrow it down, but it won't matter, since at the very least your vertex shader stage will need to be there).
You would need a single VkMemoryBarrier in the pipeline barrier. It's srcAccessMask would be SHADER_WRITE_BIT, while the dstAccessMask would be however you intend to read it. If the compute operations wrote some vertex data, you need VERTEX_ATTRIBUTE_READ_BIT. If they wrote some uniform buffer data, you need UNIFORM_READ_BIT. And so on.
If you're dispatching these operations on separate queues, that's where you need an actual synchronization object.
There are several problems:
You cannot detect if a Vulkan semaphore has been signaled by user code. Nor can you set a semaphore to the unsignaled state by user code. Nor can you reasonably submit a batch that has a semaphore in it that is currently signaled and nobody's waiting on it. You can do the latter, but it won't do the right thing.
In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it.
You cannot issue a batch that waits on a semaphore, unless a batch that signals it is "pending execution". That is, your graphics thread cannot vkQueueSubmit its batch until it is certain that the compute queue has submitted its signaling batch.
So what you have to do is this. When the graphics queue goes to get its compute data, this must send a signal to the compute thread to add a semaphore to its next submit call. When the graphics thread submits its graphics operation, it then waits on that semaphore.
But to ensure proper ordering, the graphics thread cannot submit its operation until the compute thread has submitted the semaphore signaling operation. That requires a CPU-synchronization operation of some form. It could be as simple as the graphics thread polling an atomic variable set by the compute thread.

How to synchronize multiple threads to one?

I have a multithreaded application where I want to allow all but one of the threads to run synchronously. However, when a specific thread wakes up I need the rest of the threads to block.
My Current implementation is:
void ManyBackgroundThreadsDoingWork()
{
AquireMutex(mutex);
DoTheBackgroundWork();
ReleaseTheMutex(mutex);
}
void MainThread()
{
AquireMutex(mutex);
DoTheMainThreadWork();
ReleaseTheMutex(mutex);
}
This works, in that it does indeed keep the background threads from operating inside the critical block while the main thread is doing its work. However, There is a lot of contention for the mutex amongst the background threads even when they don't necessarily need it. The main thread runs intermittently and the background threads are able to run concurrently with each other, just not with the main thread.
What i've effectively done is reduced a multithreaded architecture to a single threaded one using locks... which is silly. What I really want is an architecture that is multithreaded for most of the time, but then waits while a small operation completes and goes back to being multithreaded.
Edit: An explanation of the problem.
What I have is an application that displays multiple video feeds coming from pcie capture cards. The pcie capture card driver issues callbacks on threads it manages into what is effectively the ManyBackgroundThreadsDoingWork function. In this function I copy the captured video frames into buffers for rendering. The main thread is the render thread that runs intermittently. The copy threads need to block during the render to prevent tearing of the video.
My initial approach was to simply do double buffering but that is not really an option as the capture card driver won't allow me to buffer frames without pushing the frames through system memory. The technique being used is called "DirectGMA" from AMD that allows the capture card to push video frames directly into the GPU memory. The only method for synchronization is to put a glFence and mutex around the actual rendering as the capture card will be continuously streaming data to the GPU Memory. The driver offers no indication of when a frame transfer completes. The callback supplies enough information for me to know that a frame is ready to be transferred at which point I trigger the transfer. However, I need to block transfers during the scene render to prevent tearing and artifacts in the video. The technique described above is the suggested technique from the pcie card manufacturer. The technique, however, breaks down when you want more than one video playing at a time. Thus, the question.

You need a lock that supports both shared and exclusive locking modes, sometimes called a readers/writer lock. This permits multiple threads to get read (shared) locks until one thread requests an exclusive (write) lock.

QGLWidget's paintGL() method called from which Qt thread?

suppose I use the QGLWidget's paintGL() method to draw into the widget using OpenGL. After the Qt called the paintGL() method, it automatically triggers a buffer swap. In OpenGL, this buffer swap usually blocks the calling thread until the frame rendering to the background buffer is completed, right? I wonder which Qt thread calls the paintGL as well as the buffer swap. Is it the main Qt UI thread? If it is, wouldn't that mean that the block during the buffer swap also blocks the whole UI? I could not find any information about this process in general..
Thanks

I don't use the QGLWidget very often, but consider that yes, if swapBuffers() is synchronous the Qt GUI thread is stuck. This means that during that operation you'll be unable to process events.
Anyway, if you're experiencing difficulties while doing this, consider reading this article which manage to allow multithreaded OpenGL to overcome this difficulty.
Even better, this article explains well the situation and introduces the new multithreading OpenGL capabilities in Qt 4.8, which is now in release candidate.

In OpenGL, this buffer swap usually blocks the calling thread until the frame rendering to the background buffer is completed, right?
It depends on how it is implemented. Which means that it varies from hardware to hardware and driver to driver.
If it is, wouldn't that mean that the block during the buffer swap also blocks the whole UI?
Even if it does block, it will only do so for 1/60th of a second. Maybe 1/30th if your game is slowing down. If you're really slow, 1/15th. The at most one keypress or mouse action that the user gives will still be in the message queue.
The issue with blocking isn't about the UI. It will be responsive enough for the user to not notice. But if you have strict timings (such as you might for a game), I would suggest avoiding paintGL at all. You should be rendering when you want to, not when Qt tells you to.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string