I have a setup where an scene is rendered in an offscreen OpenGL framebuffer, then a compute shader extracts some data from it, and buts it into a ring buffer allocated on the device. This ring buffer is mapped with glMapBufferRange() to host-readable memory.
On the host side, there should be an interface where a Push() function enqueues the OpenGL operations on the command queue, followed by a glFenceSync().
And a Pull() function uses glClientWaitSync() to wait for a sync object to be finished, and then reads and returns the data from part of the ring buffer.
Ideally it should be possible to call Push() and Pull() from different threads.
But there is the problem that an OpenGL context can only be current on one thread at a time, but glClientWaitSync() like all other GL functions needs the proper context to be current.
So with this the Pull() thread would take the OpenGL context, and then call glClientWaitSync() which can be blocking. During that time Push() cannot be called because the context still belongs to the other context while it is waiting.
Is there a way to temporarily release the thread's current OpenGL context while waiting in glClientWaitSync() (in a way similar to how std::condition_variable::wait() unlocks the mutex), or to wait on a GLSync object belonging to another context?
The only solution seems to be to periodically poll glClientWaitSync() with zero timeout instead (and release the context inbetween), or to setup a second OpenGL context with resource sharing.
You cannot change the current context in someone else's thread. Well you can (by making that context current in yours), but that causes a data race if the other thread is currently in, or tries to call, an OpenGL function.
Instead, you should have two contexts with objects shared between them. Sync objects are shared between contexts, so that's not a problem. However, you need to flush the fence after you create it on the context which created the fence before another thread tries to wait on it.
Related
In my application it is imperative that "state" and "graphics" are processed in separate threads. So for example, the "state" thread is only concerned with updating object positions, and the "graphics" thread is only concerned with graphically outputting the current state.
For simplicity, let's say that the entirety of the state data is contained within a single VkBuffer. The "state" thread creates a Compute Pipeline with a Storage Buffer backed by the VkBuffer, and periodically vkCmdDispatchs to update the VkBuffer.
Concurrently, the "graphics" thread creates a Graphics Pipeline with a Uniform Buffer backed by the same VkBuffer, and periodically draws/vkQueuePresentKHRs.
Obviously there must be some sort of synchronization mechanism to prevent the "graphics" thread from reading from the VkBuffer whilst the "state" thread is writing to it.
The only idea I have is to employ the usage of a host mutex fromvkQueueSubmit to vkWaitForFences in both threads.
I want to know, is there perhaps some other method that is more efficient or is this considered to be OK?
Try using semaphores. They are used to synchronize operations solely on the GPU, which is much more optimal than waiting in the app and submitting work after previous work is fully processed.
When You submit work You can provide a semaphore which gets signaled when this work is finished. When You submit another work You can provide the same semaphore on which the second batch should wait. Processing of the second batch will start automatically when the semaphore gets signaled (this semaphore is also automatically unsignaled and can be reused).
(I think there are some constraints on using semaphores, associated with queues. I will update the answer later when I confirm this but they should be sufficient for Your purposes.
[EDIT] There are constraints on using semaphores but it shouldn't affect You - when You use a semaphore as a wait semaphore during submission, no other queue can wait on the same semaphore.)
There are also events in Vulkan which can be used for similar purposes but their use is a little bit more complicated.
If You really need to synchronize GPU and Your application, use fences. They are signaled in a similar way as semaphores. But You can check their state on the app side and You need to manually unsignal them before You can use then again.
[EDIT]
I've added an image that more or less shows what I think You should do. One thread calculates state and with each submission adds a semaphore to the top of the list (or a ring buffer as #NicolasBolas wrote). This semaphore gets signaled when the submission is finished (it is provided in pSignalSemaphores during "compute" batch submission).
Second thread renders Your scene. It manages it's own list of semaphores similarly to the compute thread. But when You want to render things, You need to be sure that compute thread finished calculations. That's why You need to take the latest "compute" semaphore and wait on it (provide it in pWaitSemaphores during "render" batch submission). When You submit rendering commands, compute thread can't start and modify the data because it may influence the results of a rendering. So compute thread also needs to wait until the most recent rendering is done. That's why compute thread also needs to provide a wait semaphore (the most recent "rendering" semaphore).
You just need to synchronize submissions. Rendering thread cannot start when a compute threads submits commands and vice versa. That's why adding semaphores to the lists (and taking semaphores from the list) should be synchronized. But this has nothing to do with Vulkan. Probably some mutex will be helpful (for example a C++-ish std::lock_guard<std::mutex>). But this synchronization is a problem only when You have a single buffer.
Another thing is what to do with old semaphores from both lists. You cannot directly check what is their state and You cannot directly unsignal them. The state of semaphores can be checked by using additional fences provided with each submission. You don't wait on them but from time to time check if a given fence is signaled and, if it is, You can destroy old semaphore (as You cannot unsignal it from the application) or You can make an empty submission, with no command buffers, and use that semaphore as a wait semaphore. This way the semaphore will be unsignaled and You can reuse it. But I don't know which solution is more optimal: destroying old and creating new semaphores, or unsignaling them with empty submissions.
When You have a single buffer, a one-element list/ring is probably enough. But more optimal solution would have some kind of a ping-pong set of buffers - You read data from one buffer, but store results in another buffer. And in the next step You swap them. That's why in the image above, the lists of semaphores (rings) may have more elements depending on Your setup. The more independent buffers and semaphores in the lists (of course to some reasonable count), the best performance You will get as You reduce time wasted on waiting. But this complicates Your code and it may also increase a lag (rendering thread gets data that is a bit older than the data currently processed by the compute thread). So You may need to balance performance, code complexity and a rendering lag.
How you do this depends on two factors:
Whether you want to dispatch the compute operation on the same queue as its corresponding graphics operation.
The ratio of compute operations to their corresponding graphics operations.
#2 is the most important part.
Even though they are generated in separate threads, there must be at least some idea that the graphics operation is being fed by a particular compute operation (otherwise, how would the graphics thread know where the data is to read from?). So, how do you do that?
At the end of the day, that part has nothing to do with Vulkan. You need to use some inter-thread communication mechanism to allow the graphics thread to ask, "which compute task's data should I be using?"
Typically, this would be done by having the compute thread add every compute operation it does to some kind of circular buffer (thread-safe of course. And non-locking). When the graphics thread goes to decide where to read its data from, it asks the circular buffer for the most recently added compute operation.
In addition to the "where to read its data from" information, this would also provide the graphics thread with an appropriate Vulkan synchronization primitive to use to synchronize its command buffer(s) with the compute operation's CB.
If the compute and graphics operations are being dispatched on the same queue, then this is pretty simple. There doesn't have to actually be a synchronization primitive. So long as the graphics CBs are issued after the compute CBs in the batch, all the graphics CBs need is to have a vkCmdPipelineBarrier at the front which waits on all memory operations from the compute stage.
srcStageMask would be STAGE_COMPUTE_SHADER_BIT, with dstStageMask being, well, pretty much everything (you could narrow it down, but it won't matter, since at the very least your vertex shader stage will need to be there).
You would need a single VkMemoryBarrier in the pipeline barrier. It's srcAccessMask would be SHADER_WRITE_BIT, while the dstAccessMask would be however you intend to read it. If the compute operations wrote some vertex data, you need VERTEX_ATTRIBUTE_READ_BIT. If they wrote some uniform buffer data, you need UNIFORM_READ_BIT. And so on.
If you're dispatching these operations on separate queues, that's where you need an actual synchronization object.
There are several problems:
You cannot detect if a Vulkan semaphore has been signaled by user code. Nor can you set a semaphore to the unsignaled state by user code. Nor can you reasonably submit a batch that has a semaphore in it that is currently signaled and nobody's waiting on it. You can do the latter, but it won't do the right thing.
In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it.
You cannot issue a batch that waits on a semaphore, unless a batch that signals it is "pending execution". That is, your graphics thread cannot vkQueueSubmit its batch until it is certain that the compute queue has submitted its signaling batch.
So what you have to do is this. When the graphics queue goes to get its compute data, this must send a signal to the compute thread to add a semaphore to its next submit call. When the graphics thread submits its graphics operation, it then waits on that semaphore.
But to ensure proper ordering, the graphics thread cannot submit its operation until the compute thread has submitted the semaphore signaling operation. That requires a CPU-synchronization operation of some form. It could be as simple as the graphics thread polling an atomic variable set by the compute thread.
I have several QQuickFramebufferObjects between which I want to share some GL objects (shaders and VBOs mainly). My initial plan was:
Create a class SharedGLData to hold the shared object
Instantiate this class on the stack in C++'s main()
Pass a pointer the object to QML via ctx->assignRootProperty or something
Pass the object as a property to the QML items of type SharedGLData
Access the pointer from C++
But that means I'd be creating GL objects on the main thread and later access them on the render thread. I'm pretty sure that's forbidden, for example see here where it says:
QOpenGLContext can be moved to a different thread with moveToThread(). Do not call makeCurrent() from a different thread than the one to which the QOpenGLContext object belongs.
Is it ok to follow my initial plan or is there a way to create shared GL objects on the render thread?
A possible hack would be to put the shared stuff into a singleton that gets initialized on first use, and make my first use be directly from the rendering code. But that's a hack.
Another idea is to call moveToThread on the QQFBO's GL context to move it to the main thread, instantiate SharedGLData, then move the GL context back to the render thread. But I don't have a pointer to the render thread...
Clarification after the answer I got: By "render thread" I mean the thread that Qt SceneGraph silently creates to do all the rendering. It's not a thread that I'm creating!
If you want multiple threads sharing OpenGL objects,
Create a QOpenGLContext. Usually, the first one you create should be the one belonging to the window (which will actually draw to the screen).
Create a second QOpenGLContext, but call setShareContext before calling create.
You now have two OpenGL contexts which share objects (shaders, VBOs, etc) and you can now use these contexts simultaneously from different threads.
But... this often not an ideal experience. In many cases, using OpenGL simultaneously from different threads will be no faster than using it from one thread, or it will be slower, or it will be buggier. You are at the mercy of the OpenGL implementation.
A Different Way
It sounds like your goal is to load assets (shaders, textures, vertex data) in a background thread while your main thread continues to render. There is a more straightforward way of doing this that does not involve creating multiple contexts at all.
Simply map OpenGL buffers into memory in the rendering thread, and then pass the pointer to the background loader thread. The loader thread is free to write data into the buffer while the render thread continues to make OpenGL calls. When the loader thread is done, it signals the main thread, which does the appropriate synchronization and calls glTexImage2D or whatever. These days, you can even keep a single buffer persistently mapped, but the traditional two-buffer method also works quite well.
Under this scheme, your rendering thread does not have to do any IO, and your background thread does not have to make any OpenGL calls at all.
You can't use this to compile shaders in the background but c'est la vie.
OpenGL functions are only supposed to be called from the thread in which the OpenGL context is current. Does this limit apply to updating data using glMapBuffer/glMapBufferRange, i.e. can I map (a region of) a buffer and then read from / write to that region in another thread? Assuming, of course, that the mapping (and unmapping) functions are called from the rendering thread.
Before answering the main question, let's cover some misinformation:
I know that you're supposed to only call OpenGL functions in the thread that created the OpenGL context.
This is not true. You must call OpenGL functions only on the thread where the context is current. You can make an OpenGL context current in a different thread (which will automatically make it not current in the previous one. OpenGL contexts can only be current in one thread at the same time). And you can create multiple contexts which share objects. Each such context can be current in a different thread.
Now to the issue. Yes, you are perfectly free to use the mapped pointer however you wish from another thread. Though, as you said, you must use appropriate synchronization mechanisms to let the original thread know that you've finished.
I have vertex buffers holding meshes of terrain chunks. Whenever the player edits terrain, the mesh of the corresponding chunk must be regenerated and uploaded to the vertex buffer. Since regenerating the mesh takes some time, I do it in an asynchronous worker thread.
The issue is that the main threads draws the buffer in the same moment the worker thread uploads new data. That means, after the player editing the terrain, a corrupted chunk gets rendered for one frame. It just flares up once and after that, the correct buffers gets drawn.
This kind of made sense to me, we shouldn't write and read the same data at the same time of course. So instead of updating the old buffer, I created a new one, filled it and swapped them. The swapping was just changing the buffer id stored within the terrain chunk struct, so that should be atomic. Hoever, that didn't help.
Due to the fact that OpenGL commands are sent to a queue on GPU, they don't have to be executed when the application on the CPU continues. So I may have swapped the buffers before the new one was actually ready.
I also tried an alternative to switching the buffers, using a mutex for buffer access. The main thread locks the mutex while drawing and the worker thread locks it while uploading new buffer data. However, this didn't help either and it may be because of OpenGL's asynchronous nature, too. The main thread didn't actually draw, but just send draw commands to the GPU. On the other hand, when there really is only one command queue, uploading buffers and drawing them could never occur at the same time, does it?
How can I synchronize the vertex buffer access from my two threads to prevent that an undefined buffer gets drawn for one frame?
You must make sure that the buffer update is actually completed before you can use that buffer in your draw thread. The easieast solution would be to call glFinish in your update thread after you issued all the update GL commands, and only notify the the draw thread after that returned.
To have a more fine grained control over the synchronization, I would advice you to have a look at fence sync objects (as described in the GL_ARB_sync extension). You can issue a fence sync after you issued your update commands and actually store the sync object handle with your buffer handle so that the draw thread can check if the update actually completed (or wait for it). Note that sync objects are kind of special since they are the only objects not tied to the GL context, so that they can be used in multi-context setups.
Is there anything wrong with creating a window in a separate thread, which will also contain the message loop, then creating an OpenGL Context in another thread?
You should be able to get it to work, if you're careful. See the parallel opengl faq.
Q: Why does my OpenGL application crash/not work when
I am rendering from another thread?
A: The OpenGL context is thread-specific. You have to
make it current in the thread using glXMakeCurrent,
wglMakeCurrent or aglSetCurrentContext, depending on
your operating system.
What you want to do is perfectly possible. Even better, OpenGL contexts can migrate between threads and even be used with multiple windows as long as their pixel format is compatible. The one constraint is, that a OpenGL context can be bound in only one thread at a time and that only a unbound context can be bound.
So you could even create the window and the context in one thread, then unbind the context, create another thread and re-bind the context to the window in the secondary thread. No problem there.
The only thing you must be aware of is, that OpenGL itself doesn't like to be multithreaded. The API itself is more or less thread safe, as only one context can be bound to a thread at a time. But all the bookkeeping required if OpenGL operations spawn over several threads may trigger nasty driver bugs and also has a certain performance hit.