Vulkan Compute dispatch from a child CPU thread - multithreading

Can Vulkan Compute dispatch from a child CPU thread, or does it have to dispatch from the main thread? I don't think this is possible to dispatch compute shaders in Unity from child threads and I wanted to find out if it could be done in Unreal Engine.

It depends on what you mean by "dispatch" and "main thread".
vkCmdDispatch, as the "Cmd" prefix suggests, puts a command in a command buffer. This can be called on any thread, so long as the VkCommandBuffer object will not have other vkCmd functions called on it at the same time (typically, you reserve specific command buffers for a single thread). So by one definition, you can "dispatch" compute operations from other threads.
Of course, recording commands in a command buffer doesn't actually do anything. Commands only get executed when you queue up those CBs via vkQueueSubmit. Like vkCmdDispatch, it doesn't matter what thread you call that function on. However, like vkCmdDispatch, it does matter that multiple threads be prevented from accessing the same VkQueue object at the same time.
Now, you don't have to use a single thread for that VkQueue; you can lock the VkQueue behind some kind of mutex, so that only one thread can own it at a time. And thus, a thread that creates a CB could submit its own work.
However, ignoring the fact that tasks often need to be inserted into the queue in an order (one task might generate some compute data that a graphics task needs to wait on, so the graphics task CB must be after the compute CB), there's a bigger problem. vkQueueSubmit takes a long time. If you look at the function, it can take an arbitrarily large number of CBs to insert, and it has the ability to have multiple batches, with each batch guarded by semaphores and fences for synchronization. As such, you are strongly encouraged to make as few vkQueueSubmit calls as possible, since each call has a quantity of overhead to it that has nothing to do with how many CBs you are queuing up.
There's even a warning about this in the spec itself.
So the typical way applications are structured is that you farm out tasks to the available CPU threads, and these tasks build command buffers. One particular thread will be anointed as the owner of the queue. That thread may perform some CB building, but once it is done, it will wait for the other tasks to complete and gather up all of the CBs from the other threads. Once gathered, that thread will vkQueueSubmit them in appropriate batches.
You could call that thread the "main thread", but Vulkan itself doesn't really care which thread "owns" the queue. It certainly doesn't have to be your process's initial thread.

Related

How is ReactiveX different from multithreaded programming

Reading about ReaactiveX(like here), it states something like:
An advantage of this approach is that when you have a bunch of tasks that are not dependent on each
other, you can start them all at the same time rather than waiting for each one to finish before
starting the next one — that way, your entire bundle of tasks only takes as long to complete as the
longest task in the bundle.
Are not we all doing this already using multi threading programming? So how are two things different actually?
This is a broader topic about light-weight async tasks in general vs threads.
A big difference is in cost and speed. Threads are expensive, and OSs generally limit the number that you can create. Every thread has room for an entire full stack set aside in case it's needed (you get a StackOverflow if it wasn't enough). If you have more threads than processors, then switching tasks means saving off all the current thread info and loading the new thread into registers, etc.
ReactiveX libraries work with callbacks, so the only memory needed is the object with the callback data. Switching ReactiveX tasks is just a method call.
You can have many millions of ReactiveX tasks in progress at once, not so much with threads.
Most slow tasks (like file or network IO) actually do a lot of waiting. Why allocate an entire thread just to do nothing but wait?
With ReactiveX, the tasks are just simple objects that are waiting, just sitting in a queue.
Now, ReactiveX is built on top of threading. Those millions of tasks (just callback object in memory) when actually running are running on some thread. And ReactiveX, tasks aren't all really "running" at the same time (only a thread running on a core can actually do something). Most tasks do a lot of waiting so really those millions of ReactiveX tasks, are really just all "waiting" at the same time by hanging out in a queue.
Also, consider a scenario like Javascript, which is a single threaded environment. Multi-threading just isn't an option there. Even if you can create threads, avoiding concurrency or simplifying UI code that needs thread affinity can be nice when many tasks are all managed on a single thread.
Even in multiple thread scenarios, ReactiveX can be really helpful since it's API guarantees synchronous event callback for a particular stream, even if that steam is using many threads to generate the data.

Vulkan Queue Synchronization in Multithreading

In my application it is imperative that "state" and "graphics" are processed in separate threads. So for example, the "state" thread is only concerned with updating object positions, and the "graphics" thread is only concerned with graphically outputting the current state.
For simplicity, let's say that the entirety of the state data is contained within a single VkBuffer. The "state" thread creates a Compute Pipeline with a Storage Buffer backed by the VkBuffer, and periodically vkCmdDispatchs to update the VkBuffer.
Concurrently, the "graphics" thread creates a Graphics Pipeline with a Uniform Buffer backed by the same VkBuffer, and periodically draws/vkQueuePresentKHRs.
Obviously there must be some sort of synchronization mechanism to prevent the "graphics" thread from reading from the VkBuffer whilst the "state" thread is writing to it.
The only idea I have is to employ the usage of a host mutex fromvkQueueSubmit to vkWaitForFences in both threads.
I want to know, is there perhaps some other method that is more efficient or is this considered to be OK?
Try using semaphores. They are used to synchronize operations solely on the GPU, which is much more optimal than waiting in the app and submitting work after previous work is fully processed.
When You submit work You can provide a semaphore which gets signaled when this work is finished. When You submit another work You can provide the same semaphore on which the second batch should wait. Processing of the second batch will start automatically when the semaphore gets signaled (this semaphore is also automatically unsignaled and can be reused).
(I think there are some constraints on using semaphores, associated with queues. I will update the answer later when I confirm this but they should be sufficient for Your purposes.
[EDIT] There are constraints on using semaphores but it shouldn't affect You - when You use a semaphore as a wait semaphore during submission, no other queue can wait on the same semaphore.)
There are also events in Vulkan which can be used for similar purposes but their use is a little bit more complicated.
If You really need to synchronize GPU and Your application, use fences. They are signaled in a similar way as semaphores. But You can check their state on the app side and You need to manually unsignal them before You can use then again.
[EDIT]
I've added an image that more or less shows what I think You should do. One thread calculates state and with each submission adds a semaphore to the top of the list (or a ring buffer as #NicolasBolas wrote). This semaphore gets signaled when the submission is finished (it is provided in pSignalSemaphores during "compute" batch submission).
Second thread renders Your scene. It manages it's own list of semaphores similarly to the compute thread. But when You want to render things, You need to be sure that compute thread finished calculations. That's why You need to take the latest "compute" semaphore and wait on it (provide it in pWaitSemaphores during "render" batch submission). When You submit rendering commands, compute thread can't start and modify the data because it may influence the results of a rendering. So compute thread also needs to wait until the most recent rendering is done. That's why compute thread also needs to provide a wait semaphore (the most recent "rendering" semaphore).
You just need to synchronize submissions. Rendering thread cannot start when a compute threads submits commands and vice versa. That's why adding semaphores to the lists (and taking semaphores from the list) should be synchronized. But this has nothing to do with Vulkan. Probably some mutex will be helpful (for example a C++-ish std::lock_guard<std::mutex>). But this synchronization is a problem only when You have a single buffer.
Another thing is what to do with old semaphores from both lists. You cannot directly check what is their state and You cannot directly unsignal them. The state of semaphores can be checked by using additional fences provided with each submission. You don't wait on them but from time to time check if a given fence is signaled and, if it is, You can destroy old semaphore (as You cannot unsignal it from the application) or You can make an empty submission, with no command buffers, and use that semaphore as a wait semaphore. This way the semaphore will be unsignaled and You can reuse it. But I don't know which solution is more optimal: destroying old and creating new semaphores, or unsignaling them with empty submissions.
When You have a single buffer, a one-element list/ring is probably enough. But more optimal solution would have some kind of a ping-pong set of buffers - You read data from one buffer, but store results in another buffer. And in the next step You swap them. That's why in the image above, the lists of semaphores (rings) may have more elements depending on Your setup. The more independent buffers and semaphores in the lists (of course to some reasonable count), the best performance You will get as You reduce time wasted on waiting. But this complicates Your code and it may also increase a lag (rendering thread gets data that is a bit older than the data currently processed by the compute thread). So You may need to balance performance, code complexity and a rendering lag.
How you do this depends on two factors:
Whether you want to dispatch the compute operation on the same queue as its corresponding graphics operation.
The ratio of compute operations to their corresponding graphics operations.
#2 is the most important part.
Even though they are generated in separate threads, there must be at least some idea that the graphics operation is being fed by a particular compute operation (otherwise, how would the graphics thread know where the data is to read from?). So, how do you do that?
At the end of the day, that part has nothing to do with Vulkan. You need to use some inter-thread communication mechanism to allow the graphics thread to ask, "which compute task's data should I be using?"
Typically, this would be done by having the compute thread add every compute operation it does to some kind of circular buffer (thread-safe of course. And non-locking). When the graphics thread goes to decide where to read its data from, it asks the circular buffer for the most recently added compute operation.
In addition to the "where to read its data from" information, this would also provide the graphics thread with an appropriate Vulkan synchronization primitive to use to synchronize its command buffer(s) with the compute operation's CB.
If the compute and graphics operations are being dispatched on the same queue, then this is pretty simple. There doesn't have to actually be a synchronization primitive. So long as the graphics CBs are issued after the compute CBs in the batch, all the graphics CBs need is to have a vkCmdPipelineBarrier at the front which waits on all memory operations from the compute stage.
srcStageMask would be STAGE_COMPUTE_SHADER_BIT, with dstStageMask being, well, pretty much everything (you could narrow it down, but it won't matter, since at the very least your vertex shader stage will need to be there).
You would need a single VkMemoryBarrier in the pipeline barrier. It's srcAccessMask would be SHADER_WRITE_BIT, while the dstAccessMask would be however you intend to read it. If the compute operations wrote some vertex data, you need VERTEX_ATTRIBUTE_READ_BIT. If they wrote some uniform buffer data, you need UNIFORM_READ_BIT. And so on.
If you're dispatching these operations on separate queues, that's where you need an actual synchronization object.
There are several problems:
You cannot detect if a Vulkan semaphore has been signaled by user code. Nor can you set a semaphore to the unsignaled state by user code. Nor can you reasonably submit a batch that has a semaphore in it that is currently signaled and nobody's waiting on it. You can do the latter, but it won't do the right thing.
In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it.
You cannot issue a batch that waits on a semaphore, unless a batch that signals it is "pending execution". That is, your graphics thread cannot vkQueueSubmit its batch until it is certain that the compute queue has submitted its signaling batch.
So what you have to do is this. When the graphics queue goes to get its compute data, this must send a signal to the compute thread to add a semaphore to its next submit call. When the graphics thread submits its graphics operation, it then waits on that semaphore.
But to ensure proper ordering, the graphics thread cannot submit its operation until the compute thread has submitted the semaphore signaling operation. That requires a CPU-synchronization operation of some form. It could be as simple as the graphics thread polling an atomic variable set by the compute thread.

What is the difference between user level threads and coroutines?

User-level threading involves N user level threads that run on a single kernel thread. What are the details of the user-level threading and how does this differ from coroutines?
Wikipedia has a quite in-depth summary on the subject: Thread (computing).
With Green threads there's a VM executing instructions that typically decides between to switch thread in-between two instructions.
With coroutines the two functions yield to each other at specified points, possibly passing values along, and typically requiring special language support. E.g. a producer yielding to a consumer, passing along an item.
The idea behind user-level threads is to have multiple different logical threads running in the same program but to have the user program handle the mapping from logical threads to kernel threads (Which actually get scheduled) rather than having the OS handle the entire mapping. This can improve performance by letting the user program handle scheduling. Conceptually, user threads are one implementation of preemptive multitasking, where multiple jobs are run to completion in parallel by having the threads periodically stopped while other threads run.
Coroutines, on the other hand, are a generalization of standard function call and return ("subroutines") where functions pass control back and forth to one another, communicating values as they switch between routines. The switching back and forth between coroutines is under the control of the coroutines themselves; control only passes from one coroutine to another if one of the coroutines explicitly yields a value to another. This is an example of cooperative multitasking, where multiple jobs are completed in parallel by having the individual steps in the task manually coordinate who gets to run and when.
Hope this helps!

If I "get back to the main thread" then what exactly happens, and how do interrupts work with threads?

Background: I was using Beej's guide and he mentioned forking and ensuring you "get the zombies". An Operating Systems book I grabbed explained how the OS creates "threads" (I always thought it was a more fundamental piece), and by quoting it, I mean it the OS decides nearly everything. Basically they share all external resources, but they split the register and stack spaces (and I think a 3rd thing).
So I get to the waitpid function which http://www.qnx.com's developer docs explain very well. In fact, I read the entire section on threads, minus all the types of conditions after a Processes and Threads google.
The fact that I can split code up and put it back together doesn't confuse me. HOW I can do this is confusing.
In C and C++, your program is a Main() function, which goes forward, calls other functions, maybe loops forever (waiting for input or rendering), and then eventually quits or returns. In this model I see NO reason for it to stop beyond a "I'm waiting for something", in which case it just loops.
Well, it seems it can loop by setting certain things, like "I'm waiting for a semaphore" or "a response" or "an interrupt". Or maybe it gets interrupted without waiting for one. This is what confuses me.
The processor time-slices processes and threads. That's all fine and dandy, but how does it decide when to stop one? I understand that you get to the Polling function and say "Hey I'm waiting for input, clock tick or user do something". Somehow it tells this to the os? I'm not sure. But moreso:
It seems to be able to completely randomly interrupt or interject, even on a single-threaded application. So you're running one thread and suddenly waitpid() says "Hey, I finished a process, let me interrupt this, we both hate zombies, I gotta do this." and you're still looping on some calculation. So, what just happens??? I have no idea, somehow they both run and your computation isn't messed with, 'cause it's single threaded, but that somehow doesn't mean that it won't stop what it's doing to run waitpid() inside the same thread WHILE you're still doing your other app things.
Also confusing, is how you can be notified, like iOSes notifications, and say "Hey, I got some UI changes, get me off of 16 and put me back on 1 so I can change this thing". But same question as last paragraph, how does it interrupt a thread that's running?
I think I understand the splitting, but this joining is utterly confusing. It's like the textbooks have this "rabbit from hat" step I'm supposed to accept. Other SO posts told me they don't share the same stack, but that didn't help, now I'm imagining a slinky (stack) leaning over to another slinky, but unsure how it recombines to change the data.
Thanks for any help, I apologize that this is long, but I know someone's going to misinterpret this and give me the "they are different stacks" answer if I'm too concise here.
Thanks,
OK, I'll have a go, though it's gonna be 'economical with the truth':)
It's sorta like this:
The OS kernel scheduler/dispatcher is a state-machine for managing threads. A thread comprises a stack, (allocated at the time of thread creation), and a Thread Control Block, (TCB), struct in the kernel that holds thread state and can store thread context, (including user registers, especially the stack-pointer). A thread must have code to run, but the code is not dedicated to the thread - many threads can run the same code. Threads have states, eg. blocked on I/O, blocked on an inter-thread signal, sleeping for a timer period, ready, running on a core.
Threads belong to processes - a process must have at least one thread to run its code and has one created for it by the OS loader when the process starts up. The 'main thread' may then create others that will also belong to that process.
The state-machine inputs are software interrupts - system calls from those threads that are already running on cores, and hardware interrupts from perhiperal devices/controllers, (disk, network, mouse, KB etc), that use processor hardware features to stop the processor/s running instructions from the threads and 'immediately' run driver code instead.
The output of the state-machine is a set of threads running on cores. If there are fewer ready threads than cores, the OS will halt the unuseable cores. If there are more ready threads than cores, (ie. the machine is overloaded), the 'sheduling algorithm' that decided with threads to run takes into account several factors - thread and process priority, prority boosts for threads that have just become ready on I/O completion or inter-thread signal, foreground-process boosts and others.
The OS has the ability to stop any running thread on any core. It has an interprocessor hardware-interrupt channel and drivers that can force any thread to enter the OS and be blocked/stopped, (maybe because another thread has just beome ready and the OS scheduling algorithm has decided that a running thread must be immediately preempted).
The software intrrupts from running threads can change the set of running threads by requesting I/O, or by signaling other threads, (the events, mutexes, condition-variables and semaphores). The hardware interrupts from peripheral devices can change the set of running threads by signaling I/O completion.
When the OS gets these inputs, it uses that input, and internal state in containers of Thread Control Block and Process Control Block structs, to decide which set of ready threads to run next. It can block a thread from running by saving its context, (including registers, especially stack pointer), in its TCB and not returning from the interrupt. It can run a thread that was blocked by restoring its context from its TCB to a core and performing an interrupt-return, so allowing the thread to resume from where it left off.
The gain is that no thread that is waiting for I/O gets to run at all and so does not use any CPU and, when I/O becomes avilable, a waiting thread is made ready 'immediately' and, if there is a core available, running.
This combination of OS state data, and hardware/software interrupts, effciently matches up threads that can make forward progress with cores avalable to run them, and no CPU is wasted on polling I/O or inter-thread comms flags.
All this complexity, both in the OS and for the developer who has to design multithreaded apps and so put up with locks, synchronization, mutexes etc, has just one vital goal - high performance I/O. Without it, you can forget video streaming, BitTorrent and browsers - they would all be too piss-slow to be useable.
Statements and phrases like 'CPU quantum', 'give up the remainder of their time-slice' and 'round-robin' make me want to throw up.
It's a state-machine. Hardware and software interrupts go in, a set of running threads comes out. The hardware timer interrupt, (the one that can time-out system calls, allow threads to sleep and share out CPU on a box that is overloaded), though valuable, is just one of many.
So I'm on thread 16, and I need to get to thread 1 to modify UI. I
randomly stop it anywhere, "move the stack over to thread 1" then
"take its context and modify it"?
No, time for 'economical with truth' #2...
Thread 1 is running the GUI. To do this, it needs inputs from mouse, keyboard. The classic way for this to happen is that thread 1 waits, blocked, on a GUI input queue - a thread-safe producer-consumer queue, for KB/mouse messages. It's using no CPU - the cores are off running services and BitTorrent downloads. You hit a key on the keyboard, and the keyboard-controller hardware raises an interrupt line on the interrupt controller, causing a core to jump to the keyboard driver code as soon as it has finished its current instruction. The driver reads the KB controller, assembles a KeyPressed message and pushes it onto the input queue of the GUI thread with focus - your thread 1. The driver exits by calling the scheduler interrupt entry point so that a scheduling run can be performed and your GUI thread is assigned a core an run on it. To thread 1, all it has done is make a blocking 'pop' call on a queue and, eventually, it returns with a message to process.
So, thread 1 is performing:
void* HandleGui{
while(true){
GUImessage message=thread1InputQueue.pop();
switch(message.type){
.. // lots of case statements to handle all the possible GUI messages
..
..
};
};
};
If thread 16 wants to interact with the GUI, it cannot do it directly. All it can do is to queue a message to thread 1, in a similar way to the KB/mouse drivers, to instruct it to do stuff.
This may seem a bit restrictive, but the message from thread 16 can contain more than POD. It could have a 'RunMyCode' message type and contain a function pointer to code that thread 16 wants to be run in the context of thread 1. When thread 1 gets around to hadling the message, its 'RunMyCode' case statement calls the function pointer in the message. Note that this 'simple' mechanism is asynchronous - thread 16 has issued the mesage and runs on - it has no idea when thread 1 will get around to running the function it passed. This can be a problem if the function accesses any data in thread 16 - thread 16 may also be accessing it. If this is an issue, (and it may not be - all the data required by the function may be in the message, which can be passed into the function as a parameter when thread 1 calls it), it is possible to make the function call synchronous by making thread 16 wait until thread 1 has run the function. One way would be for the function signal an OS synchronization object as its last line - an object upon which thread 16 will wait immediately after queueing its 'RunMyCode' message:
void* runOnGUI(GUImessage message){
// do stuff with GUI controls
message.notifyCompletion->signal(); // tell thread 16 to run again
};
void* thread16run(){
..
..
GUImessage message;
waitEvent OSkernelWaitObject;
message.type=RunMyCode;
message.function=runOnGUI;
message.notifyCompletion=waitEvent;
thread1InputQueue.push(message); // ask thread 1 to run my function.
waitEvent->wait(); // wait, blocked, until the function is done
..
..
};
So, getting a function to run in the context of another thread requires cooperation. Threads cannot call other threads - only signal them, usually via the OS. Any thread that is expected to run such 'externally signaled' code must have an accessible entry point where the function can be placed and must execute code to retreive the function address and call it.

Mutithreading thread control

How do I control the number of threads that my program is working on?
I have a program that is now ready for mutithreading but one problem is that the program is extremely memory intensive and i have to limit the number of threads running so that i don't run out of ram. The main program goes through and creates a whole bunch of handles and associated threads in suspended state.
I want the program to activate a set number of threads and when one thread finishes, it will automatically unsuspended the next thread in line until all the work has been completed. How do i do this?
Someone has once mentioned something about using a thread handler, but I can't seem to find any information about how to write one or exactly how it would work.
If anyone can help, it would be greatly appreciated.
Using windows and visual c++.
Note: i don't need to worry about the traditional problems of access with the threads, each one is completely independent of each other, its more of like batch processing rather than true mutithreading of a program.
Thanks,
-Faken
Don't create threads explicitly. Create a thread pool, see Thread Pools and queue up your work using QueueUserWorkItem. The thread pool size should be determined by the number of hardware threads available (number of cores and ratio of hyperthreading) and the ratio of CPU vs. IO your work items do. By controlling the size of the thread pool you control the number of maximum concurrent threads.
A Suspended thread doesn't use CPU resources, but it still consumes memory, so you really shouldn't be creating more threads than you want to run simultaneously.
It is better to have only as many threads as your maximum number of simultaneous tasks, and to use a queue to pass units of work to the pool of worker threads.
You can give work to the standard pool of threads created by Windows using the Windows Thread Pool API.
Be aware that you will share these threads and the queue used to submit work to them with all of the code in your process. If, for some reason, you don't want to share your worker threads with other code in your process, then you can create a FIFO queue, create as many threads as you want to run simultaneously and have each of them pull work items out of the queue. If the queue is empty they will block until work items are added to the queue.
There is so much to say here.
There are a few ways
You should only create as many thread handles as you plan on running at the same time, then reuse them when they complete. (Look up thread pool).
This guarantees that you can never have too many running at the same time. This raises the question of funding out when a thread completes. You can have a callback be called just before a thread terminates where a parameter in that callback is the thread handle that just finished. Use Boost bind and boost signals for that. When the callback is called, look for another task for that thread handle and restart the thread. That way all you have to do is add to the "tasks to do" list and the callback will remove the tasks for you. No polling needed, and no worries about too many threads.

Resources