How should I make multithreaded program that uses GPU for computation?

How should I make multithreaded program that uses GPU for computation? - multithreading

I am making simulation program that uses compute shaders and i ran into a problem. I am currently using OpenGL context to render GUI stuff to control and watch simulation. And I use the same context to call glDispatchCompute.
That could cause program window to freeze, because simulation might run in any UPS (like 0.1 - 10000 updates per second) and window should update in fixed FPS (display refresh rate, 60 FPS in common).
That becomes a problem, when simulation is slow and single step takes, for example 600 ms to compute. And swap buffers function waits for all compute shaders to perform, and so - FPS drops.
How can I make updates and renders independent from each other? On CPU I can just spawn second thread, but OpenGL context is not multithreaded. Should I use Vulkan for this task?

Even with Vulkan, there is no way to just shove a giant blob of work at the GPU and guarantee that later graphics work will just interrupt the GPU's processing. The most reliable way to handle this is to break your compute work up into chunks of a size that you're reasonably sure will not break your framerate and interleave them with your rendering commands.
Vulkan offers ways that allow GPUs to execute interruptable work, but do not require any particular interrupting functionality. You can create a compute queue that has the lowest possible priority and create a graphics queue with the highest priority. But even that assumes:
The Vulkan implementation offers multiple queues at all. Many embedded ones do not.
Queue priority implementations will preempt work in progress if higher-priority work is submitted. This may happen, but the specification offers no guarantees. Here is some documentation about the behavior of GPUs, so some of them can handle that. It's a few years old, so more recent GPUs may be even better, but it should get you started.
Overall, Vulkan can help, but you'll need to profile it and you'll need to have a fallback if you care about implementations that don't have enough queues to do anything.
OpenGL is of course even less useful for this, as it has no explicit queueing system at all. So breaking the work up is really the only way to ensure that the compute task doesn't starve the renderer.

Related

Two threads ping-ponging work between them causing CPU frequency to drop since none of them are 100% busy

The background for my question is that I have a game engine whose main rendering loop involves two threads: In short, one thread generating OpenGL commands and another thread that takes the generated OpenGL commands and dispatches them to the driver. This allows them a certain degree of overlap of the work, but they still have to wait for each other at certain points. This is of course a fairly specialized situation local to me, but I also can't imagine the situation of having two threads ping-ponging work between them with a limited degree of parallelism is a particularly uncommon scenario.
Now, my problem is that, since none of the two threads are 100% busy, but rather spends some time sleeping waiting for the other thread to catch up, my OS scheduler apparently thinks that I'm not using the CPU enough to clock it up to maximum frequency. In marginal situations, this causes me to drop FPS significantly. This is on Linux with the default powersave cpufreq governor. Setting it to performance makes the problem go away entirely, but that's obviously not the default, and not what other people will be using. I don't have any Windows systems readily available to test on, so I don't know if Windows' scheduler has a similar behavior. My program is written in Java, but I can't imagine that to have any significant bearing on the problem.
Again though, I'm sure I'm not the only one with a program that behaves in this manner. Has someone else had this problem? Is there a good way to "solve" it? It seems a bit unnecessary to be dropping performance significantly just because the operating system doesn't really understand that the CPU is in fact being used to 100%.

Unity: How to use all CPU cores for Camera.Render with Multi-threaded Graphics Jobs? Optimisations - not working?

How to use Multi-threaded Graphics Job optimisations?
I have a project (less of a game, more of an application) which uses many tens of cameras in a scene, all rendering at one time. (Don't ask why, it has to be this way!)
Needless to say, when running the application, this nearly maxes-out one CPU core since the rendering pipeline is single-threaded. It drops the frame rate very low as a result. (GPU, memory etc are well below 100% load - the CPU is the bottleneck here)
I was greatly relieved to see that Unity has an option called "Graphics Jobs (experimental)" which is specifically designed to split the rendering across multiple threads, and as such, multiple cores.
However, with this option enabled and DirectX12 set at the top of the graphics APIs list, I would expect that the CPU can now be fully utilised, namely all CPU cores can be engaged in active rendering. However, it seems to still be that the application is still only utilising around 40% of my CPU's potential whilst delivering a low frame rate. Surely, it should only drop the frame rate once it has maxed out the CPU? It doesn't seem that the individual cores are being used any differently. Why can I not maximise the output frame-rate by utilising close to 100% of all cores on my CPU (in total, including any additional programs which are running)?
Essentially I would like to know how I can get Unity to use all of my CPU cores to their full potential so as to get as high a frame rate as possible whilst having as many cameras in my scene at once... ...I thought Graphics Jobs would solve this? ...Unless I'm using them incorrectly or not using the correct combination of settings?
As an aside, my CPU is a i7-4790 #3.6ghz and my GPU is a 980Ti using DX12. 32gb of RAM.

During camera render, your CPU has to talk to the GPU, in what is called a drawcall. Under the hood, Unity is already using a multi-threaded approach (ctrl+f "threads") to get the CPU to talk to the GPU, and this process of dealing out tasks to worker threads takes place in a black box. Note that, "Not all platforms support multithreaded rendering; at the time of writing, WebGL does not support this feature. On platforms that do not support multithreaded rendering, all CPU tasks are carried out on the same thread."
So, if your application is bound by the CPU, then perhaps your platform is not supported, or the tasks simply can't be broken down into small enough chunks to fill up all your cores, or some other 'black box' reason, like maybe the CPU is iterating over each camera sequentially, instead of batching the bunch (this makes sense to me because the mainthread has to be the one to interface with the Unity API in a threadsafe manner to determine what must be drawn).
It seems that your only option is to read the links provided here and try to optimize your scene to minimize draw calls, but with that being said, I think trying to render 'tens of cameras' is a very bad idea to start with, especially if they're all rendering a complex scene at a large resolution with post processing, etc. - though I've never tried anything like it.

Does multiple isolated OpenGL context affect performance

My co-worker and I are working on a video rendering engine.
The whole idea is to parse a configuration file and render each frame to offscreen FBO, and then fetch the frame render results using glReadPixel for video encoding.
We tried to optimize the rendering speed by creating two threads each with an independent OpenGL context. One thread render odd frames and the other render even frames. The two threads do not share any gl resources.
The results are quite confusing. On my computer, the rendering speed increased compared to our single thread implementation, while on my partner's computer, the entire speed dropped.
I wonder here that, how do the amount of OpenGL contexts affect the overall performance. Is it really a good idea to create multiple OpenGL threads if they do not share anything.

Context switching is certainly not free. As pretty much always for performance related questions, it's impossible to quantify in general terms. If you want to know, you need to measure it on the system(s) you care about. It can be quite expensive.
Therefore, you add a certain amount of overhead by using multiple contexts. If that pays off depends on where your bottleneck is. If you were already GPU limited with a single CPU thread, you won't really gain anything because you can't get the GPU to do the work quicker if it was already fully loaded. Therefore you add overhead for the context switches without any gain, and make the whole thing slower.
If you were CPU limited, using multiple CPU threads can reduce your total elapsed time. If the parallelization of the CPU work combined with the added overhead for synchronization and context switches results in a net total gain again depends on your use case and the specific system. Trying both and measuring is the only good thing to do.
Based on your problem description, you might also be able to use multithreading while still sticking with a single OpenGL context, and keeping all OpenGL calls in a single thread. Instead of using glReadPixels() synchronously, you could have it read into PBOs (Pixel Buffer Objects), which allows you to use asynchronous reads. This decouples GPU and CPU work much better. You could also do the video encoding on a separate thread if you're not doing that yet. This approach will need some inter-thread synchronization, but it avoids using multiple contexts, while still using parallel processing to get the job done.

Pseudo real time threading

So I have built a small application that has a physics engine and a display. The display is attached to a controller which handles the physics engine(well, actually a view model that handles the controller, but details).
Currently the controller is a delegate that gets activated by a begin-invoke and deactivated by a cancellation token, and then reaped by an endinvoke. Inside the lambda brushes PropertyChanged(hooked into INotifyPropertyChanged) which keeps the UI up to date.
From what I understand the BeginInvoke method activates a task rather than another thread(which on my computers does activate another thread, but this isn't a guarantee from the reading I have done,it's up to the thread pool how it wants to get the task completed), which is fine from all the testing I have done. The lambda doesn't complete until a CancellationToken is killed. It has a sleep and an update(so it is sort of simulating a real-time physics engine...it's crude, but I don't need real precision on the real time, just enough to get a feel)
The question I have is, will this work on other computers, or should I switch over to explicit threads that I start and cancel? The scenario I am thinking of is on a 1 core processor, is it possible the second task will get massively less processor time and thereby make my acceptably inaccurate model into something unacceptably inaccurate(i.e. waiting for milliseconds before switching rather than microseconds?). Or is their some better way of doing this that I haven't come up with?

In my experience, using the threadpool in the way you described will pretty much guarantee reasonably optimal performance on most computers, without you having to go to the trouble to figure out how to divvy up the threads.
A thread is not the same thing as a core; you will still get multiple threads on a single-core machine, and those threads will each take part of the processing load. You won't get the "deadlock" condition you describe, unless you do something unusual with the threads, like give one of them real-time priority.
That said, microseconds is not a lot of time for context switching between threads, so YMMV. You'll have to try it, and see how well it works; there may be some tweaking required.

When does the CPU wait on the GPU?

In an application which is GPU bound, I am wondering at what point the CPU will wait on the GPU to complete rendering. Is it different between DirectX and OpenGL?
Running an example similar to below, obviously the CPU doesn't run away - and looking in task manager, CPU usage (If it were a single core machine) would be below 100%.
While (running){
Clear ();
SetBuffers (); // Vertex / Index
SetTexture ();
DrawPrimitives ();
Present ();
}

The quick summary is that you will probably see the wait in Present(), but it really depends on what it is the Present() call.
Generally, unless you specifically say you want notice of when the GPU is finished, you might end up waiting at the (random to you) point the driver's input buffer fills up. Think of the GPU driver & card as a very long pipeline. You can put in work at one end and eventually after a while it comes out to the display. You might be able to put in several frames worth of commands into the pipeline before it fills up. The card could be taking a lot of time drawing primitives, but you might see the CPU waiting at a point several frames later.
If your Present() call contains the equivalent of glFinish(), that entire pipeline must drain before that call can return. So, the CPU will wait there.
I hope the following can be helpful:
Clear ();
Causes all the pixels in the current buffer to change color, so the GPU is doing
work. Lookup your GPU's clear pix/sec
rate to see what time this should be taking.
SetBuffers ();
SetTexture ();
The driver may do some work here, but generally it wants to wait until you
actually do drawing to use this new data. In any event, the GPU doesn't do
much here.
DrawPrimitives ();
Now here is where the GPU should be doing most of the work. Depending on the
primitive size you'll be limited by vertices/sec or pixels/sec. Perhaps you have
an expensive shader you'll be limited by shader instructions/sec.
However, you may not see this as the place the CPU is waiting. The driver
may buffer the commands for you, and the CPU may be able to continue on.
Present ();
At this point, the GPU work is minimal. It just changes a pointer to start displaying from a different buffer.
However, this is probably the point that appears to the CPU to be where it is waiting on the GPU. Depending on your API, "Present()" may include something like glFlush() or glFinish(). If it does, then you'll likely wait here.

On Windows the waits are in the Video driver. They depend on somewhat on driver implentation, though in a lot of cases the need for a wait is dictated by the requirements of the API you are using (whether calls are defined to be syncronous or not).
So yes, it would most likely be different between DirectX and OpenGL.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string