When does the CPU wait on the GPU?

When does the CPU wait on the GPU? - graphics

In an application which is GPU bound, I am wondering at what point the CPU will wait on the GPU to complete rendering. Is it different between DirectX and OpenGL?
Running an example similar to below, obviously the CPU doesn't run away - and looking in task manager, CPU usage (If it were a single core machine) would be below 100%.
While (running){
Clear ();
SetBuffers (); // Vertex / Index
SetTexture ();
DrawPrimitives ();
Present ();
}

The quick summary is that you will probably see the wait in Present(), but it really depends on what it is the Present() call.
Generally, unless you specifically say you want notice of when the GPU is finished, you might end up waiting at the (random to you) point the driver's input buffer fills up. Think of the GPU driver & card as a very long pipeline. You can put in work at one end and eventually after a while it comes out to the display. You might be able to put in several frames worth of commands into the pipeline before it fills up. The card could be taking a lot of time drawing primitives, but you might see the CPU waiting at a point several frames later.
If your Present() call contains the equivalent of glFinish(), that entire pipeline must drain before that call can return. So, the CPU will wait there.
I hope the following can be helpful:
Clear ();
Causes all the pixels in the current buffer to change color, so the GPU is doing
work. Lookup your GPU's clear pix/sec
rate to see what time this should be taking.
SetBuffers ();
SetTexture ();
The driver may do some work here, but generally it wants to wait until you
actually do drawing to use this new data. In any event, the GPU doesn't do
much here.
DrawPrimitives ();
Now here is where the GPU should be doing most of the work. Depending on the
primitive size you'll be limited by vertices/sec or pixels/sec. Perhaps you have
an expensive shader you'll be limited by shader instructions/sec.
However, you may not see this as the place the CPU is waiting. The driver
may buffer the commands for you, and the CPU may be able to continue on.
Present ();
At this point, the GPU work is minimal. It just changes a pointer to start displaying from a different buffer.
However, this is probably the point that appears to the CPU to be where it is waiting on the GPU. Depending on your API, "Present()" may include something like glFlush() or glFinish(). If it does, then you'll likely wait here.

On Windows the waits are in the Video driver. They depend on somewhat on driver implentation, though in a lot of cases the need for a wait is dictated by the requirements of the API you are using (whether calls are defined to be syncronous or not).
So yes, it would most likely be different between DirectX and OpenGL.

Related

How should I make multithreaded program that uses GPU for computation?

I am making simulation program that uses compute shaders and i ran into a problem. I am currently using OpenGL context to render GUI stuff to control and watch simulation. And I use the same context to call glDispatchCompute.
That could cause program window to freeze, because simulation might run in any UPS (like 0.1 - 10000 updates per second) and window should update in fixed FPS (display refresh rate, 60 FPS in common).
That becomes a problem, when simulation is slow and single step takes, for example 600 ms to compute. And swap buffers function waits for all compute shaders to perform, and so - FPS drops.
How can I make updates and renders independent from each other? On CPU I can just spawn second thread, but OpenGL context is not multithreaded. Should I use Vulkan for this task?

Even with Vulkan, there is no way to just shove a giant blob of work at the GPU and guarantee that later graphics work will just interrupt the GPU's processing. The most reliable way to handle this is to break your compute work up into chunks of a size that you're reasonably sure will not break your framerate and interleave them with your rendering commands.
Vulkan offers ways that allow GPUs to execute interruptable work, but do not require any particular interrupting functionality. You can create a compute queue that has the lowest possible priority and create a graphics queue with the highest priority. But even that assumes:
The Vulkan implementation offers multiple queues at all. Many embedded ones do not.
Queue priority implementations will preempt work in progress if higher-priority work is submitted. This may happen, but the specification offers no guarantees. Here is some documentation about the behavior of GPUs, so some of them can handle that. It's a few years old, so more recent GPUs may be even better, but it should get you started.
Overall, Vulkan can help, but you'll need to profile it and you'll need to have a fallback if you care about implementations that don't have enough queues to do anything.
OpenGL is of course even less useful for this, as it has no explicit queueing system at all. So breaking the work up is really the only way to ensure that the compute task doesn't starve the renderer.

Two threads ping-ponging work between them causing CPU frequency to drop since none of them are 100% busy

The background for my question is that I have a game engine whose main rendering loop involves two threads: In short, one thread generating OpenGL commands and another thread that takes the generated OpenGL commands and dispatches them to the driver. This allows them a certain degree of overlap of the work, but they still have to wait for each other at certain points. This is of course a fairly specialized situation local to me, but I also can't imagine the situation of having two threads ping-ponging work between them with a limited degree of parallelism is a particularly uncommon scenario.
Now, my problem is that, since none of the two threads are 100% busy, but rather spends some time sleeping waiting for the other thread to catch up, my OS scheduler apparently thinks that I'm not using the CPU enough to clock it up to maximum frequency. In marginal situations, this causes me to drop FPS significantly. This is on Linux with the default powersave cpufreq governor. Setting it to performance makes the problem go away entirely, but that's obviously not the default, and not what other people will be using. I don't have any Windows systems readily available to test on, so I don't know if Windows' scheduler has a similar behavior. My program is written in Java, but I can't imagine that to have any significant bearing on the problem.
Again though, I'm sure I'm not the only one with a program that behaves in this manner. Has someone else had this problem? Is there a good way to "solve" it? It seems a bit unnecessary to be dropping performance significantly just because the operating system doesn't really understand that the CPU is in fact being used to 100%.

Why doesn't the instruction reorder issue occur on a single CPU core?

From this post:
Two threads being timesliced on a single CPU core won't run into a reordering problem. A single core always knows about its own reordering and will properly resolve all its own memory accesses. Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
Why can't the instruction reorder issue occur on a single CPU core? This article doesn't explain it.
EXAMPLE:
The following pictures are picked from Memory Reordering Caught in the Act:
Below is recorded:
I think the recorded instructions can also cause issue on a single CPU, because both r1 and r2 aren't 1.

A single core always knows about its own reordering and will properly resolve all its own memory accesses.
A single CPU core does reorder, but it knows it's own reordering, and can do clever tricks to pretend it's not. Thus, things go faster, without weird side effects.
Multiple cores however operate independently in this regard and thus won't really know about each other's reordering.
When a CPU reorders, the other CPUs can't compensate for this. Imagine if CPU #1 is waiting for a write to variableA, then it reads from variableB. If CPU#2 wrotes to variableB, then variableA like the code says, no problems occur. If CPU#2 reorders to write to variableA first, then CPU#1 doesn't know and tries to read from variableB before it has a value. This can cause crashes or any "random" behavior. (Intel chips have more magic that makes this not happen)
Two threads being timesliced on a single CPU core won't run into a reordering problem.
If both threads are on the same CPU, then it doesn't matter which order the writes happen in, because if they're reordered, then they're both in progress, and the CPU won't really switch until both are written, in which case they're safe to read from the other thread.
Example
For the code to have a problem on a single core, it would have to rearrange the two instructions from process 1 and be interrupted by process 2 and execute that between the two instructions. But if interrupted between them, it knows it has to abort both of them since it knows about it's own reordering, and knows it's in a dangerous state. So it will either do them in order, or do both before switching to process 2, or do neither before switching to process 2. All of which avoid the reordering problem.

There are multiple effects at work, but they are modeled as just one effect. Makes it easier to reason about them. Yes, a modern core already re-orders instructions by itself. But it maintains logical flow between them, if two instructions have an inter-dependency between them then they stay ordered so the logic of the program does not change. Discovering these inter-dependencies and preventing an instruction from being issued too early is the job of the reorder buffer in the execution engine.
This logic is solid and can be relied upon, it would be next to impossible to write a program if that wasn't the case. But that same guarantee cannot be provided by the memory controller. It has the un-enviable job of giving multiple processors access to the same shared memory.
First is the prefetcher, it reads data from memory ahead of time to ensure the data is available by the time a read instruction executes. Ensures the core won't stall waiting for the read to complete. With the problem that, because memory was read early, it might be a stale value that was changed by another core between the time the prefetch was done and the read instruction executes. To an outside observer it looks like the instruction executed early.
And the store buffer, it takes the data of a write instruction and writes it lazily to memory. Later, after the instruction executed. Ensures the core won't stall waiting on the memory bus write cycle to complete. To an outside observer, it just looks like the instruction executed late.
Modeling the effects of the prefetcher and store buffer as instruction reordering effects is very convenient. You can write that down on a piece of paper easily and reason about the side-effects.
To the core itself, the effects of the prefetcher and store buffer are entirely benign and it is oblivious to them. As long as there isn't another core that's also changing memory content. A machine with a single core always has that guarantee.

Single Core: Couldn't you slow down a program by moving your mouse a lot?

I'm learning about Concurrency and how the OS handles interrupts such as moving your cursor across the screen while a program is doing some important computation like large matrix multiplications.
My question is, say you're on those old computers with a single core on it, wouldn't that single core need to constantly context-switch to handle your interrupts because of all the cursor moving and therefore more time is needed for the important computation? But I assume it's not that huge of a delay because perhaps the OS will prioritize my calculation above my interrupts? Maybe skip a few frames between the movement.
And if we move to a multi-core system, is this generally less likely to happen as the cursor moving will probably be processed by another core? So my calculations will not really be that delayed?
While I am at this, am I right to assume that the single-core computer probably goes through like hundreds of processes and it context-switches throughout all of them? So quite literally, your computer is doing one instruction at a time for a certain amount of time (a time slice) and then it needs to switch to another process with a different set of instructions. If so, it's amazing how the core does so much.... Jump, get the context, do a few steps, save the context onto stack, jump yet again. Rinse and repeat. There's obviously no parallelism. So no two instructions are EVER running at the same time. It only gives that illusion.

Your last paragraph is correct, it's the job of the operating system's scheduler to generate the feeling of parallelism by letting each process execute some instructions and then continuing with the next. This does not only affect single core CPUs by the way - typically your computer will be running many more processes than you have CPUs. (Use task manager on windows or top on Linux to see how many processes are running currently).
In terms of your mouse question: that interrupt will most likely just change the current mouse coordinates in a variable and not cause a repaint. Therefore it is going to be extremely fast and should not cause any measurable delay in programming execution. Maybe it would if you could move your mouse by the speed of light ;)

Does multiple isolated OpenGL context affect performance

My co-worker and I are working on a video rendering engine.
The whole idea is to parse a configuration file and render each frame to offscreen FBO, and then fetch the frame render results using glReadPixel for video encoding.
We tried to optimize the rendering speed by creating two threads each with an independent OpenGL context. One thread render odd frames and the other render even frames. The two threads do not share any gl resources.
The results are quite confusing. On my computer, the rendering speed increased compared to our single thread implementation, while on my partner's computer, the entire speed dropped.
I wonder here that, how do the amount of OpenGL contexts affect the overall performance. Is it really a good idea to create multiple OpenGL threads if they do not share anything.

Context switching is certainly not free. As pretty much always for performance related questions, it's impossible to quantify in general terms. If you want to know, you need to measure it on the system(s) you care about. It can be quite expensive.
Therefore, you add a certain amount of overhead by using multiple contexts. If that pays off depends on where your bottleneck is. If you were already GPU limited with a single CPU thread, you won't really gain anything because you can't get the GPU to do the work quicker if it was already fully loaded. Therefore you add overhead for the context switches without any gain, and make the whole thing slower.
If you were CPU limited, using multiple CPU threads can reduce your total elapsed time. If the parallelization of the CPU work combined with the added overhead for synchronization and context switches results in a net total gain again depends on your use case and the specific system. Trying both and measuring is the only good thing to do.
Based on your problem description, you might also be able to use multithreading while still sticking with a single OpenGL context, and keeping all OpenGL calls in a single thread. Instead of using glReadPixels() synchronously, you could have it read into PBOs (Pixel Buffer Objects), which allows you to use asynchronous reads. This decouples GPU and CPU work much better. You could also do the video encoding on a separate thread if you're not doing that yet. This approach will need some inter-thread synchronization, but it avoids using multiple contexts, while still using parallel processing to get the job done.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string