opengl offscreen rendering in linux is slow

opengl offscreen rendering in linux is slow - linux

My opengl application runs at about 110 fps. The moment I add off-screen rendering, it slows down to 15 fps. I am using frame buffer objects and glReadPixels to render off-screen. I have searched on net and found that GPU memory to CPU memory data transfer is slow and vice versa is fast.
I have ATI Mobility Radeon™ X2300 with 128MB video memory.
So my questions are
1) Is there is way to increase VRAM to CPU ram data transfer speed?
2) Are there any GPUs in market optimized for better read speed?

The problem is not the transfer speed but more the serialization between CPU and GPU. When you call glReadPixels in that way the CPU will stop and wait for the GPU to finish all rendering and this is quite inefficient as you noticed already.
The solution is to use PBOs. You can have N number of PBOs and on every frame you bind PBO X (where 0 <= X < N) and glReadPixels to that PBO. Before that you can map X and read the pixels of a past frame. N is not a magic number, for most implementations 3 frames delay is something usual. So N=3 is a good starting point.

Related

Low 'Average Physical Core Utilization' according to VTune when using OpenMP, not sure what the bigger picture is

I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):
Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go
#pragma omp parallel for
for (int y = 0; y < window->height; y++) {
for (int x = 0; x < window->width; x++) {
Ray& ray = rays.get(x, y);
accelerator.trace(ray);
}
}
I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.
I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:
In particular, this is the 2nd biggest time consumer:
where the 58% is the microarchitecture usage.
Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:
Average Physical Core Utilization
Metric Description
The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.
I'm not sure what this is trying to tell me, which leads me to my question:
Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.
I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.
I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.

Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.
Why?
As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.
Long story short, I suggest you try to profile memory usage.

Why is slither.io CPU intensive?

From a programmer's point of view, how can a snakes and dots game consume much more resources than many other more advanced games?

Lazy programming.
Considering you could run the original Quake, a fully 3D game with texture mapping on a Pentium 66 Mhz with 16 Mb RAM without a 3D accelerator card, there is no reason for a silly 2D browser snake game to stall on a modern multi-Ghz CPU other than programmer incompetence.

All the calculations are done by the CPU, whereas other computation-intensive games tend to make use of the GPU. You can observe this by looking at your CPU monitor/task manager, and your graphics driver configuration interface (if you have a fancy card).
When I load the slither game my GPU temp doesn't increase at all, whereas the CPU temp climbs steadily, and the game immediately swallows half of my cores. CPU use further increases in "busy" periods when there are many other players interacting in a small area.

I suspect Slither.io is calculating block chain while we play the game. in such cases the game would run smoother on an old graphics card where their secondary intention is not possible. The greater the GPU the more time of the game is split for their nefarious purposes.

What is the least amount of (managable) samples I can give to a PCM buffer?

Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).
Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.
Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!
(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)

TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.
The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.
The sequence of events is:
The audio hardware progressively outputs samples from its buffer
At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
The operating system service the interrupt, and marks the thread as being ready to run
The operating system schedules the thread to run on a CPU
The thread computes, or otherwise obtains samples, and writes them into the output buffer.
The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.
General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.
It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.
Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.
It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.

While 1ms audio buffers are possible they are not necessarily desirable. The tick rate of modern Windows, for example, is between 10ms and 30ms. That means that, usually at the audio driver end, you need to keep a ring buffer of a bunch of those 1 ms packets to deal with buffer starvation conditions, in case the CPU gets pulled out from under you by some other thread.
All modern audio engines use powers of 2 for their audio buffer sizes. Start with 256 samples per frame and see how that works for you. Every piece of hardware is different, and you can't depend on how a Mac or PC gives you time slices. The user might be calculating pi on some other process while you are running your audio program. It's safer to leave the buffer size large, like 2048 samples per frame, and let the user turn it down if the latency bothers them.

Using GPU for number-crunching and rendering at the same time in parallel

Can rendering job and number-crunching job (f.ex. on OpenCL) be effectively shared on the same single GPU? For example,
thread A runs OpenCL task to generate an image
then, when image is ready, thread A notifies another thread B (image is ready) and continues to new image calculation
thread B starts some pre-display activities on a given image (like overlay calculation with GDI), combines final image and render it to display
Can this kind of GPU resource sharing get performance improvement or, on the contrary, will cause to overall slowdown of compute and rendering tasks?
Thanks

There is many factors involved here, but generally, you shouldn't see a slowdown.
Problems with directly answering your question:
OpenCL could be using your CPU as well, depending on how you set it up
Your gfx stuff could be done mostly on CPU or a different part of the GPU, depending on what you display ; for example, many GDI implementations render using the CPU, and only use very simple 2D acceleration techniques on the GPU, mostly to blit the final composed image.
It might depend on the GPU, GPU driver, graphic stack etc. you use.
As most of the time, you will get the best answer for this by trying it out, or at least benchmarking the different parts. After all, you won't really get much of a benefit if your computations are too simple or the image rendering part is too simple.
Also you might try going even further and rendering the result with shaders or the like - in that case you could prevent having to move the data back from the gpu memory to main memory, which could - depending on your circumstances - also give you a speed boost.

if data/crunching ratio is big and also if you have to send data from cpu to gpu:
crunch ---> crunch ----> render
GPU th0 : crunch for (t-1) crunch for (t) rendering
CPU th1 : send data for t send data for t+1 send data for t+2
CPU th2 : get data of t-2 get data of t-1 get data of t
CPU th3-th7 : Other things independent of crunching or rendering.
At the same time: crunching&comm. crunching&comm. rendering&communication
and other things and other things and other things
if data/crunching ratio is big and also if you dont have to send data from cpu to gpu:
use interoperatability of CL (example: cl-gl interop)
if data/crunching ratio is small
should not see any slowdown.
Medium data/crunching ratio: crunch --> render --->crunch ---> render
GPU th0 : crunch for (t) rendering crunch for (t+1) render again! and continue cycling like this
CPU th1 : get data of (t-1) send data for t+1 get data of (t)
CPU th2-th7 : Other things independent of crunching or rendering.
At the same time: crunching&getting. rendering&sending. crunching&getting
and other things and other things and other things

How would a multithreaded program be more energy efficient?

In its Energy-Efficient Software Guidelines Intel suggests that programs are designed multithreaded for better energy efficiency.
I don't get it. Suppose I have a quad core processor that can switch off unused cores. Suppose my code is perfectly parallelizeable (synchronization overhead is negligible).
If I use only one core I burn one core for one hour, if I use four cores I burn four cores for 15 minutes - the same amount of core-hours either way. Where's the saving?

I suspect it has to do with a non-linear relation between CPU utilization and power consumption. So if you can spread 100% CPU utilization over 4 CPUs each will have 25% utilization - and say 12% consumption.
This is especially true when dynamic CPU scaling is used according to Wikipedia the power drain of a CPU is P = C(V^2)F. When a CPU is running faster it requires higher voltages - and that 'to the power of 2' becomes crucial. Furthermore the voltage will be a function of F (which means F can be solved for V) giving something like P = C(F^2)F. Thus by spreading the load over 4 CPUs (running at 100% capacity at that frequency) you can mitigate the cost for the same work.
We can make F a function of L (load) at 100% of one core (as it would be in your OS), so:
F = 1000 + L/100 * 500 = 1000 + 5L
p = C((1000 + 5L)^2)(1000 + 5L) = C(1000 + 5L)^3
Now that we can relate load (L) to the power consumption we can see the characteristics of the power consumption given everything on one core:
p = C(1000 + 5L)^3
p = 1000000000 + 15000000L + 75000L^2 + 125L^3
Or spread over 4 cores:
p = 4C(1000 + (5/4)L)^3
p = 4000000000 + 15000000L + 18750.4L^2 + 7.5L^3
Notice the factors in front of the L^2 and L^3.

During that one hour, the one core isn't the only thing you keep running.

You burn 4 times energy with 4 cores but you do 4 times more work too! If, as you said, the synchro is negligible and the work is parallelizable, you'll spend 4 times less time.
Using multiple threads can save energy when you have i/o waits. One thread can wait while other threads can perform other computations; instead of having your application idle.

A CPU is one part of a computer. It has fans, a motherboard, hard drives, graphics card, RAM etc, lets call this the BASE. If your doing scientific computing (i.e., a compute cluster) you are powering many computers. If you are powering 100's of BASE's anyway, why not allow those BASES to have multiple physical CPU's on them so those CPU's can share the resources of the BASE, physical and logical.
Now INTEL's marketing blurb probably also depends on the fact that these days, each CPU wafer contains multiple cores. Powering multiple physical CPU's is different to powering a single physical cpu with multiple cores.
So if amount of work done per unit of power is the benchmark in question, then modern CPU's performing highly parallel tasks then yes you get more bang for your buck, compared with the previous generation of processors. As not only can you get more cores / cpu, it is also common to get BASE's which can take multiple CPU's.
One may easily assert that one top-end system can now house the processing power of 8-16 singl-cpu single-core CPU's of the past (assuming that in this hypothetical case, that on the new system and the older generation system, each core has the same processing power ).

If a program is multithreaded that doesn't mean that it would use more cores. It just means that more tasks are dealt with in the same time so the overall processor time is shorter.

There are 3 reasons, two of which have already been pointed out:
More overall time means that other (non-CPU) components need to run longer, even if the net calculation for the CPU remains the same
More threads mean more things are done at the same time (because stalls are used for something useful), again the overall real time is reduced.
The CPU power consumption for running the same calculations on one core is not the same. Intel CPUs have a built-in clock boosting for single-core usage (I forgot the marketing buzzword for it). A higher clock means dysproportionally more power consumption and dysproportionally more heat, which again requires the fan to spin faster, too.
So in summary, you consume more power with the CPU and more power for cooling the CPU for a longer time, and you run other components for a longer time, too.
As a 4th reason, one could allege (note that this is only an assumption!) that Intel CPUs are hyperthreaded, and since hyperthreaded cores share some resources, running two threads at once is more efficient than running one thread twice as long.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string