Does the framebuffer contain depth buffer information, or just color buffer information in graphics applications? What about for gui's on windows, is their a frame buffer and does it hold both color buffer + depth info or just color info?
If you're talking about the kernel-level framebuffer in Linux, it sets both the resolution and the color depth. Here's a list of common framebuffer modes; notice that the modes are determined by both the resolution and the color depth. You can override the framebuffer by passing a command line parameter to the kernel in your bootloader (vga=...).
Unlike Linux, on Windows the graphics subsystem is a part of the OS. I don't think (and, please, someone correct me if I'm wrong) there is support for non-VGA output devices in the latest Windows, so the framebuffer is deprecated/unavailable there.
As for real-time 3D, the "standard" buffers are the color buffer, in RGBA format, 1 byte per component, and the depth buffer, 3 bytes per component. There is one sample per fragment ( i.e., if you have 8x antialiasing, you will have 8 colors and 8 depth samples per pixel )
Now, many applications use additional, custom buffers. They are usually called g-buffers. These can include : object ID, material ID, albedo, shininess, normals, tangent, binormal, AO factor, most influent lights impacting the fragment, etc. At 1080p and 4xMSAA and double- or triple-buffering, this can take huge amounts of memory, so all these information are usually packed as tightly as possible.
Related
I'm working on a fullscreen Windows desktop application that's moderately graphics-intensive, it uses OpenGL but only renders 2D content. Nothing fancy, mostly pushing pixels to the screen (up to 4K, single monitor) and uploading textures. We're using VSync to control the rendering framerate, ie. calling SwapBuffers() at the end of rendering to block until the next VBlank.
The main requirement we have is that the app runs at a solid 60FPS as it's used with a touchscreen, and interactions need to be as fluid as possible.
Because it's pretty basic, the app runs just fine on a 8th gen Intel i7 CPU with integrated Intel HD Graphics 630 GPU. Neither the CPU or GPU are anywhere near peak usage, and we can see that we're hitting a comfortable 60FPS through our in-app FPS meter. I also have it running with similar results on my Surface Book 2 with Intel i7 and integrated Intel UHD Graphics 620 GPU.
However, what I've recently started noticing is that the app sometimes starts dropping to 30FPS, then staying there either for long periods of time or sometimes even permanently. Through our FPS meter, I can tell that we're not actually spending any time rendering, it's just our SwapBuffers() call that blocks arbitrarily for 2 frames, capping us at 30FPS. The only way to get back to 60FPS is to alt-tab with another app and back to ours, or simply bringing the Windows menu up then going back to the app.
Because of the app going back to 60FPS afterwards, I'm positive that this is an intended behavior of the Intel driver, probably meant for gaming (gamers prefer a stable 30FPS rather than irregular/occasional dropped frames which make the game look choppy).
However in our case, dropping an occasional frame isn't a big deal, however being capped at 30FPS makes our UI and interactions far less pleasing to the eye, especially when it could easily render at a smooth 60FPS instead.
Is there any way to switch the driver behavior to prefer pushing 60FPS with occasional drops rather than capping at 30FPS?
OK so I was able to figure this out with a little bit of tweaking and reverse-engineering: The answer is that yes this is an intended but unfortunate default behavior of the Intel driver, and it can be fixed via the Intel HD Graphics Control Panel app if available, or directly in the registry otherwise (which is the only way to fix the issue on the Surface Book and other Surface devices, where the custom Intel driver doesn't expose the Intel HD Graphics Control Panel app anymore).
Starting with the simple solution: In the Intel HD Graphics Control Panel app, go to "3D", then "Application Settings". You'll first need to create an application profile, by selecting the file on disk for the process that creates the OpenGL window. Once that's done, the setting you want to adjust is "Vertical Sync". By default, "Use Application Default Settings" is selected. This is the setting that causes the capping at 30FPS. Select "Use Driver Settings" instead to disable that behavior and always target 60FPS:
This should've been pretty obvious, if it wasn't for Intel's horrible choice of terms and incomprehensible documentation. To me it looks like the choices for the settings are inverted: I would expect the capping to happen when I select "Use Driver Settings", which then implies the driver is free to adjust buffer swapping as it sees fit. Similarly, "Use Application Default Settings" implies that the app decides when to push frames, which is precisely the opposite of what the setting does. Even the little help bubbles in the app seem to contradict what these settings do...
ps: I'll post the registry-based solution in a separate answer to keep it short
Here is the registry-based answer, if your driver does not expose the Intel HD control panel (such as the driver used on the Surface Book and possibly other Surface laptops), or if you want to make that fix programmatically via regedit.exe or the Win32 API:
The application profiles created by the Intel HD control panel are saved in the registry under HKCU\Software\Intel\Display\igfxcui\3D using a key with the process file name (e.g. my_game.exe) and a REG_BINARY value with a 536-byte data blob divided like this:
Byte 0-3: Anisotropic Filtering (0 = use app default, 2-10 = multiplier setting)
Byte 4-7: Vertical Sync (0 = use app default, 1 = use driver setting)
Byte 8-11: MSAA (0 = use app default, 1 = force off)
Byte 12-15: CMAA (0 = use app default, 1 = override, 2 = enhance)
Byte 16-535: Application Display Name (wide-chars, for use in the control panel's application list)
Note: all values are stored in little-endian
In addition, you need to make sure that the Global value under the same key has its first byte set to 1, which is a sort of global toggle.(the control panel sets it to 1 when one or more entries are added to the applications list, then back to zero when the last entry is deleted from the list).
The format of value is also a REG_BINARY value with 8 bytes encoded like this:
Byte 0-3: Global toggle for application entries (0 = no entries, 1 = entries)
Byte 4-7: Application Optimal mode (0 = enabled, 1 = disabled)
For example:
I'm developing a high-speed, high-resolution video camera for robotics applications. For various reasons I need to adopt gigabit ethernet (1Ge) or 10Ge to interface my cameras to PCs. Either that or I'll need to develop my own PCIe card which I prefer not to do (more work, plus then I'd have to create drivers).
I have two questions that I am not certain about after reading linux documentation.
#1: My desired ethernet frame is:
8-byte interpacket pad + sync byte
6-byte MAC address (destination)
6-byte MAC address (source)
2-byte packet length (varies 6KB to 9KB depending on lossless compression)
n-byte image data (number of bytes specified in previous 2-byte field)
4-byte CRC32
The question is, will linux accept this packet if the application tell linux to expect AF_PACKETs (assuming applications CAN tell linux this)? It is acceptable if the application that controls the camera (sends packets to it) and receives the image data in packets must run with root privilege.
#2: Which will be faster:
A: linux sockets with AF_PACKET protocol
B: libpcap application
Speed is crucial, because packets will arrive with little space between them, since each packet contains one horizontal row of pixels in my own lossless compression format (unless I can find a better algorithm that can also be implemented in the FPGA at real time speeds). There will be a pause between frames, but that is after 1200 or more horizontal rows (ethernet frame packets).
Because the application is robotics, each horizontal row will be immediately decompressed and stored in a simple packed array of RGBA pixels just like OpenGL accepts as textures. So robotics software can immediately inspect each image as the image arrives row by row and possibly react as quickly as inhumanly possible.
The data for the first RGBA pixel in each row immediately follows the last RGBA pixel in the previous row, so at the end of the last horizontal row of pixels the image is complete and ready to transfer to GPUs and/or save to disk. Each horizontal row will be a multiple of 16 pixels, so no "padding" is required.
NOTE: The camera must be directly plugged into the RJ45 jack without routers or other devices between camera and PC.
I think you will have to change your Ethernet frame format to use the first two bytes after the source and dest MACs as the type, not the length. Old-style lengths must be less than 1536, anything greater is treated as an IEEE type field instead. As you want 6K or more, there's a chance the receiving Ethernet chip / Linux packet handler will discard your frames because they're badly formatted.
As for performance, the Golden Rule is measure, don't guess. Pick the one that is simplest to program and try.
Hope this helps.
Many of the embedded/mobile GPUs are providing access to performance registers called Pixel Write Speed and Texel Write speed. Could you explain how those terms can be interpreted and defined from the actual GPU hardware point of view?
I would assume the difference between pixel and texel is pretty clear for you. Anyway, just to make this answer a little bit more "universal":
A pixel is the fundamental unit of screen space.
A texel, or texture element (also texture pixel) is the fundamental unit of texture space.
Textures are represented by arrays of texels, just as pictures are
represented by arrays of pixels. When texturing a 3D surface (a
process known as texture mapping) the renderer maps texels to
appropriate pixels in the output picture.
BTW, it is more common to use fill rate instead of write speed and you can easily find all required information, since this terminology is quite old and widely-used.
Answering your question
All fill-rate numbers (whatever definition is used) are expressed in
Mpixels/sec or Mtexels/sec.
Well the original idea behind fill-rate was the number of finished
pixels written to the frame buffer. This fits with the definition of
Theoretical Peak fill-rate. So in the good old days it made sense to
express that number in Mpixels.
However with the second generation of 3D accelerators a new feature
was added. This feature allows you to render to an off screen surface
and to use that as a texture in the next frame. So the values written
to the buffer are not necessarily on screen pixels anymore, they might
be texels of a texture. This process allows several cool special
effects, imagine rendering a room, now you store this picture of a
room as a texture. Now you don't show this picture of the room but you
use the picture as a texture for a mirror or even a reflection map.
Another reason to use MTexels is that games are starting to use
several layers of multi-texture effects, this means that a on-screen
pixel is constructed from various sub-pixels that end up being blended
together to form the final pixel. So it makes more sense to express
the fill-rate in terms of these sub-results, and you could refer to
them as texels.
Read the whole article - Fill Rate Explained
Additional details can be found here - Texture Fill Rate
Update
Texture Fill Rate = (# of TMU - texture mapping unit) x (Core Clock)
The number of textured pixels the card can render to the
screen every second.
It is obvious that the card with more TMUs will be faster at processing texture information.
The performance registers/counters Pixel Write Speed and Texel Write speed maintain stats / count operations about pixel and texel processed/written. I will explain the peak (maximum possible) fill rates.
Pixel Rate
A picture element is a physical point in a raster image, smallest
element of display device screen.
Pixel rate is the maximum amount of pixels the GPU could possibly write to the local memory in one second, measured in millions of pixels per second. The actual pixel output rate also depends on quite a few other factors, most notably the memory bandwidth - the lower the memory bandwidth is, the lower the ability to get to the maximum fill rate.
The pixel rate is calculated by multiplying the number of ROPs (Raster Operations Pipelines - aka Render Output Units) by the the core clock speed.
Render Output Units : The pixel pipelines take pixel and texel information and process it, via specific matrix and vector operations, into a final pixel or depth value. The ROPs perform the transactions between the relevant buffers in the local memory.
Importance : Higher the pixel rate, higher is the screen resolution of the GPU.
Texel Rate
A texture element is the fundamental unit of texture space (a tile of
3D object surface).
Texel rate is the maximum number of texture map elements (texels) that can be processed per second. It is measured in millions of texels in one second
This is calculated by multiplying the total number of texture units by the core speed of the chip.
Texture Mapping Units : Textures need to be addressed and filtered. This job is done by TMUs that work in conjunction with pixel and vertex shader units. It is the TMU's job to apply texture operations to pixels.
Importance : Higher the texel rate, faster the game renders displays demanding games fluently.
Example : Not a nVidia fan but here are specs for GTX 680, (could not find much for embedded GPU)
Model Geforce GTX 680
Memory 2048 MB
Core Speed 1006 MHz
Shader Speed 1006 MHz
Memory Speed 1502 MHz (6008 MHz effective)
Unified Shaders 1536
Texture Mapping Units 128
Render Output Units 32
Bandwidth 192256 MB/sec
Texel Rate 128768 Mtexels/sec
Pixel Rate 32192 Mpixels/sec
I'm looking into rendering frames at a high rate (ideally next to the max monitor rate) and I was wondering if anyone had any idea at what level I should start looking into: kernel/driver level (OS space) ? X11 level ? svgalib (userspace) ?
On a modern computer, you can do it using the ordinary tools and APIs for graphics. If you have full frames full of random pixels, a simple bit blit from an in-memory buffer will perform more than adequately. Without any optimization work, I found that I could generate more than 500 frames per second on Windows XP using 2008 PCs.
I am using OpenGL, Ffmpeg and SDL to play videos and am currently optimizing the process of getting frames, decoding them, converting them from YUV to RGB, uploading them to texture and displaying the texture on a quad. Each of these stages is performed by a seperate thread and they written to shared buffers which are controlled by SDL mutexes and conditions (except for the upload and display of the textures as they need to be in the same context).
I have the player working fine with the decode, convert and OpenGL context on seperate threads but realised that because the video is 25 frames per second, I only get a converted frame from the buffer, upload it to OpenGL and bind it/display it every 40 milliseconds in the OpenGL thread. The render loop goes round about 6-10 times not showing the next frame for every frame it shows, due to this 40ms gap.
Therefore I decided it might be a good idea to have a buffer for the textures too and set up an array of textures created and initialised with glGenTextures() and the glParameters I needed etc.
When it hasn't been 40ms since the last frame refresh, a method is ran which grabs the next converted frame from the convert buffer and uploads it to the next free texture in the texture buffer by binding it then calling glTexSubImage2D(). When it has been 40ms since the last frame refresh, a seperate method is ran which grabs the next GLuint texture from the texture buffer and binds it with glBindTexture(). So effectively, I am just splitting up what was being done before (grab from convert buffer, upload, display) into seperate methods (grab from convert buffer, upload to texture buffer | and | grab from texture buffer, display) to make use of the wasted time between 40ms refreshes.
Does this sound reasonable? Because when ran, the video halts all the time in a sporadic manner, sometimes about 4 frames are played when they are supposed to (every 40ms) but then there is a 2 second gap, then 1 frame is shown, then a 3 second gap and the video is totally unwatchable.
The code is near identical to how I manage the convert thread grabbing decoded frames from the decode buffer, converting them from YUV to RGB and then putting them into the convert buffer so can't see where the massive bottlenecking could be.
Could the bottlenecking be on the OpenGL side of things? Is the fact that I am storing new image data to 10 different textures the problem as when a new texture is grabbed from the texture buffer, the raw data could be a million miles away from the last one in terms of memory location on the video memory? That's my only attempt at an answer, but I don't know much about how OpenGL works internally so that's why I am posting here.
Anybody have any ideas?
I'm no OpenGL expert, but my guess of the bottleneck is that the textures are intialized properly in system memory but are sent to the video memory at the 'wrong' time (like all at once instead of as soon as possible), stalling the pipeline. When using glTexSubImage2D you have no guarantees about when a texture arrives in video memory until you bind it.
Googling around it seems pixelbuffer objects give you more control about when they are in video memory: http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=262523