As I'm reading VideoCoreIV-AG100-R spec of BCM vc4 chip, there is a paragraph talking about:
All rendering by the 3D system is in tiles, requiring separate binning and rendering passes to render a frame. In
normal operation the host processor creates a control list in memory defining all the operations and supplying
all the data for rendering for a complete frame.
It mentions of rendering a frame requires binning and rendering pass. Could anybody explain in details how exactly those 2 passes playing roles in a graphic pipeline? Thanks a lot.
For tile based render architecture passes are:
Binning pass - generates stream\map between frame tiles & corresponding geometry which should be rendered into particular tile
Rendering pass - takes map between tiles & geometry and renders the appropriate pixels per tile.
In mobile GPUs due to many limitations compared to Desktops GPUs (such as memory bandwidth due to in mobile devices memory is shared between GPU & CPU,etc) vendors uses approaches to split work into small pieces to decrease overall memory bandwidth consumption - for ex. apply Tile Based Rendering - to achieve efficient utilization of all available resources and gain acceptable performance.
Details
Tile Based Rendering approach described on many GPU vendors sites such as:
A look at the PowerVR graphics architecture: Tile-based rendering
GPU Framebuffer Memory: Understanding Tiling
Related
I am trying to find out how much texture memory does consumed from my application. There are following types of texture and calculations by me:
RGB textures -> textureWidth * textureHeight * 3 (memory usage)
RGBA textures -> textureWidth * textureHeight * 4 (memory usage)
As a result I am wondering that can graphics driver allocate much more memory than above calculated memory?
A few simple answers:
To the best of my knowledge, it's been around 2 decades since (the majority of) hardware devices supported packed 24bit RGB data. In modern hardware this is usually represented in an "XRGB" (or equivalent) format where there is one padding byte per pixel. It is painful in hardware to efficiently handle pixels that straddle cache lines etc. Further, since many applications (read "games") use texture compression, having support for fully packed 24bit seems a bit redundant.
Texture dimensions: If a texture's dimensions are not 'nice' for the particular hardware (e.g,. maybe, say, x is not a multiple of 16bytes, or, say, 4x4 or 8x8 blocks), then the driver may pad the physical size of the texture.
Finally, if you have MIP mapping (and you do want this for performance as well as quality reasons), it will expand the texture size by around 33%.
In addition to the answer from Simon F, it's also worth noting that badly written applications can force the driver to allocate memory for multiple copies of the same texture. This can occur if it attempts to modify the texture while it is still referenced by an in-flight rendering operation. This is commonly known as "resource copy-on-write" or "resource ghosting".
Blog here explains in more detail:
https://community.arm.com/developer/tools-software/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources
Can rendering job and number-crunching job (f.ex. on OpenCL) be effectively shared on the same single GPU? For example,
thread A runs OpenCL task to generate an image
then, when image is ready, thread A notifies another thread B (image is ready) and continues to new image calculation
thread B starts some pre-display activities on a given image (like overlay calculation with GDI), combines final image and render it to display
Can this kind of GPU resource sharing get performance improvement or, on the contrary, will cause to overall slowdown of compute and rendering tasks?
Thanks
There is many factors involved here, but generally, you shouldn't see a slowdown.
Problems with directly answering your question:
OpenCL could be using your CPU as well, depending on how you set it up
Your gfx stuff could be done mostly on CPU or a different part of the GPU, depending on what you display ; for example, many GDI implementations render using the CPU, and only use very simple 2D acceleration techniques on the GPU, mostly to blit the final composed image.
It might depend on the GPU, GPU driver, graphic stack etc. you use.
As most of the time, you will get the best answer for this by trying it out, or at least benchmarking the different parts. After all, you won't really get much of a benefit if your computations are too simple or the image rendering part is too simple.
Also you might try going even further and rendering the result with shaders or the like - in that case you could prevent having to move the data back from the gpu memory to main memory, which could - depending on your circumstances - also give you a speed boost.
if data/crunching ratio is big and also if you have to send data from cpu to gpu:
crunch ---> crunch ----> render
GPU th0 : crunch for (t-1) crunch for (t) rendering
CPU th1 : send data for t send data for t+1 send data for t+2
CPU th2 : get data of t-2 get data of t-1 get data of t
CPU th3-th7 : Other things independent of crunching or rendering.
At the same time: crunching&comm. crunching&comm. rendering&communication
and other things and other things and other things
if data/crunching ratio is big and also if you dont have to send data from cpu to gpu:
use interoperatability of CL (example: cl-gl interop)
if data/crunching ratio is small
should not see any slowdown.
Medium data/crunching ratio: crunch --> render --->crunch ---> render
GPU th0 : crunch for (t) rendering crunch for (t+1) render again! and continue cycling like this
CPU th1 : get data of (t-1) send data for t+1 get data of (t)
CPU th2-th7 : Other things independent of crunching or rendering.
At the same time: crunching&getting. rendering&sending. crunching&getting
and other things and other things and other things
Imagine a big procedural world that worths more than 300k triangles (about 5-10k per 8x8x8m chunk). Since it's procedural, I need to create all the data by code. I've managed to make a good normal smoothing algorithm, and now I'm going to use textures (everyone needs textures, you don't want to walk around simple-colored world, do you?). The problem is that I don't know where it's better to calculate UV sets on (I'm using triplanar texture mapping).
I have three approaches:
1. Do calculations on CPU, then upload to the GPU (mesh is not being modified every frame - 90% of time it stays static, calculations is done per chunk when chunk changes);
2. Do calculations on GPU using vertex shader (calculations is done for every triangle in every frame, but the meshes is kinda big - is it expensive to do so every frame?);
3. Move the algorithm onto OpenCL (calculations is done per chunk when chunk changes, I already use OpenCL to do meshing of my data) and call the kernel when the mesh is being changed (inexpensive, but all my OpenCL experience is based on modifying existing code, but still I have some C background, so it may take a long time before I'll get it to work);
Which approach is better in respect that my little experiment is already little heavy for a mid-range hardware?
I'm using C# (.NET 4.0) with SlimDX and Cloo.
I recently upgraded from a GTX480 to a GTX680 in the hope that the tripled number of cores would manifest as significant performance gains in my CUDA code. To my horror, I've discovered that my memory intensive CUDA kernels run 30%-50% slower on the GTX680.
I realize that this is not strictly a programming question but it does directly impact on the performance of CUDA kernels on different devices. Can anyone provide some insight into the specifications of CUDA devices and how they can be used to deduce their performance on CUDA C kernels?
Not exactly an answer to your question, but some information that might be of help in understanding the performance of the GK104 (Kepler, GTX680) vs. the GF110 (Fermi, GTX580):
On Fermi, the cores run on double the frequency of the rest of the logic. On Kepler, they run at the same frequency. That effectively halves the number of cores on Kepler if one wants to do more of an apples to apples comparison to Fermi. So that leaves the GK104 (Kepler) with 1536 / 2 = 768 "Fermi equivalent cores", which is only 50% more than the 512 cores on the GF110 (Fermi).
Looking at the transistor counts, the GF110 has 3 billion transistors while the GK104 has 3.5 billion. So, even though the Kepler has 3 times as many cores, it only has slightly more transistors. So now, not only does the Kepler have only 50% more "Fermi equivalent cores" than Fermi, but each of those cores must be much simpler than the ones of Fermi.
So, those two issues probably explain why many projects see a slowdown when porting to Kepler.
Further, the GK104, being a version of Kepler made for graphics cards, has been tuned in such a way that cooperation between threads is slower than on Fermi (as such cooperation is not as important for graphics). Any potential potential performance gain, after taking the above facts into account, may be negated by this.
There is also the issue of double precision floating point performance. The version of GF110 used in Tesla cards can do double precision floating point at 1/2 the performance of single precision. When the chip is used in graphics cards, the double precision performance is artificially limited to 1/8 of single precision performance, but this is still much better than the 1/24 double precision performance of GK104.
One of the advances of new Kepler architecture is 1536 cores grouped into 8 192-core SMX'es but at the same time this number of cores is a big problem. Because shared memory is still limited to 48 kb. So if your application needs a lot of SMX resources then you can't execute 4 warps in parallel on single SMX. You can profile your code to find real occupancy of you GPU. The possible ways to improve you application:
use warp vote functions instead of shared memory communications;
increase a number of tread blocks and decrease a number threads in one block;
optimize global loads/stores. Kepler have 32 load/store modules for each SMX (twice more than on Kepler).
I am installing nvieuw and I use coolbits 2.0 to unlock your shader cores from default to max performance. Also, you must have both connectors of your device to 1 display, which can be enabled in nVidia control panel screen 1/2 and screen 2/2. Now you must clone this screen with the other, and Windows resolution config set screen mode to extended desktop.
With nVidia inspector 1.9 (BIOS level drivers), you can activate this mode by setting up a profile for the application (you need to add application's exe file to the profile). Now you have almost double performance (keep an eye on the temperature).
DX11 also features tesselation, so you want to override that and scale your native resolution.
Your native resolution can be achieved by rendering a lower like 960-540P and let the 3D pipelines do the rest to scale up to full hd (in nv control panel desktop size and position). Now scale the lower res to full screen with display, and you have full HD with double the amount of texture size rendering on the fly and everything should be good to rendering 3D textures with extreme LOD-bias (level of detail). Your display needs to be on auto zoom!
Also, you can beat sli config computers. This way I get higher scores than 3-way sli in tessmark. High AA settings like 32X mixed sample makes al look like hd in AAA quality (in tessmark and heavon benchies). There are no resolution setting in the endscore, so that shows it's not important that you render your native resolution!
This should give you some real results, so please read thoughtfully not literary.
I think the problem may lie in the number of Streaming Multiprocessors: The GTX 480 has 15 SMs, the GTX 680 only 8.
The number of SMs is important, since at most 8/16 blocks or 1536/2048 threads (compute capability 2.0/3.0) can reside on an single SM. The resources they share, e.g. shared memory and registers, can further limit the number of blocks per SM. Also, the higher number of cores per SM on the GTX 680 can only reasonably be exploited using instruction-level parallelism, i.e. by pipelining several independent operations.
To find out the number of blocks you can run concurrently per SM, you can use nVidia's CUDA Occupancy Calculator spreadsheet. To see the amount of shared memory and registers required by your kernel, add -Xptxas –v to the nvcc command line when compiling.
Ok.I can find simulation designs for simple architectures.(Edit :definitely like not x86) For example use an int as the program counter , use a byte array as the Memory and so on.But how can I simulate the graphic card's(the simplest graphic card imaginable) functionality ?
like use an array to represent each pixel and "paint" each pixel one by one.
But when to paint- synchronized with CPU or asynchronously ? Who stores graphic data in that array ? Is there an instruction for storing a pixel and painting a pixel ?
Please consider all the question marks ('?') doesn't mean "you are asking a lot of questions" but explains the problem itself - How to simulate a Graphic Card ?
Edit : LINK to a basic implementation design for CPU+Memory simulation
Graphic cards typically carry a number of KBs or MBs of memory that stores colors of individual pixels that are then displayed on the screen. The card scans this memory a number of times per second turning the numeric representation of pixel colors into video signals (analog or digital) that the display understands and visualizes.
The CPU has access to this memory and whenever it changes it, the card eventually translates the new color data into appropriate video signals and the display shows the updated picture. The card does all the processing asynchronously and doesn't need much help from the CPU. From the CPU's point of view it's pretty much like write the new pixel color into the graphic card's memory at the location corresponding to the coordinates of the pixel and forget about it. It may be a little more complex in reality (due to poor synchronization artifacts such as tearing, snow and the like), but that's the gist of it.
When you simulate a graphic card, you need to somehow mirror the memory of the simulated card in the physical graphic card's memory. If in the OS you can have direct access to the physical graphic card's memory, it's an easy task. Simply implement writing to the memory of your emulated computer something like this:
void MemoryWrite(unsigned long Address, unsigned char Value)
{
if ((Address >= SimulatedCardVideoMemoryStart) &&
(Address - SimulatedCardVideoMemoryStart < SimulatedCardVideoMemorySize))
{
PhysicalCard[Address - SimulatedCardVideoMemoryStart] = Value;
}
EmulatedComputerMemory[Address] = Value;
}
The above, of course, assumes that the simulated card has exactly the same resolution (say, 1024x768) and pixel representation (say, 3 bytes per pixel, first byte for red, second for green and third for blue) as the physical card. In real life things can be slightly more complex, but again, that's the general idea.
You can access the physical card's memory directly in MSDOS or on a bare x86 PC without any OS if you make your code bootable by the PC BIOS and limit it to using only the BIOS service functions (interrupts) and direct hardware access for all the other PC devices.
Btw, it will probably be very easy to implement your emulator as a DOS program and run it either directly in Windows XP (Vista and 7 have extremely limited support for DOS apps in 32-bit editions and none in 64-bit editions; you may, however, install XP Mode, which is XP in a VM in 7) or better yet in something like DOSBox, which appears to be available for multiple OSes.
If you implement the thing as a Windows program, you will have to use either GDI or DirectX in order to draw something on the screen. Unless I'm mistaken, neither of these two options lets you access the physical card's memory directly such that changes in it would be automatically displayed.
Drawing individual pixels on the screen using GDI or DirectX may be expensive if there's a lot of rendering. Redrawing all simulated card's pixels every time when one of them gets changed amounts to the same performance problem. The best solution is probably to update the screen 25-50 times a second and update only the parts that have changed since the last redraw. Subdivide the simulated card's buffer into smaller buffers representing rectangular areas of, say, 64x64 pixels, mark these buffers as "dirty" whenever the emulator writes to them and mark them as "clean" when they've been drawn on the screen. You may set up a periodic timer driving screen redraws and do them in a separate thread. You should be able to do something similar to this in Linux, but I don't know much about graphics programming there.