Why is z-buffering fast? - graphics

When I made my rasterizer, I realized that each pixel needed to compare all the triangles in the model to determine the depth value. But if there are, for example, a million of these triangles, then it turns out that each individual GPU core must compare a million triangles with each other? This all takes an incredibly long time, so I would like to know how this problem is avoided. I heard that this is done in hardware, but by what principle I did not understand

Depth sorting need to sort all triangles by perpendicular distance to camera and even split intersecting triangles in order to work correctly. That is a huge amount of work scaled with number of entities rendered with ~O(n.log(n)) but does not need too much additonal memory (unless too many splits)... That is why it was used in the past when memory was scarce and CPUs where slow so there where only few entities to render making it still fast enough... Also in some edge cases the depth sorting might be done by simply O(1) back face culling (simple scenes with single convex and non intersecting polygons or too far from each other to block their view)...
Nowadays situations is different we have very complex scenes with lot of entities and fast CPUs and GPUs and lot of memory so Depth buffering is used instead because its O(1), pixel perfect, but needs a shadow screen buffer holding the depths which can be a large chunk of memory ... The rendering is done like this:
clear depth buffer with most distant value
this is the slowest operation but done only once per frame and its just memory filling ... Usually done like this:
for (y=0;y<y_resolution;y++)
for (x=0;x<x_resolution;x++)
{
depth[y][x]=z_max
color[y][x]=background_color;
}
in case the buffers are stored as linear arrays you can use memset or even DMA on some platforms for this.
add condition to rendering pixel and also store rendered depth
to skip pixels if something is already rendered before them like:
void pixel(int x,int y,int z,int col)
{
if (depth[y][x]>z)
{
depth[y][x]=z; // store new dept value to buffer
color[y][x]=col; // render pixel
}
}
as this is done by HW no brunch or CACHE unfrendly operation is involved ...
This approach results in 2 images output one holding the colors (wanted image) and the depth buffer holding the rendered depths so we still have 3D info which allows to do additional processing/effect like ray picking, lighting effects, shadows, scattering and much much more ...
There are also hybrid techniques using both approaches like this:
OpenGL - How to create Order Independent transparency?

Related

How are graphics / drawing methods made?

One thing I've noticed about programming is that whenever there's a method that draws individual pixels or gets data from individual pixels, it's always much slower than methods for drawing primitives or pre-made graphics. I was just wondering why that is. Wouldn't making those methods require (at some point) the use of a single-pixel drawing method? And if there's a faster way to do it, why wouldn't they make the single-pixel method do it that way as well?
Some thoughts:
Many modern computing environments include hardware acceleration for drawing graphics primitives. Instead of having the CPU access the video memory one pixel at a time, graphics acceleration hardware touches the video memory equivalent of multiple pixels at once. Drawing primitives can take advantage of such acceleration in ways that software pixel operations cannot. http://en.wikipedia.org/wiki/Hardware_acceleration has some useful pointers.
Even drawing primitives that run in software are highly optimized, often in ways that aren't (easily) accessible from higher level languages. For example, SIMD instructions in modern CPU's allow touching multiple pixel locations at once.
Finally, an interpreted language may introduce significant pixel-to-pixel delays, compared to a compiled language. This depends on the programming language you experimented with most.
There is overhead in GetPixel/SetPixel operations that can usually be optimized away when drawing a primitive that touches many adjacent pixels.
Consider how GetPixel and SetPixel must be implemented:
Determine if each coordinate is in bounds.
Compute the location in memory of that pixel.
Depending on the pixel format, isolating the data for that pixel might involve unpacking.
Now consider something like an axis-aligned rectangle primitive. The naive way to implement it would be a double loop over x and y and calling SetPixel:
void DrawRect(RGB color, int left, int top, int right, int bottom) {
for (int y = top; y < bottom; ++y) {
for (int x = left; x < right; ++x) {
SetPixel(x, y, color);
}
}
}
SetPixel is going to check that both coordinates are in bounds for every call, which is wasteful. If the y coordinate was valid for the previous pixel in the same row, then it's still going to be valid for the next one.
Additionally, in most raster formats, most of the pixels you set will be adjacent in memory, so a general purpose formula (even something simple like address = base + (y * stride) + x) is more calculation than just incrementing the last address for the next pixel on the same row.
This illustrates how many (most?) primitives can be drawn with far less bounds checking and calculation than a naive implementation using SetPixel. Primitive drawing operations tend to be optimized because they are so common.
On a modern machine, there's even more opportunity for optimization. Some primitives might actually be drawn by the GPU rather than the CPU. This is significant because the GPU has direct access to the video memory. The CPU typically has indirect access to the video memory, and pixel access would have to be shuttled back and forth across the graphics bus. Some graphics busses are very fast, but usually not as fast as accessing local memory. Sending a single draw-primitive command, which is only a few bytes, across the bus to the GPU is going to be a lot faster than sending many set-pixel commands.
Furthermore, GPUs are designed to do this kind of work in parallel. They generally have many simple cores that can work at the same time. So, if it's drawing a rectangle, the GPU might (for example) distribute each scanline to a separate core, and the entire rectangle will be drawn as fast as a single core can draw a single horizontal line.

Number of polygons in a 3D object and the rendering workload?

Is there any relation (preferably an equation) between the number of polygons in a 3D object and the rendering workload? I want to see how much the rendering workload would be increased if for instance the number of polygons doubles.
There is no clear connection between the arbitrary number of polygons and the mythical "workload".
See the following samples:
You render a cube with 6 faces composed of 12 triangles. You get, say, 1000fps (without vsync). When you tesselate the cube into 120 triangles, most likely the fps counter remains 1000.
You render a single fullscreen-sized quad with a heavy fragment shader with a lot of calculation. You get 0.5fps (or more, but I hope you get the point).
Another extreme. You are rendering a thousand of similar cubes, each with different texture. The rendering state change will take most of the time, not the actual rendering.
So, polygons may have different screen area and they may be rendered not within a single primitive. If you're talking about one big vertex array with a large number of polygons, then for some certain scenarios the performance change must be something like linear. "Something" because the videocard and the drivers are clipping the invisible polys and perfrom the early-out tests for each pixel being rendered.
Could you define 'workload'? – Erno yesterday
Well, I mean working
calculations. I want to see how much overhead (for GPU, CPU,
memory,...) would be increased. Actually I want to conclude the energy
usage of the device – user1196937 2 hours ago
If that is the actual question, a comparison of energy usage:
You will have to pick specific configurations and test those. Energy usage is very different from GPU to GPU and machine to machine.
Some GPU manufactures give very detailed information on the performance of their processors but when you want to compare those you will need an actual machine.

Best way to move sprites in OpenGL - translate or alter vertices

I am creating an app for android using openGL ES. I am trying to draw, in 2D, lots of moving sprites which bounce around the screen.
Let's consider I have a ball at coordinates 100,100. The ball graphic is 10px wide, therefore I can create the vertices boundingBox = {100,110,0, 110,110,0, 100,100,0, 110,100,0} and perform the following on each loop of onDrawFrame() with the ball texture loaded.
//for each ball object
FloatBuffer ballVertexBuffer = byteBuffer.asFloatBuffer();
ballVertexBuffer.put(ball.boundingBox);
ballVertexBuffer.position(0);
gl.glVertexPointer(3, GL10.GL_FLOAT, 0, ballVertexBuffer);
gl.glDrawArrays(GL10.GL_TRIANGLE_STRIP, 0,4);
I would then update the boundingBox array to move the balls around the screen.
Alternatively, I could not alter the bounding box at all and instead translatef() the ball before drawing the verticies
gl.glVertexPointer(3, GL10.GL_FLOAT, 0, ballVertexBuffer);
gl.glPushMatrix();
gl.glTranslatef(ball.posX, ball.posY, 0);
gl.glDrawArrays(GL10.GL_TRIANGLE_STRIP, 0,4);
gl.glPopMatrix();
What would be the best thing to do in the case in terms of efficient and best practices.
OpenGL ES (as of 2.0) does not support instancing, unluckily. If it did, I would recommend drawing a 2-triangle sprite instanced N times, reading the x/y offsets of the center point, and possibly a scale value if you need differently sized sprites, from a vertex texture (which ES supports just fine). This would limit the amount of data you must push per frame to a minimum.
Assuming you can't do the simulation directly on the GPU (thus avoiding uploading the vertex data each frame) ... this basically leaves you only with only one efficient option:
Generate 2 VBOs, map one and fill it, while the other is used as the source of the draw call. You can also do this seemingly with a single buffer if you glBufferData(... 0) in between, which tells OpenGL to generate a new buffer and throw the old one away as soon as it's done reading from it.
Streaming vertices in every frame may not be super fast, but this does not matter as long as the latency can be well-hidden (e.g. by drawing from one buffer while filling another). Few draw calls, few state changes, and ideally no stalls should still make this fast.
Drawing calls are much more expensive than altering the data. Also glTranslate is not nearly as efficient as just adding a few numbers, after all it has to go through a full 4×4 matrix multiplication, which is 64 scalar multiplies and 16 scalar additions.
Of course the best method is using some form of instancing.

Sprites in game programming, multiple files vs one "texture"?

Pardon me if my lingo is not correct as I'm new to game programming. I've been looking at some open source projects and noticed that some sprites are split up into several files, all of which are grouped together to make a 2d object look like it's animating. That's straight forward. Then I'll see a different approach, with the 2d object all in one png file or something similar, all next to each other.
Is there an advantage of using one approach to another? Should sprites be in separate files? Why are they sometimes all on one sheet?
The former approach is typically more straightforward and easy to program, so you see a lot of it in open source projects.
The second approach is more efficient on modern graphics hardware, because it allows you to draw multiple different sprites from one large texture by specifying different u,v coordinates to select each individual sprite from the composite sheet. Because u,v coordinates can be streamed along with vertex data to a shader, this allows you to draw a large group of sprites much more efficiently than you could if you had to switch textures (which means changing shader state) for each poly. That means you can draw more sprites per millisecond, and thus get more on screen.
Every time you switch your currently bound texture you incur a penalty (sometimes a very big one if the system runs out of memory and starts paging textures in and out). So the more things you can draw with one texture the better. Going to extremes, if you never switched texture bindings, you'd incur 0 penalty.
On the other hand, video cards limit the maximum size of a texture, so you can only group smaller textures into a big one so much. The older the card the smaller the texture size you can use. So if you want to make your game work on a large variety of cards, you have to limit your textures to a more normal size (or have different sets of textures for different cards).
Another problem is that sometimes the stuff in your virtual world just doesn't pertain itself to being grouped like this. While you can have a big texture with every little decoration for your UI (window frames, buttons, etc), you're gonna have a harder time to use a single texture for different enemies because they might not even appear on the screen at the same time, or you might be unable to draw them one after the other because of the back-to-front drawing scheme necessary for transparency.
Not so long ago one reason to use packed sprites instead of seperate ones was that graphics hardware was limited to power-of-two textures (256, 512, 1024, ...). So you would waste a good amount of memory by not packing the sprites as you would have to enlarge everything to power-of-two dimensions before you could upload it. Packing multiple sprites into a single texture worked around that.
Another reason is that its much quicker to load one big image file from the HD then it is to load hundreds of small ones. This is still the case as file access comes with quite a large overhead per file, so the less files you have the faster things become. And especially with small sprites you can easily turn hundred files into a single one, so the saving can be quite noticable.
There are however also reason against having everything in one texture. For one OpenGL is no longer limited to power-of-two textures, so any size will work. But more importantly, packing everything in one texture has negative side effects. When you for example have lots of scaling in a game you have to be careful about the borders of your sprites, as colors will blend into neighboring sprites giving you ugly artifacts. You can avoid that to a certain degree by adding extra space around your sprites, but its not a perfect solution. Having everything in one texture also limits what you can do with the image. For certain effects, such as a waterfall for example, you might want to do the animation by simply offsetting the UV coordinates of the texture, you can't do that so easily when everything is packed into a single texture.

What does 'Polygon' mean in terms of 3D Graphics?

An old Direct3D book says
"...you can achieve an acceptable frame
rate with hardware acceleration while
displaying between 2000 and 4000
polygons per frame..."
What is one polygon in Direct3D? Do they mean one primitive (indexed or otherwise) or one triangle?
That book means triangles. Otherwise, what if I wanted 1000-sided polygons? Could I still achieve 2000-4000 such shapes per frame?
In practice, the only thing you'll want it to be is a triangle because if a polygon is not a triangle it's generally tessellated to be one anyway. (Eg, a quad consists of two triangles, et cetera). A basic triangulation (tessellation) algorithm for that is really simple; you just loop though the vertices and turn every three vertices into a triangle.
Here, a "polygon" refers to a triangle. All . However, as you point out, there are many more variables than just the number of triangles which determine performance.
Key issues that matter are:
The format of storage (indexed or not; list, fan, or strip)
The location of storage (host-memory vertex arrays, host-memory vertex buffers, or GPU-memory vertex buffers)
The mode of rendering (is the draw primitive command issued fully from the host, or via instancing)
Triangle size
Together, those variables can create much greater than a 2x variation in performance.
Similarly, the hardware on which the application is running may vary 10x or more in performance in the real world: a GPU (or integrated graphics processor) that was low-end in 2005 will perform 10-100x slower in any meaningful metric than a current top-of-the-line GPU.
All told, any recommendation that you use 2-4000 triangles is so ridiculously outdated that it should be entirely ignored today. Even low-end hardware today can easily push 100,000 triangles in a frame under reasonable conditions. Further, most visually interesting applications today are dominated by pixel shading performance, not triangle count.
General rules of thumb for achieving good triangle throughput today:
Use [indexed] triangle (or quad) lists
Store data in GPU-memory vertex buffers
Draw large batches with each draw primitives call (thousands of primitives)
Use triangles mostly >= 16 pixels on screen
Don't use the Geometry Shader (especially for geometry amplification)
Do all of those things, and any machine today should be able to render tens or hundreds of thousands of triangles with ease.
According to this page, a polygon is n-sided in Direct3d.
In C#:
public static Mesh Polygon(
Device device,
float length,
int sides
)
As others already said, polygons here means triangles.
Main advantage of triangles is that, since 3 points define a plane, triangles are coplanar by definition. This means that every point within the triangle is exactly defined as a linear combination of polygon points. More vertices aren't necessarily coplanar, and they don't define a unique curved plane.
An advantage more in mechanical modeling than in graphics is that triangles are also undeformable.

Resources