One thing I've noticed about programming is that whenever there's a method that draws individual pixels or gets data from individual pixels, it's always much slower than methods for drawing primitives or pre-made graphics. I was just wondering why that is. Wouldn't making those methods require (at some point) the use of a single-pixel drawing method? And if there's a faster way to do it, why wouldn't they make the single-pixel method do it that way as well?
Some thoughts:
Many modern computing environments include hardware acceleration for drawing graphics primitives. Instead of having the CPU access the video memory one pixel at a time, graphics acceleration hardware touches the video memory equivalent of multiple pixels at once. Drawing primitives can take advantage of such acceleration in ways that software pixel operations cannot. http://en.wikipedia.org/wiki/Hardware_acceleration has some useful pointers.
Even drawing primitives that run in software are highly optimized, often in ways that aren't (easily) accessible from higher level languages. For example, SIMD instructions in modern CPU's allow touching multiple pixel locations at once.
Finally, an interpreted language may introduce significant pixel-to-pixel delays, compared to a compiled language. This depends on the programming language you experimented with most.
There is overhead in GetPixel/SetPixel operations that can usually be optimized away when drawing a primitive that touches many adjacent pixels.
Consider how GetPixel and SetPixel must be implemented:
Determine if each coordinate is in bounds.
Compute the location in memory of that pixel.
Depending on the pixel format, isolating the data for that pixel might involve unpacking.
Now consider something like an axis-aligned rectangle primitive. The naive way to implement it would be a double loop over x and y and calling SetPixel:
void DrawRect(RGB color, int left, int top, int right, int bottom) {
for (int y = top; y < bottom; ++y) {
for (int x = left; x < right; ++x) {
SetPixel(x, y, color);
}
}
}
SetPixel is going to check that both coordinates are in bounds for every call, which is wasteful. If the y coordinate was valid for the previous pixel in the same row, then it's still going to be valid for the next one.
Additionally, in most raster formats, most of the pixels you set will be adjacent in memory, so a general purpose formula (even something simple like address = base + (y * stride) + x) is more calculation than just incrementing the last address for the next pixel on the same row.
This illustrates how many (most?) primitives can be drawn with far less bounds checking and calculation than a naive implementation using SetPixel. Primitive drawing operations tend to be optimized because they are so common.
On a modern machine, there's even more opportunity for optimization. Some primitives might actually be drawn by the GPU rather than the CPU. This is significant because the GPU has direct access to the video memory. The CPU typically has indirect access to the video memory, and pixel access would have to be shuttled back and forth across the graphics bus. Some graphics busses are very fast, but usually not as fast as accessing local memory. Sending a single draw-primitive command, which is only a few bytes, across the bus to the GPU is going to be a lot faster than sending many set-pixel commands.
Furthermore, GPUs are designed to do this kind of work in parallel. They generally have many simple cores that can work at the same time. So, if it's drawing a rectangle, the GPU might (for example) distribute each scanline to a separate core, and the entire rectangle will be drawn as fast as a single core can draw a single horizontal line.
Related
When I made my rasterizer, I realized that each pixel needed to compare all the triangles in the model to determine the depth value. But if there are, for example, a million of these triangles, then it turns out that each individual GPU core must compare a million triangles with each other? This all takes an incredibly long time, so I would like to know how this problem is avoided. I heard that this is done in hardware, but by what principle I did not understand
Depth sorting need to sort all triangles by perpendicular distance to camera and even split intersecting triangles in order to work correctly. That is a huge amount of work scaled with number of entities rendered with ~O(n.log(n)) but does not need too much additonal memory (unless too many splits)... That is why it was used in the past when memory was scarce and CPUs where slow so there where only few entities to render making it still fast enough... Also in some edge cases the depth sorting might be done by simply O(1) back face culling (simple scenes with single convex and non intersecting polygons or too far from each other to block their view)...
Nowadays situations is different we have very complex scenes with lot of entities and fast CPUs and GPUs and lot of memory so Depth buffering is used instead because its O(1), pixel perfect, but needs a shadow screen buffer holding the depths which can be a large chunk of memory ... The rendering is done like this:
clear depth buffer with most distant value
this is the slowest operation but done only once per frame and its just memory filling ... Usually done like this:
for (y=0;y<y_resolution;y++)
for (x=0;x<x_resolution;x++)
{
depth[y][x]=z_max
color[y][x]=background_color;
}
in case the buffers are stored as linear arrays you can use memset or even DMA on some platforms for this.
add condition to rendering pixel and also store rendered depth
to skip pixels if something is already rendered before them like:
void pixel(int x,int y,int z,int col)
{
if (depth[y][x]>z)
{
depth[y][x]=z; // store new dept value to buffer
color[y][x]=col; // render pixel
}
}
as this is done by HW no brunch or CACHE unfrendly operation is involved ...
This approach results in 2 images output one holding the colors (wanted image) and the depth buffer holding the rendered depths so we still have 3D info which allows to do additional processing/effect like ray picking, lighting effects, shadows, scattering and much much more ...
There are also hybrid techniques using both approaches like this:
OpenGL - How to create Order Independent transparency?
If this question is off, please let me know as I don't want to clutter the platform with off-topic questions!
Anyways, I'm having a hard time finding information about what's actually going on when an image is rendered because of some code I've written.
Say I wanted to add the numbers 5 and 3. The CPU would write 5 to one register and 3 to another one. The ALU would take care of the calculation and output 8. That's fine, the CPU uses MOVE and ADD to produce a result.
What I don't find any information on however, is what's going on when I want to draw a rectangle. There are importable frameworks for most programming languages which lets you do this. In SpriteKit (Swift & Objc) for example, you would write something like
let node = SKSpriteNode(color: .white, size: CGSize(width: 200, height: 300))
and add node to an SKScene (just a scene containing childNodes) and a white rectangle would "magically" get rendered. What I would like to know is what goes on under the hood. Why does this exact framework let you draw a rectangle. What is the assembly code (say, for Intel Core M) which makes the GPU calculate what this rectangle will look like? And how does SpriteKit build on the basics of Swift/Objective C to actually do this (and could I do this myself)?
Maybe a weird question, but I feel like I have to know (yes, sometimes I'm too curious). Thank you.
P.S. I would love a really detailed answer, not "the CPU 'tells' the GPU to draw a rectangle" - CPUs can't talk!
There are many ways to render convex polygon. The most used in past was ScanLine algorithm where you simply rasterize all the lines of circumference into left/right buffers and then just render using horizontal lines and interpolating the other coordinates along the way (like z,r,g,b,tx,ty,nx,ny,nz...). This was suited for single-thread CPU based SW rendering.
With parallelization (like on GPU) different approach get more popular. It simply renders only triangles (so you need to triangulate your polygons) and renders like this:
compute AABB
so simply min,max of x,y coordinates of the triangle vertexes.
loop through AABB
this is done in parallel and its done by GPU interpolators. Each interpolated (looped) "pixel" is called fragment (as it usually contains more than just color)
for each fragment
compute barycentric coordinates and from the result decide if fragment is inside (s+t<=1) or outside (s+t>1) triangle. If inside invoke Fragment shader.
All this gets done just before Fragment shader stage and usually all this (or majority of it) is implemented in HW so no code.
Nowadays GPU rendering is done by passing geometry to the gfx driver itself. What drivers does under the hood is just guess work for us but most likely they also just pass the geometry and configuration setting to the right places on the GPU (memory, registers, ...).
Is there any relation (preferably an equation) between the number of polygons in a 3D object and the rendering workload? I want to see how much the rendering workload would be increased if for instance the number of polygons doubles.
There is no clear connection between the arbitrary number of polygons and the mythical "workload".
See the following samples:
You render a cube with 6 faces composed of 12 triangles. You get, say, 1000fps (without vsync). When you tesselate the cube into 120 triangles, most likely the fps counter remains 1000.
You render a single fullscreen-sized quad with a heavy fragment shader with a lot of calculation. You get 0.5fps (or more, but I hope you get the point).
Another extreme. You are rendering a thousand of similar cubes, each with different texture. The rendering state change will take most of the time, not the actual rendering.
So, polygons may have different screen area and they may be rendered not within a single primitive. If you're talking about one big vertex array with a large number of polygons, then for some certain scenarios the performance change must be something like linear. "Something" because the videocard and the drivers are clipping the invisible polys and perfrom the early-out tests for each pixel being rendered.
Could you define 'workload'? – Erno yesterday
Well, I mean working
calculations. I want to see how much overhead (for GPU, CPU,
memory,...) would be increased. Actually I want to conclude the energy
usage of the device – user1196937 2 hours ago
If that is the actual question, a comparison of energy usage:
You will have to pick specific configurations and test those. Energy usage is very different from GPU to GPU and machine to machine.
Some GPU manufactures give very detailed information on the performance of their processors but when you want to compare those you will need an actual machine.
I am creating an app for android using openGL ES. I am trying to draw, in 2D, lots of moving sprites which bounce around the screen.
Let's consider I have a ball at coordinates 100,100. The ball graphic is 10px wide, therefore I can create the vertices boundingBox = {100,110,0, 110,110,0, 100,100,0, 110,100,0} and perform the following on each loop of onDrawFrame() with the ball texture loaded.
//for each ball object
FloatBuffer ballVertexBuffer = byteBuffer.asFloatBuffer();
ballVertexBuffer.put(ball.boundingBox);
ballVertexBuffer.position(0);
gl.glVertexPointer(3, GL10.GL_FLOAT, 0, ballVertexBuffer);
gl.glDrawArrays(GL10.GL_TRIANGLE_STRIP, 0,4);
I would then update the boundingBox array to move the balls around the screen.
Alternatively, I could not alter the bounding box at all and instead translatef() the ball before drawing the verticies
gl.glVertexPointer(3, GL10.GL_FLOAT, 0, ballVertexBuffer);
gl.glPushMatrix();
gl.glTranslatef(ball.posX, ball.posY, 0);
gl.glDrawArrays(GL10.GL_TRIANGLE_STRIP, 0,4);
gl.glPopMatrix();
What would be the best thing to do in the case in terms of efficient and best practices.
OpenGL ES (as of 2.0) does not support instancing, unluckily. If it did, I would recommend drawing a 2-triangle sprite instanced N times, reading the x/y offsets of the center point, and possibly a scale value if you need differently sized sprites, from a vertex texture (which ES supports just fine). This would limit the amount of data you must push per frame to a minimum.
Assuming you can't do the simulation directly on the GPU (thus avoiding uploading the vertex data each frame) ... this basically leaves you only with only one efficient option:
Generate 2 VBOs, map one and fill it, while the other is used as the source of the draw call. You can also do this seemingly with a single buffer if you glBufferData(... 0) in between, which tells OpenGL to generate a new buffer and throw the old one away as soon as it's done reading from it.
Streaming vertices in every frame may not be super fast, but this does not matter as long as the latency can be well-hidden (e.g. by drawing from one buffer while filling another). Few draw calls, few state changes, and ideally no stalls should still make this fast.
Drawing calls are much more expensive than altering the data. Also glTranslate is not nearly as efficient as just adding a few numbers, after all it has to go through a full 4×4 matrix multiplication, which is 64 scalar multiplies and 16 scalar additions.
Of course the best method is using some form of instancing.
An old Direct3D book says
"...you can achieve an acceptable frame
rate with hardware acceleration while
displaying between 2000 and 4000
polygons per frame..."
What is one polygon in Direct3D? Do they mean one primitive (indexed or otherwise) or one triangle?
That book means triangles. Otherwise, what if I wanted 1000-sided polygons? Could I still achieve 2000-4000 such shapes per frame?
In practice, the only thing you'll want it to be is a triangle because if a polygon is not a triangle it's generally tessellated to be one anyway. (Eg, a quad consists of two triangles, et cetera). A basic triangulation (tessellation) algorithm for that is really simple; you just loop though the vertices and turn every three vertices into a triangle.
Here, a "polygon" refers to a triangle. All . However, as you point out, there are many more variables than just the number of triangles which determine performance.
Key issues that matter are:
The format of storage (indexed or not; list, fan, or strip)
The location of storage (host-memory vertex arrays, host-memory vertex buffers, or GPU-memory vertex buffers)
The mode of rendering (is the draw primitive command issued fully from the host, or via instancing)
Triangle size
Together, those variables can create much greater than a 2x variation in performance.
Similarly, the hardware on which the application is running may vary 10x or more in performance in the real world: a GPU (or integrated graphics processor) that was low-end in 2005 will perform 10-100x slower in any meaningful metric than a current top-of-the-line GPU.
All told, any recommendation that you use 2-4000 triangles is so ridiculously outdated that it should be entirely ignored today. Even low-end hardware today can easily push 100,000 triangles in a frame under reasonable conditions. Further, most visually interesting applications today are dominated by pixel shading performance, not triangle count.
General rules of thumb for achieving good triangle throughput today:
Use [indexed] triangle (or quad) lists
Store data in GPU-memory vertex buffers
Draw large batches with each draw primitives call (thousands of primitives)
Use triangles mostly >= 16 pixels on screen
Don't use the Geometry Shader (especially for geometry amplification)
Do all of those things, and any machine today should be able to render tens or hundreds of thousands of triangles with ease.
According to this page, a polygon is n-sided in Direct3d.
In C#:
public static Mesh Polygon(
Device device,
float length,
int sides
)
As others already said, polygons here means triangles.
Main advantage of triangles is that, since 3 points define a plane, triangles are coplanar by definition. This means that every point within the triangle is exactly defined as a linear combination of polygon points. More vertices aren't necessarily coplanar, and they don't define a unique curved plane.
An advantage more in mechanical modeling than in graphics is that triangles are also undeformable.