GPU frustum culling : why using scan? - graphics

I'm trying to implement frustum culling in the gpu. After reading a bit and also stumbling on this very helpfull repo : https://github.com/ellioman/Indirect-Rendering-With-Compute-Shaders , I've noticed that the goto implementation seems to be
Test the bbox of all objects and mark the one that are in the camera frustum with a 1, and the one that are not with a 0
Save the result in a buffer.
Execute a scan algorithm on this buffer
Use the indices computed as indices in a final buffer to store the selected matrices, that will be used in the draw pass.
But I'm wondering : Why use the scan and all its complexity, and not just append the matrices of the objects that passed the bbox test into an appendbuffer directly ?
My guess is that appendbuffer access are slow, but are they THAT slower than running a scan on the gpu (which can take 2 dispatch call if the input array is bigger than the max threads per group).
Thank you !
EDIT : I am on unity, but I don't think this matter so much for this question.

Related

Why is z-buffering fast?

When I made my rasterizer, I realized that each pixel needed to compare all the triangles in the model to determine the depth value. But if there are, for example, a million of these triangles, then it turns out that each individual GPU core must compare a million triangles with each other? This all takes an incredibly long time, so I would like to know how this problem is avoided. I heard that this is done in hardware, but by what principle I did not understand
Depth sorting need to sort all triangles by perpendicular distance to camera and even split intersecting triangles in order to work correctly. That is a huge amount of work scaled with number of entities rendered with ~O(n.log(n)) but does not need too much additonal memory (unless too many splits)... That is why it was used in the past when memory was scarce and CPUs where slow so there where only few entities to render making it still fast enough... Also in some edge cases the depth sorting might be done by simply O(1) back face culling (simple scenes with single convex and non intersecting polygons or too far from each other to block their view)...
Nowadays situations is different we have very complex scenes with lot of entities and fast CPUs and GPUs and lot of memory so Depth buffering is used instead because its O(1), pixel perfect, but needs a shadow screen buffer holding the depths which can be a large chunk of memory ... The rendering is done like this:
clear depth buffer with most distant value
this is the slowest operation but done only once per frame and its just memory filling ... Usually done like this:
for (y=0;y<y_resolution;y++)
for (x=0;x<x_resolution;x++)
{
depth[y][x]=z_max
color[y][x]=background_color;
}
in case the buffers are stored as linear arrays you can use memset or even DMA on some platforms for this.
add condition to rendering pixel and also store rendered depth
to skip pixels if something is already rendered before them like:
void pixel(int x,int y,int z,int col)
{
if (depth[y][x]>z)
{
depth[y][x]=z; // store new dept value to buffer
color[y][x]=col; // render pixel
}
}
as this is done by HW no brunch or CACHE unfrendly operation is involved ...
This approach results in 2 images output one holding the colors (wanted image) and the depth buffer holding the rendered depths so we still have 3D info which allows to do additional processing/effect like ray picking, lighting effects, shadows, scattering and much much more ...
There are also hybrid techniques using both approaches like this:
OpenGL - How to create Order Independent transparency?

Parallelizing search in a 2D array on CUDA

I have a 500 x 500 2D array of floats. I wish to search in the vertical and horizontal directions from the middle of the array for the first zero element in both directions. The output should be 4 indices for the first zero element in the North, South, East and West directions. Is there a way to parallelize this search operation on CUDA.
Thanks.
(This answer assumes that you are not searching entire quadrants, but only the straight lines in each direction)
1. In case the array is in CPU memory
In fact, you have a search space of just 1,000 elements. The overhead of copying the data, launching the kernel and waiting for the result is such that it is not worth your trouble.
Do it on the CPU. One of your axes already has the data nicely laid out, consecutively; probably best to work on that axis first. The other axis will be a bitch in terms of memory access, but that's life. You could go multi-threaded here, but I'm not sure it's worth your trouble for so little work. If you did, each thread would wait on its own element.
As far as the algorithm - since your data isn't sorted, it's basically a linear search (up to vectorization). If you've gone multi-threaded - perhaps use a shared variable which a thread occasionally polls to see if an "closer-to-the-center" thread has found a zero yet; and when a thread finds a zero, it updates that variable to let other threads know to stop working.
2. In case the array is in GPU global memory
Now you get lots of (CUDA) 'threads'. So, it makes less sense to use an atomic variable, or polling etc.
We treat each of the four directions separately (although it doesn't have to be 4 separate kernels).
As #RobertCrovella notes, you can treat this problem as a parallel reduction, with each thread assigned an input element: Initially, each thread holds a value of infinity (if its corresponding element is non-zero), or its distance from the center if its corresponding array value is 0. Now, the reduction operator is "minimum".
This is not entirely optimal, because when warp or block results are collected (as part of a parallel reduction), this problem allows for short-circuiting when the lowest non-infinity value is located. You can read up how parallel reduction is implemented - but I really wouldn't bother, because you have a very small amount of computational work here.
Note: It is also possible that your array is in GPU array memory. In that case you would get better locality in both dimensions
It's not really clear how you define "first zero element in the North, South, East and West directions" but I could imagine a rectangular data set broken into 4 quadrants along the diagonals.
We could label the top region the "north region" and we could label the other regions similarly.
with that assumption, In the worst case you have to check every element of the array.
Therefore one possible approach is a parallel reduction.
You would then do a parallel reduction on each region, such that the distance from the center (using the standard distance formula) is minimized, considering the zero elements in the region.
If you are actually only interested in the elements associated with the vertical axis and horizontal axis that pass through the center of the image, then another approach may be better.
Even in that case, I think a parallel reduction would be a typical approach, two for each axis, considering only the zero elements on the axis half.

Number of polygons in a 3D object and the rendering workload?

Is there any relation (preferably an equation) between the number of polygons in a 3D object and the rendering workload? I want to see how much the rendering workload would be increased if for instance the number of polygons doubles.
There is no clear connection between the arbitrary number of polygons and the mythical "workload".
See the following samples:
You render a cube with 6 faces composed of 12 triangles. You get, say, 1000fps (without vsync). When you tesselate the cube into 120 triangles, most likely the fps counter remains 1000.
You render a single fullscreen-sized quad with a heavy fragment shader with a lot of calculation. You get 0.5fps (or more, but I hope you get the point).
Another extreme. You are rendering a thousand of similar cubes, each with different texture. The rendering state change will take most of the time, not the actual rendering.
So, polygons may have different screen area and they may be rendered not within a single primitive. If you're talking about one big vertex array with a large number of polygons, then for some certain scenarios the performance change must be something like linear. "Something" because the videocard and the drivers are clipping the invisible polys and perfrom the early-out tests for each pixel being rendered.
Could you define 'workload'? – Erno yesterday
Well, I mean working
calculations. I want to see how much overhead (for GPU, CPU,
memory,...) would be increased. Actually I want to conclude the energy
usage of the device – user1196937 2 hours ago
If that is the actual question, a comparison of energy usage:
You will have to pick specific configurations and test those. Energy usage is very different from GPU to GPU and machine to machine.
Some GPU manufactures give very detailed information on the performance of their processors but when you want to compare those you will need an actual machine.

Best way to move sprites in OpenGL - translate or alter vertices

I am creating an app for android using openGL ES. I am trying to draw, in 2D, lots of moving sprites which bounce around the screen.
Let's consider I have a ball at coordinates 100,100. The ball graphic is 10px wide, therefore I can create the vertices boundingBox = {100,110,0, 110,110,0, 100,100,0, 110,100,0} and perform the following on each loop of onDrawFrame() with the ball texture loaded.
//for each ball object
FloatBuffer ballVertexBuffer = byteBuffer.asFloatBuffer();
ballVertexBuffer.put(ball.boundingBox);
ballVertexBuffer.position(0);
gl.glVertexPointer(3, GL10.GL_FLOAT, 0, ballVertexBuffer);
gl.glDrawArrays(GL10.GL_TRIANGLE_STRIP, 0,4);
I would then update the boundingBox array to move the balls around the screen.
Alternatively, I could not alter the bounding box at all and instead translatef() the ball before drawing the verticies
gl.glVertexPointer(3, GL10.GL_FLOAT, 0, ballVertexBuffer);
gl.glPushMatrix();
gl.glTranslatef(ball.posX, ball.posY, 0);
gl.glDrawArrays(GL10.GL_TRIANGLE_STRIP, 0,4);
gl.glPopMatrix();
What would be the best thing to do in the case in terms of efficient and best practices.
OpenGL ES (as of 2.0) does not support instancing, unluckily. If it did, I would recommend drawing a 2-triangle sprite instanced N times, reading the x/y offsets of the center point, and possibly a scale value if you need differently sized sprites, from a vertex texture (which ES supports just fine). This would limit the amount of data you must push per frame to a minimum.
Assuming you can't do the simulation directly on the GPU (thus avoiding uploading the vertex data each frame) ... this basically leaves you only with only one efficient option:
Generate 2 VBOs, map one and fill it, while the other is used as the source of the draw call. You can also do this seemingly with a single buffer if you glBufferData(... 0) in between, which tells OpenGL to generate a new buffer and throw the old one away as soon as it's done reading from it.
Streaming vertices in every frame may not be super fast, but this does not matter as long as the latency can be well-hidden (e.g. by drawing from one buffer while filling another). Few draw calls, few state changes, and ideally no stalls should still make this fast.
Drawing calls are much more expensive than altering the data. Also glTranslate is not nearly as efficient as just adding a few numbers, after all it has to go through a full 4×4 matrix multiplication, which is 64 scalar multiplies and 16 scalar additions.
Of course the best method is using some form of instancing.

Why are GPU threads in CUDA and OpenCL allocated in a grid?

I'm just learning OpenCL, and I'm at the point when trying to launch a kernel. Why is it that the GPU threads are managed in a grid?
I'm going to read more about this in detail, but it would be nice with a simple explanation. Is it always like this when working with GPGPUs?
This is a common approach, which is used in CUDA, OpenCL and I think ATI stream.
The idea behind the grid is to provide a simple, but flexible, mapping between the data being processed and the threads doing the data processing. In the simple version of the GPGPU execution model, one GPU thread is "allocated" for each output element in a 1D, 2D or 3D grid of data. To process this output element, the thread will read one (or more) elements from the corresponding location or adjacent locations in the input data grid(s). By organizing the threads in a grid, it's easier for the threads to figure out which input data elements to read and where to store the output data elements.
This contrasts with the common multi-core, CPU threading model where one thread is allocated per CPU core and each thread processes many input and output elements (e.g. 1/4 of the data in a quad-core system).
The simple answer is that GPUs are designed to process images and textures that are 2D grids of pixels. When you render a triangle in DirectX or OpenGL, the hardware rasterizes it into a grid of pixels.
I will invoke the classic analogy of putting a square peg in a round hole. Well, in this case the GPU is a very square hole and not as well rounded as GP (general purpose) would suggest.
The above explanations put forward the ideas of 2D textures, etc. The architecture of the GPU is such that all processing is done in streams with the pipeline being identical in each stream, so the data being processed need to be segmented like that.
One reason why this is a nice API is that typically you are working with an algorithm that has several nested loops. If you have one, two or three loops then a grid of one, two or three dimensions maps nicely to the problem, giving you a thread for the value of each index.
So values that you need in your kernel (index values) are naturally expressed in the API.

Resources