Why are GPU threads in CUDA and OpenCL allocated in a grid? - multithreading

I'm just learning OpenCL, and I'm at the point when trying to launch a kernel. Why is it that the GPU threads are managed in a grid?
I'm going to read more about this in detail, but it would be nice with a simple explanation. Is it always like this when working with GPGPUs?

This is a common approach, which is used in CUDA, OpenCL and I think ATI stream.
The idea behind the grid is to provide a simple, but flexible, mapping between the data being processed and the threads doing the data processing. In the simple version of the GPGPU execution model, one GPU thread is "allocated" for each output element in a 1D, 2D or 3D grid of data. To process this output element, the thread will read one (or more) elements from the corresponding location or adjacent locations in the input data grid(s). By organizing the threads in a grid, it's easier for the threads to figure out which input data elements to read and where to store the output data elements.
This contrasts with the common multi-core, CPU threading model where one thread is allocated per CPU core and each thread processes many input and output elements (e.g. 1/4 of the data in a quad-core system).

The simple answer is that GPUs are designed to process images and textures that are 2D grids of pixels. When you render a triangle in DirectX or OpenGL, the hardware rasterizes it into a grid of pixels.

I will invoke the classic analogy of putting a square peg in a round hole. Well, in this case the GPU is a very square hole and not as well rounded as GP (general purpose) would suggest.
The above explanations put forward the ideas of 2D textures, etc. The architecture of the GPU is such that all processing is done in streams with the pipeline being identical in each stream, so the data being processed need to be segmented like that.

One reason why this is a nice API is that typically you are working with an algorithm that has several nested loops. If you have one, two or three loops then a grid of one, two or three dimensions maps nicely to the problem, giving you a thread for the value of each index.
So values that you need in your kernel (index values) are naturally expressed in the API.

Related

Why is z-buffering fast?

When I made my rasterizer, I realized that each pixel needed to compare all the triangles in the model to determine the depth value. But if there are, for example, a million of these triangles, then it turns out that each individual GPU core must compare a million triangles with each other? This all takes an incredibly long time, so I would like to know how this problem is avoided. I heard that this is done in hardware, but by what principle I did not understand
Depth sorting need to sort all triangles by perpendicular distance to camera and even split intersecting triangles in order to work correctly. That is a huge amount of work scaled with number of entities rendered with ~O(n.log(n)) but does not need too much additonal memory (unless too many splits)... That is why it was used in the past when memory was scarce and CPUs where slow so there where only few entities to render making it still fast enough... Also in some edge cases the depth sorting might be done by simply O(1) back face culling (simple scenes with single convex and non intersecting polygons or too far from each other to block their view)...
Nowadays situations is different we have very complex scenes with lot of entities and fast CPUs and GPUs and lot of memory so Depth buffering is used instead because its O(1), pixel perfect, but needs a shadow screen buffer holding the depths which can be a large chunk of memory ... The rendering is done like this:
clear depth buffer with most distant value
this is the slowest operation but done only once per frame and its just memory filling ... Usually done like this:
for (y=0;y<y_resolution;y++)
for (x=0;x<x_resolution;x++)
{
depth[y][x]=z_max
color[y][x]=background_color;
}
in case the buffers are stored as linear arrays you can use memset or even DMA on some platforms for this.
add condition to rendering pixel and also store rendered depth
to skip pixels if something is already rendered before them like:
void pixel(int x,int y,int z,int col)
{
if (depth[y][x]>z)
{
depth[y][x]=z; // store new dept value to buffer
color[y][x]=col; // render pixel
}
}
as this is done by HW no brunch or CACHE unfrendly operation is involved ...
This approach results in 2 images output one holding the colors (wanted image) and the depth buffer holding the rendered depths so we still have 3D info which allows to do additional processing/effect like ray picking, lighting effects, shadows, scattering and much much more ...
There are also hybrid techniques using both approaches like this:
OpenGL - How to create Order Independent transparency?

Parallelizing search in a 2D array on CUDA

I have a 500 x 500 2D array of floats. I wish to search in the vertical and horizontal directions from the middle of the array for the first zero element in both directions. The output should be 4 indices for the first zero element in the North, South, East and West directions. Is there a way to parallelize this search operation on CUDA.
Thanks.
(This answer assumes that you are not searching entire quadrants, but only the straight lines in each direction)
1. In case the array is in CPU memory
In fact, you have a search space of just 1,000 elements. The overhead of copying the data, launching the kernel and waiting for the result is such that it is not worth your trouble.
Do it on the CPU. One of your axes already has the data nicely laid out, consecutively; probably best to work on that axis first. The other axis will be a bitch in terms of memory access, but that's life. You could go multi-threaded here, but I'm not sure it's worth your trouble for so little work. If you did, each thread would wait on its own element.
As far as the algorithm - since your data isn't sorted, it's basically a linear search (up to vectorization). If you've gone multi-threaded - perhaps use a shared variable which a thread occasionally polls to see if an "closer-to-the-center" thread has found a zero yet; and when a thread finds a zero, it updates that variable to let other threads know to stop working.
2. In case the array is in GPU global memory
Now you get lots of (CUDA) 'threads'. So, it makes less sense to use an atomic variable, or polling etc.
We treat each of the four directions separately (although it doesn't have to be 4 separate kernels).
As #RobertCrovella notes, you can treat this problem as a parallel reduction, with each thread assigned an input element: Initially, each thread holds a value of infinity (if its corresponding element is non-zero), or its distance from the center if its corresponding array value is 0. Now, the reduction operator is "minimum".
This is not entirely optimal, because when warp or block results are collected (as part of a parallel reduction), this problem allows for short-circuiting when the lowest non-infinity value is located. You can read up how parallel reduction is implemented - but I really wouldn't bother, because you have a very small amount of computational work here.
Note: It is also possible that your array is in GPU array memory. In that case you would get better locality in both dimensions
It's not really clear how you define "first zero element in the North, South, East and West directions" but I could imagine a rectangular data set broken into 4 quadrants along the diagonals.
We could label the top region the "north region" and we could label the other regions similarly.
with that assumption, In the worst case you have to check every element of the array.
Therefore one possible approach is a parallel reduction.
You would then do a parallel reduction on each region, such that the distance from the center (using the standard distance formula) is minimized, considering the zero elements in the region.
If you are actually only interested in the elements associated with the vertical axis and horizontal axis that pass through the center of the image, then another approach may be better.
Even in that case, I think a parallel reduction would be a typical approach, two for each axis, considering only the zero elements on the axis half.

fixed function vs shader based

I'm a beginner to computer graphics and am trying to get a better understanding. My professor has discussed fixed function pipeline and shader based programming. How do these two compare to each other? What's the difference?
The fixed-function pipeline is as the name suggests — the functionality is fixed. So someone wrote a list of different ways you'd be permitted to transform and rasterise geometry, and that's everything available. In broad terms, you can do linear transformations and then rasterise by texturing, interpolate a colour across a face, or by combinations and permutations of those things. But more than that, the fixed pipeline enshrines certain deficiencies.
For example, it was obvious at the time of design that there wasn't going to be enough power to compute lighting per pixel. So lighting is computed at vertices and linearly interpolated across the face.
There were some intermediate extensions related to specific effects — dot3 plus cubemaps for per-pixel lighting from a single source, for example — but the programmable pipeline lets you do whatever you want at each stage, giving you complete flexibility.
In the first place that allowed better lighting, then better general special effects (ripples on reflective water, imperfect glass, etc), and more recently has been used for things like deferred rendering that flip the pipeline on its end.
All support for the fixed-functionality pipeline is implemented by programming the programmable pipeline on hardware of the last decade or so. The programmable pipeline is an advance on its predecessor, afforded by hardware improvements.
Graphics Processing Units started off very simply with fixed functions, that allowed for quick 3D maths (much faster than CPU maths), and texture lookup, and some simple lighting and shading options (flat, phong, etc).
These were very basic but allowed the CPU to offload the very repetitive tasks of 3D rendering to the GPU. Once the Graphics was taken away from the CPU, and given to the GPU, Games made a massive leap forward.
It wasn't long before the fixed functions needed to be changed to assembly programs and soon there was demand for doing more than simple shading, basic reflections, and single texture maps offered by the fixed function GPUs.
So the 2nd breed of GPU was created, this had two distinct pipelines, one that processed vertex programs and moved verts around in 3D space, and the shader programs that worked with pixels allowing multiple textures to be merged, and more lights and shades to be created.
Now in the latest form of GPU all the pipes in the card are generic, and can run any type of GPU assembler code. This increased in the number of uses for the pipe - they still do vertex mapping, and pixel color calculation, but they also do geometry shaders (tessellation), and even Compute shaders (where the parallel processor is used to do a non-graphics job).
So fixed function is limited but easy, and now in the past for all but the most limited devices. Programmable function shaders using OpenGL (GLSL) or DirectX (HLSL) are the de-facto standard for modern GPUs.
Essential the fixed function pipeline is a hardwired implementation of a, well, fixed program, through which each piece of data a GPU processes traverses, without the ability to change the details of any step. The only thing you can parameterize are the occasional branch to switch between hardcoded paths in the program (like enabling or disabling lighting, or using a separate specular) or some constants used (light colors and positions, texture environment base color modulation). And each and every step follows a specific formula.
In a programmable pipeline however the GPU is clean slate. It's completely up to the programmer how the various stages of the rendering process (vertex transformation, tesselation, fragment processing) are carried out. And you can use whatever formula you see fit for the task.
Fixed function pipeline GPUs have exactly one illumination mode: A Lambertian illumination model, implemented using Gourad or Phong shading. There were a few tricks to slightly alter the illumination model, for example to be anisotropic, but you had to somehow outsmart (or outdumb to be hones) the GPU for this. With a programmable pipeline you simply do what you wanted to do in the first place.

Number of polygons in a 3D object and the rendering workload?

Is there any relation (preferably an equation) between the number of polygons in a 3D object and the rendering workload? I want to see how much the rendering workload would be increased if for instance the number of polygons doubles.
There is no clear connection between the arbitrary number of polygons and the mythical "workload".
See the following samples:
You render a cube with 6 faces composed of 12 triangles. You get, say, 1000fps (without vsync). When you tesselate the cube into 120 triangles, most likely the fps counter remains 1000.
You render a single fullscreen-sized quad with a heavy fragment shader with a lot of calculation. You get 0.5fps (or more, but I hope you get the point).
Another extreme. You are rendering a thousand of similar cubes, each with different texture. The rendering state change will take most of the time, not the actual rendering.
So, polygons may have different screen area and they may be rendered not within a single primitive. If you're talking about one big vertex array with a large number of polygons, then for some certain scenarios the performance change must be something like linear. "Something" because the videocard and the drivers are clipping the invisible polys and perfrom the early-out tests for each pixel being rendered.
Could you define 'workload'? – Erno yesterday
Well, I mean working
calculations. I want to see how much overhead (for GPU, CPU,
memory,...) would be increased. Actually I want to conclude the energy
usage of the device – user1196937 2 hours ago
If that is the actual question, a comparison of energy usage:
You will have to pick specific configurations and test those. Energy usage is very different from GPU to GPU and machine to machine.
Some GPU manufactures give very detailed information on the performance of their processors but when you want to compare those you will need an actual machine.

Signal processing: FFT overlap processing resources

Are there any good (if possible scientific) resources available (web or books) about overlap processing. I am not that interested in the effects of using overlap processing and windows when analyzing a signal, since the requirements are different. It is more about the following Real Time situation: (I am currently dealing with audio signals)
Dividing a signal into smaller parts.
Creating overlap windows.
FFTing the windowed chunks.
Do processing in the frequency domain.
IFFT the results.
put the chunks together to a continuous stream.
I am especially interested in the influence of the window used on the resulting error as well as the effect of the overlap length. However I couldn't find any good resources that deal with the subject in detail. Any suggestions?
Edit:
After some discussions if using a window function is appropriate, I found a decent handout explaining the overlap and add/save method. http://www.ece.tamu.edu/~deepa/ecen448/handouts/08c/10_Overlap_Save_Add_handouts.pdf
However, after doing some tests, I noticed that the windowed version would perform more accurate in most cases than the overlap & add/save method. Could anybody confirm this?
I don't want to jump to any conclusions regarding computation time though....
Edit2:
Here are some graphs from my tests:
I created a signal, which consists of three cosine waves
I used this filter function in the time domain for filtering. (It's symmetric, as it is applied to the whole output of the FFT, which also is symmetric for real input signals)
The output of the IFFT looks like this: It can be seen that low frequencies are attenuated more than frequency in the mid range.
For the overlap add/save and the windowed processing I divided the input signal into 8 chunks of 256 samples. After reassembling them they look like that. (sample 490 - 540)
It can be seen that the overlap add/save processes differ from the windowed version at the point where chunks are put together (sample 511). This is the error which leads to different results when comparing windowed process and overlap add/save. The windowed process is closer to the one processed in one big junk.
However, I have no idea why they are there or if they shouldn't be there at all.
This is fairly well-known area of signal processing, and generally speaking if you are doing processing along the lines of FFT -> spectral processing -> IFFT you need to use the "overlap and add" approach. Cross-correlation of two inputs is a classic example, done much more easily in the spectral domain than the time domain.
Here's a short paper I found right away via Google (I just searched for "fft overlap and add"): http://www.coe.montana.edu/ee/rmaher/ee477/ee477_fftlab_sp07.pdf
I would recommend you invest in a good Signal Processing book, such as the classic Rabiner & Gold "Theory and application of digital signal processing" (Prentice-Hall ISBN 0-13-914101-4). That should cover the concept of overlap-and-add processing.
When using an FFT for overlap-add or overlap-save fast convolution filtering, normally you don't want to use a windowing function. The circular windowing artifacts cancel out when combining successive FFT frames in canonical overlap add/save filtering.
ADDED:
If you do use a non-rectangular window, you might want to make sure that all the overlapped frames of windows sum to DC, otherwise your resulting filtered signal will have amplitude scalloping. Rectangular windows and raised-cosine (von Hann) windows will sum to DC if the overlap amount is an exact submultiple of the window width (except, of course, at the very start and end of the overlap sequence).
I have been playing with this attempting to answer the question for myself as to why one would use a window. My only references to a synthesis window are this:
https://ccrma.stanford.edu/~jos/sasp/Inverse_FFT_Synthesis.html
http://recherche.ircam.fr/anasyn/roebel/amt_audiosignale/VL2.pdf
http://www.dspdimension.com/tutorials/
Stephan Bernsee has some good overview information. His smbpitchshift code uses a synthesis window -- He uses the raised cosine on the input block, then applies it again on the output block, but this I believe is necessary because the pitch shifting algorithm is not a linear filtering operation, so it is certain there may be discontinuous artifacts on the window boundaries, thus a synthesis window is used to create a smooth transition between frames.
I think the reason there is not much information specifically addressing windowing for frequency domain real-time convolution is because it doesn't have a practical application unless you also need to do some analysis (ie, and adaptive filter of some sort), then the topics related to spectral spreading is again of interest.
I have plotted outputs from a filtered signal using both a raised cosine window as well as overlap-add method, and the end result is an identical IR, and identical signals. It comes as no surprise since the same operations performed in the time domain yield the same results.
On the other hand, if I implement a broken filter kernel, a smooth windowing function can help mask artifacts. This in a sense windows the broken IR so there is a more cohesive transition between frames. It would still be better to have an IR that is limited to length nfft/2 in the time domain. If you need to obtain a filter response with an IR longer than nfft/2, then you should consider either using a larger FFT size (if latency is not a problem) or use a partitioned convolution scheme:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB4QFjAA&url=http%3A%2F%2Fpcfarina.eng.unipr.it%2FPublic%2FPapers%2F164-Mohonk2001.PDF&ei=qtH0TorDEoKziQKAloHEDg&usg=AFQjCNGDmz79DiuG1kmPXifbWJ7M-gr9rQ&sig2=CMopEcGc1VArZ3gipWTr_w
or
http://www.music.miami.edu/programs/mue/Research/jvandekieft/jvchapter2.htm
I hope that is helpful to somebody reading this
I hope those links help, even though it doesn't directly address windowing as used in real-time Frequency domain filtering.

Resources