Understanding OpenCL paradigm - graphics

OpenCL abstracts the hardware with device objects. Each device object is comprised of different compute units and each compute unit has a certain number of working elements.
How are these 3 concepts mapped physically?
Let's take a graphic card as an example.
This is my guess:
device -> graphic card
compute units -> graphic card cores
working elements -> single cells of vector alus inside the graphic card cores (stream cores)
What I read from different OpenCL tutorials is that we should divide our problem data in a 1, 2 or 3 dimensional space and then assign a piece of that n dimensional data to each working group. A working group is then executed in the stream cores inside a single compute unit.
If my graphic card has 4 compute units, does this mean that I can have at most 4 working groups?
Every compute unit in my graphic card has 48 stream cores. Again does this mean that I should create working groups with at most 48 elements? Multiples of 48?
I guess that OpenCL has some kind of scheduler that allows us to use a lot more working groups and threads than available hardware resources, but I think that the real parallelism is accomplished as I stated above.
Have I got the OpenCL paradigm right?

Related

Texture Memory Usage

I am trying to find out how much texture memory does consumed from my application. There are following types of texture and calculations by me:
RGB textures -> textureWidth * textureHeight * 3 (memory usage)
RGBA textures -> textureWidth * textureHeight * 4 (memory usage)
As a result I am wondering that can graphics driver allocate much more memory than above calculated memory?
A few simple answers:
To the best of my knowledge, it's been around 2 decades since (the majority of) hardware devices supported packed 24bit RGB data. In modern hardware this is usually represented in an "XRGB" (or equivalent) format where there is one padding byte per pixel. It is painful in hardware to efficiently handle pixels that straddle cache lines etc. Further, since many applications (read "games") use texture compression, having support for fully packed 24bit seems a bit redundant.
Texture dimensions: If a texture's dimensions are not 'nice' for the particular hardware (e.g,. maybe, say, x is not a multiple of 16bytes, or, say, 4x4 or 8x8 blocks), then the driver may pad the physical size of the texture.
Finally, if you have MIP mapping (and you do want this for performance as well as quality reasons), it will expand the texture size by around 33%.
In addition to the answer from Simon F, it's also worth noting that badly written applications can force the driver to allocate memory for multiple copies of the same texture. This can occur if it attempts to modify the texture while it is still referenced by an in-flight rendering operation. This is commonly known as "resource copy-on-write" or "resource ghosting".
Blog here explains in more detail:
https://community.arm.com/developer/tools-software/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources

What exactly is a GPU binning pass

As I'm reading VideoCoreIV-AG100-R spec of BCM vc4 chip, there is a paragraph talking about:
All rendering by the 3D system is in tiles, requiring separate binning and rendering passes to render a frame. In
normal operation the host processor creates a control list in memory defining all the operations and supplying
all the data for rendering for a complete frame.
It mentions of rendering a frame requires binning and rendering pass. Could anybody explain in details how exactly those 2 passes playing roles in a graphic pipeline? Thanks a lot.
For tile based render architecture passes are:
Binning pass - generates stream\map between frame tiles & corresponding geometry which should be rendered into particular tile
Rendering pass - takes map between tiles & geometry and renders the appropriate pixels per tile.
In mobile GPUs due to many limitations compared to Desktops GPUs (such as memory bandwidth due to in mobile devices memory is shared between GPU & CPU,etc) vendors uses approaches to split work into small pieces to decrease overall memory bandwidth consumption - for ex. apply Tile Based Rendering - to achieve efficient utilization of all available resources and gain acceptable performance.
Details
Tile Based Rendering approach described on many GPU vendors sites such as:
A look at the PowerVR graphics architecture: Tile-based rendering
GPU Framebuffer Memory: Understanding Tiling

How to calculate max number if thread that my kernel runs on them in opencl

when I read the device info from an OpenCL device, how can I calculate how good is its processing capability?
To add more information, assume that I want to do a very simple task on a pixels of an image, as far as I know (which maybe is not right !) when I run my kernel on a GPU, opencl runs it in parallel with different processing unit in GPU and I can think of the kernel as the thread body which would run in parallel.
If this is correct, then for my simple task, I need to find the device that has more processing unit so my kernel runs on them and hence finishes faster. Am I wrong?
How to find a suitable device based on its processing power?
Counting the number of processors in an OpenCL device is not sufficient to know how it will perform, for many reasons:
Different processors can have very different frequencies (in MHz/GHz)
Different processors can have very different architectures, e.g. out-of-order, multi-scalar, functions implemented in hardware
Different OpenCL devices have different types of memory available to them, which can affect the overall performance to a large extent
OpenCL devices could be integrated with the main CPU, on discrete peripheral board, or across a network. The latency and the need to synchronize or copy memory will affect the performance.
Different algorithms favor different architectures, so while one device may be faster than another for one algorithm, the same may not be true for a different algorithm.
I don't recommend using the number of processors as a measure of performance. The best way is to benchmark with a specific algorithm.

Thermal aware scheduler in linux

Currently i'm working on making a temperature aware version of linux for my university project. Right now I have to create a temperature aware scheduler which could take into account processor temperature and perform some scheduling. Is there any generalized way to get the temperature of the processor cores or can I integrate the coretemp driver with the linux kernel in any way ( I didn't find a way to do so on the internet ).
lm-sensors simply uses some device files exported by the kernel for CPU temperature, you can just read whatever these device files have as backing variables in the kernel to get the temperature information. In terms of a scheduler I would not write one from scratch and would start with the kernels CFS implementation and in your case modify the load balancer check to include temperature (currently it uses a metric that is the calculated cost of moving a task from one core to another in terms of cache issues, etc... I'm not sure if you want to keep this or not).
Temperature control is very difficult. The difficulty is with thermal capacity and conductance. It is quite easy to read a temperature. How you control it will depend on the system model. A Kalman filter or some higher order filter will be helpful. You don't know,
Sources of heat.
Distance from sensors.
Number of sensors.
Control elements, like a fan.
If you only measure at the CPU itself, the hard drive could have over heated 10 minutes ago, but the heat is only arriving at the CPU now. Throttling the CPU at this instance is not going to help. Only by getting a good thermal model of the system can you control the heat. Yet, you say you don't really know anything about the system? I don't see how a scheduler by itself can do this.
I have worked on mobile freezer application where operators would load pallets of ice cream, etc from a freezer to a truck. Very small distances between sensors and control elements can create havoc with a control system. Also, you want your ambient temperature to be read instantly if possible. There is a lot of lag in temperature control. A small distance could delay a reading by 5-15 minutes (ie, it take 5-15 minutes for heat to transfer 1cm).
I don't see the utility of what you are proposing. If you want this for a PC, then video cards, hard drives, power supplies, sound cards, etc. can create as much heat as the CPU. You can not generically model a PC; maybe you could with an Apple product. I don't think you will have a lot of success, but you will learn a lot from trying!

How to quantify the processing tradeoffs of CUDA devices for C kernels?

I recently upgraded from a GTX480 to a GTX680 in the hope that the tripled number of cores would manifest as significant performance gains in my CUDA code. To my horror, I've discovered that my memory intensive CUDA kernels run 30%-50% slower on the GTX680.
I realize that this is not strictly a programming question but it does directly impact on the performance of CUDA kernels on different devices. Can anyone provide some insight into the specifications of CUDA devices and how they can be used to deduce their performance on CUDA C kernels?
Not exactly an answer to your question, but some information that might be of help in understanding the performance of the GK104 (Kepler, GTX680) vs. the GF110 (Fermi, GTX580):
On Fermi, the cores run on double the frequency of the rest of the logic. On Kepler, they run at the same frequency. That effectively halves the number of cores on Kepler if one wants to do more of an apples to apples comparison to Fermi. So that leaves the GK104 (Kepler) with 1536 / 2 = 768 "Fermi equivalent cores", which is only 50% more than the 512 cores on the GF110 (Fermi).
Looking at the transistor counts, the GF110 has 3 billion transistors while the GK104 has 3.5 billion. So, even though the Kepler has 3 times as many cores, it only has slightly more transistors. So now, not only does the Kepler have only 50% more "Fermi equivalent cores" than Fermi, but each of those cores must be much simpler than the ones of Fermi.
So, those two issues probably explain why many projects see a slowdown when porting to Kepler.
Further, the GK104, being a version of Kepler made for graphics cards, has been tuned in such a way that cooperation between threads is slower than on Fermi (as such cooperation is not as important for graphics). Any potential potential performance gain, after taking the above facts into account, may be negated by this.
There is also the issue of double precision floating point performance. The version of GF110 used in Tesla cards can do double precision floating point at 1/2 the performance of single precision. When the chip is used in graphics cards, the double precision performance is artificially limited to 1/8 of single precision performance, but this is still much better than the 1/24 double precision performance of GK104.
One of the advances of new Kepler architecture is 1536 cores grouped into 8 192-core SMX'es but at the same time this number of cores is a big problem. Because shared memory is still limited to 48 kb. So if your application needs a lot of SMX resources then you can't execute 4 warps in parallel on single SMX. You can profile your code to find real occupancy of you GPU. The possible ways to improve you application:
use warp vote functions instead of shared memory communications;
increase a number of tread blocks and decrease a number threads in one block;
optimize global loads/stores. Kepler have 32 load/store modules for each SMX (twice more than on Kepler).
I am installing nvieuw and I use coolbits 2.0 to unlock your shader cores from default to max performance. Also, you must have both connectors of your device to 1 display, which can be enabled in nVidia control panel screen 1/2 and screen 2/2. Now you must clone this screen with the other, and Windows resolution config set screen mode to extended desktop.
With nVidia inspector 1.9 (BIOS level drivers), you can activate this mode by setting up a profile for the application (you need to add application's exe file to the profile). Now you have almost double performance (keep an eye on the temperature).
DX11 also features tesselation, so you want to override that and scale your native resolution.
Your native resolution can be achieved by rendering a lower like 960-540P and let the 3D pipelines do the rest to scale up to full hd (in nv control panel desktop size and position). Now scale the lower res to full screen with display, and you have full HD with double the amount of texture size rendering on the fly and everything should be good to rendering 3D textures with extreme LOD-bias (level of detail). Your display needs to be on auto zoom!
Also, you can beat sli config computers. This way I get higher scores than 3-way sli in tessmark. High AA settings like 32X mixed sample makes al look like hd in AAA quality (in tessmark and heavon benchies). There are no resolution setting in the endscore, so that shows it's not important that you render your native resolution!
This should give you some real results, so please read thoughtfully not literary.
I think the problem may lie in the number of Streaming Multiprocessors: The GTX 480 has 15 SMs, the GTX 680 only 8.
The number of SMs is important, since at most 8/16 blocks or 1536/2048 threads (compute capability 2.0/3.0) can reside on an single SM. The resources they share, e.g. shared memory and registers, can further limit the number of blocks per SM. Also, the higher number of cores per SM on the GTX 680 can only reasonably be exploited using instruction-level parallelism, i.e. by pipelining several independent operations.
To find out the number of blocks you can run concurrently per SM, you can use nVidia's CUDA Occupancy Calculator spreadsheet. To see the amount of shared memory and registers required by your kernel, add -Xptxas –v to the nvcc command line when compiling.

Resources