Interleaved vs non-interleaved vertex buffers - graphics

This seems like a question which has been answered throughout time for one IHV or another but recently I have have been trying to come to a consensus about vertex layouts and the best practices for a modern renderer across all IHVs and architectures. Before someone says benchmark, I can't easily do that as I don't have access to a card from every IHV and every architecture from the last 5 years. Therefore, I am looking for some best practices that will work decently well across all platforms.
First, the obvious:
Separating position from other attributes is good for:
Shadow and depth pre-passes
Per-triangle culling
Tiled based deferred renderers (such as Apple M1)
Interleaved is more logical on the CPU, can have a Vertex class.
Non-interleaved can make some CPU calculations faster due to being able to take advantage of SIMD.
Now onto the less obvious.
Many people quote NVIDIA as saying that you should always interleave and moreover you should align to 32 or 64 bytes. I have not found the source of this but have instead found a document about vertex shader performance by NVIDIA but it is quite old (2013) and is regarding the Tegra GPU which is mobile, not desktop. In particular it says:
Store vertex data as interleaved attribute streams ("array of structures" layout), such that "over-fetch" for an attribute tends to pre-fetch data that is likely to be useful for subsequent attributes and vertices. Storing attributes as distinct, non-interleaved ("structure of arrays") streams can lead to "page-thrashing" in the memory system, with a massive resultant drop in performance.
Fast forward 3 years to GDC 2016 and EA gives a presentation which mentions several reasons why you should de-interleave the vertex buffers. However, this recommendation seems to be tied to AMD architectures, in particular GCN. While they make a cross platform case for separating the position they propose de-interleaving everything with the statement that it will allow the GPU to:
Evict cache lines as quickly as possible
And that it is optimal for GCN (AMD) architectures.
This seems to be in conflict to what I have heard elsewhere which says to use interleaved in order to make the most use of a cache line. But again, that was not in regard to AMD.
With many different IHVs, Intel, NVIDIA, AMD, and now Apple with the M1 GPU, and each one having many different architectures it leaves me in the position of being completely uncertain about what one should do today (without the budget to test on dozens of GPUs) in order to best optimize performance across all architectures without resulting in
a massive resultant drop in performance
on some architectures. In particular, is de-interleaved still best on AMD? Is it no longer a problem on NVIDIA, or was it never a problem on desktop NVIDIA GPUs? What about the other IHVs?
NOTE: I am not interested in mobile, only all desktop GPUs in the past 5 years or so.

Related

OpenCL GPU Audio

There's not much on this subject, perhaps because it isn't a good idea in the first place.
I want to create a realtime audio synthesis/processing engine that runs on the GPU. The reason for this is because I will also be using a physics library that runs on the GPU, and the audio output will be determined by the physics state. Is it true that GPU only carries audio output and can't generate it? Would this mean a large increase in latency, if I were to read the data back on the CPU and output it to the soundcard? I'm looking for a latency between 10 and 20ms in terms of the time between synthesis and playback.
Would the GPU accelerate synthesis by any worthwhile amount? I'm going to have a large number of synthesizers running at once, each of which I imagine could take up their own parallel process. AMD is coming out with GPU audio, so there must be something to this.
For what it's worth, I'm not sure that this idea lacks merit. If DarkZero's observation about transfer times is correct, it doesn't sound like there would be much overhead in getting audio onto the GPU for processing, even from many different input channels, and while there are probably audio operations that are not very amenable to parallelization, many are very VERY parallelizable.
It's obvious for example, that computing sine values for 128 samples of output from a sine source could be done completely in parallel. Working in blocks of that size would permit a latency of only about 3ms, which is acceptable in most digital audio applications. Similarly, the many other fundamental oscillators could be effectively parallelized. Amplitude modulation of such oscillators would be trivial. Efficient frequency modulation would be more challenging, but I would guess it is still possible.
In addition to oscillators, FIR filters are simple to parallelize, and a google search turned up some promising looking research papers (which I didn't take the trouble to read) that suggest that there are reasonable parallel approaches to IIR filter implementation. These two types of filters are fundamental to audio processing and many useful audio operations can be understood as such filters.
Wave-shaping is another task in digital audio that is embarrassingly parallel.
Even if you couldn't take an arbitrary software synth and map it effectively to the GPU, it is easy to imagine a software synthesizer constructed specifically to take advantage of the GPU's strengths, and avoid its weaknesses. A synthesizer relying exclusively on the components I have mentioned could still produce a fantastic range of sounds.
While marko is correct to point out that existing SIMD instructions can do some parallelization on the CPU, the number of inputs they can operate on at the same time pales in comparison to a good GPU.
In short, I hope you work on this and let us know what kind of results you see!
DSP operations on modern CPUs with vector processing units (SSE on x86/x64 or NEON on ARM) are already pretty cheap if exploited properly. This is particularly the case with filters, convolution, FFT and so on - which are fundamentally stream-based operations. There are the type of operations where a GPU might also excel.
As it turns out, soft synthesisers have quite a few operations in them that are not stream-like, and furthermore, the tendency is to process increasingly small chunks of audio at once to target low latency. These are a really bad fit for the capabilities of GPU.
The effort involved in using a GPU - particularly getting data in and out - is likely to far exceed any benefit you get. Furthermore, the capabilities of inexpensive personal computers - and also tablets and mobile devices - are more than enough for many digital audio applications AMD seem to have a solution looking for a problem. For sure, the existing music and digital audio software industry is not about to start producing software that only targets a limited sub-set of hardware.
Typical transfer times for some MB to/from GPU take 50us.
Delay is not your problem, however parallelizing a audio synthesizer in the GPU may be quite difficult. If you don't do it properly it may take more time the processing rather than the copy of data.
If you are going to run multiple synthetizers at once, I would recommend you to perform each synthesizer in a work-group, and parallelize the synthesis process with the work-items available. It will not be worth to have each synthesizer in one work-item, since it is unlikely you will have thousand.
http://arxiv.org/ftp/arxiv/papers/1211/1211.2038.pdf
You might be better off using OpenMP for it's lower initialization times.
You could check out the NESS project which is all about physical modelling synthesis. They are using GPUs for audio rendering because it the process involves simulating an acoustic 3D space for whichever given sound, and calculating what happens to that sound within the virtual 3D space (and apparently GPUs are good at working with this sort of data). Note that this is not realtime synthesis because it is so demanding of processing.

Massively parallel application: what about several 8 bits cores for non-vector IA applications?

I was thinking (oh god, it starts badly) about neuron networks and how it is not possible to simulate those because they require many atomic operation at the same time (here meaning simultaneously), because that's how neurons are faster: they are many to compute stuff.
Since our processors are 32 bits so they can compute a significantly larger band (meaning a lot of different atomic numbers, being floating points or integers), the frequency race is also over and manufacturers start shipping multicore processors, requiring developpers to implement multithreading in their application.
I was also thinking about the most important difference between computers and brains; brains use a lot of neurons, while computers use precision at a high frequency: that's why it seems harder or impossible to simulate an real time AI with the current processor model.
Since 32bits/64bits chips also take a great deal of transistors and since AI doesn't require vector/floating point precision, would it be a good idea to have many more 8bits cores on a single processor, like 100 or 1000 for example since they take much less room (I don't work at intel or AMD so I don't know how they design their processors, it's just a wild guess), to plan for those kind of AI simulations ?
I don't think it would only serve AI research though, because I don't know how webservers can really take advantage of 64 bits processors (strings use 8bits), Xeon processors are only different considering their cache size.
What you describe is already available by means of multimedia instruction sets. It turns out that computer graphics needs also many parallel operations on bytes or even half-bytes. So the CPUs started growing vector operations (SSE, MMX, etc); more recently, graphic processors have opened up to general purpose computing (GPGPU).
I think you are mistaken in assuming that neuronal processing is not a vector operation: many AI neuronal networks heavily rely on vector and matrix operations.

When machine code is generated from a program how does it translates to hardware level operations? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Like if say the instruction is something like 100010101 1010101 01010101 011101010101. Now how is this translating to an actual job of deleting something from memory? Memory consists of actual physical transistors the HOLD data. What causes them to lose that data is some external signal?
I want to know how that signal is generated. Like how some binary numbers change the state of a physical transistor. Is there a level beyond machine code that isn't explicitly visible to a programmer? I have heard of microcode that handle code at hardware level, even below assembly language. But still I pretty much don't understand. Thanks!
I recommend reading the Petzold book "Code". It explains these things as best as possible without the physics/electronics knowledge.
Each bit in the memory, at a functional level, HOLDs either a zero or a one (lets not get into the exceptions, not relevant to the discussion), you cannot delete memory you can set it to zeros or ones or a combination. The arbitrary definition of deleted or erased is just that, a definition, the software that erases memory is simply telling the memory to HOLD the value for erased.
There are two basic types of ram, static and dynamic. And are as their names imply, so long as you dont remove power the static ram will hold its value until changed. Dynamic memory is more like a rechargeable battery and there is a lot of logic that you dont see with assembler or microcode or any software (usually) that keeps the charged batteries charged and empty ones empty. Think about a bunch of water glasses, each one is a bit. Static memory the glasses hold the water until emptied, no evaporation, nothing. Glasses with water lets say are ones and ones without are zeros (an arbitrary definition). When your software wants to write a byte there is a lot of logic that interprets that instruction and commands the memory to write, in this case there is a little helper that fills up or empties the glasses when commanded or reads the values in the glasses when commanded. In the case of dynamic memory, the glasses have little holes in the bottom and are constantly but slowly letting the water drain out. So glasses that are holding a one have to be filled back up, so the helper logic not only responds to the read and write commands but also walks down the row of glasses periodically and fills back up the ones. Why would you bother with unreliable memory like that? It takes twice (four times?) as many transistors for an sram than a dram. Twice the heat/power, twice the size, twice the price, with the added logic it is still cheaper all the way around to use dram for bulk memory. The bits in your processor that are used say for the registers and other things are sram based, static. Bulk memory, the gigabytes of system memory, are usually dram, dynamic.
The bulk of the work done in the processor/computer is done by electronics that the instruction set or microcode in the rare case of microcoding (x86 families are/were microcoded but when you look at all processor types, microcontrollers that drive most of the everyday items you touch they are generally not microcoded, so most processors are not microcoded). In the same way that you need some worker to help you turn C into assembler, and assembler into machine code, there is logic to turn that machine code into commands to the various parts of the chip and peripherals outside the chip. download either the llvm or gcc source code to get an idea of the percentages of your program being compiled is compared to the amount of software it takes to do that compiling. You will get an idea of how many transistors are needed to turn your 8 or 16 or 32 bit instruction into some sort of command to some hardware.
Again I recommend the Petzold book, he does an excellent job of teaching how computers work.
I also recommend writing an emulator. You have done assembler, so you understand the processor at that level, in the same assembler reference for the processor the machine code is usually defined as well, so you can write a program that reads the bits and bytes of the machine code and actually performs the function. An instruction mov r0,#11 you would have some variable in your emulator program for register 0 and when you see that instruction you put the value 11 in that variable and continue on. I would avoid x86, go with something simpler pic 12, msp430, 6502, hc11, or even the thumb instruction set I used. My code isnt necessarily pretty in anyway, closer to brute force (and still buggy no doubt). If everyone reading this were to take the same instruction set definition and write an emulator you would probably have as many different implementations as there are people writing emulators. Likewise for hardware, what you get depends on the team or individual implementing the design. So not only is there a lot of logic involved in parsing through and executing the machine code, that logic can and does vary from implementation to implementation. One x86 to the next might be similar to refactoring software. Or for various reasons the team may choose a do-over and start from scratch with a different implementation. Realistically it is somewhere in the middle chunks of old logic reused tied to new logic.
Microcoding is like a hybrid car. Microcode is just another instruction set, machine code, and requires lots of logic to implement/execute. What it buys you in large processors is that the microcode can be modified in the field. Not unlike a compiler in that your C program may be fine but the compiler+computer as a whole may be buggy, by putting a fix in the compiler, which is soft, you dont have to replace the computer, the hardware. If a bug can be fixed in microcode then they will patch it in such a way that the BIOS on boot will reprogram the microcode in the chip and now your programs will run fine. No transistors were created or destroyed nor wires added, just the programmable parts changed. Microcode is essentially an emulator, but an emulator that is a very very good fit for the instruction set. Google transmeta and the work that was going on there when Linus was working there. the microcode was a little more visible on that processor.
I think the best way to answer your question, barring how do transistors work, is to say either look at the amount of software/source in a compiler that takes a relatively simple programming language and converts it to assembler. Or look at an emulator like qemu and how much software it takes to implement a virtual machine capable of running your program. The amount of hardware in the chips and on the motherboard is on par with this, not counting the transistors in the memories, millions to many millions of transistors are needed to implement what is usually few hundred different instructions or less. If you write a pic12 emulator and get a feel for the task then ponder what a 6502 would take, then a z80, then a 486, then think about what a quad core intel 64 bit might involve. The number of transistors for a processor/chip is often advertised/bragged about so you can also get a feel from that as to how much is there that you cannot see from assembler.
It may help if you start with an understanding of electronics, and work up from there (rather than from complex code down).
Let's simplify this for a moment. Imagine an electric circuit with a power source, switch and a light bulb. If you complete the circuit by closing the switch, the bulb comes on. You can think of the state of the circuit as a 1 or a 0 depending on whether it is completed (closed) or not (open).
Greatly simplified, if you replace the switch with a transistor, you can now control the state of the bulb with an electric signal from a separate circuit. The transistor accepts a 1 or a 0 and will complete or open the first circuit. If you group these kinds of simple circuits together, you can begin to create gates and start to perform logic functions.
Memory is based on similar principles.
In essence, the power coming in the back of your computer is being broken into billions of tiny pieces by the components of the computer. The behavior and activity of such is directed by the designs and plans of the engineers who came up with the microprocessors and circuits, but ultimately it is all orchestrated by you, the programmer (or user).
Heh, good question! Kind of involved for SO though!
Actually, main memory consists of arrays of capacitors, not transistors, although cache memories may be implemented with transistor-based SRAM.
At the low level, the CPU implements one or more state machines that process the ISA, or the Instruction Set Architecture.
Look up the following circuits:
Flip-flop
Decoder
ALU
Logic gates
A series of FFs can hold the current instruction. A decoder can select a memory or register to modify, and the state machine can then generate signals (using the gates) that change the state of a FF at some address.
Now, modern memories use a decoder to select an entire line of capacitors, and then another decoder is used when reading to select one bit out of them, and the write happens by using a state machine to change one of those bits, then the entire line is written back.
It's possible to implement a CPU in a modern programmable logic device. If you start with simple circuits you can design and implement your own CPU for fun these days.
That's one big topic you are asking about :-) The topic is generally called "Computer Organization" or "Microarchitecture". You can follow this Wikipedia link to get started if you want to learn.
I don't have any knowledge beyond a very basic level about either electronics or computer science but I have a simple theory that could answer your question and most probably the actual processes involved might be very complex manifestations of my answer.
You could imagine the logic gates getting their electric signals from the keystrokes or mouse strokes you make.
A series or pattern of keys you may press may trigger particular voltage or current signals in these logic gates.
Now what value of currents or voltages will be produced in which all logic gates when you press a particular pattern of keys, is determined by the very design of these gates and circuits.
For eg. If you have a programming language in which the "print(var)" command prints "var",
the sequence of keys "p-r-i-n-t-" would trigger a particular set of signals in a particular set of logic gates that would result in displaying "var" on your screen.
Again, what all gates are activated by your keystrokes depends on their design.
Also, typing "print(var)" on your desktop or anywhere else apart from the IDE will not yield same results because the software behind that IDE activates a particular set of transistors or gates which would respond in an appropriate way.
This is what I think happens at the Fundamental level, and the rest is all built layer upon layer.

Programming graphics and sound on PC - Total newbie questions, and lots of them!

This isn't exactly specifically a programming question (or is it?) but I was wondering:
How are graphics and sound processed from code and output by the PC?
My guess for graphics:
There is some reserved memory space somewhere that holds exactly enough room for a frame of graphics output for your monitor.
IE: 800 x 600, 24 bit color mode == 800x600x3 = ~1.4MB memory space
Between each refresh, the program writes video data to this space. This action is completed before the monitor refresh.
Assume a simple 2D game: the graphics data is stored in machine code as many bytes representing color values. Depending on what the program(s) being run instruct the PC, the processor reads the appropriate data and writes it to the memory space.
When it is time for the monitor to refresh, it reads from each memory space byte-for-byte and activates hardware depending on those values for each color element of each pixel.
All of this of course happens crazy-fast, and repeats x times a second, x being the monitor's refresh rate. I've simplified my own likely-incorrect explanation by avoiding talk of double buffering, etc
Here are my questions:
a) How close is the above guess (the three steps)?
b) How could one incorporate graphics in pure C++ code? I assume the practical thing that everyone does is use a graphics library (SDL, OpenGL, etc), but, for example, how do these libraries accomplish what they do? Would manual inclusion of graphics in pure C++ code (say, a 2D spite) involve creating a two-dimensional array of bit values (or three dimensional to include multiple RGB values per pixel)? Is this how it would be done waaay back in the day?
c) Also, continuing from above, do libraries such as SDL etc that use bitmaps actual just build the bitmap/etc files into machine code of the executable and use them as though they were build in the same matter mentioned in question b above?
d) In my hypothetical step 3 above, is there any registers involved? Like, could you write some byte value to some register to output a single color of one byte on the screen? Or is it purely dedicated memory space (=RAM) + hardware interaction?
e) Finally, how is all of this done for sound? (I have no idea :) )
a.
A long time ago, that was the case, but it hasn't been for quite a while. Most hardware will still support that type of configuration, but mostly as a fall-back -- it's not how they're really designed to work. Now most have a block of memory on the graphics card that's also mapped to be addressable by the CPU over the PCI/AGP/PCI-E bus. The size of that block is more or less independent of what's displayed on the screen though.
Again, at one time that's how it mostly worked, but it's mostly not the case anymore.
Mostly right.
b. OpenGL normally comes in a few parts -- a core library that's part of the OS, and a driver that's supplied by the graphics chipset (or possibly card) vendor. The exact distribution of labor between the CPU and GPU varies somewhat though (between vendors, over time within products from a single vendor, etc.) SDL is built around the general idea of a simple frame-buffer like you've described.
c. You usually build bitmaps, textures, etc., into separate files in formats specifically for the purpose.
d. There are quite a few registers involved, though the main graphics chipset vendors (ATI/AMD and nVidia) tend to keep their register-level documentation more or less secret (though this could have changed -- there's constant pressure from open source developers for documentation, not just closed-source drivers). Most hardware has capabilities like dedicated line drawing, where you can put (for example) line parameters into specified registers, and it'll draw the line you've specified. Exact details vary widely though...
e. Sorry, but this is getting long already, and sound covers a pretty large area...
For graphics, Jerry Coffin's got a pretty good answer.
Sound is actually handled similarly to your (the OP's) description of how graphics is handled. At a very basic level, you have a "buffer" (some memory, somewhere).
Your software writes the sound you want to play into that buffer. It is basically an encoding of the position of the speaker cone at a given instant in time.
For "CD quality" audio, you have 44100 values per second (a "sample rate" of 44.1 kHz).
A little bit behind the write position, you have the audio subsystem reading from a read position in the buffer.
This read position will be a little bit behind the write position. The distance behind is known as the latency. A larger distance gives more of a delay, but also helps to avoid the case where the read position catches up to the write position, leaving the sound device with nothing to actually play!

When using Direct3D, how much math is being done on the CPU?

Context: I'm just starting out. I'm not even touching the Direct3D 11 API, and instead looking at understanding the pipeline, etc.
From looking at documentation and information floating around the web, it seems like some calculations are being handled by the application. That, is, instead of simply presenting matrices to multiply to the GPU, the calculations are being done by a math library that operates on the CPU. I don't have any particular resources to point to, although I guess I can point to the XNA Math Library or the samples shipped in the February DX SDK. When you see code like mViewProj = mView * mProj;, that projection is being calculated on the CPU. Or am I wrong?
If you were writing a program, where you can have 10 cubes on the screen, where you can move or rotate cubes, as well as viewpoint, what calculations would you do on the CPU? I think I would store the geometry for the a single cube, and then transform matrices representing the actual instances. And then it seems I would use the XNA math library, or another of my choosing, to transform each cube in model space. Then get the coordinates in world space. Then push the information to the GPU.
That's quite a bit of calculation on the CPU. Am I wrong?
Am I reaching conclusions based on too little information and understanding?
What terms should I Google for, if the answer is STFW?
Or if I am right, why aren't these calculations being pushed to the GPU as well?
EDIT: By the way, I am not using XNA, but documentation notes the XNA Math Library replaces the previous DX Math library. (i see the XNA Library in the SDK as a sheer template library).
"Am I reaching conclusions based on too little information and understanding?"
Not as a bad thing, as we all do it, but in a word: Yes.
What is being done by the GPU is, generally, dependent on the GPU driver and your method of access. Most of the time you really don't care or need to know (other than curiosity and general understanding).
For mViewProj = mView * mProj; this is most likely happening on the CPU. But it is not much burden (counted in 100's of cycles at the most). The real trick is the application of the new view matrix on the "world". Every vertex needs to be transformed, more or less, along with shading, textures, lighting, etc. All if this work will be done in the GPU (if done on the CPU things will slow down really fast).
Generally you make high level changes to the world, maybe 20 CPU bound calculations, and the GPU takes care of the millions or billions of calculations needed to render the world based on the changes.
In your 10 cube example: You supply a transform for each cube, any math needed for you to create the transform is CPU bound (with exceptions). You also supply a transform for the view, again creating the transform matrix might be CPU bound. Once you have your 11 new matrices you apply the to the world. From a hardware point of view the 11 matrices need to be copied to the GPU...that will happen very, very fast...once copied the CPU is done and the GPU recalculates the world based on the new data, renders it to a buffer and poops it on the screen. So for your 10 cubes the CPU bound calculations are trivial.
Look at some reflected code for an XNA project and you will see where your calculations end and XNA begins (XNA will do everything is possibly can in the GPU).

Resources