When using Direct3D, how much math is being done on the CPU? - graphics

Context: I'm just starting out. I'm not even touching the Direct3D 11 API, and instead looking at understanding the pipeline, etc.
From looking at documentation and information floating around the web, it seems like some calculations are being handled by the application. That, is, instead of simply presenting matrices to multiply to the GPU, the calculations are being done by a math library that operates on the CPU. I don't have any particular resources to point to, although I guess I can point to the XNA Math Library or the samples shipped in the February DX SDK. When you see code like mViewProj = mView * mProj;, that projection is being calculated on the CPU. Or am I wrong?
If you were writing a program, where you can have 10 cubes on the screen, where you can move or rotate cubes, as well as viewpoint, what calculations would you do on the CPU? I think I would store the geometry for the a single cube, and then transform matrices representing the actual instances. And then it seems I would use the XNA math library, or another of my choosing, to transform each cube in model space. Then get the coordinates in world space. Then push the information to the GPU.
That's quite a bit of calculation on the CPU. Am I wrong?
Am I reaching conclusions based on too little information and understanding?
What terms should I Google for, if the answer is STFW?
Or if I am right, why aren't these calculations being pushed to the GPU as well?
EDIT: By the way, I am not using XNA, but documentation notes the XNA Math Library replaces the previous DX Math library. (i see the XNA Library in the SDK as a sheer template library).

"Am I reaching conclusions based on too little information and understanding?"
Not as a bad thing, as we all do it, but in a word: Yes.
What is being done by the GPU is, generally, dependent on the GPU driver and your method of access. Most of the time you really don't care or need to know (other than curiosity and general understanding).
For mViewProj = mView * mProj; this is most likely happening on the CPU. But it is not much burden (counted in 100's of cycles at the most). The real trick is the application of the new view matrix on the "world". Every vertex needs to be transformed, more or less, along with shading, textures, lighting, etc. All if this work will be done in the GPU (if done on the CPU things will slow down really fast).
Generally you make high level changes to the world, maybe 20 CPU bound calculations, and the GPU takes care of the millions or billions of calculations needed to render the world based on the changes.
In your 10 cube example: You supply a transform for each cube, any math needed for you to create the transform is CPU bound (with exceptions). You also supply a transform for the view, again creating the transform matrix might be CPU bound. Once you have your 11 new matrices you apply the to the world. From a hardware point of view the 11 matrices need to be copied to the GPU...that will happen very, very fast...once copied the CPU is done and the GPU recalculates the world based on the new data, renders it to a buffer and poops it on the screen. So for your 10 cubes the CPU bound calculations are trivial.
Look at some reflected code for an XNA project and you will see where your calculations end and XNA begins (XNA will do everything is possibly can in the GPU).

Related

How does GPU programming differ from usage of the graphics card in games?

One way of doing GPU programming is OpenCL, which will work with parallelized, number-crunching operations.
Now think of your favorite 3D PC game. When the screen renders, what's going on? Did the developers hand-craft an OpenCL kernel (or something like it), or are they using pre-programmed functions in the graphics card?
Sorry to make this sound like a homework problem, I couldn't think of a better way to ask it.
H'okay, so, I'ma answer this in terms of history. Hopefully that gives a nice overview of the situation and lets you decide how to proceed.
Graphics Pipeline
3D graphics have an almost set-in-stone flow of calculations. You start with your transformation matrices, you multiply out your vertex positions (maybe generate some more on the fly), figure out what your pixels ought to be colored, then spit out the result. This is the (oversimplified) gist of 3D graphics. To change anything in it, you just twiddle one aspect of the pipeline a bit with a 'shader', i.e. little programmable elements with defined inputs and outputs so that they could be slotted into the pipeline.
Early GPGPU
Back when GPGPU was still in its infancy, the only way people had access to the massively parallel prowess of the GPU was through graphics shaders. For example, there were fragment shaders, which basically calculated what colors should be on each pixel of the screen (I'm kind of oversimplifying here, but that's what they did).
So, for example, you might use a vertex shader to chuck data about the screen before reducing a bunch of values in the fragment shader by taking advantage of color blending (effectively making the tricky transformation of mathematical problem space to... well, color space).
The gist of this is that old GPGPU stuff worked within the confines of 3D graphics, using the same 'pre-programmed functions in the graphics card' that the rest of the 3D graphics pipeline used.
It was painful to read, write, and think about (or at least, I found it so painful that I was dissuaded).
CUDA and OpenCL and [all of the other less popular GPGPU solutions]
Then some folks came along and said, "Wow, this is kind of dumb - we're stuck in the graphics pipeline when we want to be doing more general calculations!"
Thus GPGPU escaped from the confines of the graphics pipeline, and now we have OpenCL and CUDA and Brook and HSA and... Well, you get the picture.
tl;dr
The difference between GPGPU kernels and 3D graphics kernels are that the latter are stuck in a pipeline with (convenient) constraints attached to them, while the former have far more relaxed requirements, the pipeline is defined by the user, and the results don't have to be attached to a display (although they can be if you're masochistic like that).
When you run a game there may be two distinct systems operating on your GPU:
OpenGL renders images to your screen (graphics)
OpenCL does general-purpose computing tasks (compute)
OpenGL is programed with shaders. OpenCL is programmed with kernels.
If you would like to learn in more detail how games work on the GPU, I recommend reading about OpenCL, OpenGL, and game engine architecture.

Why do we still use Fixed function blending operations in D3D11 ect?

I was looking aground trying to understand why we are still using fixed function blending modes in newer 3D API's (like D3D11). In D3D10 fixed function Alpha Clipping was removed in favor of using the shaders. Why because its a much more powerful approach to almost any situation.
So why then can we not calculate or own blending operations (aka texture sample from the RenderTarget we are currently rendering into)?? Is there some hardware design issue in the video card pipelines that make this difficult to accomplish?
The reason this would be useful, is because you could do things like make refraction shaders run way faster as you wouldn't have to swap back and forth between two renderTargets for each refractive object overlay. Such as a refractive windowing system for an OS or game UI.
Where might be the best place to suggest an idea like this as this is not a discussion forum as I would love to see this in D3D12? Or is this already possible in D3D11?
So why then can we not calculate or own blending operations
Who says you can't? With shader_image_load_store (and the D3D11 equivalent), you can do pretty much anything you want with images. Provided that you follow the rules. That last part is generally what trips people up. Doing a full read/modify/write in a shader, such that later fragment shader invocations don't read the wrong value is almost impossible in the most general case. You have to restrict it by saying that each rendered object will not overlap with itself, and you have to insert a memory barrier between rendered objects (which can overlap with other rendered objects). Or you use the linked list approach.
But the point is this: with these mechanisms, not only have people implemented blending in shaders, but they've implemented order-independent transparency (via linked lists). Nothing is stopping you from doing what you want right now.
Well, nothing except performance of course. The fixed-function blender will always be faster because it can run in parallel with the fragment shader operations. The blending units are separate hardware from the fragment shaders, so you can be doing blending operations while simultaneously doing fragment shader ops (obviously from later fragments, not the ones being blended).
The read/modify/write mechanism in the blend hardware is designed specifically for blending, while the image_load_store is a more generic mechanism. And while generic may beat specific in the long-term of hardware evolution, for the immediate and near-future, you can expect fixed-function blending to beat image_load_store blending performance-wise every time.
You should use it only when you must. And even the, decide if you really, really need it.
Is there some hardware design issue in the video card pipelines that make this difficult to accomplish?
Yes, this is actually the case. If one could do blending in the fragment shader, this would introduce possible feedback loops, and this really complicates things. Blending is done in a separate hardwired stage for performance and parallelization reasons.

Low level graphics programming and ZBrush

After a while of 3d modelling and enjoying ZBrush's impeccable performance and numerous features I thought it would be great OpenGL practice for me to create something similar, just a small sculpting tool. Sure enough I got it done, I couldn't match ZBrush's performance of course seeing as how a brigade of well payed professionals outmatch a hobbyist. For the moment I just assumed ZBrush was heavily hardware accelerated, imagine my surprise when I found out it's not and furthermore it uses neither opengl or direct3d.
This made me want to learn graphics on a lower level but I have no clue where to start. How are graphics libraries made and how does one access the framebuffer without the use of opengl. How much of a hassle would it be to display just a single pixel without any preexisting tools and what magic gives ZBrush such performance.
I'd appreciate any info on any question and a recommendation for a book that covers any of these topics. I'm already reading Michael Abrash's Graphics Programming Black Book but it's not really addressing these matters or I just haven't reached that point yet.
Thank you in advance.
(Please don't post answers like "just use opengl" or "learn math", this seems to be the reaction everywhere I post this question but these replies are off topic)
ZBrush is godly in terms of performance but I think it's because it was made by image processing experts with assembly expertise (it's also likely due to the sheer amount of assembly code that they've been almost 20 years late in porting to 64-bit). It actually started out without any kind of 3D sculpting and was just a 2.5D "pixol" painter where you could spray pixels around on a canvas with some depth and lighting to the "pixols". It didn't get sculpting until around ZB 1.5 or so. Even then it impressed people with how fast you could spray these 2.5D "pixols" around on the canvas back when a similarly-sized brush just painting flat pixels with Photoshop or Corel Painter would have brought framerates to a stutter. So they were cutting-edge in performance even before they tackled anything 3D and were doing nothing more than spraying pixels on a canvas; that tends to require some elite micro-optimization wizardry.
One of the things to note about ZB when you're sculpting 20 million polygon models with it is that it doesn't even use GPU rasterization. All the rasterization is done in CPU. As a result it doesn't benefit from a beefy video card with lots of VRAM supporting the latest GLSL/HLSL versions; all it needs is something that can plot colored pixels to a screen. This is probably one of the reasons it uses so little memory compared to, say, MudBox, since it doesn't have to triple the memory usage with, say, VBOs (which tend to double system memory usage while also requiring the data to be stored on the GPU).
As for how you get started with this stuff, IMO a good way to get your feet wet is to write your own raytracer. I don't think ZBrush uses, say, scanline rasterization which tends to rise very proportionally in cost the more polygons you have, since they reduce the number of pixels being rendered at times like when you rotate the model. That suggests that whatever technique they're using for rasterization is more dependent in terms of performance by the number of pixels being rendered rather than the number of primitives (vertices/triangles/lines/voxels) being rendered. Raytracing fits those characteristics. Also IMHO a raytracer is actually easier to write than a scanline rasterizer since you don't have to bother with tricky cases so much and elimination of overdrawing comes free of charge.
Once you got a software where the cost of an operation is more in proportion to the number of pixels being rendered than the amount of geometry, then you can throw a boatload of polygons at it as they did all the way back when they demonstrated 20 million polygon sculpting at Siggraph with silky frame rates almost 17 years ago.
However, it's very difficult to get a raytracer to update interactively in response to mesh data that is being not only sculpted interactively, but sometimes having its topology being changed interactively. So chances are that they are using some data structure other than your standard BVH or KD-Tree as popular in raytracing, and instead a data structure which is well-suited for dynamic meshes that are not only deforming but also having their topology being changed. Maybe they can voxelize and revoxelize (or "pixolize" and "repixolize") meshes on the fly really quickly and cast rays directly into the voxelized representation. That would start to make sense given how their technology originally revolved around these 2.5D "pixels" with depth.
Anyway, I'd suggest raytracing for a start even if it's only just getting your feet wet and getting you nowhere close to ZB's performance just yet (it's still a very good start on how to translate 3D geometry and lighting into an attractive 2D image). You can find minimal examples of raytracers on the web written with just a hundred lines of code. Most of the work typically in building a raytracer is performance and also handling a rich diversity of shaders/materials. You don't necessarily need to bother with the latter and ZBrush doesn't so much either (they use these dirt cheap matcaps for modeling). Then you'll likely have to innovate some kind of data structure that's well-suited for mesh changes to start getting on par with ZB and micro-tune the hell out of it. That software is really on a whole different playing field.
I have likewise been so inspired by ZB but haven't followed in their footsteps directly, instead using the GPU rasterizer and OpenGL. One of the reasons I find it difficult to explore doing all this stuff on the CPU as ZB has is because you lose the benefits of so much industrial research and revolutionary techniques that game engines and NVidia and AMD have come up with into lighting models in realtime and so forth that all benefit from GPU-side processing. There's 99% of the 3D industry and then there's ZBrush in its own little corner doing things that no one else is doing and you need a lot of spare time and maybe a lot of balls to abandon the rest of the industry and try to follow in ZB's footsteps. Still I always wish I could find some spare time to explore a pure CPU rasterizing engine like ZB since they still remain unmatched when your goal is to directly interact with ridiculously high-resolution meshes.
The closest I've gotten to ZB performance was sculpting 2 million polygon meshes at over 30 FPS back in the late 90s on an Athlon T-Bird 1.2ghz with 256MB of RAM, and that was after 6 weeks of intense programming and revisiting the drawing board over and over in a very simplistic demo, and that was a very rare time where my company gave me so much R&D time to explore what ZB was doing. Still, ZB was handling 5 times that geometry at the same frame rates even at that time and on the same hardware and using half the memory. I couldn't even get close, though I did end up with a newfound respect and admiration for the programmers at Pixologic. I also had to insist to my company to do the research. Some of the people there thought ZBrush would never become anything noteworthy and would just remain a cutesy artistic application. I thought the opposite since I saw something revolutionary long before it acquired such an epic following.
A lot of people at the time thought ZB's ability to handle so many polygons was impractical and that you could just paint bump/normal/displacement maps and add whatever details you needed into textures. But that's ignoring the workflow side of things. When you can just work straight with epic amounts of geometry, you get to uniformly apply the same tools and workflow to select vertices, polygons, edges, brush over things, etc. It becomes the most straightforward way to create such a detailed and complex model, after which you can bake out the details into bump/normal/displacement maps for use in other engines that would vomit on 20 million polygons. Nowadays I don't think anyone still questions the practicality of ZB.
[...] but it's not really addressing these matters or I just haven't
reached that point yet.
As a caveat, no one has published anything on how to achieve performance rivaling ZB. Otherwise there would be a number of applications rivaling its performance and features when it comes to sculpting, dynamesh, zspheres, etc and it wouldn't be so amazingly special. You definitely need your share of R&D to come up with anything close to it, but I think raytracing is a good start. After that you'll likely need to come up with some really interesting ideas for algorithms and data structures in addition to a lot of micro-tuning.
What I can say with a fair degree of confidence is that:
They have some central data structure to accelerate rasterization that can update extremely quickly in response to changes the user makes to a mesh (including topological ones).
The cost of rasterization is more in proportion to the number of pixels rendered rather than the size of the 3D input.
There's some micro-optimization wizardry in there, including straight up assembly coding (I'm quite certain ZB uses assembly coding since they were originally requiring programmers to have both assembly and C++ knowledge back when they were hiring in the 2000s; I really wanted to work at Pixologic but lacked the prerequisite assembly skills).
Whatever they use is pretty light on memory requirements given that the models are so dynamic. Last time I checked, they use less than 100MB per million polygons even when loading in production models with texture maps. Competing 3D software with the exception of XSI can take over a gigabyte for the same data. XSI uses even less memory than ZB with its gigapoly core but is ill-suited to manipulating such data, slowing down to a crawl (they probably optimized it in a way that's only well-suited for static data like offloading data to disk or even using some expensive forms of compression).
If you're really interested in exploring this, I'd be interested to see what you can come up with. Maybe we can exchange notes. I've devoted much of my career just being interested in figuring out what ZB is doing, or at least coming up with something of my own that can rival what it's doing. For just about everything else I've tackled over the years from raytracing to particle simulations to fluid dynamics and video processing and so forth, I've been able to at least come up with demos that rival or surpass the performance of the competition, but not ZBrush. ZBrush remains that elusive thorn in my side where I just can't figure out how they manage to be so damned efficient at what they do.
If you really want to crawl before you even begin to walk (I think raytracing is a decent enough start, but if you want to start out even more fundamental) then maybe a natural evolution is to first just focus on image processing: filtering images, painting them with brushes, etc, along with some support for basic vector graphics like a miniature Photoshop/Illustrator. Then work your way up to rasterizing some basic 3D primitives, like maybe just a wireframe of a model being rendered using Wu line rasterization and some basic projection functions. Then work your way towards rasterizing filled triangles without any lighting or texturing, at which point I think you'll get closer to ZBrush focusing on raytracing rather than scanline with a depth buffer. However, doing a little bit of the latter might be a useful exercise anyway. Then work on rendering lit triangles, maybe starting with direct lighting and just a single light source, just computing a luminance based on the angle of the normal relative to the light source. Then work towards textured triangles using baycentric coordinates to figure out what texels to render. Then work towards indirect lighting and multiple light sources. That should be plenty of homework for you to develop a fairly comprehensive idea of the fundamentals of rasterization.
Now once you get to raytracing, I'm actually going to recommend one of the least efficient data structures for the job typically: octrees, not BVH or KD-Tree, mainly because I believe octrees are probably closer to allowing what ZB allows. Your bottlenecks in this context don't have to do with rendering the most beautiful images with complex diffuse materials and indirect lighting and subpixel samples for antialiasing. It has to do with handling a boatload of geometry with simple lighting and simple shaders and one sample per pixel which is changing on the fly, including topologically. Octrees seem a little better suited in that case than KD-tree or BVHs as a starting point.
One of the problems with ignoring the fundamentals these days is that a lot of young developers have lost that connection from, say, triangle to pixel on the screen. So if you don't want to take such rasterization and projection for granted, then your initial goal is to project 3D data into a 2D coordinate space and rasterize it.
If you want a book that starts at a low level, with framebuffers and such, try Computer Graphics: Principles and Practice, by Foley, van Dam, et al. It is an older, traditional text, but newer books tend to have a higher-level view. For a more modern text, I can also recommend 3D Computer Graphics by Alan Watt. There are plenty of other good introductory texts available -- these are just two that I am personally familiar with.
Neither of the above books are tied to OpenGL -- if I recall correctly, they include the specific math and algorithms necessary to understand and implement 3D graphics from the bottom up.

Could a Cray XK6 run a real-time raytracer?

I heard about Cray's new supercomputer -- the XK6 -- today, but I am a little confused about where the bottlenecks are. Is it in interconnec? Can an XK6 configured with, say, 500,000 16-core processors achieve a graphic fidelity comprable to Toy Story 3 in real-time? By "real-time", I mean 60fps, or around 16.7 milliseconds per frame.
No. Pure computation is surprisingly little of what it takes to render a film frame from Toy Story 3 or a similar modern animated (or VFX) film. Those scenes may reference many hundreds of GB of texture, and even if you could know exactly which subset of that texture will be needed for a frame, it may be tens of GB, which still needs to be read from disk and/or transferred over a network. GPUs or massively parallel distributed computation doesn't speed that up. Furthermore, rendering is only the very last step... preparing the geometric input for a frame (simulating the fluids, cloth and hair, tessellating the geometry, reading and interpreting large scenes from disk) can be substantial.
So, just pulling numbers out of the air (but these are moderately realistic), say it takes 30 minutes to prepare the scene (load the models, tessellate it, some minor sims, etc.), and 4.5 hours to render (of which, say, 30 minutes is reading texture and other resources from disk, leaving 4 hours of "ray tracing" and other computation). If the XK6 made the ray tracing infinitely fast, it would only speed the total process up by 5x (1 hour is still hard-to-serialize prep and I/O). That's Amdahl's Law for you.
Now, you're probably asking yourself, "how do games go so fast?" They do it in two ways: (1) they drastically reduce the data set (texture size, geometric resolution, etc.) to make it all fit on the GPU and be reasonably fast to load levels (which, curiously, you the user are not counting when you think of the rendering as happening in "real time"); (2) they spare no expense in computation, tricks, and human labor to optimize the scenes and algorithms before they ship the disks, so that when it's in front of the player it can render quickly.
So, in summary, if you are asking if the total computational power of the XK6 is enough to compute in real time all the pure math required to render a film frame, then yes, it probably is. But if you are asking if an XK6 could actually render the movie in realtime given the kinds of inputs the renderer needs, then no, it couldn't. Would an XK6 be of any use to people rendering those movie frames? No, it probably wouldn't be worth the trouble of reprogramming all the software (hundreds of man years) from the ground up.
Looking at it from another viewpoint, users generally render only one scene at a time, then make small changes and re-render over and over again. To render only a single scene in realtime may still require several GB of textures loaded onto the GPU's RAM.
Could a supercomputer like those from Cray using a "sea of cores" or a vast array of modern CPU's perform the job in realtime? Yes, for simple enough scenes. More complex scenes that require 100+ rays per pixel at 8MP (4K x 2K for movies, 2MP for DSLR/indie type movies), along with lots of objects, shadows, haze, refraction, diffuse lighting sources, etc) would probably require too many computations, even at 24fps.

Collision detection with hardware generated primitives

There's a lot of literature on collision detection, and I've read at least a big enough portion of it to be fairly familiar with most techniques. However, there's something that has eluded me for a while, and I figured, since StackOverflow provides access to a large group of brilliant minds at once, I'd ask here first before digging around in the bookshelf.
In this day and age, more and more work is being delegated to GPU rather than CPU, and in a lot of cases this is a good thing. For example, there are geometry shaders to create new geometry, or (slightly less new, but still quite fascinating) vertex shaders to which you can through a bunch of vertexes at, and something elegant will come out of it. What I was considering though, as these primitives exists only on the graphics hardware, how would you perform reliable collision detection with these primitives? Let's assume I have some kind of extremely simplified mesh which is displaced in a vertex shader (I don't have a concrete problem, I'm more playing with the idea, and I haven't gotten very deep into geometry shaders yet).
What I've considered so far is separate 'rendering' passes from suitable angles with shading more or less turned off, and perhaps lower resolution mesh, rendering the inside (with faces flipped inward) of my second primitive along with the mesh I want to test against, and executing an occlusion query for the mesh. If the mesh is completely occluded there'd be no intersection. This would of course require that my second primitive is convex.
Somehow I get the feeling that this kind of test will be extremely expensive as the number of primitives increase (even if a large portion can be culled directly). Does anyone else have another idea or technique? I'm more familiar with opengl and cg than directx, but if you have some examples or so in directx, I guess I'll be able to figure out the opengl counterparts.
All ideas are appreciated, so please brainstorm. :)
It sounds like Dan Horn's article “Stream Reduction Operations for GPGPU Applications” in GPU Gems 2 is exactly what you want. Like all chapters, it's freely available online.

Resources