Could a Cray XK6 run a real-time raytracer? - graphics

I heard about Cray's new supercomputer -- the XK6 -- today, but I am a little confused about where the bottlenecks are. Is it in interconnec? Can an XK6 configured with, say, 500,000 16-core processors achieve a graphic fidelity comprable to Toy Story 3 in real-time? By "real-time", I mean 60fps, or around 16.7 milliseconds per frame.

No. Pure computation is surprisingly little of what it takes to render a film frame from Toy Story 3 or a similar modern animated (or VFX) film. Those scenes may reference many hundreds of GB of texture, and even if you could know exactly which subset of that texture will be needed for a frame, it may be tens of GB, which still needs to be read from disk and/or transferred over a network. GPUs or massively parallel distributed computation doesn't speed that up. Furthermore, rendering is only the very last step... preparing the geometric input for a frame (simulating the fluids, cloth and hair, tessellating the geometry, reading and interpreting large scenes from disk) can be substantial.
So, just pulling numbers out of the air (but these are moderately realistic), say it takes 30 minutes to prepare the scene (load the models, tessellate it, some minor sims, etc.), and 4.5 hours to render (of which, say, 30 minutes is reading texture and other resources from disk, leaving 4 hours of "ray tracing" and other computation). If the XK6 made the ray tracing infinitely fast, it would only speed the total process up by 5x (1 hour is still hard-to-serialize prep and I/O). That's Amdahl's Law for you.
Now, you're probably asking yourself, "how do games go so fast?" They do it in two ways: (1) they drastically reduce the data set (texture size, geometric resolution, etc.) to make it all fit on the GPU and be reasonably fast to load levels (which, curiously, you the user are not counting when you think of the rendering as happening in "real time"); (2) they spare no expense in computation, tricks, and human labor to optimize the scenes and algorithms before they ship the disks, so that when it's in front of the player it can render quickly.
So, in summary, if you are asking if the total computational power of the XK6 is enough to compute in real time all the pure math required to render a film frame, then yes, it probably is. But if you are asking if an XK6 could actually render the movie in realtime given the kinds of inputs the renderer needs, then no, it couldn't. Would an XK6 be of any use to people rendering those movie frames? No, it probably wouldn't be worth the trouble of reprogramming all the software (hundreds of man years) from the ground up.

Looking at it from another viewpoint, users generally render only one scene at a time, then make small changes and re-render over and over again. To render only a single scene in realtime may still require several GB of textures loaded onto the GPU's RAM.
Could a supercomputer like those from Cray using a "sea of cores" or a vast array of modern CPU's perform the job in realtime? Yes, for simple enough scenes. More complex scenes that require 100+ rays per pixel at 8MP (4K x 2K for movies, 2MP for DSLR/indie type movies), along with lots of objects, shadows, haze, refraction, diffuse lighting sources, etc) would probably require too many computations, even at 24fps.

Related

Interleaved vs non-interleaved vertex buffers

This seems like a question which has been answered throughout time for one IHV or another but recently I have have been trying to come to a consensus about vertex layouts and the best practices for a modern renderer across all IHVs and architectures. Before someone says benchmark, I can't easily do that as I don't have access to a card from every IHV and every architecture from the last 5 years. Therefore, I am looking for some best practices that will work decently well across all platforms.
First, the obvious:
Separating position from other attributes is good for:
Shadow and depth pre-passes
Per-triangle culling
Tiled based deferred renderers (such as Apple M1)
Interleaved is more logical on the CPU, can have a Vertex class.
Non-interleaved can make some CPU calculations faster due to being able to take advantage of SIMD.
Now onto the less obvious.
Many people quote NVIDIA as saying that you should always interleave and moreover you should align to 32 or 64 bytes. I have not found the source of this but have instead found a document about vertex shader performance by NVIDIA but it is quite old (2013) and is regarding the Tegra GPU which is mobile, not desktop. In particular it says:
Store vertex data as interleaved attribute streams ("array of structures" layout), such that "over-fetch" for an attribute tends to pre-fetch data that is likely to be useful for subsequent attributes and vertices. Storing attributes as distinct, non-interleaved ("structure of arrays") streams can lead to "page-thrashing" in the memory system, with a massive resultant drop in performance.
Fast forward 3 years to GDC 2016 and EA gives a presentation which mentions several reasons why you should de-interleave the vertex buffers. However, this recommendation seems to be tied to AMD architectures, in particular GCN. While they make a cross platform case for separating the position they propose de-interleaving everything with the statement that it will allow the GPU to:
Evict cache lines as quickly as possible
And that it is optimal for GCN (AMD) architectures.
This seems to be in conflict to what I have heard elsewhere which says to use interleaved in order to make the most use of a cache line. But again, that was not in regard to AMD.
With many different IHVs, Intel, NVIDIA, AMD, and now Apple with the M1 GPU, and each one having many different architectures it leaves me in the position of being completely uncertain about what one should do today (without the budget to test on dozens of GPUs) in order to best optimize performance across all architectures without resulting in
a massive resultant drop in performance
on some architectures. In particular, is de-interleaved still best on AMD? Is it no longer a problem on NVIDIA, or was it never a problem on desktop NVIDIA GPUs? What about the other IHVs?
NOTE: I am not interested in mobile, only all desktop GPUs in the past 5 years or so.

plot speed up curve vs number of OpenMP threads - scalability?

I am working on a C++ code which uses OpenMP threads. I have plotted the speedup curve versus the number of OpenMP threads and the theorical curve (if the code was able to be fully parallelized).
here is this plot :
From this picture, can we say this code is not scalable (from a point of view of parallelization) ? i.e the code is not twice more fast with 2 OpenMP threads, four more fast with 4 threads etc ... ?
Thanks
For the code that barely achieves 2.5x speedup on 16 threads, it is fair to say that it does not scale. However "is not scalable" is often considered a stronger statement. The difference, as I understand it, is that "does not scale" typically refers to a particular implementation and does not imply inherent inability to scale; in other words, maybe you can make it scale if bottlenecks are eliminated. On the other hand, "is not scalable" usually means "you cannot make it scale, at least not without changing the core algorithm". Assuming such meaning, one cannot say "a problem/code/algorithm is not scalable" only looking at a chart.
On an additional note, it's not always reasonable to expect perfect scaling (2x with 2 threads, 4x with 4 threads, etc). A curve that is "close enough" to the ideal scaling might still be considered as showing good scalability; and what "close enough" means may depend on a number of factors. It can be useful to tell / think of parallel efficiency, and not speedup, when scalability is a question. For example, if parallel efficiency is 0.8 (or 80%) and does not drop when the number of threads increase, it could be considered a good scalability. Also, it's possible that some program scales well till a certain number of threads, but remains flat or even falls down if more resources are added.

OpenCL GPU Audio

There's not much on this subject, perhaps because it isn't a good idea in the first place.
I want to create a realtime audio synthesis/processing engine that runs on the GPU. The reason for this is because I will also be using a physics library that runs on the GPU, and the audio output will be determined by the physics state. Is it true that GPU only carries audio output and can't generate it? Would this mean a large increase in latency, if I were to read the data back on the CPU and output it to the soundcard? I'm looking for a latency between 10 and 20ms in terms of the time between synthesis and playback.
Would the GPU accelerate synthesis by any worthwhile amount? I'm going to have a large number of synthesizers running at once, each of which I imagine could take up their own parallel process. AMD is coming out with GPU audio, so there must be something to this.
For what it's worth, I'm not sure that this idea lacks merit. If DarkZero's observation about transfer times is correct, it doesn't sound like there would be much overhead in getting audio onto the GPU for processing, even from many different input channels, and while there are probably audio operations that are not very amenable to parallelization, many are very VERY parallelizable.
It's obvious for example, that computing sine values for 128 samples of output from a sine source could be done completely in parallel. Working in blocks of that size would permit a latency of only about 3ms, which is acceptable in most digital audio applications. Similarly, the many other fundamental oscillators could be effectively parallelized. Amplitude modulation of such oscillators would be trivial. Efficient frequency modulation would be more challenging, but I would guess it is still possible.
In addition to oscillators, FIR filters are simple to parallelize, and a google search turned up some promising looking research papers (which I didn't take the trouble to read) that suggest that there are reasonable parallel approaches to IIR filter implementation. These two types of filters are fundamental to audio processing and many useful audio operations can be understood as such filters.
Wave-shaping is another task in digital audio that is embarrassingly parallel.
Even if you couldn't take an arbitrary software synth and map it effectively to the GPU, it is easy to imagine a software synthesizer constructed specifically to take advantage of the GPU's strengths, and avoid its weaknesses. A synthesizer relying exclusively on the components I have mentioned could still produce a fantastic range of sounds.
While marko is correct to point out that existing SIMD instructions can do some parallelization on the CPU, the number of inputs they can operate on at the same time pales in comparison to a good GPU.
In short, I hope you work on this and let us know what kind of results you see!
DSP operations on modern CPUs with vector processing units (SSE on x86/x64 or NEON on ARM) are already pretty cheap if exploited properly. This is particularly the case with filters, convolution, FFT and so on - which are fundamentally stream-based operations. There are the type of operations where a GPU might also excel.
As it turns out, soft synthesisers have quite a few operations in them that are not stream-like, and furthermore, the tendency is to process increasingly small chunks of audio at once to target low latency. These are a really bad fit for the capabilities of GPU.
The effort involved in using a GPU - particularly getting data in and out - is likely to far exceed any benefit you get. Furthermore, the capabilities of inexpensive personal computers - and also tablets and mobile devices - are more than enough for many digital audio applications AMD seem to have a solution looking for a problem. For sure, the existing music and digital audio software industry is not about to start producing software that only targets a limited sub-set of hardware.
Typical transfer times for some MB to/from GPU take 50us.
Delay is not your problem, however parallelizing a audio synthesizer in the GPU may be quite difficult. If you don't do it properly it may take more time the processing rather than the copy of data.
If you are going to run multiple synthetizers at once, I would recommend you to perform each synthesizer in a work-group, and parallelize the synthesis process with the work-items available. It will not be worth to have each synthesizer in one work-item, since it is unlikely you will have thousand.
http://arxiv.org/ftp/arxiv/papers/1211/1211.2038.pdf
You might be better off using OpenMP for it's lower initialization times.
You could check out the NESS project which is all about physical modelling synthesis. They are using GPUs for audio rendering because it the process involves simulating an acoustic 3D space for whichever given sound, and calculating what happens to that sound within the virtual 3D space (and apparently GPUs are good at working with this sort of data). Note that this is not realtime synthesis because it is so demanding of processing.

Low level graphics programming and ZBrush

After a while of 3d modelling and enjoying ZBrush's impeccable performance and numerous features I thought it would be great OpenGL practice for me to create something similar, just a small sculpting tool. Sure enough I got it done, I couldn't match ZBrush's performance of course seeing as how a brigade of well payed professionals outmatch a hobbyist. For the moment I just assumed ZBrush was heavily hardware accelerated, imagine my surprise when I found out it's not and furthermore it uses neither opengl or direct3d.
This made me want to learn graphics on a lower level but I have no clue where to start. How are graphics libraries made and how does one access the framebuffer without the use of opengl. How much of a hassle would it be to display just a single pixel without any preexisting tools and what magic gives ZBrush such performance.
I'd appreciate any info on any question and a recommendation for a book that covers any of these topics. I'm already reading Michael Abrash's Graphics Programming Black Book but it's not really addressing these matters or I just haven't reached that point yet.
Thank you in advance.
(Please don't post answers like "just use opengl" or "learn math", this seems to be the reaction everywhere I post this question but these replies are off topic)
ZBrush is godly in terms of performance but I think it's because it was made by image processing experts with assembly expertise (it's also likely due to the sheer amount of assembly code that they've been almost 20 years late in porting to 64-bit). It actually started out without any kind of 3D sculpting and was just a 2.5D "pixol" painter where you could spray pixels around on a canvas with some depth and lighting to the "pixols". It didn't get sculpting until around ZB 1.5 or so. Even then it impressed people with how fast you could spray these 2.5D "pixols" around on the canvas back when a similarly-sized brush just painting flat pixels with Photoshop or Corel Painter would have brought framerates to a stutter. So they were cutting-edge in performance even before they tackled anything 3D and were doing nothing more than spraying pixels on a canvas; that tends to require some elite micro-optimization wizardry.
One of the things to note about ZB when you're sculpting 20 million polygon models with it is that it doesn't even use GPU rasterization. All the rasterization is done in CPU. As a result it doesn't benefit from a beefy video card with lots of VRAM supporting the latest GLSL/HLSL versions; all it needs is something that can plot colored pixels to a screen. This is probably one of the reasons it uses so little memory compared to, say, MudBox, since it doesn't have to triple the memory usage with, say, VBOs (which tend to double system memory usage while also requiring the data to be stored on the GPU).
As for how you get started with this stuff, IMO a good way to get your feet wet is to write your own raytracer. I don't think ZBrush uses, say, scanline rasterization which tends to rise very proportionally in cost the more polygons you have, since they reduce the number of pixels being rendered at times like when you rotate the model. That suggests that whatever technique they're using for rasterization is more dependent in terms of performance by the number of pixels being rendered rather than the number of primitives (vertices/triangles/lines/voxels) being rendered. Raytracing fits those characteristics. Also IMHO a raytracer is actually easier to write than a scanline rasterizer since you don't have to bother with tricky cases so much and elimination of overdrawing comes free of charge.
Once you got a software where the cost of an operation is more in proportion to the number of pixels being rendered than the amount of geometry, then you can throw a boatload of polygons at it as they did all the way back when they demonstrated 20 million polygon sculpting at Siggraph with silky frame rates almost 17 years ago.
However, it's very difficult to get a raytracer to update interactively in response to mesh data that is being not only sculpted interactively, but sometimes having its topology being changed interactively. So chances are that they are using some data structure other than your standard BVH or KD-Tree as popular in raytracing, and instead a data structure which is well-suited for dynamic meshes that are not only deforming but also having their topology being changed. Maybe they can voxelize and revoxelize (or "pixolize" and "repixolize") meshes on the fly really quickly and cast rays directly into the voxelized representation. That would start to make sense given how their technology originally revolved around these 2.5D "pixels" with depth.
Anyway, I'd suggest raytracing for a start even if it's only just getting your feet wet and getting you nowhere close to ZB's performance just yet (it's still a very good start on how to translate 3D geometry and lighting into an attractive 2D image). You can find minimal examples of raytracers on the web written with just a hundred lines of code. Most of the work typically in building a raytracer is performance and also handling a rich diversity of shaders/materials. You don't necessarily need to bother with the latter and ZBrush doesn't so much either (they use these dirt cheap matcaps for modeling). Then you'll likely have to innovate some kind of data structure that's well-suited for mesh changes to start getting on par with ZB and micro-tune the hell out of it. That software is really on a whole different playing field.
I have likewise been so inspired by ZB but haven't followed in their footsteps directly, instead using the GPU rasterizer and OpenGL. One of the reasons I find it difficult to explore doing all this stuff on the CPU as ZB has is because you lose the benefits of so much industrial research and revolutionary techniques that game engines and NVidia and AMD have come up with into lighting models in realtime and so forth that all benefit from GPU-side processing. There's 99% of the 3D industry and then there's ZBrush in its own little corner doing things that no one else is doing and you need a lot of spare time and maybe a lot of balls to abandon the rest of the industry and try to follow in ZB's footsteps. Still I always wish I could find some spare time to explore a pure CPU rasterizing engine like ZB since they still remain unmatched when your goal is to directly interact with ridiculously high-resolution meshes.
The closest I've gotten to ZB performance was sculpting 2 million polygon meshes at over 30 FPS back in the late 90s on an Athlon T-Bird 1.2ghz with 256MB of RAM, and that was after 6 weeks of intense programming and revisiting the drawing board over and over in a very simplistic demo, and that was a very rare time where my company gave me so much R&D time to explore what ZB was doing. Still, ZB was handling 5 times that geometry at the same frame rates even at that time and on the same hardware and using half the memory. I couldn't even get close, though I did end up with a newfound respect and admiration for the programmers at Pixologic. I also had to insist to my company to do the research. Some of the people there thought ZBrush would never become anything noteworthy and would just remain a cutesy artistic application. I thought the opposite since I saw something revolutionary long before it acquired such an epic following.
A lot of people at the time thought ZB's ability to handle so many polygons was impractical and that you could just paint bump/normal/displacement maps and add whatever details you needed into textures. But that's ignoring the workflow side of things. When you can just work straight with epic amounts of geometry, you get to uniformly apply the same tools and workflow to select vertices, polygons, edges, brush over things, etc. It becomes the most straightforward way to create such a detailed and complex model, after which you can bake out the details into bump/normal/displacement maps for use in other engines that would vomit on 20 million polygons. Nowadays I don't think anyone still questions the practicality of ZB.
[...] but it's not really addressing these matters or I just haven't
reached that point yet.
As a caveat, no one has published anything on how to achieve performance rivaling ZB. Otherwise there would be a number of applications rivaling its performance and features when it comes to sculpting, dynamesh, zspheres, etc and it wouldn't be so amazingly special. You definitely need your share of R&D to come up with anything close to it, but I think raytracing is a good start. After that you'll likely need to come up with some really interesting ideas for algorithms and data structures in addition to a lot of micro-tuning.
What I can say with a fair degree of confidence is that:
They have some central data structure to accelerate rasterization that can update extremely quickly in response to changes the user makes to a mesh (including topological ones).
The cost of rasterization is more in proportion to the number of pixels rendered rather than the size of the 3D input.
There's some micro-optimization wizardry in there, including straight up assembly coding (I'm quite certain ZB uses assembly coding since they were originally requiring programmers to have both assembly and C++ knowledge back when they were hiring in the 2000s; I really wanted to work at Pixologic but lacked the prerequisite assembly skills).
Whatever they use is pretty light on memory requirements given that the models are so dynamic. Last time I checked, they use less than 100MB per million polygons even when loading in production models with texture maps. Competing 3D software with the exception of XSI can take over a gigabyte for the same data. XSI uses even less memory than ZB with its gigapoly core but is ill-suited to manipulating such data, slowing down to a crawl (they probably optimized it in a way that's only well-suited for static data like offloading data to disk or even using some expensive forms of compression).
If you're really interested in exploring this, I'd be interested to see what you can come up with. Maybe we can exchange notes. I've devoted much of my career just being interested in figuring out what ZB is doing, or at least coming up with something of my own that can rival what it's doing. For just about everything else I've tackled over the years from raytracing to particle simulations to fluid dynamics and video processing and so forth, I've been able to at least come up with demos that rival or surpass the performance of the competition, but not ZBrush. ZBrush remains that elusive thorn in my side where I just can't figure out how they manage to be so damned efficient at what they do.
If you really want to crawl before you even begin to walk (I think raytracing is a decent enough start, but if you want to start out even more fundamental) then maybe a natural evolution is to first just focus on image processing: filtering images, painting them with brushes, etc, along with some support for basic vector graphics like a miniature Photoshop/Illustrator. Then work your way up to rasterizing some basic 3D primitives, like maybe just a wireframe of a model being rendered using Wu line rasterization and some basic projection functions. Then work your way towards rasterizing filled triangles without any lighting or texturing, at which point I think you'll get closer to ZBrush focusing on raytracing rather than scanline with a depth buffer. However, doing a little bit of the latter might be a useful exercise anyway. Then work on rendering lit triangles, maybe starting with direct lighting and just a single light source, just computing a luminance based on the angle of the normal relative to the light source. Then work towards textured triangles using baycentric coordinates to figure out what texels to render. Then work towards indirect lighting and multiple light sources. That should be plenty of homework for you to develop a fairly comprehensive idea of the fundamentals of rasterization.
Now once you get to raytracing, I'm actually going to recommend one of the least efficient data structures for the job typically: octrees, not BVH or KD-Tree, mainly because I believe octrees are probably closer to allowing what ZB allows. Your bottlenecks in this context don't have to do with rendering the most beautiful images with complex diffuse materials and indirect lighting and subpixel samples for antialiasing. It has to do with handling a boatload of geometry with simple lighting and simple shaders and one sample per pixel which is changing on the fly, including topologically. Octrees seem a little better suited in that case than KD-tree or BVHs as a starting point.
One of the problems with ignoring the fundamentals these days is that a lot of young developers have lost that connection from, say, triangle to pixel on the screen. So if you don't want to take such rasterization and projection for granted, then your initial goal is to project 3D data into a 2D coordinate space and rasterize it.
If you want a book that starts at a low level, with framebuffers and such, try Computer Graphics: Principles and Practice, by Foley, van Dam, et al. It is an older, traditional text, but newer books tend to have a higher-level view. For a more modern text, I can also recommend 3D Computer Graphics by Alan Watt. There are plenty of other good introductory texts available -- these are just two that I am personally familiar with.
Neither of the above books are tied to OpenGL -- if I recall correctly, they include the specific math and algorithms necessary to understand and implement 3D graphics from the bottom up.

When using Direct3D, how much math is being done on the CPU?

Context: I'm just starting out. I'm not even touching the Direct3D 11 API, and instead looking at understanding the pipeline, etc.
From looking at documentation and information floating around the web, it seems like some calculations are being handled by the application. That, is, instead of simply presenting matrices to multiply to the GPU, the calculations are being done by a math library that operates on the CPU. I don't have any particular resources to point to, although I guess I can point to the XNA Math Library or the samples shipped in the February DX SDK. When you see code like mViewProj = mView * mProj;, that projection is being calculated on the CPU. Or am I wrong?
If you were writing a program, where you can have 10 cubes on the screen, where you can move or rotate cubes, as well as viewpoint, what calculations would you do on the CPU? I think I would store the geometry for the a single cube, and then transform matrices representing the actual instances. And then it seems I would use the XNA math library, or another of my choosing, to transform each cube in model space. Then get the coordinates in world space. Then push the information to the GPU.
That's quite a bit of calculation on the CPU. Am I wrong?
Am I reaching conclusions based on too little information and understanding?
What terms should I Google for, if the answer is STFW?
Or if I am right, why aren't these calculations being pushed to the GPU as well?
EDIT: By the way, I am not using XNA, but documentation notes the XNA Math Library replaces the previous DX Math library. (i see the XNA Library in the SDK as a sheer template library).
"Am I reaching conclusions based on too little information and understanding?"
Not as a bad thing, as we all do it, but in a word: Yes.
What is being done by the GPU is, generally, dependent on the GPU driver and your method of access. Most of the time you really don't care or need to know (other than curiosity and general understanding).
For mViewProj = mView * mProj; this is most likely happening on the CPU. But it is not much burden (counted in 100's of cycles at the most). The real trick is the application of the new view matrix on the "world". Every vertex needs to be transformed, more or less, along with shading, textures, lighting, etc. All if this work will be done in the GPU (if done on the CPU things will slow down really fast).
Generally you make high level changes to the world, maybe 20 CPU bound calculations, and the GPU takes care of the millions or billions of calculations needed to render the world based on the changes.
In your 10 cube example: You supply a transform for each cube, any math needed for you to create the transform is CPU bound (with exceptions). You also supply a transform for the view, again creating the transform matrix might be CPU bound. Once you have your 11 new matrices you apply the to the world. From a hardware point of view the 11 matrices need to be copied to the GPU...that will happen very, very fast...once copied the CPU is done and the GPU recalculates the world based on the new data, renders it to a buffer and poops it on the screen. So for your 10 cubes the CPU bound calculations are trivial.
Look at some reflected code for an XNA project and you will see where your calculations end and XNA begins (XNA will do everything is possibly can in the GPU).

Resources