Is depth-sorting redundant if using depth-prepass? - graphics

To avoid overdraw (shading the same pixel twice) it's beneficial to draw objects front to back. When you draw an object in the front the depth values are written to the depth buffer, then when you draw an object behind it the pixel can be rejected if it's depth value doesn't pass the comparison operation. However I've been wondering whether there's any benefit to depth sorting if you do a complete depth pre-pass. The only way I can see that there's a benefit to sorting front to back is if the testing of the depth is separate from the writing. Because the pixel depth value is read in order to compare it with the currently rasterised triangle the number then the number of depth buffer reads is ALWAYS the same, no matter whether rendering front to back or back to front.
However the number of writes to the depth buffer (to replace the value with a new lowest/highest value) does seem dependent on whether you sort front to back or back to front. I don't know much about it but I'm led to believe that the depth test and write are the same one operation (disabling depth tests also disables depth writes). If that's the case then sorting won't matter.
So does it make a difference? And if so or if not, why?
Also, consider whether there's a difference if the depth pass for an object has a fragment shader attached which writes to depth (which is necessary for alpha tested writes, such as chain link fences, trees, etc.) My intuition is that alpha tested objects won't make a difference, since if you write to the depth buffer, early depth test cannot be done, and therefore the fragment shader needs to run regardless. And so in this case the same question applies, except the depth is read and written at the end of the fragment shader execution, the question still being whether only testing (and not writing), is faster than both testing and writing (ie., is there benefit to depth sorting for depth prepass?).

Sorting with the depth pre-pass can still be helpful. Some depth buffers have functionality that can cull entire groups of fragments with a single test. Hierarchical depth buffers, Hi-Z, whatever the particular name for it is, these technologies exist. And they work best when you render in a roughly front-to-back order.

there are gfx techniques that requires depth sorting regardless or in addition to depth buffering ...
For example dealing with transparency because rendering transparent polygons in wrong order cause wrong rendering result.
I expect there are more techniques and effects similar to this I imagine volumetric lighting, clouds, light gaps, and similar ...

Related

VKRay/DXR: What information do you store in your "Payload" structure?

This will be an important design decision in my next ray-tracing project.
We know that the "Payload" structure is used for passing data from the closest-hit shader to the raygen shader. For a recursive PBRT system, basically, I have 2 options here:
Use it to store the ray-surface interaction information of the hit-point (be it a BRDF or a BSDF)
Use it to store the information of the next ray to trace
The 1st option is intuitive. There is a universal representation of ray-surface interaction. The closest-shaders returns that information to raygen shader, and the information is processed uniformly in the raygen shader to modify the integrated color and generate the next ray. A problem of this design is that the BSDF representation can be big and complicated in a PBRT system. For VKRay/DXR, we want the Payload to be as small as possible.
The 2nd option leaves the evaluation of ray-surface interaction to the closest-hit shaders. Each of the closest-hit shader can have its own representation of ray-surface interaction, which can be extremely simple (depending on the geometry representation). In this case, Payload stores the evaluated results, such as the information of the next ray to trace. A possible issue is that the closest-hit shaders are the diverged parts in the execution flow, would them be less efficient than the raygen shader?
Which one do you prefer, and why?
The short answer is it depends on it will depend on the hardware your running on and it's worth experimenting with. What we can say is method 1 will certainly have less memory footprint and probably have better ray gathering and better caching, but may also have higher bandwidth due to the increased payload size. Another advantage of method 1 is that you will not be limited by recursion depth.
Of course, method 2 is more flexible from a design perspective. Adding a new material/object would involve adding a new shader without modifying the raygen shader/payload definition.
If you are going for method 1 then you may also want to look into using "ray queries" which will bypass the need for a closest-hit shader altogether. You will effectively end op with a very bulky raygen shader. It may be less flexible but it should be faster.

How Might I organize vertex data in WebGL for a frame-by-frame (very specific) animated program?

I have been working on an animated graphics project with very specific requirements, and after quite a bit of searching and test coding, I have figured that I could take several approaches, but the Khronos and MDN documentation I have been reading coupled with other posts I have seen here don't answer all of my questions regarding my particular project. In the meantime, I have written short test programs (setting infrastructure for testing).
Firstly, I should describe the project:
The main object drawn to the screen is a simple quad surrounded by a black outline (LINE_LOOP or LINES will do, probably, though I have had issues with z-fighting...that will be left for another question). When the user interacts with the program, exactly one new quad is created and immediately drawn, but for a set amount of time its vertices move around until the quad moves to its final destination. (Note that translations won't do.) Random black lines are also drawn, and sometimes those lines also move around.
Once one of the quads reaches its final spot, it never moves again.
A new quad is always atop old quads (closer to the screen). That means that I need to layer the quads and lines from oldest to newest.
*this also means that it would probably be best to assign z-values to each quad and line, even if the graphics are in pixel coordinates and use an orthographic matrix. Would everyone agree with this?
Given these parameters, I have a few options with varying levels of complexity:
1> Take the object-oriented approach and just assign a buffer to each quad, and the same goes for the random lines. --creation and destruction of buffers every frame for the one shape that is moving. I truthfully think that this is a terrible idea that might only work in a higher level library that does heavy optimization underneath. This approach also doesn't take advantage of the fact that almost every quad will stay the same.
[vertices0] ... , [verticesN]
Draw x N (many draws for many small-size buffers)
2> Assign a z-value to each quad, outline, and line (as mentioned above). Allocate a huge vertex buffer and element buffer to store all permanently-in-their-final-positions quads. Resize only in the very unlikely case someone interacts for long enough. Create a second tiny buffer to store the one temporary moving quad and use bufferSubData every frame. When the quad reaches its destination, bufferSubData it into the large buffer and overwrite the small buffer upon creation of the next quad...all on the same frame. The main questions I have here are: is it possible (safe?) to use bufferSubData and draw it on the same frame? Also, would I use DYNAMIC_DRAW on both buffers even though the larger one would see fewer updates?
[permanent vertices ... | uninitialized (keep a count)]
bufferSubData -> [tempVerticesForOneQuad]
Draw 2x
3> Still create the large and small buffers, but instead of using bufferSubData every frame, create a second shader program and add an attribute for the new/moving quad that explicitly sets the vertex positions for the animation (I would pass vertex index attributes). Only draw with the small buffer when the quad is moving. For the frame when the quad reaches its destination, draw both large and small buffer, but then bufferSubData the final coordinates into the large permanent buffer to be used in the next frame.
switchToShaderProgramA();
[permanent vertices...| uninitialized (keep a count)]
switchToShaderProgramB();
[temp vertices] <- shader B accepts indices for each vertex so we can do all animation in the vertex shader
---last frame of movement arrives : bufferSubData into the permanent vertices buffer for when the the next quad is created
I get the sense that the third option might be the best, but I would like to learn whether there are some other factors that I did not consider. For example, my assumption that a program switch, additional attributes, and vertex shader manipulation would be faster than just substituting the buffer values as in 2>. The advantage of approach 3> (I think) is that I can defer the buffer substitution to a time when nothing needs to be drawn.
Still, I am still not sure of how to work with the randomly-appearing lines. I can't take the "single quad vertex buffer" approach since the number of lines cannot be predicted. Might I also allocate a large buffer for the moving lines? Those also stay after the quad is finished moving, though I don't think that I could use the vertex shader trick because there would be too many attributes to set (as opposed to the 4 for the one quad). I suppose that I could create a large "permanent line data" buffer first, but what to do during the animation is tricky because the lines move. Maybe bufferSubData() + draw on the same frame is not terrible? Or it could be. This is where I need advise.
I understand that this question might not be too specific code-wise, but I don't believe that I would be allowed to show the core of the program. All I have is the typical WebGL boilerplate ready.
I am looking forward to hearing people's thoughts on how I might proceed and whether there are any trade-offs I might have missed when considering the three options above.
Thank you in advance, and please feel free to ask any additional questions if clarification is necessary.
Honestly, for what you're describing, it doesn't sound to me like it matters which you choose. On modern hardware, drawing a few hundred quads and a few thousand lines each frame would not really tax the hardware much.
Having said that, I agree that approach 1 seems very inefficient. Approach 2 sounds perfectly fine. You can safely draw a buffer on the same frame that you uploaded the data. I don't think it matters much whether you use DYNAMIC_DRAW or STATIC_DRAW for the buffer. I tend to think of dynamic buffers as being something you're updating every frame. If you only update it every few seconds or less, then static is fine. Approach 3 is also fine. Between 2 and 3, I'd say do whichever is easier for you to understand and program.
Likewise, for the lines, I would use a separate buffer. It sounds like that one changes per frame, so I would use DYNAMIC_DRAW for that. Allocating a single large buffer for it and performing a glBufferSubData() per frame is probably a fine strategy. As always, trying it and profiling it will tell you for sure.

several walkers walking on a grid: How to organize the threads?

My algorithm is processing DEMs. a DEM (Digital Elevation Model) is a representations of ground topography where elevation is known at grid nodes.
My problem can be summarized as follows:
Q is a queue containing nodes to visit.
at start, the boundary of the grid is pushed in Q.
while Q is not empty, do
remove Node N from the top of Q
if N was never visited then do
consider the 8 neighbors of N
among them select the unvisited ones
among them select those with a higher elevation than N's
push these at Q's tail
mark N as visited
done
done
As described, the algorithm will mark as 'visited' every node that can be reached from the border by a continuously ascendant path. It is worth noticing that the order of processing the nodes in the queue is unimportant. Note also that some points may request a tortuous ascendant path to be reached from the border. Think for example to a cone with a furrow spiraling around it. The ridge of the furrow is such a unique and tortuous path capable of reaching the top of the cone without never descending into the furrow.
Anyway, I want to mutithread this algorithm. I am still in the first step of wondering which is the best organization of data and threads in order to have as least pain as possible at debugging the beast when it is written.
My first thought is to divide the grid into tiles and split the Queue in as many tiles as there is in the grid. The tiles are piled in a work-list. A few threads are parsing the work-list and grab any tile where something can be done at the moment.
Working on a specific tile will firstly need that the tile's queue is not empty. I may also need that the neighboring tiles can be locked if the walker's tile has to visit a node at the edge of the tile.
I am thinking that when a walker cannot lock a neighboring tile while it needs to, then it can skip to the next node in the local queue, or even the thread itself can release the tile to the work-list and seek for another tile to work on.
My actual experience of multi-thread programming is good enough to understand that this lovely description is very likely to turn into a nightmare when I will debug it. However I am not experienced enough to evaluate the various possibilities of programming the algorithm and make a good decision, having in mind that I will not be given a month to debug a spaghetti dish.
Thanks for reading :)

Fixing an incorrectly taken 3D head scan

The problem I am facing is following.
I have a number of 3D head scans, some of them are taken correctly (like attached example) but in many it is easy to see that the scanned person had his head not exactly aligned with the machine's front and thus one side of the texture (and depth map) seems to be "wider" (the exact reason is that one side was taken more from behind, it can be easily seen if you look at the ears).
Fortunately when I go from the cylindrical coordinates to carthesian ones and render the face with XNA, the face is symmetrical.
Now the thing is that I would like the texture and depth maps of all my heads by as nice and symmetrical as the correct one (because later i want to align them and perform PCA).
The idea I have at the moment is that I could interpolate the surfaces between all of the vertices and from those interpolations take new vertices that are equally distanced from each other.
This solutions seems a lot of work and maybe its an overkill.
Maybe there is some other way (like geting that interpolation data from DirectX/XNA that has to calculate it at some point anyway).
I will be most thankful for helpful answers.
The correct example:
http://i55.tinypic.com/332mio2.jpg
Incorrect example:
http://i54.tinypic.com/309ujvt.jpg
It's probably possible to salvage (some of) the bad scans to some degree using some coordinate transformations, but you would have to guess the "incorrectness" of the alignment and it's probably impossible to do automatically.
But, unless the original subject is dead (or otherwise unavailable); it's probably a lot easier to redo the scans.
Making another scan is very likely to be quicker, and you won't loose quality as transforming the bad scans probably will. The nose on the incorrect sample seems to be shadowing the side of the nose, and no fancy algorithm can ever fix the missing data.

What is the best approach to compute efficiently the first intersection between a viewing ray and a set of objects?

For instance:
An approach to compute efficiently the first intersection between a viewing ray and a set of three objects: one sphere, one cone and one cylinder (other 3D primitives).
What you're looking for is a spatial partitioning scheme. There are a lot of options for dealing with this, and lots of research spent in this area as well. A good read would be Christer Ericsson's Real-Time Collision Detection.
One easy approach covered in that book would be to define a grid, assign all objects to all cells it intersects, and walk along the grid cells intersecting the line, front to back, intersecting with each object associated with that grid cell. Keep in mind that an object might be associated with more grid-cells, so the intersection point computed might actually not be in the current cell, but actually later on.
The next question would be how you define that grid. Unfortunately, there's no one good answer, and you need to consider what approach might fit your scenario best.
Other partitioning schemes of interest are different tree structures, such as kd-, Oct- and BSP-trees. You could even consider using trees combined with a grid.
EDIT
As pointed out, if your set is actually these three objects, you're definately better of just intersecting each one, and just pick the earliest one. If you're looking for ray-sphere, ray-cylinder, etc, intersection tests, these are not really hard and a quick google should supply all the math you might possibly need. :)
"computationally efficient" depends on how large the set is.
For a trivial set of three, just test each of them in turn, it's really not worth trying to optimise.
For larger sets, look at data structures which divide space (e.g. KD-Trees). Whole chapters (and indeed whole books) are dedicated to this problem. My favourite reference book is An Introduction to Ray Tracing (ed. Andrew. S. Glassner)
Alternatively, if I've misread your question and you're actually asking for algorithms for ray-object intersections for specific types of object, see the same book!
Well, it depends on what you're really trying to do. If you'd like to produce a solution that is correct for almost every pixel in a simple scene, an extremely quick method is to pre-calculate "what's in front" for each pixel by pre-rendering all of the objects with a unique identifying color into a background item buffer using scan conversion (aka the z-buffer). This is sometimes referred to as an item buffer.
Using that pre-computation, you then know what will be visible for almost all rays that you'll be shooting into the scene. As a result, your ray-environment intersection problem is greatly simplified: each ray hits one specific object.
When I was doing this many years ago, I was producing real-time raytraced images of admittedly simple scenes. I haven't revisited that code in quite a while but I suspect that with modern compilers and graphics hardware, performance would be orders of magnitude better than I was seeing then.
PS: I first read about the item buffer idea when I was doing my literature search in the early 90s. I originally found it mentioned in (I believe) an ACM paper from the late 70s. Sadly, I don't have the source reference available but, in short, it's a very old idea and one that works really well on scan conversion hardware.
I assume you have a ray d = (dx,dy,dz), starting at o = (ox,oy,oz) and you are finding the parameter t such that the point of intersection p = o+d*t. (Like this page, which describes ray-plane intersection using P2-P1 for d, P1 for o and u for t)
The first question I would ask is "Do these objects intersect"?
If not then you can cheat a little and check for ray collisions in order. Since you have three objects that may or may not move per frame it pays to pre-calculate their distance from the camera (e.g. from their centre points). Test against each object in turn, by distance from the camera, from smallest to largest. Although the empty space is the most expensive part of the render now, this is more effective than just testing against all three and taking a minimum value. If your image is high res then this is especially efficient since you amortise the cost across the number of pixels.
Otherwise, test against all three and take a minimum value...
In other situations you may want to make a hybrid of the two methods. If you can test two of the objects in order then do so (e.g. a sphere and a cube moving down a cylindrical tunnel), but test the third and take a minimum value to find the final object.

Resources