UAV counter indices used across multiple shaders? - graphics

I've been trying to implement a Compute Shader based particle system.
I have a compute shader which builds a structured buffer of particles, using a UAV with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.
When I add to this buffer, I check if this particle has any complex behaviours, which I want to filter out and perform in a separate compute shader. As an example, if the particle wants to perform collision detection, I add its index to another structured buffer, also with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.
I then run a second compute shader, which processes all the indices, and applies collision detection to those particles.
However, in the second compute shader, I'd estimate that about 5% of the indices are wrong - they belong to other particles, which don't support collision detection.
Here's the compute shader code that perfroms the list building:
// append to destination buffer
uint dstIndex = g_dstParticles.IncrementCounter();
g_dstParticles[ dstIndex ] = particle;
// add to behaviour lists
if ( params.flags & EMITTER_FLAG_COLLISION )
{
uint behaviourIndex = g_behaviourCollisionIndices.IncrementCounter();
g_behaviourCollisionIndices[ behaviourIndex ] = dstIndex;
}
If I split out the "add to behaviour lists" bit into a separate compute shader, and run it after the particle lists are built, everything works perfectly. However I think I shouldn't need to do this - it's a waste of bandwidth going through all the particles again.
I suspect that IncrementCounter is actually not guaranteed to return a unique index into the UAV, and that there is some clever optimisation going on that means the index is only valid inside the compute shader it is used in. And thus my attempt to pass it to the second compute shader is not valid.
Can anyone give any concrete answers to what's going on here? And if there's a way for me to keep the filtering inside the same compute shader as my core update?
Thanks!

IncrementCounter is an atomic operation and so will (driver/hardware bugs notwithstanding) return a unique value to each thread that calls it.
Have you thought about using Append/Consume buffers for this, as it's what they were designed for? The first pass simply appends the complex collision particles to an AppendStructuredBuffer and the second pass consumes from the same buffer but using a ConsumeStructuredBuffer view instead. The second run of compute will need to use DispatchIndirect so you only run as many thread groups as necessary for the number in the list (something the CPU won't know).
The usual recommendations apply though, have you tried the D3D11 Debug Layer and running it on the reference device to be sure it isn't a driver issue?

Related

Ways of drawing vertex ranges in Direct3D

Suppose I have a one big shader program and want only specific ranges of vertices/triangles rendered.
What are some performant ways of doing this? Which one looks most promising?
I came up with 3 methods, are there some more?
Batching draw calls Draw(position, count) using command lists. (Opengl has glMultiDrawArrays.)
If the ranges are known ahead of time, we don't need to worry about the time spent constructing them. However, since some ranges change unpredictably, it is probably unrealistic to keep command lists for all possibilities.
This method obviously reduces the time spent constructing the draw calls on CPU side and I assume that these calls already tell the GPU that it does not need to do any state changes.
Call Draw on the whole buffer and keep updating a boolean per-vertex float buffer which would just multiply the output positions by 0/1 in inactive/active ranges.
The benefit here is only one draw call. However, we need to update the buffer and it seems like the buffer needs to be locked while we update it...
Powerset defined in a constant buffer. For n ranges, we can use one n-bit constant mask compared against a static n-bit per_vertex_mask. Vertex is visible if mask & per_vertex_mask != 0.
Constant buffers are probably cheaper to update than a whole vertex buffer. The number of ranges may however be too large for this method.

Vulkan Shader & Resources: Why Uniform and not Const Resources

We usually use const in c++ to imply that the value does not change (read only), why in GLSL/VK in the shader or resource definition they choose the word uniform ? Wodn`t be more consistent and use the keyword borrowed from c/c++
Beside that probably the uniform keyword in shader definitions give clues to the compiler to attach those resources as close to the hardware as possible, probably shared memory or registers ? Not sure on that.
That also probably why they mention in the VkSpec. that we need small ammounts of data for those type of resources. Like for eg: values of cosmological constants..etc
Is anything that I`m missing, or some bit of history that passed away ?
Uniforms in GPU programming and const in C++ are focused on different things.
C++ const documents that a variable is not intended to be changed, with some compiler enforcement. As such it's more about using the type system to improve clarity and enforce intended usage -- important for large-project software engineering. You can still get around it with const_cast or other tricks, and the compiler can't assume you didn't, so it's not strictly enforced.
The important thing about uniforms is that they're, well, uniform. Meaning they have the same value whenever they are read within a draw call. Since there might be hundreds to millions of reads of that value in a single draw call, this allows it to be cached, and just one copy of it to be cached, or that it can be preloaded into registers (or cache) before shaders run, that it can be cached in a non-coherent cache, that a single read result can be broadcast across all SIMD lanes in a core, etc. For this to work, the fact that the contents can't change must be strictly enforced (with memory aliasing you can get around even this, now, but results are very much undefined if you do). So uniform really isn't about declaring intent to other programmers for software engineering benefits like const is, it's about declaring intent to the compiler and driver so they can optimize based on it.
D3D uses "const" and "constant buffer" rather than uniform, so clearly there is some overlap. Though that does lead to saying things like "how many times do you update constants per frame?" which when you think about it is kind of a weird thing to say :). The values are constant within shader code, but very much aren't constant at the API level.
The etymology of the word is important here. The term "uniform" is derived from GLSL, which was inspired by the Renderman standard's shader terminology. In Renderman, "uniform" was used for values "whose values are constant over whatever portion of the surface begin shaded". This was an alternative to "varying" which represented values interpolated across the surface.
"Constant" would imply that the value never changes. Uniform values do change; they simply don't change at the same frequency as other values. Input values change per-invocation, uniform values change per-draw call, and constant values don't change. Note that in GLSL, const usually means "compile-time constant": a value that is set at compile time and is never changed.
A uniform variable in Vulkan ultimately comes from a resource that exists outside of the shader. Blocks of uniform variables fed by buffers, uniforms in push constants fed by push constant state are both external resources, set by the user. That's a fundamentally different concept from having a compile-time constant struct.
Since it's different from a constant struct, it needs a different term to request it.

Efficient 2D rendering with Glium

I'm using Glium to do rendering for an emulator I'm writing. I've pieced together something that works (based on this example) but I suspect that it's pretty inefficient. Here's the relevant function:
fn update_screen(display: &Display, screen: &Rc<RefCell<NesScreen>>) {
let target = display.draw();
// Write screen buffer
let borrowed_scr = screen.borrow();
let mut buf = vec![0_u8; 256 * 240 * 3];
buf.clone_from_slice(&borrowed_scr.screen_buffer[..]);
let screen = RawImage2d::from_raw_rgb_reversed(buf, SCREEN_DIMENSIONS);
glium::Texture2d::new(display, screen)
.unwrap()
.as_surface()
.fill(&target, MagnifySamplerFilter::Nearest);
target.finish().unwrap();
}
At a high level, this is what I'm doing:
Borrow NesScreen which contains the screen buffer, which is an array.
Clone the screen buffer into a vector
Create a texture from the vector data and render it
My suspicion is that cloning the entire screen buffer via clone_from_slice is really inefficient. The RawImage2d::from_raw_rgb_reversed function takes ownership of the vector passed into it, so I'm not sure how to do this in a way that avoids the clone.
So, two questions:
Is this actually inefficient? I don't have enough experience rendering stuff to know intuitively.
If so, is there a more efficient way to do this? I've scoured Glium quite a bit but there isn't much specific to 2D rendering.
This won't be a very good answer, but maybe a few things here could help you.
First of all: is this really inefficient? That's really hard to say, especially the OpenGL part, as OpenGL performance depends a lot on when synchronization is required/requested.
As for the cloning of the screen buffer: you are merely copying 180kb, which is not too much. I quickly benchmarked it on my machine and cloning a 180kb vector takes around 5µs, which is really not a lot.
Note that you can create a RawImage2d without using a method, because all fields are public. This means that you can avoid the simple 5µs clone if you create a reversed vector yourself. However, reversing the vector with the method glium uses is a lot slower than just cloning the vector; on my machine it takes 170µs for a vector of the same length. This is probably still tolerable if you just want to achieve 60fps = 17ms per frame, but still not very nice.
You could think about using the correct row ordering in your original array to avoid this problem. OR you could, instead of directly copying the texture to the framebuffer, just draw a fullscreen quad (one vertex for each screen corner) with the texture on it. Sure, then you need a mesh, a shader and all that stuff, but you could just "reverse" the image by tweaking the texture coordinates.
Lastly, I unfortunately don't know a lot about the time the GPU takes to execute the OpenGL commands. I'd guess that it's not optimal because OpenGL doesn't have a lot of room to schedule your commands, but has to execute them right away (forced synchronization). But maybe that's not avoidable in your case.

Iterative octree traversal

I am not able to figure out the procedure for iterative octree traversal though I have tried approaching it in the way of binary tree traversal. For my problem, I have octree nodes having child and parent pointers and I would like to iterate and only store the leaf nodes in the stack.
Also, is going for iterative traversal faster than recursive traversal?
It is indeed like binary tree traversal, but you need to store a bit of intermediate information. A recursive algorithm will not be slower per se, but use a bit more stack space for O(log8) recursive calls (about 10 levels for 1 billion elements in the octree).
Iterative algorithms will also need the same amount of space to be efficient, but you can place it into the heap it you are afraid that your stack might overflow.
Recursively you would do (pseudocode):
function traverse_rec (octree):
collect value // if there are values in your intermediate nodes
for child in children:
traverse_rec (child)
The easiest way to arrive at an iterative algorithm is to use a stack or queue for depth first or breath first traversal:
function traverse_iter_dfs(octree):
stack = empty
push_stack(root_node)
while not empty (stack):
node = pop(stack)
collect value(node)
for child in children(node):
push_stack(child)
Replace the stack with a queue and you got breath first search. However, we are storing something in the region of O(7*(log8 N)) nodes which we are yet to traverse. If you think about it, that's the lesser evil though, unless you need to traverse really big trees. The only other way is to use the parent pointers, when you are done in a child, and then you need to select the next sibling, somehow.
If you don't store in advance the index of the current node (in respect to it's siblings) though, you can only search all the nodes of the parent in order to find the next sibling, which essentially doubles the amount of work to be done (for each node you don't just loop through the children but also through the siblings). Also, it looks like you at least need to remember which nodes you visited already, for it is in general undecidable whether to descend farther down or return back up the tree otherwise (prove me wrong somebody).
All in all I would recommend against searching for such a solution.
Depends on what your goal is. Are you trying to find whether a node is visible, if a ray will intersect its bounding box, or if a point is contained in the node?
Let's assume that you are doing the last one, checking if a point is/should be contained in the node. I would add a method to the Octnode that takes a point and checks whether or not it lies within the bounding box of the Octnode. If it does return true, else false, pretty simple. From here, call a drill down method that starts at your head node and check each child, simple "for" loop, to see which Octnode it lies in, it can at most be one.
Here is where your iterative vs recursive algorithm comes into play. If you want iterative, just store the pointer to the current node, and swap this pointer from the head node to the one containing your point. Then just keep drilling down till you reach maximal depth or don't find an Octnode containing it. If you want a recursive solution, then you will call this drill down method on the Octnode that you found the point in.
I wouldn't say that iterative versus recursive has much performance difference in terms of speed, but it could have a difference in terms of memory performance. Each time you recurse you add another call depth onto the stack. If you have a large Octree this could result in a large number of calls, possibly blowing your stack.

remove individual points from vtkPoints

I have a question regarding a VTK class called vtkPoints. The class has the functionality to insert individual points, but doesn't have the functionality to remove individual points. This is inconvenient for the case when the view needs to be updated by calling vtkPoints::Modified() to drive the graphics pipeline again to update/re-render the view. The obvious case of re-initializing vtkPoints, adding all points again and updating/re-rendering the view is too slow and resource intensive.
Is there a possible solution to this problem?
Thanks,
timecatcher
The example http://www.vtk.org/Wiki/VTK/Examples/Cxx/PolyData/DeletePoint has a rather simple solution. Copy points to another temporary vtkPoints by filtering the id to remove, and shallow-copy it to the original one:
void ReallyDeletePoint(vtkSmartPointer<vtkPoints> points, vtkIdType id)
{
vtkSmartPointer<vtkPoints> newPoints =
vtkSmartPointer<vtkPoints>::New();
for(vtkIdType i = 0; i < points->GetNumberOfPoints(); i++)
{
if(i != id)
{
double p[3];
points->GetPoint(i,p);
newPoints->InsertNextPoint(p);
}
}
points->ShallowCopy(newPoints);
}
There is no way to remove individual points from vtkPoints. Depending on what your problem is here are some potential solutions:
Store all the points in a single vtkPoint instance, overwrite points you want to get rid of with a value to replace it. This would be useful to cap the max amount of memory a point cloud could use.
Store all the points in a single vtkPoint instance, overwrite points you want to get rid of with a value a value that is far away from your scene.
Create a vtkPoint, vtkCellArray, and vtkPolyData for each point, join them together using vtkAppendPolyData. This has a RemoveInput(vtkPolyData*) so you could remove individual points.
this is a way to remove a point from vtkPoints in python.
def deletePoint(vtk_points, *args):
if isinstance(args[0], list):
args = args[0]
points = vtk.vtkPoints()
for i in range(vtk_points.GetNumberOfPoints()):
if i in args: continue
p = vtk_points.GetPoint(i)
points.InsertNextPoint(p)
return points
No: it has the same limitations on mutability as a float[] array. The only way to remove a point is to copy and exclude. Note that you will incur the same copy penalty when doing Insert() operations if you exceed pre-allocated storage.
Other related data structure options include vtkCollection and vtkPolyData. Also, it might be informative to look at the source for some of the PolyData clip filters to get an idea of the way these type of operations are implemented internally - those should be about as fast as they can be within the limitations of the data structure.
Allowing a point to be deleted from vtkPoints can cause a data set that uses the point to become corrupted. You would also have to delete all cells that use that point as well as shrink the point data arrays.
I would suggest that if you have a filter that is creating the vtkPoints to modify the vtkPoints object and anything that depends on that in the RequestData() method.

Resources