Multithreaded shading pass Vulkan

Multithreaded shading pass Vulkan - multithreading

I'm currently implementing a basic deferred renderer with multithreading in Vulkan. Since my G-Buffer should have the same resolution as the final image I want to do it in a single render-pass with multiple sub-passes, according to this presentation, on slide 44 (page 138). It says:
vkCmdBeginCommandBuffer
vkCmdBeginRenderPass
vkCmdExecuteCommands
vkCmdNextSubpass
vkCmdExecuteCommands
vkCmdEndRenderPass
vkCmdEndCommandBuffer
I get that in the first sub-pass, you iterate the scene graph and record one secondary commandbuffer for each entity/mesh. What I don't get is how you are supposed to do the shading pass with secondary command buffers. Do you somehow spit the screen into parts and render each part in a separate thread or just record one secondary commandbuffer for the entire second sub-pass?

To me, like you said, you can need to multi thread your command buffer for the "building g-buffer subpass". However for the shading pass, it must depends on how are you doing things. To me (again), you do not need to multi thread your shading subpasses. However, you must take into consideration that you can have one "by region dependency".
So, I encourage you to procede that way.
Before to begin your RenderPass, use a Compute Shader to splat all your lights on the screen (here you have a kind of array of "quad").
By splatting I mean this kind of thing. You have a point light (for example), the idea is to compute the quad in screen space affected by the light. With that you have 4 vertices (that represents a quad) that you put into a SSBO and you can use it as a vertex Buffer in the shading subpass.
Now you begin the render pass.
MT the scene graph rendering if needed. and do your vkCmdExecuteCommands();
NextSubpass
Use the "array of quads" you create from the earlier compute shader (do not forget a VK_SUBPASS_EXTERNAL dependency).
NextSubpass and so on
However, you said
you iterate the scene graph and record one secondary commandbuffer for each entity/mesh.
I am not sure I really understand what you meant, but if you intend to have one secondary command buffer for one mesh, I really advice you to change the way you are doing. You must use batching. Let's say you have 64 000 different meshes to draw. You could for exemple create 64 command buffers (that you dispatch on 4 threads) and each command buffers have 1000 meshes to draw. (The number are took randomly, so profile your application).
So to answer your question for the shading subpass, I would not use command buffers or only very few (by kind of lights (punctual, directional))

What I don't get is how you are supposed to do the shading pass with secondary command buffers.
The shading pass (assumably the second subpass) would possibly take the G-buffers created by the first subpass as an Input Attachment. Then it would draw to equally sized screen-size quad using data from the G-buffers + from a set of lights (or whatever your deferred shader tries to defer).
The presentation you link tries to hint at this structure style starting at page 13 (marked "Page 107").
First step would be to make it working. Use e.g. this SW example. Then the next step of optimizing it into single renderpass should be easier.

Related

How Might I organize vertex data in WebGL for a frame-by-frame (very specific) animated program?

I have been working on an animated graphics project with very specific requirements, and after quite a bit of searching and test coding, I have figured that I could take several approaches, but the Khronos and MDN documentation I have been reading coupled with other posts I have seen here don't answer all of my questions regarding my particular project. In the meantime, I have written short test programs (setting infrastructure for testing).
Firstly, I should describe the project:
The main object drawn to the screen is a simple quad surrounded by a black outline (LINE_LOOP or LINES will do, probably, though I have had issues with z-fighting...that will be left for another question). When the user interacts with the program, exactly one new quad is created and immediately drawn, but for a set amount of time its vertices move around until the quad moves to its final destination. (Note that translations won't do.) Random black lines are also drawn, and sometimes those lines also move around.
Once one of the quads reaches its final spot, it never moves again.
A new quad is always atop old quads (closer to the screen). That means that I need to layer the quads and lines from oldest to newest.
*this also means that it would probably be best to assign z-values to each quad and line, even if the graphics are in pixel coordinates and use an orthographic matrix. Would everyone agree with this?
Given these parameters, I have a few options with varying levels of complexity:
1> Take the object-oriented approach and just assign a buffer to each quad, and the same goes for the random lines. --creation and destruction of buffers every frame for the one shape that is moving. I truthfully think that this is a terrible idea that might only work in a higher level library that does heavy optimization underneath. This approach also doesn't take advantage of the fact that almost every quad will stay the same.
[vertices0] ... , [verticesN]
Draw x N (many draws for many small-size buffers)
2> Assign a z-value to each quad, outline, and line (as mentioned above). Allocate a huge vertex buffer and element buffer to store all permanently-in-their-final-positions quads. Resize only in the very unlikely case someone interacts for long enough. Create a second tiny buffer to store the one temporary moving quad and use bufferSubData every frame. When the quad reaches its destination, bufferSubData it into the large buffer and overwrite the small buffer upon creation of the next quad...all on the same frame. The main questions I have here are: is it possible (safe?) to use bufferSubData and draw it on the same frame? Also, would I use DYNAMIC_DRAW on both buffers even though the larger one would see fewer updates?
[permanent vertices ... | uninitialized (keep a count)]
bufferSubData -> [tempVerticesForOneQuad]
Draw 2x
3> Still create the large and small buffers, but instead of using bufferSubData every frame, create a second shader program and add an attribute for the new/moving quad that explicitly sets the vertex positions for the animation (I would pass vertex index attributes). Only draw with the small buffer when the quad is moving. For the frame when the quad reaches its destination, draw both large and small buffer, but then bufferSubData the final coordinates into the large permanent buffer to be used in the next frame.
switchToShaderProgramA();
[permanent vertices...| uninitialized (keep a count)]
switchToShaderProgramB();
[temp vertices] <- shader B accepts indices for each vertex so we can do all animation in the vertex shader
---last frame of movement arrives : bufferSubData into the permanent vertices buffer for when the the next quad is created
I get the sense that the third option might be the best, but I would like to learn whether there are some other factors that I did not consider. For example, my assumption that a program switch, additional attributes, and vertex shader manipulation would be faster than just substituting the buffer values as in 2>. The advantage of approach 3> (I think) is that I can defer the buffer substitution to a time when nothing needs to be drawn.
Still, I am still not sure of how to work with the randomly-appearing lines. I can't take the "single quad vertex buffer" approach since the number of lines cannot be predicted. Might I also allocate a large buffer for the moving lines? Those also stay after the quad is finished moving, though I don't think that I could use the vertex shader trick because there would be too many attributes to set (as opposed to the 4 for the one quad). I suppose that I could create a large "permanent line data" buffer first, but what to do during the animation is tricky because the lines move. Maybe bufferSubData() + draw on the same frame is not terrible? Or it could be. This is where I need advise.
I understand that this question might not be too specific code-wise, but I don't believe that I would be allowed to show the core of the program. All I have is the typical WebGL boilerplate ready.
I am looking forward to hearing people's thoughts on how I might proceed and whether there are any trade-offs I might have missed when considering the three options above.
Thank you in advance, and please feel free to ask any additional questions if clarification is necessary.

Honestly, for what you're describing, it doesn't sound to me like it matters which you choose. On modern hardware, drawing a few hundred quads and a few thousand lines each frame would not really tax the hardware much.
Having said that, I agree that approach 1 seems very inefficient. Approach 2 sounds perfectly fine. You can safely draw a buffer on the same frame that you uploaded the data. I don't think it matters much whether you use DYNAMIC_DRAW or STATIC_DRAW for the buffer. I tend to think of dynamic buffers as being something you're updating every frame. If you only update it every few seconds or less, then static is fine. Approach 3 is also fine. Between 2 and 3, I'd say do whichever is easier for you to understand and program.
Likewise, for the lines, I would use a separate buffer. It sounds like that one changes per frame, so I would use DYNAMIC_DRAW for that. Allocating a single large buffer for it and performing a glBufferSubData() per frame is probably a fine strategy. As always, trying it and profiling it will tell you for sure.

what is colorAttachment[n] in Metal?

Very new to graphics in general, starting off with Metal for immediate needs, will try out OpenGL soon enough.
Was wondering what the question means in lay man terms. Also, what is the extent of 'n', I have just used it as 0 in the 2D triangle I made.

In general, the color attachment(s) are where rendered images are stored (at least temporarily) during a render pass. It is common to only use one color attachment at index 0, so what you're doing is fine. It is also possible to render to multiple color attachments simultaneously, which is why there's an array. It's an advanced technique that you don't have to worry about until you see the need, at which point it should be straightforward how to do it.
There are two places where colorAttachments[n] appears in Metal. First is in MTLRenderPipelineDescriptor. The other is in MTLRenderPassDescriptor.
In both cases, the extent is given in the Metal Implementation Limits table, in the "Render targets" section, in the row labelled "Maximum number of color render targets per render pass descriptor".
For MTLRenderPipelineDescriptor, colorAttachments[n] is a reference to a MTLRenderPipelineColorAttachmentDescriptor. Here, you configure the pixel format, write mask, and color blending operation.
For MTLRenderPassDescriptor, colorAttachments[n] is a reference to a MTLRenderPassColorAttachmentDescriptor. This is a subclass of MTLRenderPassAttachmentDescriptor, which is where most of its properties are defined. Here, you configure which part of which texture you will render to, what should happen to that texture's data when the render pass starts and ends, and, if it's to be cleared, what color it should be cleared to.
Information about color attachments is split across those two objects based on how expensive it is to change. The render pipeline state object is fairly expensive to create from the pipeline descriptor. You would typically create all of your pipeline state objects once in a run and reuse them for the rest of your app's lifetime.
By contrast, you will create render command encoders from render pass descriptors fairly often; at least once per frame. They are relatively inexpensive to create, so you can change the descriptor and create a new one to render elsewhere.
The render pipeline state's color attachment pixel format has to match the pixel format of the texture of the render command encoder's color attachment texture.

DirectX11: Read stencil bit from compute shader

I am converting a full-screen effect to a compute shader so that I can take advantage of some features of compute that can't be done with fragment shaders. Right now, this full-screen effect uses stenciling to avoid writing to pixels which it should not affect, and I would like to mimic that behavior with my compute shader.
I know that I can write this info to a color channel somewhere, but I'm hoping to avoid that and instead just read the stencil bit directly in the compute shader. However, I can't find any way to bind a D24S8 buffer to a compute shader such that I can actually read the stencil bit. It seems to only provide the depth information.
Is there any way to do this? Google is failing me, because everyone calls it the depth/stencil buffer when talking about sampling depth values.

This is possible.
Create your D24S8 depth buffer Texture with the typeless format DXGI_FORMAT_R24G8_TYPELESS.
Create your DepthStencilView of the resource in Step 1 using the strongly typed format of DXGI_FORMAT_D24_UNORM_S8_UINT.
Create a ShaderResourceView of the stencil buffer for binding to the Compute Shader using the strongly typed format of DXGI_FORMAT_X24_TYPELESS_G8_UINT.
[Optional] Create a ShaderResourceView of the "depth" part of the resource using the strongly typed format DXGI_FORMAT_R24_UNORM_X8_TYPELESS.
Make sure you declare your stencil SRV in HLSL to be Texture2D<uint2> so you can access the Green channel (G8) which is where the stencil data will come from.
Note since Depth and Stencil are separate SRVs you'd need to declare two Texture2Ds in HLSL, one for each. eg:
Texture2D<float> depthBuffer; // Red contains depth.
Texture2D<uint2> stencilBuffer; // Green contains stencil. Red is unused.

OpenGL geometry performance

I have an application which renders many filled polygons with OpenGL, in 2D. Filling is done by tesselation but performance is not optimal. 1900 polygons made up of 122000 vertex (that is, about 64 vertex per polygon) are displayed in about 3 seconds.
Apparently, the CPU is not the bottleneck, as if I replace calls to gluTessVertex by calls to glColor - just to test where is the bottleneck, performance is doubled.
I have the same problem with loading many small textures.
Now, which are the options to improve the performance? Seems that most time is spend in the geometry subsystem. Rendering is fast enough.
I already have a worker thread which does the load (so tesselation, texture binding) in one context, and another thread which does the draw in another context. The two contexts share objects via wglShareLists and it works like a charm.
Can I have a third thread in a third context which would handle also tesselation for half of the polygons? Anyone tried that? Is it safe? Any example of sharing objects between three contexts?
Forgot to say, I have an ATI Radeon HD 4550 graphics card, suppose it can handle more than 39kB/s of data.

Increase Performance
Sounds like you're using the old fixed-function pipeline.
If you're unsure of what that is, well, the following functions are a part of the fixed-function pipeline.
glBegin()
glEnd()
glVertex*()
glTexCoord*()
glNormal*()
glColor*()
etc.
Those functions are old and render geometry immediately. That means that each time you call the above functions, that geometry gets send to the GPU. By doing that a lot of times, you can easy make the FPS go way under 60 just by rendering simple things.
Now you need to use buffers and to be more precise VAOs with/or VBOs (and IBOs).
VBO or Vertex Buffer Object, is a buffer which can store vertices which you then can render. This is much much faster and better to use than glBegin() and glEnd(). When you create a VBO you supply it with vertices and they only require to be send to the GPU once, that's basically why they are fast, because they already are in the GPU and only require a single draw call instead of multiple.
The reason I said "with/or" is because in the newer versions you need to create a VAO which then would use a VBO, where before you could simply render the VBOs.
Tessellation
There are multiple ways to do tessellation and things which look like/would give the effect of tessellation.
For instance you could also simply render different models according to the required LOD (Level of Detail), thereby when you're up close to an object you then render the model with all it's details which probably would have a high vertices count. Then the further you're away from the model you simply render another version of that model but which have less vertices, which also equals less detail. Though if you can't really do that on something like terrain and definitely shouldn't do it on something like dynamic terrain and/or procedurally generated terrain.
You can also do actual geometry tessellation and you would do that through a Shader. Since tessellation is a really huge topic I will provide you with 2 urls which both explain and have code on them.
Both of these articles uses modern 4.0/4.0+ OpenGL.
http://prideout.net/blog/?p=48
http://antongerdelan.net/opengl/tessellation.html
Texturing
Generating and binding textures are still the same.
Instead of using gluBuild2DMipmaps() you can use glGenerateMipmap(GL_TEXTURE_2D); it was added in OpenGL version 3.0'ish if I remember correctly.
Again you can (and should) change all you glBegin() - glEnd() (and everything in between) calls out with VAOs and VBOs. You can store everything you want inside a buffer vertices, texture coordinates, normals, colors, etc. You can store the things in separate buffers or you can store them inside a single buffer, usually called an Interleaved Buffer or Interleaved VBO.
You wouldn't be needing glEnable(GL_TEXTURE_2D) and glDisable(GL_TEXTURE_2D) anymore, because you do that within a Shader, you bind textures and use them in a Shader, and since you create the Shader Program you can make it act however you want it to.

Best way to update a Direct3D Texture

I need to render some CPU generated images in Direct3D 9 and I'm not sure of the best way to get the texture data onto the graphics card as there seems to be a number of approaches.
My usage path goes along the following lines each frame
Render a bunch of stuff with the textures
Update a few parts of the texture (which may have been used by the previous renders)
Render some more stuff with the texture
Update another part of the texture
and so on
Ive thought of a couple of ways to do this, however I'm not sure which one to go with. I considered benchmarking each method however I have no way to know if any results I get are representative of hardware in general, or only my hardware.
Which pool is best for a texture for this task?
Whats the best way to update this texture?
Call LockRect and UnlockRect for each region I need to update
Call LockRect and UnlockRect for the entire texture
Call LockRect and UnlockRect for the entire texture with D3DLOCK_DISCARD and copy in a bitmap from RAM.
Create a completely new texture each time I need to "update it"
Use 1,2 or 3 to update a surface in D3DPOOL_SYSMEM, then UpdateSurface to update level 0 of my texture from this surface
Same as 5 but specify RECT to cover the entire area I need
Same as 5 but make multiple calls, one for each region I updated
Probably yet another way to do this I haven't thought of yet...
It should be noted that the areas I'm updating are usually fairly small compared to the size of the entire texture, eg the texture may be 1024*1024 and I might want to update 5 or so 64*64 regions of it.

If you need to update multiple areas, you should lock the whole texture and use the D3DLOCK_NO_DIRTY_UPDATE flag, then for each area call AddDirtyRect before unlocking.
This of course all depends on the size of the texture etc, for small texture it may be more efficient to copy the whole thing from ram.

D3DPOOL_DEFAULT
D3DUSAGE_DYNAMIC
call LockRect and UnlockRect for each region you need to update
--> This is the fastest!
Benchmark will follow...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string