I'm having an issue with atomics in wgpu / WGSL but I'm not sure if it's due to a fundamental misunderstanding or a bug in my code.
I have a input array declared in WGSL as
struct FourTileUpdate {
// (u32 = 4 bytes)
data: array<u32, 9>
};
#group(0) #binding(0) var<storage, read> tile_updates : array<FourTileUpdate>;
I'm limiting the size of this array to around 5MB, but sometimes I need to transfer more than that for a single frame and so use multiple command encoders & compute passes.
Each "tile update" has an associated position (x & y) and a ms_since_epoch property that indicates when the tile update was created. Tile updates get written to a texture.
I don't want to overwrite newer tile updates with older tile updates, so in my shader I have a guard:
storageBarrier();
let previous_timestamp_value = atomicMax(&last_timestamp_for_tile[x + y * r_locals.width], ms_since_epoch);
if (previous_timestamp_value > ms_since_epoch) {
return;
}
However, something is going wrong and older tile updates are overwriting newer tile updates. I can't reproduce this on Windows / Vulkan but it consistently happens on macOS / Metal. Here's an image of the rendered texture--it should be completely green instead of the occasional red and black pixel:
rendered texture
A few questions:
is execution order guaranteed to be the same as the order of the command encoder constructions?
do storageBarrier() and atomics work across all invocations in a single frame or just the compute pass?
I tried submitting each encoder with queue.submit(Some(encoder.finish())) before creating the next encoder for the frame, and even waiting for the queue to finish processing for each submitted encoder with
let (tx, rx) = mpsc::channel();
queue.on_submitted_work_done(move || {
tx.send().unwrap();
});
device.poll(wgpu::Maintain::Wait);
rx.rev().unwrap()
// ... loop back and create & submit next encoder for current frame
but that didn't work either.
Good questions!
is execution order guaranteed to be the same as the order of the command encoder constructions?
I believe that is the case. But I checked and the spec is actually unclear about this. I filed https://github.com/gpuweb/gpuweb/issues/3809 to fix this.
Further, I believe the intent is that all memory accesses (e.g. to storage buffers) from one GPU command will complete before the next GPU command begins. So the effect of any writes in one command will be visible in the next command (read-after-write hazard). Also, a write in a later command will not be visible in an earlier command (write-after-read hazard).
do storageBarrier() and atomics work across all invocations in a single frame or just the compute pass?
Another good question. storageBarrier() only works within a single workgroup. This may be surprising, but is due to a limitation in some platforms.
For further details, see https://github.com/gpuweb/gpuweb/issues/3774
This will be a FAQ because it is surprising, and subtle!
Update: I suspect the bad behaviour you're seeing is that storageBarrier() does not work across workgroups. It's a limitation in Metal.
Related
I have experience with D3D11 and want to learn D3D12. I am reading the official D3D12 multithread example and don't understand why the shadow map (generated in the first pass as a DSV, consumed in the second pass as SRV) is created for each frame (actually only 2 copies, as the FrameResource is reused every 2 frames).
The code that creates the shadow map resource is here, in the FrameResource class, instances of which is created here.
There is actually another resource that is created for each frame, the constant buffer. I kind of understand the constant buffer. Because it is written by CPU (D3D11 dynamic usage) and need to remain unchanged until the GPU finish using it, so there need to be 2 copies. However, I don't understand why the shadow map needs to do the same, because it is only modified by GPU (D3D11 default usage), and there are fence commands to separate reading and writing to that texture anyway. As long as the GPU follows the fence, a single texture should be enough for the GPU to work correctly. Where am I wrong?
Thanks in advance.
EDIT
According to the comment below, the "fence" I mentioned above should more accurately be called "resource barrier".
The key issue is that you don't want to stall the GPU for best performance. Double-buffering is a minimal requirement, but typically triple-buffering is better for smoothing out frame-to-frame rendering spikes, etc.
FWIW, the default behavior of DXGI Present is to stall only after you have submitted THREE frames of work, not two.
Of course, there's a trade-off between triple-buffering and input responsiveness, but if you are maintaining 60 Hz or better than it's likely not noticeable.
With all that said, typically you don't need to double-buffered depth/stencil buffers for rendering, although if you wanted to make the initial write of the depth-buffer overlap with the read of the previous depth-buffer passes then you would want distinct buffers per frame for performance and correctness.
The 'writes' are all complete before the 'reads' in DX12 because of the injection of the 'Resource Barrier' into the command-list:
void FrameResource::SwapBarriers()
{
// Transition the shadow map from writeable to readable.
m_commandLists[CommandListMid]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_DEPTH_WRITE, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE));
}
void FrameResource::Finish()
{
m_commandLists[CommandListPost]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE, D3D12_RESOURCE_STATE_DEPTH_WRITE));
}
Note that this sample is a port/rewrite of the older legacy DirectX SDK sample MultithreadedRendering11, so it may be just an artifact of convenience to have two shadow buffers instead of just one.
So my goal is to add CVPixelBuffers into my AVAssetWriter / AVAssetWriterInputPixelBufferAdaptor with super high speed. My previous solution used CGContextDrawImage but it is very slow (0.1s) to draw. The reason seems to be with color matching and converting, but that's another question I think.
My current solution is trying to read the bytes of the image directly to skip the draw call. I do this:
CGImageRef cgImageRef = [image CGImage];
CGImageRetain(cgImageRef);
CVPixelBufferRef pixelBuffer = NULL;
CGDataProviderRef dataProvider = CGImageGetDataProvider(cgImageRef);
CGDataProviderRetain(dataProvider);
CFDataRef da = CGDataProviderCopyData(dataProvider);
CVPixelBufferCreateWithBytes(NULL,
CGImageGetWidth(cgImageRef),
CGImageGetHeight(cgImageRef),
kCVPixelFormatType_32BGRA,
(void*)CFDataGetBytePtr(da),
CGImageGetBytesPerRow(cgImageRef),
NULL,
0,
NULL,
&pixelBuffer);
[writerAdaptor appendPixelBuffer:pixelBuffer withPresentationTime:presentTime];
-- releases here --
This works fine on my simulator and inside an app. But when I run the code inside the SpringBoard process, it comes out as the images below. Running it outside the sandbox is a requirement, it is meant for jailbroken devices.
I have tried to play around with e.g. pixel format styles but it mostly comes out with differently corrupted images.
The proper image/video file looks fine:
But this is what I get in the broken state:
Answering my own question as I think I got the answer(s). The resolution difference was a simple code error, not using the device bounds on the latter ones.
The color issues. In short, the CGImages I got when running outside of the sandbox was using more bytes per pixel, 8 bytes. The images I get when running inside the sandbox was 4 bytes. So basically I was simply writing the wrong data into the buffer.
So, instead of simply slapping all of the bytes from the larger image into the smaller buffer. I loop through the pixel buffer row-by-row, byte-by-byte and I pick the RGBA values for each pixel. I essentially had to skip every other byte from the source image to get the right data into the right place within the buffer.
The example code for creating a version 3 AudioUnit demonstrates how the implementation needs to return a function block for rendering processing. The block will both get samples from the previous
AxudioUnit in the chain via pullInputBlock and supply the output buffers with the processed samples. It also must provide some output buffers if the unit further down the chain did not. Here is an excerpt of code from an AudioUnit subclass:
- (AUInternalRenderBlock)internalRenderBlock {
/*
Capture in locals to avoid ObjC member lookups.
*/
// Specify captured objects are mutable.
__block FilterDSPKernel *state = &_kernel;
__block BufferedInputBus *input = &_inputBus;
return Block_copy(^AUAudioUnitStatus(
AudioUnitRenderActionFlags *actionFlags,
const AudioTimeStamp *timestamp,
AVAudioFrameCount frameCount,
NSInteger outputBusNumber,
AudioBufferList *outputData,
const AURenderEvent *realtimeEventListHead,
AURenderPullInputBlock pullInputBlock) {
...
});
This is fine if the processing does not require knowing the frameCount before the call to this block, but many applications do require knowing the frameCount before this block in order to allocate memory, prepare processing parameters, etc.
One way around this would be to accumulate past buffers of output, outputting only frameCount samples each call to the block, but this only works if there is known minimum frameCount. The processing must be initialized with a size greater than this frame count in order to work. Is there a way to specify or obtain a minimum value for frameCount or force it to be a specific value?
The example code is taken from: https://github.com/WildDylan/appleSample/blob/master/AudioUnitV3ExampleABasicAudioUnitExtensionandHostImplementation/FilterDemoFramework/FilterDemo.mm
Under iOS, an audio unit callback must be able to handle variable frameCounts. You can't force it to be a constant.
Thus any processing that requires a fixed size buffer should be done outside the audio unit callback. You can pass data to a processing thread by using a lock-free circular buffer/fifo or similar structure that does not require memory management in the callback.
You can suggest that the frameCount be a certain size by setting a buffer duration using the AVAudioSession APIs. But the OS is free to ignore this, depending on other audio needs in the system (power saving modes, system sounds, etc.) In my experience, the audio driver will only increase your suggested size, not decrease it (by more than a couple samples if resampling to not a power of 2).
As my question title says, I want update texture for every frame.
I got an idea :
create a VkImage as a texture buffer with bellow configurations :
initialLayout = VK_IMAGE_LAYOUT_PREINITIALIZED
usage= VK_IMAGE_USAGE_SAMPLED_BIT
and it's memory type is VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
In draw loop :
first frame :
map texure data to VkImage(use vkMapMemory).
change VkImage layout from VK_IMAGE_LAYOUT_PREINITIALIZED to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.
use this VkImage as texture buffer.
second frame:
The layout will be VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL after the first frame , can I map next texure data to this VkImage directly without change it's layout ? if I can not do that, which layout can I change this VkImage to ?
In vkspec 11.4 it says :
The new layout used in a
transition must not be VK_IMAGE_LAYOUT_UNDEFINED or VK_IMAGE_LAYOUT_PREINITIALIZED
So , I can not change layout back to _PREINITIALIZED .
Any help will be appreciated.
For your case you do not need LAYOUT_PREINITIALIZED. That would only complicate your code (forcing you to provide separate code for the first frame).
LAYOUT_PREINITIALIZED is indeed a very special layout intended only for the start of the life of the Image. It is more useful for static textures.
Start with LAYOUT_UNDEFINED and use LAYOUT_GENERAL when you need to write the Image from CPU side.
I propose this scheme:
berfore render loop
Create your VkImage with UNDEFINED
1st to Nth frame (aka render loop)
Transition image to GENERAL
Synchronize (likely with VkFence)
Map the image, write it, unmap it (weell, the mapping and unmaping can perhaps be outside render loop)
Synchronize (potentially done implicitly)
Transition image to whatever layout you need next
Do your rendering and whatnot
start over at 1.
It is a naive implementation but should suffice for ordinary hobbyist uses.
Double buffered access can be implemented — that is e.g. VkBuffer for CPU access and VkImage of the same for GPU access. And VkCmdCopy* must be done for the data hand-off.
It is not that much more complicated than the above approach and there can be some performance benefits (if you need those at your stage of your project). You usually want your resources in device local memory, which often is not also host visible.
It would go something like:
berfore render loop
Create your VkBuffer b with UNDEFINED backed by HOST_VISIBLE memory and map it
Create your VkImage i with UNDEFINED backed by DEVICE_LOCAL memory
Prepare your synchronization primitives between i and b: E.g. two Semaphores, or Events could be used or Barriers if the transfer is in the same queue
1st to Nth frame (aka render loop)
Operations on b and i can be pretty detached (even can be on different queues) so:
For b:
Transition b to GENERAL
Synchronize to CPU (likely waiting on VkFence or vkQueueIdle)
invalidate(if non-coherent), write it, flush(if non-coherent)
Synchronize to GPU (done implicitly if 3. before queue submission)
Transition b to TRANSFER
Synchronize to make sure i is not in use (likely waiting on a VkSemaphore)
Transition i to TRANSFER
Do vkCmdCopy* from b to i
Synchronize to make known I am finished with i (likely signalling a VkSemaphore)
start over at 1.
(The fence at 2. and semaphore at 6. have to be pre-signalled or skipped for first frame to work)
For i:
Synchronize to make sure i is free to use (likely waiting on a VkSemaphore)
Transition i to whatever needed
Do your rendering
Synchronize to make known I am finished with i (likely signalling a VkSemaphore)
start over at 1.
You have a number of problems here.
First:
create a VkImage as a texture buffer
There's no such thing. The equivalent of an OpenGL buffer texture is a Vulkan buffer view. This does not use a VkImage of any sort. VkBufferViews do not have an image layout.
Second, assuming that you are working with a VkImage of some sort, you have recognized the layout problem. You cannot modify the memory behind the texture unless the texture is in the GENERAL layout (among other things). So you have to force a transition to that, wait until the transition command has actually completed execution, then do your modifications.
Third, Vulkan is asynchronous in its execution, and unlike OpenGL, it will not hide this from you. The image in question may still be accessed by the shader when you want to change it. So usually, you need to double buffer these things.
On frame 1, you set the data for image 1, then render with it. On frame 2, you set the data for image 2, then render with it. On frame 3, you overwrite the data for image 1 (using events to ensure that the GPU has actually finished frame 1).
Alternatively, you can use double-buffering without possible CPU waiting, by using staging buffers. That is, instead of writing to images directly, you write to host-visible memory. Then you use a vkCmdCopyBufferToImage command to copy that data into the image. This way, the CPU doesn't have to wait on events or fences to make sure that the image is in the GENERAL layout before sending data.
And BTW, Vulkan is not OpenGL. Mapping of memory is always persistent; there's no reason to unmap a piece of memory if you're going to map it every frame.
I have Windows CE 6 running an Atom N270 with an Intel 945GSE chipset. I wrote a small test application for Direct3D Mobile and have been experiencing some strange behaviour. I can only call Draw*Primitive once per frame. Calling multiple times either results in a white screen as though Present failed (even though no error is given) or only the first call seems to be processed.
With the message handling omitted the body of the render loop is as follows:
HandleD3DMError(pD3DMobileDevice->Clear(0,0,D3DMCLEAR_TARGET | D3DMCLEAR_ZBUFFER ,0xA50AB0F,1.0f,0),_T("Clear"));
if(HandleD3DMError(pD3DMobileDevice->BeginScene(),_T("BeginScene")))
{
HandleD3DMError(pD3DMobileDevice->SetTexture(0,pTexture), _T("SetTexture"));
HandleD3DMError(pD3DMobileDevice->SetStreamSource(0,m_pPPVVB[0],sizeof(PrimitiveVertex)),_T("SetStreamSource Failed"));
HandleD3DMError(pD3DMobileDevice->SetIndices(pIndexBuffer),_T("SetIndices failed"));
HandleD3DMError(pD3DMobileDevice->DrawIndexedPrimitive(D3DMPT_TRIANGLELIST,0,0,NUM_TRIS*3,0,NUM_TRIS/2),_T("Draw Primitive 1"));
//HandleD3DMError(pD3DMobileDevice->DrawIndexedPrimitive(D3DMPT_TRIANGLELIST,0,0,NUM_TRIS*3,NUM_TRIS/2 * 3,NUM_TRIS/2),_T("Draw Primitive 2"));
HandleD3DMError(pD3DMobileDevice->SetStreamSource(0,0,0),_T("SetStreamSource Null Failed"));
HandleD3DMError(pD3DMobileDevice->SetIndices(0),_T("SetIndices Null failed"));
HandleD3DMError(pD3DMobileDevice->EndScene(),_T("EndScene"));
}
HandleD3DMError(pD3DMobileDevice->Present(0,0,0,0),_T("Present Failed"));
You can switch the two DrawIndexedPrimitive() lines and both render four triangles each, which is correct. However when they are both present nothing is rendered. The HandleD3DMError() function displays a message box when an error occurs. This is done all through initialisation too. No errors are displayed at any time.
I have tried drawing different primitive types and drawing non-indexed vertex buffers with no success. I am able to draw 10,000 triangles using a single buffer but trying to use multiple buffers fails (I assume it is related to the multiple Draw calls issue).
The documentation on MSDN does not mention any limitation on Draw calls. They even mention cases where you would need to make multiple Draw calls. The official samples I've looked at
only ever call Draw*() once per frame too.
If I try and BeginScene() and EndScene() multiple times in the one frame nothing is rendered, not even the cleared colour.
I can provide all of the source if needed.
I appreciate any help anyone can give me.
Cheers