I'm having an issue with atomics in wgpu / WGSL but I'm not sure if it's due to a fundamental misunderstanding or a bug in my code.
I have a input array declared in WGSL as
struct FourTileUpdate {
// (u32 = 4 bytes)
data: array<u32, 9>
};
#group(0) #binding(0) var<storage, read> tile_updates : array<FourTileUpdate>;
I'm limiting the size of this array to around 5MB, but sometimes I need to transfer more than that for a single frame and so use multiple command encoders & compute passes.
Each "tile update" has an associated position (x & y) and a ms_since_epoch property that indicates when the tile update was created. Tile updates get written to a texture.
I don't want to overwrite newer tile updates with older tile updates, so in my shader I have a guard:
storageBarrier();
let previous_timestamp_value = atomicMax(&last_timestamp_for_tile[x + y * r_locals.width], ms_since_epoch);
if (previous_timestamp_value > ms_since_epoch) {
return;
}
However, something is going wrong and older tile updates are overwriting newer tile updates. I can't reproduce this on Windows / Vulkan but it consistently happens on macOS / Metal. Here's an image of the rendered texture--it should be completely green instead of the occasional red and black pixel:
rendered texture
A few questions:
is execution order guaranteed to be the same as the order of the command encoder constructions?
do storageBarrier() and atomics work across all invocations in a single frame or just the compute pass?
I tried submitting each encoder with queue.submit(Some(encoder.finish())) before creating the next encoder for the frame, and even waiting for the queue to finish processing for each submitted encoder with
let (tx, rx) = mpsc::channel();
queue.on_submitted_work_done(move || {
tx.send().unwrap();
});
device.poll(wgpu::Maintain::Wait);
rx.rev().unwrap()
// ... loop back and create & submit next encoder for current frame
but that didn't work either.
Good questions!
is execution order guaranteed to be the same as the order of the command encoder constructions?
I believe that is the case. But I checked and the spec is actually unclear about this. I filed https://github.com/gpuweb/gpuweb/issues/3809 to fix this.
Further, I believe the intent is that all memory accesses (e.g. to storage buffers) from one GPU command will complete before the next GPU command begins. So the effect of any writes in one command will be visible in the next command (read-after-write hazard). Also, a write in a later command will not be visible in an earlier command (write-after-read hazard).
do storageBarrier() and atomics work across all invocations in a single frame or just the compute pass?
Another good question. storageBarrier() only works within a single workgroup. This may be surprising, but is due to a limitation in some platforms.
For further details, see https://github.com/gpuweb/gpuweb/issues/3774
This will be a FAQ because it is surprising, and subtle!
Update: I suspect the bad behaviour you're seeing is that storageBarrier() does not work across workgroups. It's a limitation in Metal.
I have experience with D3D11 and want to learn D3D12. I am reading the official D3D12 multithread example and don't understand why the shadow map (generated in the first pass as a DSV, consumed in the second pass as SRV) is created for each frame (actually only 2 copies, as the FrameResource is reused every 2 frames).
The code that creates the shadow map resource is here, in the FrameResource class, instances of which is created here.
There is actually another resource that is created for each frame, the constant buffer. I kind of understand the constant buffer. Because it is written by CPU (D3D11 dynamic usage) and need to remain unchanged until the GPU finish using it, so there need to be 2 copies. However, I don't understand why the shadow map needs to do the same, because it is only modified by GPU (D3D11 default usage), and there are fence commands to separate reading and writing to that texture anyway. As long as the GPU follows the fence, a single texture should be enough for the GPU to work correctly. Where am I wrong?
Thanks in advance.
EDIT
According to the comment below, the "fence" I mentioned above should more accurately be called "resource barrier".
The key issue is that you don't want to stall the GPU for best performance. Double-buffering is a minimal requirement, but typically triple-buffering is better for smoothing out frame-to-frame rendering spikes, etc.
FWIW, the default behavior of DXGI Present is to stall only after you have submitted THREE frames of work, not two.
Of course, there's a trade-off between triple-buffering and input responsiveness, but if you are maintaining 60 Hz or better than it's likely not noticeable.
With all that said, typically you don't need to double-buffered depth/stencil buffers for rendering, although if you wanted to make the initial write of the depth-buffer overlap with the read of the previous depth-buffer passes then you would want distinct buffers per frame for performance and correctness.
The 'writes' are all complete before the 'reads' in DX12 because of the injection of the 'Resource Barrier' into the command-list:
void FrameResource::SwapBarriers()
{
// Transition the shadow map from writeable to readable.
m_commandLists[CommandListMid]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_DEPTH_WRITE, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE));
}
void FrameResource::Finish()
{
m_commandLists[CommandListPost]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE, D3D12_RESOURCE_STATE_DEPTH_WRITE));
}
Note that this sample is a port/rewrite of the older legacy DirectX SDK sample MultithreadedRendering11, so it may be just an artifact of convenience to have two shadow buffers instead of just one.
I want to use snd_pcm_delay() to query the delay until the sample I am about write to the ALSA buffer are hearable. I expect this value to vary between individual calls. Though, on two system this value is constant. The function returns a value that is always equal to the period size on one platform and on the other platform it is equal to the buffer size (two times the period size in my code).
Is my understanding of snd_pcm_delay() wrong? Is it a driver problem?
The delay is proportional to the number of samples in the buffer (the inverse of snd_pcm_avail()), plus a time that describes how much time is needed to move samples from the buffer to the speakers. The latter part is driver dependent and might not be implemented.
If the device takes out samples one entire period at a time (some DMA controllers have no better granularity for reporting the current position), then the delay value will appear to stay constant for a time, and then jump by an entire period. And you see that jump only before you have re-filled the buffer.
I have been writing Spotify support for a project using libspotify, and when I wanted to implement seeking, I noticed that there is apparently no function to get the current playback position. In other words, a counterpart to sp_session_player_seek(), which returns the current offset.
What some people seem to do is to save the offset used in the last seek() call, and then accumulate the number of frames in a counter in music_delivery. Together with the stored offset, the current position can be calculated that way, yes - but is this really the only way how to do it?
I know that the Spotify Web API has a way to get the current position, so it is strange that libspotify doesn't have one.
Keeping track of it yourself the way to do it.
The way to look at it is this: libspotify doesn't actually play music, so it can't possibly know the current position.
All libspotify does it pass you PCM data. It's the application's job to get that out to the speakers, and audio pipelines invariably have various buffers and latencies and whatnot in place. When you seek, you're not saying "Move playback to here", you're saying "start giving me PCM from this point in the track".
In fact, if you're just counting frames and then passing them to an audio pipeline, your position is likely incorrect if it doesn't take into account that pipeline's buffers etc.
You can always track the position yourself.
For example:
void SpSeek(int position)
{
sp_session_player_seek(mSession, position);
mMsPosition = position;
}
int OnSessionMusicDelivery(sp_session *session, const sp_audioformat *format, const void *frames, int numFrames)
{
return SendToAudioDriver(...)
}
In my case i'm using OpenSL (Android), each time the buffer finish i update the value of the position like this
mMsPosition += (frameCount * 1000) / UtPlayerGetSampleRate();
Where numFrames is the frames consumed by the driver.
I wish to record the microphone audio stream so I can do realtime DSP on it.
I want to do so without having to use threads and without having .read() block while it waits for new audio data.
UPDATE/ANSWER: It's a bug in Android. 4.2.2 still has the problem, but 5.01 IS FIXED! I'm not sure where the divide is but that's the story.
NOTE: Please don't say "Just use threads." Threads are fine but this isn't about them, and the android developers intended for AudioRecord to be fully usable without me having to specify threads and without me having to deal with blocking read(). Thank you!
Here is what I have found:
When the AudioRecord object is initialized, it creates its own internal ring type buffer.
When .start() is called, it begins recording to said ring buffer (or whatever kind it really is.)
When .read() is called, it reads either half of bufferSize or the specified number of bytes (whichever is less) and then returns.
If there is more than enough audio samples in the internal buffer, then read() returns instantly with the data. If there is not enough yet, then read() waits till there is, then returns with the data.
.setRecordPositionUpdateListener() can be used to set a Listener, and .setPositionNotificationPeriod() and .setNotificationMarkerPosition() can be used to set the notification Period and Position, respectively.
However, the Listener seems to be never called unless certain requirements are met:
1: The Period or Position must be equal to bufferSize/2 or (bufferSize/2)-1.
2: A .read() must be called before the the Period or Position timer starts counting - in other words, after calling .start() then also call .read(), and each time the Listener is called, call .read() again.
3: .read() must read at least half of bufferSize each time.
So using these rules I am able to get the callback/Listener working, but for some reason the reads are still blocking and I can't figure out how to get the Listener to only be called when there is a full read's worth.
If I rig up a button view to click to read, then I can tap it and if tap rapidly, read blocks. But if I wait for the audio buffer to fill, then the first tap is instant (read returns right away) but subsiquent rapid taps are blocked because read() has to wait, I guess.
Greatly appreciated would be any insight on how I might make the Listener work as intended - in such a way that my listener gets called when there's enough data for read() to return instantly.
Below is the relavent parts of my code.
I have some log statements in my code which send strings to logcat which allows me to see how long each command is taking, and this is how I know that read() is blocking.
(And the buttons in my simple test app also are very doggy slow to respond when it is reading repeatedly, but CPU is not pegged.)
Thanks,
~Jesse
In my OnCreate():
bufferSize=AudioRecord.getMinBufferSize(samplerate,AudioFormat.CHANNEL_CONFIGURATION_MONO,AudioFormat.ENCODING_PCM_16BIT)*4;
recorder = new AudioRecord (AudioSource.MIC,samplerate,AudioFormat.CHANNEL_CONFIGURATION_MONO,AudioFormat.ENCODING_PCM_16BIT,bufferSize);
recorder.setRecordPositionUpdateListener(mRecordListener);
recorder.setPositionNotificationPeriod(bufferSize/2);
//recorder.setNotificationMarkerPosition(bufferSize/2);
audioData = new short [bufferSize];
recorder.startRecording();
samplesread=recorder.read(audioData,0,bufferSize);//This triggers it to start doing the callback.
Then here is my listener:
public OnRecordPositionUpdateListener mRecordListener = new OnRecordPositionUpdateListener()
{
public void onPeriodicNotification(AudioRecord recorder) //This one gets called every period.
{
Log.d("TimeTrack", "AAA");
samplesread=recorder.read(audioData,0,bufferSize);
Log.d("TimeTrack", "BBB");
//player.write(audioData, 0, samplesread);
//Log.d("TimeTrack", "CCC");
reads++;
}
#Override
public void onMarkerReached(AudioRecord recorder) //This one gets called only once -- when the marker is reached.
{
Log.d("TimeTrack", "AAA");
samplesread=recorder.read(audioData,0,bufferSize);
Log.d("TimeTrack", "BBB");
//player.write(audioData, 0, samplesread);
//Log.d("TimeTrack", "CCC");
}
};
UPDATE: I have tried this on Android 2.2.3, 2.3.4, and now 4.0.3, and all act the same.
Also: There is an open bug on code.google about it - one entry started in 2012 by someone else then one from 2013 started by me (I didn't know about the first):
UPDATE 2016: Ahhhh finally after years of wondering if it was me or android, I finally have answer! I tried my above code on 4.2.2 and same problem. I tried above code on 5.01, AND IT WORKS!!! And the initial .read() call is NOT needed anymore either. Now, once the .setPositionNotificationPeriod() and .StartRecording() are called, mRecordListener() just magically starts getting called every time there is data available now so it no longer blocks, because the callback is not called until after enough data has been recorded. I haven't listened to the data to know if it's recording correctly, but the callback is happening like it should, and it is not blocking the activity, like it used to!
http://code.google.com/p/android/issues/detail?id=53996
http://code.google.com/p/android/issues/detail?id=25138
If folks who care about this bug log in and vote for and/or comment on the bug maybe it'll get addressed sooner by Google.
It's late answear, but I think I know where Jesse did a mistake. His read call is getting blocked because he is requesting shorts which are sized same as buffer size, but buffer size is in bytes and short contains 2 bytes. If we make short array to be same length as buffer we will read twice as much data.
The solution is to make audioData = new short[bufferSize/2] If the buffer size is 1000 bytes, this way we will request 500 shorts which are 1000 bytes.
Also he should change samplesread=recorder.read(audioData,0,bufferSize) to samplesread=recorder.read(audioData,0,audioData.length)
UPDATE
Ok, Jesse. I can see where another mistake can be - the positionNotificationPeriod. This value have to be large enought so it won't call the listener too often and we need to make sure that when the listener is called the bytes to read are ready to be collected. If bytes won't be ready when the listener is called, the main thread will get blocked by recorder.read(audioData, 0, audioData.length) call until requested bytes get's collected by AudioRecord.
You should calculate buffer size and shorts array length based on time interval you set - how often you want the listener to be called. Position notification period, buffer size and shorts array length all have to be adjusted correctly. Let me show you an example:
int periodInFrames = sampleRate / 10;
int bufferSize = periodInFrames * 1 * 16 / 8;
audioData = new short [bufferSize / 2];
int minBufferSize = AudioRecord.getMinBufferSize(sampleRate, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT);
if (bufferSize < minBufferSize) bufferSize = minBufferSize;
recorder = new AudioRecord(AudioSource.MIC, sampleRate, AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT, buffersize);
recorder.setRecordPositionUpdateListener(mRecordListener);
recorder.setPositionNotificationPeriod(periodInFrames);
recorder.startRecording();
public OnRecordPositionUpdateListener mRecordListener = new OnRecordPositionUpdateListener() {
public void onPeriodicNotification(AudioRecord recorder) {
samplesread = recorder.read(audioData, 0, audioData.length);
player.write(short2byte(audioData));
}
};
private byte[] short2byte(short[] data) {
int dataSize = data.length;
byte[] bytes = new byte[dataSize * 2];
for (int i = 0; i < dataSize; i++) {
bytes[i * 2] = (byte) (data[i] & 0x00FF);
bytes[(i * 2) + 1] = (byte) (data[i] >> 8);
data[i] = 0;
}
return bytes;
}
So now a bit of explanation.
First we set how often the listener have to be called to collect audio data (periodInFrames). PositionNotificationPeriod is expressed in frames. Sampling rate is expressed in frames per second, so for 44100 sampling rate we have 44100 frames per second. I divided it by 10 so the listener will be called every 4410 frames = 100 milliseconds - that's reasonable time interval.
Now we calculate buffer size based on our periodInFrames so any data won't be overriden before we collect it. Buffer size is expressed in bytes. Our time interval is 4410 frames, each frame contains 1 byte for mono or 2 bytes for stereo so we multiply it by number of channels (1 in your case). Each channel contains 1 byte for ENCODING_8BIT or 2 bytes for ENCODING_16BIT so we multiply it by bits per sample (16 for ENCODING_16BIT, 8 for ENCODING_8BIT) and divide it by 8.
Then we set audioData length to be half of the bufferSize so we make sure that when the listener gets called, bytes to read are already there waiting to be collected. That's because short contains 2 bytes and bufferSize is expressed in bytes.
Then we check if bufferSize is large enought to succesfully initialize AudioRecord object, if it's not then we set bufferSize to it's minimal size - we don't need to change our time interval or audioData length.
In our listener we read and store data to short array. That's why we use audioData.length instead buffer size, because only audioData.length can tell us the number of shorts the buffer contains.
I had it working some time ago so please let me know if it will work for you.
I'm not sure why you're avoiding spawning separate threads, but if it's because you don't want have to deal with coding them properly, you can use .schedule on a Timer object after each .read, where the time interval is set to the time it takes to get your buffer filled (number of samples in buffer / sampleRate). Yes I know this is using a separate thread, but this advice was given assuming that the reason you were avoiding using threads was to avoid having to code them properly.
This way, the longest time it can possibly block the thread for should be neglible. But I don't know why you'd want to do that.
If the above reason is not why you're avoiding using separate threads, may I ask why?
Also, what exactly do you mean by realtime? Do you intend to playback the affected audio using, let's say, an AudioTrack? Because the latency on most Android devices is pretty bad.