XAudio2 voice pooling - audio

I seem to have some conflicting information, and I am not sure what is up to date and what is proper.
In this talk: https://www.microsoft.com/en-us/download/details.aspx?id=6871
They recommend pooling voices at the application level since there is a cost with destroying voices.
However, looking at https://learn.microsoft.com/en-us/windows/win32/api/xaudio2/nf-xaudio2-ixaudio2-createsourcevoice
It states:
XAudio2 uses an internal memory pooler for voices with the same format. This means memory allocation for voices will occur less frequently as more voices are created and then destroyed. To minimize just-in-time allocations, a title can create the anticipated maximum number of voices needed up front, and then delete them as necessary. Voices will then be reused from the XAudio2 pool. The memory pool is tied to an XAudio2 engine instance.
So this leads me to believe I don't need pooling as XAudio2 has internal pooling. Then in another section:
https://learn.microsoft.com/en-us/windows/win32/api/xaudio2/nf-xaudio2-ixaudio2voice-destroyvoice
To avoid title thread interruptions from a blocking DestroyVoice call, the application can destroy voices on a separate non-critical thread, or the application can use voice pooling strategies to reuse voices rather than destroying them. Note that voices can only be reused with audio that has the same data format and the same number of channels the voice was created with. A voice can play audio data with different sample rates than that of the voice by calling IXAudio2SourceVoice::SetFrequencyRatio with an appropriate ratio parameter.
This has information on pooling again, which makes it seems like I should be pooling voices. Does anyone know of which information is correct? Should I be pooling voices or should I leave it to XAudio2's internal pooler?

The docs are not inconsistent, but it's a bit subtle. If you just create voices with the same format, then internally they will get reused to help minimize the cost of creating/destroying voices.
You can, however, go further and directly reuse source voices at the application level. They have to use the same basic format (PCM vs. ADPCM, bitrate, and channel count), but they can be different sample rates.
See DirectX Tool Kit for Audio for example code for voice reuse.
For games, you likely also want to prioritize voices and use 'voice stealing' to avoid having too many sounds playing at once which would be a muddle anyhow.

Related

CoreAudio: multi-threaded back-end OS X

I'd like to learn how to deal with possibility of using multiple CPU cores in audio rendering of a single input parameter array in OSX.
In AudioToolbox, one rendering callback normally lives on a single thread which seemingly gets processed by a single CPU core.
How can one deal with input data overflow on that core, while other 3, 5 or 7 cores staying practically idle?
It is not possible to know in advance how many cores will be available on a particular machine, of course.
Is there a way of (statically or dynamically) allocating rendering callbacks to different threads or "threadbare blocks"?
Is there a way of precisely synchronising the moment at which various rendering callbacks on their own (highest priority) threads in parallel produce their audio buffers?
Can there GCD API perhaps be of any use?
Thanks in advance!
PS. This question is related to another question I have posted a while ago:
OSX AudioUnit SMP , with the difference that I now seem to better understand the scope of the problem.
No matter how you set up your audio processing on macOS – be it just writing a single render callback, or setting up a whole application suite – CoreAudio will always provide you with just one single realtime audio thread. This thread runs with the highest priority there is, and thus is the only way the system can give you at least some guarantees about processing time and such.
If you really need to distribute load over multiple CPU cores, you need to create your own threads manually, and share sample and timing data across them. However, you will not be able to create a thread with the same priority as the system's audio thread, so your additional threads should be considered much "slower" than your audio thread, which means you might have to wait on your audio thread for some other thread(s) longer than you have time available, which then results in an audible glitch.
Long story short, the most crucial part is to design the actual processing algorithm carefully, as in all scenarios you really need to know what task can take how long.
EDIT: My previous answer here was quite different and uneducated. I updated the above parts for anybody coming across this answer in the future, to not be guided in the wrong direction.
You can find the previous version in the history of this answer.
I am not completely sure, but I do not think this is possible. Of course, you can use the Accelerate.framework by Apple, which uses the available resources. But
A render callback lives on a real-time priority thread on which
subsequent render calls arrive asynchronously. Apple
Documentation
On user level you are not able to create such threads.
By the way, these slides by Godfrey van der Linden may be interesting to you.

Does multiple isolated OpenGL context affect performance

My co-worker and I are working on a video rendering engine.
The whole idea is to parse a configuration file and render each frame to offscreen FBO, and then fetch the frame render results using glReadPixel for video encoding.
We tried to optimize the rendering speed by creating two threads each with an independent OpenGL context. One thread render odd frames and the other render even frames. The two threads do not share any gl resources.
The results are quite confusing. On my computer, the rendering speed increased compared to our single thread implementation, while on my partner's computer, the entire speed dropped.
I wonder here that, how do the amount of OpenGL contexts affect the overall performance. Is it really a good idea to create multiple OpenGL threads if they do not share anything.
Context switching is certainly not free. As pretty much always for performance related questions, it's impossible to quantify in general terms. If you want to know, you need to measure it on the system(s) you care about. It can be quite expensive.
Therefore, you add a certain amount of overhead by using multiple contexts. If that pays off depends on where your bottleneck is. If you were already GPU limited with a single CPU thread, you won't really gain anything because you can't get the GPU to do the work quicker if it was already fully loaded. Therefore you add overhead for the context switches without any gain, and make the whole thing slower.
If you were CPU limited, using multiple CPU threads can reduce your total elapsed time. If the parallelization of the CPU work combined with the added overhead for synchronization and context switches results in a net total gain again depends on your use case and the specific system. Trying both and measuring is the only good thing to do.
Based on your problem description, you might also be able to use multithreading while still sticking with a single OpenGL context, and keeping all OpenGL calls in a single thread. Instead of using glReadPixels() synchronously, you could have it read into PBOs (Pixel Buffer Objects), which allows you to use asynchronous reads. This decouples GPU and CPU work much better. You could also do the video encoding on a separate thread if you're not doing that yet. This approach will need some inter-thread synchronization, but it avoids using multiple contexts, while still using parallel processing to get the job done.

Make background thread in unity3d

I have wp7 app whith two background threads:
1. Planing of time
2. Play different sound samples by planed time (Possible few samples in same time).
How to repeat this logic whith unity3d engine? Is it possible?
Unity will not allow you to access its APIs from any thread other than the main one; you can't use locking primitives to get around it.
You can use the standard .NET threading APIs to start threads that do not interact directly with the Unity API, though. You could calculate samples and buffers on an extra thread, but your main thread would have to call AudioClip.SetData to submit the calculated samples to Unity.
Note that since Unity 2018.1, the Job System has been introduced which allows certain kinds of computation tasks to be performed on background threads (for example, setting transform positions). The tasks that can be performed are being gradually opened up over time.
The fact that the API is not threadsafe does not mean that you cannot use it with additional thread safety. You only need to ensure that no two threads modify the common data at the same time. You can use a simple lock variable to ensure no one reads the samples list while it is being updated.
However, instead of threads I'd recommend using coroutines, because they make things a lot easier. No thread safety is needed, the benefits are similar and the execution order is way clearer.
A simpler way to achieve a similar solution would be to update the samples list inside Update, and read it in a LateUpdate method.
No way =( Unity API is not threadsafe: link

Analyzing and profiling multi-threaded application

We are having a multithreaded application which has heavy packet processing across multiple pipeline stages. The application is in C under Linux.
The entire application works fine and has no memory leaks or thread saftey issues. However, in order to analyse the application, how can we profile and analyse the threads?
In particular here is what we are interested in:
the resource usage done by each thread
frequency and timing with which threads were having contentions to acquire locks
Amount of overheads due to synchronization
any bottlenecks in the system
what is the best system throughput we can get
What are the best techniques and tools available for the same?
Take a look at at Intel VTune Amplifier XE (formerly … Intel Thread Profiler) to see if it will meet your needs.
This and other Intel Linux development tools are available free for non-commercial use.
In the video Using the Timeline in Intel VTune Amplifier XE a timeline of a multi-threaded application is demonstrated. The presenter uses a graphic display to show lock activity and how to dig down to the source line of the particular lock causing serialization. At 9:20 the presenter mentions "with the frame API you can programmatically mark certain events or phases in your code. And these marks will appear on the timeline."
I worked on a similar system some years ago. Here's how I did it:
Step 1. Get rid of unnecessary time-takers in individual threads. For that I used this technique. This is important to do because the overall messaging system is limited by the speed of its parts.
Step 2. This part is hard work but it pays off. For each thread, print a time-stamped log showing when each message was sent, received, and acted upon. Then merge the logs into a common timeline and study it. What you are looking for is a) unnecessary retransmissions, for example due to timeouts, b) extra delay between the time a message is received and when it is acted upon. This can happen, for example, if a thread has multiple messages in its input queue, some of which can be processed more quickly than others. It makes sense to process those first.
You may need to alternate between these two.
Don't expect this to be easy. Some programmers are too fine to be bothered with this kind of dirty work. But, you could be pleasantly surprised at how fast you can make the whole thing go.
1) Don't know. There are some profilers available for linux.
2) If you are pipelining, each stage should be doing sufficient work to ensure that contention on the P-C queues is minimal. You can dig this out with some timings - if a stage takes 10ms+ to process a packet, you can forget about contention/lock issues. If it takes 100uS, you should consider amalgamating a couple stages so that each stage does more work.
3) Same as (2), unless there is a separate synchronization issue with some global data or whatever.
4) Dumping/logging the queue counts every second would be useful. The longest queue will be before the stage with the narrowest neck.
5) No idea - don't know how your current system works, what hardware it's running on etc. There are some 'normal' optimizations - eliminating memory-manager calls with object pools, adding extra threads to stages with the heaviest CPU loadings, things like that, but 'what is the best system throughput we can get' - too ethereal to say.
Do you have flexibility to develop under Darwin (OSX) and deploy on Linux? The performance analysis tools are excellent and easy to use (Shark and Thread Viewer are useful for your purpose).
There are many Linux performance tools, of course. gprof, Valgrind (with Cachegrind, Callgrind, Massif), and Vtune will do what you need.
To my knowledge, there is no tool that will directly answer your questions. However, the answers may be found by cross referencing the data points and metrics from both instrumentation and sampling based solutions.

Which thread to use for audio decoding?

When working with audio playback I am used to the following pattern:
one disk (or network) thread which reads data from disk (or the network) and fills a ringbuffer
one audio thread which reads data from the ringbuffer, possibly performs DSP, and writes to audio hardware
(pull or push API)
This works fine, and there's no issue when working with, say, a WAV file.
Now, if the source data is encoded in a compressed format, like Vorbis or MP3, decoding takes some time.
And it seems like it's quite common to perform decoding in the disk/network thread.
But isn't this wrong design? While disk or network access blocks, some CPU time is available for decoding, but is wasted if decoding happens in the same thread.
It seems to me that if the networks becomes slow, then risks of buffer underruns are higher if decoding happens sequentially.
So, shouldn't decoding be performed in the audio thread?
In my context, I would prefer to avoid adding a dedicated decoding thread. It's for mobile platforms and SMP is pretty rare right now. But please tell if a dedicated decoding thread really makes sense to you.
It's more important for the audio thread to be available for playing audio smoothly than for the network thread to maintain a perfect size buffer. If you're only using two threads, then the decoding should be done on the network thread. If you were to decode on the playing thread then it's possible the time could come that you need to push more audio out to the hardware but the thread is busy decoding. It's better if you maintain a buffer of already decoded audio.
Ideally you would use three threads. One for reading the network, one for decoding, and one for playing. In our application that handles audio/video capture, recording, and streaming we have eight threads per stream (recently increased from six threads since we added new functionality recently). It's much easier for each thread to have it's own functionality and then it can appropriately measure its performance against those of it's incoming/outgoing buffers. This also benefits profiling and optimization.
If your device has a single CPU, all threads are sharing it. OS Thread swapping is usually very efficient (you won't lose any meaningfull CPU power for the swapping). Therefore, you should create more threads if it will simplify your logic.
In your case, there is a pipeline. Different thread for each stage of the pipeline is a good pattern. The alternative, as you notice, involves complex logic, synchronizations, events, interrupts, or whatever. Sometimes there is no good alternatives at all.
Hence, my suggestion - create a dedicated thread for the audio decoding.
If you'll have more than a single CPU, you'll even gain more efficiency by using one thread for each pipeline step.

Resources