When working with audio playback I am used to the following pattern:
one disk (or network) thread which reads data from disk (or the network) and fills a ringbuffer
one audio thread which reads data from the ringbuffer, possibly performs DSP, and writes to audio hardware
(pull or push API)
This works fine, and there's no issue when working with, say, a WAV file.
Now, if the source data is encoded in a compressed format, like Vorbis or MP3, decoding takes some time.
And it seems like it's quite common to perform decoding in the disk/network thread.
But isn't this wrong design? While disk or network access blocks, some CPU time is available for decoding, but is wasted if decoding happens in the same thread.
It seems to me that if the networks becomes slow, then risks of buffer underruns are higher if decoding happens sequentially.
So, shouldn't decoding be performed in the audio thread?
In my context, I would prefer to avoid adding a dedicated decoding thread. It's for mobile platforms and SMP is pretty rare right now. But please tell if a dedicated decoding thread really makes sense to you.
It's more important for the audio thread to be available for playing audio smoothly than for the network thread to maintain a perfect size buffer. If you're only using two threads, then the decoding should be done on the network thread. If you were to decode on the playing thread then it's possible the time could come that you need to push more audio out to the hardware but the thread is busy decoding. It's better if you maintain a buffer of already decoded audio.
Ideally you would use three threads. One for reading the network, one for decoding, and one for playing. In our application that handles audio/video capture, recording, and streaming we have eight threads per stream (recently increased from six threads since we added new functionality recently). It's much easier for each thread to have it's own functionality and then it can appropriately measure its performance against those of it's incoming/outgoing buffers. This also benefits profiling and optimization.
If your device has a single CPU, all threads are sharing it. OS Thread swapping is usually very efficient (you won't lose any meaningfull CPU power for the swapping). Therefore, you should create more threads if it will simplify your logic.
In your case, there is a pipeline. Different thread for each stage of the pipeline is a good pattern. The alternative, as you notice, involves complex logic, synchronizations, events, interrupts, or whatever. Sometimes there is no good alternatives at all.
Hence, my suggestion - create a dedicated thread for the audio decoding.
If you'll have more than a single CPU, you'll even gain more efficiency by using one thread for each pipeline step.
Related
I'm thinking about implementing an video converter using node.js with ffmpeg but since it's a cpu intensive task, It might block express from handling other requests. I've found a couple of articles about this and some of them use worker threads while others use queues like Agendajs or Bull.
Which one is more suitable for my use case? The video converter doesn't have to respond with the actual video, all it has to do is just convert it and then upload it into an S3 bucket for later retrieval.
Two sub-problems, here:
First problem is keeping your interface responsive during the conversion. If the conversion may take a long time, and you have no good way of splitting it into small chunks (such that you can service requests in between), then you will need to handle it asynchronously, indeed.
So you'll probably want to create at least one worker thread to work in parallel with the main thread.
The second problem is - presumably - making the conversion run fast. Since - as you write - it's a CPU intensive task, it may profit from additional worker threads. This could mean:
2a. several threads working on a single (queued) conversion task, simultaneously
2b. several threads each working on separate conversion tasks at the same time
2c. a mix of both.
The good news is that you really won't have to worry about most of this yourself, because a) ffmpeg is already using multithreading where possible (this depends on the codec in use!), providing you with a ready made solution for 2a. And b), node-fluent-ffmpeg (or node-ffmpeg) is already designed to call ffmpeg, asynchronously, thus solving problem 1.
The only remaining question, then, is will you want to make sure to run only one ffmpeg job at a time (queued), or start conversions as soon as they are requested (2b / 2c)? The latter is going to be easier to implement. However, this could get you in trouble, if a lot of jobs are running simultaneously. At the very least, each conversion job will buffer some input and some output data, and this could get you into memory troubles,
This is where a queue comes into the picture. You'll want to put jobs in a simple queue, and start them so that no more than n are running, concurrently. The optimal n will not necessarily be 1, but is unlikely to be larger than 4 or so (again, as each single conversion is making use of parallelism). You'll have to experiment with that a bit, always keeping in mind that the answer may differ from codec to codec.
I have a multithreaded application where I want to allow all but one of the threads to run synchronously. However, when a specific thread wakes up I need the rest of the threads to block.
My Current implementation is:
void ManyBackgroundThreadsDoingWork()
{
AquireMutex(mutex);
DoTheBackgroundWork();
ReleaseTheMutex(mutex);
}
void MainThread()
{
AquireMutex(mutex);
DoTheMainThreadWork();
ReleaseTheMutex(mutex);
}
This works, in that it does indeed keep the background threads from operating inside the critical block while the main thread is doing its work. However, There is a lot of contention for the mutex amongst the background threads even when they don't necessarily need it. The main thread runs intermittently and the background threads are able to run concurrently with each other, just not with the main thread.
What i've effectively done is reduced a multithreaded architecture to a single threaded one using locks... which is silly. What I really want is an architecture that is multithreaded for most of the time, but then waits while a small operation completes and goes back to being multithreaded.
Edit: An explanation of the problem.
What I have is an application that displays multiple video feeds coming from pcie capture cards. The pcie capture card driver issues callbacks on threads it manages into what is effectively the ManyBackgroundThreadsDoingWork function. In this function I copy the captured video frames into buffers for rendering. The main thread is the render thread that runs intermittently. The copy threads need to block during the render to prevent tearing of the video.
My initial approach was to simply do double buffering but that is not really an option as the capture card driver won't allow me to buffer frames without pushing the frames through system memory. The technique being used is called "DirectGMA" from AMD that allows the capture card to push video frames directly into the GPU memory. The only method for synchronization is to put a glFence and mutex around the actual rendering as the capture card will be continuously streaming data to the GPU Memory. The driver offers no indication of when a frame transfer completes. The callback supplies enough information for me to know that a frame is ready to be transferred at which point I trigger the transfer. However, I need to block transfers during the scene render to prevent tearing and artifacts in the video. The technique described above is the suggested technique from the pcie card manufacturer. The technique, however, breaks down when you want more than one video playing at a time. Thus, the question.
You need a lock that supports both shared and exclusive locking modes, sometimes called a readers/writer lock. This permits multiple threads to get read (shared) locks until one thread requests an exclusive (write) lock.
I'm working on an OSX application that transmits data to a hardware device over USB serial. The hardware has a small serial buffer that is drained at a variable rate and should always stay non-empty.
We have a write loop in its own NSThread that checks if the hardware buffer is full, and if not, writes data until it is. The majority of loop iterations don't write anything and take almost no time, but they can occasionally take up to a couple milliseconds (as timed with CACurrentMediaTime). The thread sleeps for 100ns after each iteration. (I know that sleep time seems insanely short, but if we bump it up, the hardware starts getting data-starved.)
This works well much of the time. However, if the main thread or another application starts doing something processor-intensive, the write thread slows down and isn't able to stream data fast enough to keep the device's queue from emptying.
So, we'd like to make the serial write thread real-time. I read the Apple docs on requesting real-time scheduling through the Mach API, then tried to adapt the code snippet from SetPriorityRealtimeAudio(mach_port_t mach_thread_id) in the Chromium source.
However, this isn't working - the application remains just as susceptible to serial communication slowdowns. Any ideas? I'm not sure if I need to change the write thread's behavior, or if I'm passing in the wrong thread policy parameters, or both. I experimented with various period/computation/constraint values, and with forcing a more consistent duty cycle (write for 100ns max and then sleep for 100ns) but no luck.
A related question: How can I check the thread's priority directly, and/or tell if it's starting off as real-time and then being demoted vs not being promoted to begin with? Right now I'm just making inferences from the hardware performance, so it's hard to tell exactly what's going on.
My suggestion is to move the thread of execution that requires the highest priority into a separate process. Apple often does this for realtime processes such as driving the built-in camera. Depending on what versions of the OS you are targeting you can use Distributed Objects (predecessor to XPC) or XPC.
You can also roll your own RPC mechanism and use standard Unix fork techniques to create a separate child process. Since your main app is the owner of the child process, you should also be able to set the scheduling priority of the process in addition to the individual thread priority within the process.
As I edit this post, I have a WWDC video playing in the background and also started a QuickTime Movie Recording task. As you can see, the real-time aspects of both those apps are running in separate XPC processes:
ps -ax | grep Video
1933 ?? 0:00.08 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
2332 ?? 0:08.94 /System/Library/Frameworks/VideoToolbox.framework/Versions/A/XPCServices/VTDecoderXPCService.xpc/Contents/MacOS/VTDecoderXPCService
XPC Services at developer.apple.com
Distributed Objects at developer.apple.com
I have around 10k video streams that I want to monitor. There's going to be a small cluster (eg: 5-10) of heterogenous machines that monitor these streams. Because there isn't enough CPU to do all this, I will have to shuffle the streams, monitor a couple of them at a time then switch to the next set.
Now, my problem is.. I would like to utilize the cores as much as possible, so that I can use fever sets and this way be able to monitor each stream more often.
Streams have different resolution, so consequently different CPU usage.
I relatively simple solution would be to measure the CPU usage for the highest bitrate stream on each machine (different CPUs, different usage). If it's 10%, and I have 4 cores I can safely run 9*4=36 processes at a time on that machine. But this would clearly waste a lot of CPU power, as other streams have low bitrates.
A better solution would be to constantly monitor the usage of the cores and if the utilization is below a threshold (eg: 95-10=85%) then start a new process.
A complex would be to start a new process with nice -n 20, then somehow check whether it is able to process the data (xx), if so, then renice it to normal priority and try the same thing with the next process... (xx: at the moment I'm not sure whether this is doable..)
Do you see any flaws in these designs? Any other ideas how to do this efficiently?
My other concern is the linux scheduler.. will it be able to distribute the processes properly? There is taskset to set CPU affinity a for process, does it make sense to manually control the allocation? (I think it does)
Also, what's the proper way to measure the CPU usage of a process? There is /proc/PID/stat and getrusage, but both of them return used CPU time, but I need a percentage. (Note: this Q has the lowest priority, if there's no response I will just check the source of top). I know I can monitor the cores with mpstat.
Perhaps I am missing something, but why do you need to group the video streams in fixed sets?
From my understanding of the problem you will be essentially sampling each stream and processing the samples. If I were implementing something like this I would place all streams in a work queue, preferably one that supports work stealing to minimize thread starvation.
Each worker thread would get a stream object/descriptor/URI/whatever from the head of the queue, sample and process it, then move it back at the end of the queue.
CPU utilization should not be an issue, unless a single stream cannot always saturate a single core due to real time constraints. If the latency while processing each sample is not an issue, then you have a few of alternatives:
Use a larger number of processing threads, until all cores are fully utilized in all cases.
Use separate input threads to receive stream chunks and pass those for processing. This should decouple the network latencies from the actual stream processing.
I am not aware of any worker queue implementation for distributed systems (as opposed to mere SMP systems), but it should be relatively easy to build one of your own if you don't find something that fits your needs...
I'm currently working on an audio recording application, that fetches up to 8 audio streams from the network and saves the data to the disk (simplified ;) ).
Right now, each stream gets handled by one thread -> the same thread also does the saving work on the disk.
That means I got 8 different threads that perform writes on the same disk, each one into a different file.
Do you think there would be an increase in the disk i/o performance if all the writing work would be done by one common thread (that would sequently write the data into the particular files)?
OS is an embedded Linux, the "disk" is a CF card, the application is written in C.
Thanks for your ideas
Nick
The short answer: Given that you are writing to a Flash disk, I wouldn't expect the number of threads to make much difference one way or another. But if it did make a difference, I would expect multiple threads to be faster than a single thread, not slower.
The longer answer:
I wrote a similar program to the one you describe about 6 years ago -- it ran on an embedded PowerPC Linux card and read/wrote multiple simultaneous audio files to/from a SCSI hard drive. I originally wrote it with a single thread doing I/O, because I thought that would give the best throughput, but it turned out that that was not the case.
In particular, when multiple threads were reading/writing at once, the SCSI layer was aware of all the pending requests from all the different threads, and was able to reorder the I/O requests such that seeking of the drive head was minimized. In the single-thread-IO scenario, on the other hand, the SCSI layer knew only about the single "next" outstanding I/O request and thus could not do that optimization. That meant extra travel for the drive head in many cases, and therefore lower throughput.
Of course, your application is not using SCSI or a rotating drive with heads that need seeking, so that may not be an issue for you -- but there may be other optimizations that the filesystem/hardware layer can do if it is aware of multiple simultaneous I/O requests. The only real way to find out is to try various models and measure the results.
My suggestion would be to decouple your disk I/O from your network I/O by moving your disk I/O into a thread-pool. You can then vary the maximum size of your I/O-thread-pool from 1 to N, and for each size measure the performance of the system. That would give you a clear idea of what works best on your particular hardware, without requiring you to rewrite the code more than once.
If it's embedded linux, I guess your machine has only one processor/core. In this case threads won't improve I/O performance at all. Of course linux block subsystem works well in concurrent environment, but in your case (if my guess about number of cores is right) there can't be a situation when several threads do something simultaneously.
If my guess is wrong and you have more than 1 core, then I'd suggest to benchmark disk I/O. Write a program that writes a lot of data from different threads and another program that does the same from only one thread. The results will show you everything you want to know.
I think that there is no big difference between multithreaded and singlethreaded solution in your case, but in case of multithreading you can syncronize between receiving threads and no one thread can affect on other threads in case of blocking in some system call.
I did particulary the same thing on embedded system, the problem was the high cpu usage when kernel drop many cached dirty pages to the CF, pdflush kernel process take all cpu time in that moment and if you receive stream via udp so it can be skipped because of cpu was busy when udp stream came, so I solved that problem by fdatasync() call every time when some not big amount of data received.