When a CPU "multitasks" it is rapidly switching between the threads of various processes to simulate processing which looks like true parallel.
When you have too many interrupts while running a program, you may not notice a difference in the performance of that program since it would only require a thread when processing, therefore letting the CPU idle when the program's not actively doing something.
What about the case of audio? Suppose we're talking about MP3 decompression. The CPU would decompress the MP3 and stream at some bitrate to the DAC. Since the CPU is still switching between threads constantly while decompressing the MP3, what happens if there are too many interrupts while an audio file is playing? How is it possible that I can listen to an audio file completely uninterrupted while there may be other performance-heavy tasks running, nearly freezing the computer?
Related
I have a real-time process sending occasional communication over RS232 to a high speed camera. I have several other real-time processes occupying a lot of CPU time, doing image processing on several GPU boards using CUDA. Normally the serial communication is very fast, with a message and response taking about 50 ms every time. However, when the background processes are busy doing image processing, the serial communication slows way down, often taking multiple seconds (sometimes more than 10 seconds).
In summary, during serial communication, Process A is delayed if Process B, C, etc., are very busy, even though process A has the highest priority:
Process A (real-time, highest priority): occasional serial communication
Processes B, C, D, etc. (real-time, lower priority): heavy CPU and GPU processing
When I change the background processes to be SCHED_OTHER (non-real-time) processes, the serial communication is fast; however, this isn't a solution for me, because the background processes need to be real-time processes (when they are not, the GPU processing doesn't keep up adequately with the high speed camera).
Apparently the serial communication is relying on some non-real-time process in the system, which is being pre-empted by my real-time background processes. I think if I knew which process was being used for serial communication, I could increase its priority and solve the problem. Does anyone know whether serial communication relies on any particular process running on the system?
I'm running RHEL 6.5, with the standard kernel (not PREEMPT_RT). It has dual 6-core CPUs.
At Erki A's suggestion, I captured an strace. Apparently it is a select() system call which is slow (the "set roi2" is the command to the camera, and the "Ok!" at the end is the response from the camera):
write(9, "set roi2"..., 26) = 26 <0.001106>
ioctl(9, TCSBRK, 0x1) = 0 <0.000263>
select(10, [9], NULL, NULL, {2, 0}) = 1 (in [9], left {0, 0}) <2.252840>
read(9, "Ok!\r\n", 4096) = 5 <0.000092>
The slow select() makes it seem like the camera itself is slow to respond. However, I know that isn't true, because of how the speed is impacted by changing the background process priorities. Is select() in this case dependent on a certain other process running?
If I skip the select() and just do the read(), the read() system call is the slow one.
Depending on your serial device/driver, the serial communications are most likely relying on a kernel worker thread (kworker) to shift the incoming serial data from the interrupt service routine buffers to the line discipline buffers. You could increase the priority of the kernel worker thread, however, worker threads process the shared work queue. So increasing the priority of the worker thread will increase the priority of the serial processing along with a whole bunch of other stuff that possibly doesn't need the priority boost.
You could modify the serial driver to use a dedicated high priority work queue rather than a shared one. Another option would be to use a tasklet, however, both these require driver level modifications.
I suspect the most straight forward solution would be to set the com port to low latency mode, either from the command line via the setserial command:
setserial /dev/ttySxx low_latency
or programatically:
struct serial_struct serinfo;
fd = open ("/dev/ttySxx");
ioctl (fd, TIOCGSERIAL, &serinfo);
serinfo.flags |= ASYNC_LOW_LATENCY;
ioctl (fd, TIOCSSERIAL, &serinfo);
close (fd);
This will cause the serial port interrupt handler to transfer the incoming data to the line discipline immediately rather than deferring the transfer by adding it to a work queue. In this mode, when you call read() from your application, you will avoid the possibility of the read() call sleeping, which it would otherwise do, if there is work in the work queue to flush. This sleep is probably the cause of your intermittent delays.
You can use strace to see where it locks up. If it is more than 10 seconds, it should be easy to see.
I'm studying operating systems and I have this doubt in my head for almost a whole week, and I couldn't find the answer in the book (Operating System Concepts - Silberschatz).
The question is how the operating system deals with a process that demands many actions to be executed? e.g. To play a video in a computer, the video must be processed, the audio must be processed, the video have to be sent to the monitor (I/O operation), the audio must be sent to the audio box (I/O operation) and so on.
In a computer with a single CPU the book says that a processor can run only one process at a time, so to process the video the operating system would have one thread for each of the operations listed before, the question is how does the operational system executes them? (Linux or Windows) Does it execute each one at a time and interchanges them (processes the video, send it to monitor monitor, processes the audio, send to audio to audio box and so on) but do it so fast that it is imperceptible or does it executes them concurrently (process the audio and video at the same time)? I think my main doubt in this question can translated as "Can two threads execute concurrently in a single CPU computer?"
Any correction and clarification in my understanding of the concepts is welcome.
Does it execute each one at a time and interchanges them (processes the video, send it to monitor monitor, processes the audio, send to audio to audio box and so on) but do it so fast that it is imperceptible or does it executes them concurrently (process the audio and video at the same time)?
It switches between them fast enough to be imperceptible. On modern operating systems, this is done in three main ways:
Preemption, where the task is simply suspended by the OS kernel's scheduler to run a different process. This is typically done when a fixed amount of time, called a time slice, runs out.
When a process starts to wait for IO, from the network, from disk or from most other sources, many operating systems will suspend it immediately. That process will only resume running when the results of the IO are available.
Cooperative multitasking, when a process indicates to the OS that it is willing to wait.
The details are different on each OS, and very different between desktop OSs, server OS, and embedded and real-time OSs.
Can two threads execute concurrently in a single CPU computer?
Check out this SO question on concurrency vs. parallelism.
I have around 10k video streams that I want to monitor. There's going to be a small cluster (eg: 5-10) of heterogenous machines that monitor these streams. Because there isn't enough CPU to do all this, I will have to shuffle the streams, monitor a couple of them at a time then switch to the next set.
Now, my problem is.. I would like to utilize the cores as much as possible, so that I can use fever sets and this way be able to monitor each stream more often.
Streams have different resolution, so consequently different CPU usage.
I relatively simple solution would be to measure the CPU usage for the highest bitrate stream on each machine (different CPUs, different usage). If it's 10%, and I have 4 cores I can safely run 9*4=36 processes at a time on that machine. But this would clearly waste a lot of CPU power, as other streams have low bitrates.
A better solution would be to constantly monitor the usage of the cores and if the utilization is below a threshold (eg: 95-10=85%) then start a new process.
A complex would be to start a new process with nice -n 20, then somehow check whether it is able to process the data (xx), if so, then renice it to normal priority and try the same thing with the next process... (xx: at the moment I'm not sure whether this is doable..)
Do you see any flaws in these designs? Any other ideas how to do this efficiently?
My other concern is the linux scheduler.. will it be able to distribute the processes properly? There is taskset to set CPU affinity a for process, does it make sense to manually control the allocation? (I think it does)
Also, what's the proper way to measure the CPU usage of a process? There is /proc/PID/stat and getrusage, but both of them return used CPU time, but I need a percentage. (Note: this Q has the lowest priority, if there's no response I will just check the source of top). I know I can monitor the cores with mpstat.
Perhaps I am missing something, but why do you need to group the video streams in fixed sets?
From my understanding of the problem you will be essentially sampling each stream and processing the samples. If I were implementing something like this I would place all streams in a work queue, preferably one that supports work stealing to minimize thread starvation.
Each worker thread would get a stream object/descriptor/URI/whatever from the head of the queue, sample and process it, then move it back at the end of the queue.
CPU utilization should not be an issue, unless a single stream cannot always saturate a single core due to real time constraints. If the latency while processing each sample is not an issue, then you have a few of alternatives:
Use a larger number of processing threads, until all cores are fully utilized in all cases.
Use separate input threads to receive stream chunks and pass those for processing. This should decouple the network latencies from the actual stream processing.
I am not aware of any worker queue implementation for distributed systems (as opposed to mere SMP systems), but it should be relatively easy to build one of your own if you don't find something that fits your needs...
I'm currently working on an audio recording application, that fetches up to 8 audio streams from the network and saves the data to the disk (simplified ;) ).
Right now, each stream gets handled by one thread -> the same thread also does the saving work on the disk.
That means I got 8 different threads that perform writes on the same disk, each one into a different file.
Do you think there would be an increase in the disk i/o performance if all the writing work would be done by one common thread (that would sequently write the data into the particular files)?
OS is an embedded Linux, the "disk" is a CF card, the application is written in C.
Thanks for your ideas
Nick
The short answer: Given that you are writing to a Flash disk, I wouldn't expect the number of threads to make much difference one way or another. But if it did make a difference, I would expect multiple threads to be faster than a single thread, not slower.
The longer answer:
I wrote a similar program to the one you describe about 6 years ago -- it ran on an embedded PowerPC Linux card and read/wrote multiple simultaneous audio files to/from a SCSI hard drive. I originally wrote it with a single thread doing I/O, because I thought that would give the best throughput, but it turned out that that was not the case.
In particular, when multiple threads were reading/writing at once, the SCSI layer was aware of all the pending requests from all the different threads, and was able to reorder the I/O requests such that seeking of the drive head was minimized. In the single-thread-IO scenario, on the other hand, the SCSI layer knew only about the single "next" outstanding I/O request and thus could not do that optimization. That meant extra travel for the drive head in many cases, and therefore lower throughput.
Of course, your application is not using SCSI or a rotating drive with heads that need seeking, so that may not be an issue for you -- but there may be other optimizations that the filesystem/hardware layer can do if it is aware of multiple simultaneous I/O requests. The only real way to find out is to try various models and measure the results.
My suggestion would be to decouple your disk I/O from your network I/O by moving your disk I/O into a thread-pool. You can then vary the maximum size of your I/O-thread-pool from 1 to N, and for each size measure the performance of the system. That would give you a clear idea of what works best on your particular hardware, without requiring you to rewrite the code more than once.
If it's embedded linux, I guess your machine has only one processor/core. In this case threads won't improve I/O performance at all. Of course linux block subsystem works well in concurrent environment, but in your case (if my guess about number of cores is right) there can't be a situation when several threads do something simultaneously.
If my guess is wrong and you have more than 1 core, then I'd suggest to benchmark disk I/O. Write a program that writes a lot of data from different threads and another program that does the same from only one thread. The results will show you everything you want to know.
I think that there is no big difference between multithreaded and singlethreaded solution in your case, but in case of multithreading you can syncronize between receiving threads and no one thread can affect on other threads in case of blocking in some system call.
I did particulary the same thing on embedded system, the problem was the high cpu usage when kernel drop many cached dirty pages to the CF, pdflush kernel process take all cpu time in that moment and if you receive stream via udp so it can be skipped because of cpu was busy when udp stream came, so I solved that problem by fdatasync() call every time when some not big amount of data received.
When working with audio playback I am used to the following pattern:
one disk (or network) thread which reads data from disk (or the network) and fills a ringbuffer
one audio thread which reads data from the ringbuffer, possibly performs DSP, and writes to audio hardware
(pull or push API)
This works fine, and there's no issue when working with, say, a WAV file.
Now, if the source data is encoded in a compressed format, like Vorbis or MP3, decoding takes some time.
And it seems like it's quite common to perform decoding in the disk/network thread.
But isn't this wrong design? While disk or network access blocks, some CPU time is available for decoding, but is wasted if decoding happens in the same thread.
It seems to me that if the networks becomes slow, then risks of buffer underruns are higher if decoding happens sequentially.
So, shouldn't decoding be performed in the audio thread?
In my context, I would prefer to avoid adding a dedicated decoding thread. It's for mobile platforms and SMP is pretty rare right now. But please tell if a dedicated decoding thread really makes sense to you.
It's more important for the audio thread to be available for playing audio smoothly than for the network thread to maintain a perfect size buffer. If you're only using two threads, then the decoding should be done on the network thread. If you were to decode on the playing thread then it's possible the time could come that you need to push more audio out to the hardware but the thread is busy decoding. It's better if you maintain a buffer of already decoded audio.
Ideally you would use three threads. One for reading the network, one for decoding, and one for playing. In our application that handles audio/video capture, recording, and streaming we have eight threads per stream (recently increased from six threads since we added new functionality recently). It's much easier for each thread to have it's own functionality and then it can appropriately measure its performance against those of it's incoming/outgoing buffers. This also benefits profiling and optimization.
If your device has a single CPU, all threads are sharing it. OS Thread swapping is usually very efficient (you won't lose any meaningfull CPU power for the swapping). Therefore, you should create more threads if it will simplify your logic.
In your case, there is a pipeline. Different thread for each stage of the pipeline is a good pattern. The alternative, as you notice, involves complex logic, synchronizations, events, interrupts, or whatever. Sometimes there is no good alternatives at all.
Hence, my suggestion - create a dedicated thread for the audio decoding.
If you'll have more than a single CPU, you'll even gain more efficiency by using one thread for each pipeline step.