I would like to run a single low latency task (for audio, ALSA/JACK) on a separate core with an embedded Linux system. Removing scheduler and other interrupts might be the key here.
There were several approaches I found so far, e.g. cpusets and an offline scheduler from 2009 (which unfortunately does not support user space tasks).
Is there a newer/more convenient way to achieve this?
Offline scheduler
The topic you are looking for is called "CPU affinity". There are two main aspects to the CPU affinity: affinity of processes and affinity of the interrupts.
To my (admittedly limited) knowledge:
The processes are assigned to CPUs using the taskset command. (The affinity is inherited by the child processes.)
The interrupts to CPU assignment on Linux can be manipulated using the /proc/irq/<n>/smp_affinity. To verify the effectiveness of the assignment, check the /proc/interrupts to see which CPUs serve which interrupts. See here.
In your particular case, you want to reserve a single CPU (aka core) for your critical application, for example CPU0. That means that all processes and interrupts should be assigned to all but the CPU0, using the affinity mask which has the bit 0 (== CPU0) cleared, e.g. 0xfffffffe. And your critical application would have the affinity mask of 0x1, meaning that it is allowed to run only on the CPU0.
Additionally, you might need to use the sched_setscheduler syscall in the application to set the scheduling to one of the real-time policies. That might improve the latencies of your application (but also can make worse).
Note that tuning the CPU affinity is not a trivial endeavor and clear-cut solutions are rare. You would need to test and experiment to make sure that the configuration can sustain the performance you need. For example, it is likely that your application would communicate with the other processes. If the communication is synchronous, and the other processes are slow to react (since they have limited CPU resources), that would in turn negatively impact performance of your critical application. Same applies to the interrupt(s) of the sound card.
Hope that helps.
Related
I'm running an application with multiple threads and it seems Linux is distributing threads among NUMA nodes almost equally. Say my application spawns 4 threads and my machine has 4 sockets. I observe that each thread is assigned to a NUMA node distributing threads among all nodes almost equally.
Is there any reason for this? why not assign all on one socket and then fill the next one?
The best binding for an application is dependent of what the application does. It is often a good idea to spread thread on different NUMA nodes so to maximize the memory throughput as all NUMA nodes can theoretically be used in this case (assuming the application is well written and NUMA aware). If all threads are bound to the same NUMA node, then only the memory of the node can be efficiently accessed (access to memory of other NUMA node is possible but slower and pages will not be automatically efficiently map due to the first touch policy which is generally the default one on most machine). When some threads communicate a lot, it is often better to put them on the same NUMA node so not to pay latency overheads. In some cases, it can even be better to put them on the same core (but different hardware threads) so to speed up synchronization operations like locks and atomics.
If you want the scheduling and the binding to be efficient, you need to provide more information to the OS or do it yourself. I strongly advise you to bind threads to specific cores. This is easy with HPC runtimes/tools like OpenMP (but a pain if your application use low-level threads unless you do not care about platform portability). As for NUMA, you can specify the policy using numactl. More information is provided in this answer.
In practice, HPC applications generally use manual binding so to improve performance. OS scheduler are generally not very good to bind thread automatically efficiently. Few years ago, there was even bugs in the scheduler causing inefficient behaviours: see The Linux Scheduler: a Decade of Wasted Cores. To my knowledge, such problem is not so uncommon in this field and not restricted to Linux. Efficient NUMA-aware OS scheduling is far from being easy.
I'm a kernel noob including schedulers. I understand that there is a IO scheduler and a task scheduler and according to this post IO scheduler uses normal tasks that are handled by the task schedule in the end.
So if I run an user space thread that was assigned to an isolated core (using isolcpus) and it will do some IO operation, will the the
task created by the IO scheduler get executed on the isolated core ?
Since CFS seems to favor user interaction does this mean that CPU intensive threads might get a lower CPU time in the long run?
Isolating cores can help mitigate this issue?
Isolating cores can decrease the scheduling latency (the time it takes for a thread that was marked as runnable to get executed ) for
the threads that are pined to the isolated cores?
So if I run an user space thread that was assigned to an isolated core
(using isolcpus) and it will do some IO operation, will the the task
created by the IO scheduler get executed on the isolated core ?
What isolcpus is doing is taking that particular core out of kernel list of cpu where it can schedule tasks. So once you isolate a cpu from kernel's list of cpus it will never schedule any task on that core, no matter whether that core is idle or is being used by some other process/thread.
Since CFS seems to favor user interaction does this mean that CPU
intensive threads might get a lower CPU time in the long run?
Isolating cores can help mitigate this issue?
Isolating cpu has a different use altogether in my opinion. Basically if your applications has both fast threads(threads with no system calls, and are latency sensitive) and slow threads(threads with system calls) you would want to have dedicated cpu cores for your fast threads so that they are not interrupted by kernel's scheduling process and hence can run to their completion without any noise. Fast threads are usually latency sensitive. On the other hand slow threads or threads which are not really latency sensitive and are doing supporting logic for your application need not have dedicated cpu cores. As mentioned earlier isloting cpu servers a different purpose. We do all this all the time in our organization.
Isolating cores can decrease the scheduling latency (the time it takes
for a thread that was marked as runnable to get executed ) for the
threads that are pined to the isolated cores?
Since you are taking cpus from kernel's list of cpus this will surely impact other threads and processes, but then again you would want to pay extra thought and attention to what really is your latency sensitive code and you would want to separate it from your non-latency sensitive code.
Hope it helps.
I recently stumbled upon the question above but I am not sure if I understand what it is asking.
How would one avoid the use of scheduling policies?
I would think that there isn't any other way...
Scheduling policy has nothing to do with the resource allocation! Processes are scheduled basically, and hence allocated resources as such.
From "Resource allocation (computer)" description on Wikipedia :-
When the user opens any program this will be counted as a process, and
therefore requires the computer to allocate certain resources for it
to be able to run. Such resources could have access to a section of
the computer's memory, data in a device interface buffer, one or more
files, or the required amount of processing power.
I don't know how you got confused between them. All the process would, at a time or another, get scheduled at any point of time; unless the CPU is an unfair one.
EDIT :
How would one avoid the use of scheduling policies?
If there are more than one user-process to be executed, then one has to apply the scheduling policy so that the processes get executed in some order. There has to be a queue to hold all the processes. See a different case in BareMetal OS below.
Then, there is BareMetal OS which is single address space OS.
Multitasking on BareMetal is unusual for operating systems in this day
and age. BareMetal uses an internal work queue that all CPU cores
poll. A task added to the work queue will be processed by any
available CPU core in the system and will execute until completion,
which results in no context switch overhead.
So, BareMetal OS doesn't use any scheduling policy, it is based on polling of the work-queue by the cores.
Like in process management and memory management.
Are the scheduler and memory manager implemented as kernel threads that are run on the cpu the moment they are needed? If not, how does the kernel treat them?
Are they like processes, tasks, or some line of code that gets executed when needed?
Some are, some aren't. The terms "process management" and "memory management" are kind of broad and cover a fair bit of kernel code.
For memory management, a call to mmap() will just require changing some data structures and can be done by the current thread, but if pages are swapped out it will be done by kswapd, which is a kernel thread.
You might consider the scheduler a special case: since the scheduler is responsible for scheduling all threads, it itself is not a thread and does not execute on any thread (otherwise it would need to schedule itself... but how would it schedule itself, if it had to schedule itself first in order to do that?). You might think of the scheduler as running directly on each processor core when necessary.
Taking CPU affinity into account, will such an environment be useful with threading? Or will there be a performance degradation in such a system, if multiple users login and spawn multiple kernel and user threads?
When you say "taking CPU affinity into account" - are you saying that all processes have CPU affinity in this hypothetical system? Or is that just as one extra possible bit of information?
Using multiple threads will slow things down a bit if the system is already loaded (so there are more runnable threads than cores) but if there are often times where there are only (say) 2 users and 4 cores available, threading may help.
Another typical use for threads is to do something "in the background" whether that's explicitly using threads or using async calls. At that point multi-threading can definitely give a benefit (e.g. a non-hanging UI) without actually using more than one core simultaneously for much of the time.