Disadvantage(s) of preempt_rt [closed] - linux

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
the target hardware platform has limited horsepower, and/or you want
the real-time job to put the smallest possible overhead on your
system. This is where dual kernels are usually better than a native
preemption system.
From here: http://www.xenomai.org/index.php/Xenomai:Roadmap#Xenomai_3_FAQ
Preempt_rt does preempt the whole Linux. In what way does preempting Linux put load on the system?
The FAQ there talks about the preempt_rt as compared to Xenomai.

CONFIG_PREEMPT_VOLUNTARY -
This option introduces checks to the most common causes of long latencies in the kernel code, so that the kernel can voluntarily yield control to a higher priority task waiting to execute. This option is said to be reducing the occurrances of long latencies to a great degree but still it doesn't eliminate them totally.
CONFIG_PREEMPT_RT -
This option causes all kernel code outside of spinlock-protected regions (created by raw_spinlock_t), to be eligible for non-voluntary preemption by higher priority kernel threads. Spinlocks created by spinlock_t and rwlock_t, and the interrupts are also made preemptable with this option enabled. With this option, worst case latency drops to (around) single digit milliseconds.
Disadvantage -
The normal Linux kernel allows preemption of a task by a higher priority task only when the user space code is getting executed.
In order to reduce the latency, the CONFIG_PREEMPT_RT patch forces the kernel to non-voluntarily preempt the task at hand, at the arrival of a higher proiority kernel task. This is bound to cause a reduction in the overall throughput of the system since there will be several context switches and also the lower priority tasks won't be getting much a chance to get through.
Source:
https://rt.wiki.kernel.org/index.php/Frequently_Asked_Questions
Description of technical terms used:
What is latency?
The time elasped between a demand issued on a computer system and the begining of a response to the same demand is called latency or response time.
Kinds of latencies:
Interrupt Latency:
The time elapsed between the generation of an interrupt and the start of the execution of the corresponding interrupt handler.
Example: When a hardware device performs a task, it generates an interrupt. This interrupt has the information about the task to be performed and about the interrupt handler to be executed. The interrupt handler then performs the particular task.
Scheduling Latency:
It is the time between a wakeup signaling that an event has occurred and the kernel scheduler getting an opportunity to schedule the thread that is waiting for the wakeup to occur (the response). Scheduling latency is also known as dispatch latency.
Worst-case Latency:
The maximum amount of time that can laspe between a demand issued on a computer system and the begining of a response to the same demand.
What is throughput?
Amount of work that a computer can do in a given period of time is called throughput.
What is Context switch?
Context switch is the switching of the CPU from one process/thread to another. Context switches can occur only in kernel mode. This is the process of saving the current execution state of the process (for resuming execution later on), and loading the saved state of the new process/thread for execution.

Adding to top vote answer "lower priority tasks won't be getting much a chance to get through"
That's sort of the whole point (though on a 4+ core system those low priority tasks could still run as long as they are forbidden from doing things that would interfere with critical tasks - this is where it's important to be able to make sure all the peripherals connected play nice by not blocking when the app running critical thread wants to access them). The critical bit (if for example thinking about developing a useful system for processing external input timely or testing behaviour of data conversion with live data as opposed to model), is to have a way to tell the kernel where the time critical input is arriving from.
Problem with current eg. Windows systems is that you might be a "serious gamer or serious musician" that notices things like 150 microsecond jitters. If you have no way to specify that the keyboard, mouse or other human interface device should be treated at critical priority, then all sort of "windows updates" and such can come and do their thing which might in turn trigger some activity in the USB controller that has higher priority than the threads related to doing the input.
I read about a case where glitches in audio were resolved by adding a 2nd USB controller with nothing on it except the input device. In portable setting, you practically need Thunderbolt PCIe passthrough to add a dedicated hub (or fpga) that can, together with drivers, override everything else on the system. This is why the aren't much USB products on the market that provide good enough performance for musicians. (2 ms round trip latency with max 150 microsecond jitter over full day without dropouts)

Related

How do I test and / or benchmark traditional Linux Kernel vs Linux Kernel with RT Preempt patch?

I am working on a project to contrast and observe the performance gain with Preempt RT patch for Linux.
What kind of C programs should I look to execute on the two different kernels to gain good understanding of the benifits that Preempt RT patch offers.
Looking for suggestions on the programs.
To compare/demonstrate specifically the scheduling characteristics perhaps implement a system where:
An interrupt is generated via a digital input IN
The interrupt handler passes the input event via a semaphore to a high priority user process.
The user process, on receipt of the semaphore creates a (say) 10ms pulse on a digital output OUT.
Then:
Drive IN with a series of pulses from a signal generator
Attach an oscilloscope to IN and OUT.
Trigger the scope on the active (interrupt generating) edge of IN
Measure the time and variance between the interrupt-edge on IN and the start of the pulse on OUT.
Trigger the scope on the rising edge of the pulse on OUT.
Measure the length and variance of the pulse width.
Most modern scopes have a "persistence" feature where the trace is not cleared between sweeps. That is useful for measuring the variance.
If you lack a scope or a signal/function generator, you could use a switch, and software timestamps in the ISR and in the user process to log event times. But you would need to ensure in the user task that no preemption occurs between capturing the time and setting the OUT state by using a critical section, and will likely need to debounce the switch. That, in the case would simply be a matter of not setting the semaphore if the last event timestamp was less than say 20ms ago.
If PREEMPT-RT is doing its job, the tests should exhibit lower latency, greater precision and less variance than with the default scheduler regardless of the load of other (lower priority) processes running. If that still does not meet your requirements you may need a real RTOS.
If this characteristic is not your application requires, then you may not need or benefit from PREEMPT-RT and inappropriate allocation of process priorities or poor task design may even cause your application to fail to meet requirements. To make PREEMPT-RT work you have to know what you are doing; it does not magically make your system "real-time"; rather it facilitates the implementation of real-time systems.

Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?

Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.). It seems that if the operation on the kernel-level thread is all through system calls, the answer to my question will be true. However, I've searched a lot to find whether the management of kernel-level thread is all through system call but find nothing. And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time. So it seems impossible for programmers to write some explicit system calls to manage threads. I'm appreciative of any ideas.
Some web searching results told me that the only deficiency of kernel-level thread is the slow speed of its management(create, switch, terminate, etc.).
It's not that simple. To understand, think about what causes task switches. Here's a (partial) list:
a device told a device driver that an operation completed (some data arrived, etc) causing a thread that was waiting for the operation to unblock and then preempt the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
enough time passed; either causing an "end of time slice" task switch, or causing a sleeping thread to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the thread accessed virtual memory that isn't currently accessible, triggering the kernel's page fault handler which finds out that the current task has to wait while the kernel fetches data from from swap space or from a file (if the virtual memory is part of a memory mapped file), or has to wait for kernel to free up RAM by sending other pages to swap space (if virtual memory was involved in some kind of "copy on write"); causing a task switch because the currently running task can't continue. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
a new process is being created, and its initial thread preempts the currently running thread. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread asked kernel to do something with a file and kernel got "VFS cache miss" that prevents the request from being performed without any task switches. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to a different process to unblock and preempt. For this case you're running kernel code when you find out that a task switch is needed, so kernel task switching is faster.
the currently running thread releases a mutex or sends some data (e.g. using a pipe or socket); causing a thread that belongs to the same process to unblock and preempt. For this case you're running user-space code when you find out that a task switch is needed, so in theory user-space task switching is faster, but in practice it can just as easily be an indicator of poor design (using too many threads and/or far too much lock contention).
a new thread is being created for the same process; and the new thread preempts the currently running thread. For this case you're running user-space code when you find out that a task switch is needed, so in user-space task switching is faster; but only if kernel isn't informed (e.g. so that utilities like "top" can properly display details for threads) - if kernel is informed anyway then it doesn't make much difference where the task switch happens.
For most software (which doesn't use very many threads); doing task switches in the kernel is faster. Of course it's also (hopefully) fairly irrelevant for performance (because time spent switching tasks should be tiny compared to time spend doing other work).
And I always have an instinct that such management should be done by the OS automatically because only OS knows which thread would be suitable to run at a specific time.
Yes; but possibly not for the reason you think.
Another problem with user-space threading (besides making most task switches slower) is that it can't support global thread priorities without becoming a severe security disaster. Specifically; a process can't know if its own thread is higher or lower priority than a thread belonging to a different process (unless it has information about all threads for the entire OS, which is information that normal processes shouldn't be trusted to have); so user-space threading leads to wasting CPU time doing unimportant work (for one process) when there's important work to do (for a different process).
Another problem with user-space threading is that (for some CPUs - e.g. most 80x86 CPUs) the CPUs are not independent, and there may be power management decisions involved with scheduling. For examples; most 80x86 CPUs have hyper-threading (where a core is shared by 2 logical processors), where a smart scheduler may say "one logical processor in the core is running a high priority/important thread, so the other logical processor in the same core should not run a low priority/unimportant thread because that would make the important work slower"; most 80x86 CPUs have "turbo boost" (with similar "don't let low priority threads ruin the turbo-boost/performance of high priority thread" possibilities); and most CPUs have thermal management (where scheduler might say "Hey, these threads are all low priority, so let's underclock the CPU so that it cools down and can go faster later (has more thermal headroom) when there's high priority/more important work to do!").
Would it makes the kernel level thread clearly preferable to user level thread if system calls is as fast as procedure calls?
If system calls were as fast as normal procedure calls, then the performance differences between user-space threading and kernel threading would disappear (but all the other problems with user-space threading would remain). However, the reason why system calls are slower than normal procedure calls is that they pass through a kind of "isolation barrier" (that isolates kernel's code and data from malicious user-space code); so to make system calls as fast as normal procedure calls you'd have to get rid of the isolation (effectively turning the kernel into a kind of "global shared library" that can be dynamically linked) but without that isolation you'll have an extreme security disaster. In other words; to have any hope of achieving acceptable security, system calls must be slower than normal procedure calls.
Your basic premise is wrong. System calls are much slower than procedure calls in almost every interesting architecture.
The perceived cpu throughput is based on pipelining, speculative execution and fetching. The syscall stops the pipeline, invalidates the speculative execution and halts the speculative fetching, is a store and instruction barrier, and may flush the write fifo.
So, the processor slows down to its ‘spec’ speed around the syscall, accelerating back up until the syscall return, whereupon it does about the exact same thing.
Attempts to optimise this area have given rise to lots of papers named after fictional James Bond organizations, and not conciliatory enough apologies from not embarrassed enough cpu product managers. Google spectre as an example, then follow the associated links.
The other cost of syscall
A bit over 30 years ago, some smart guys wrote a paper about least privilege. Conceptually, it is a stunner. The basic premise is that whatever your program is doing, it should do it with the least privilege possible.
If your program is inverting arrays, according to the notion of least privilege, it should not be able to disable interrupts. Disabling interrupts can cause a very difficult to diagnose system failure. Simple user code should not have this ability.
The notion of user and kernel modes of execution evolved from early computer systems, and (with the possible exception of the iax32 / 80286 ) are increasingly showing their inadequacy in the connected computer environment. At one point in time you could say "this is a single user system"; but the IoT dweebs have made everything multi-user.
Least privilege insists that all code should execute with the minimum privilege required to complete the task at hand. Thus, nothing should be in the kernel that absolutely doesn't need to be. If you think that is a radical thought, in Ken Thompson's 1977(?) paper on the UNIX kernel he states exactly the same thing.
So no, putting your junk in the kernel just means you have increased the attack surface for no valid reason. Try to think in terms of exposing minimum risk, it leads to better software and better sleep.

Why softirq is used for highly threaded and high frequency uses?

What makes the softirq so special that we use it for high frequency uses., like in network drivers and block drivers.
SoftIrqs are typically used to complete queued work from a processed interrupt because they fit that need very well - they run with second-highest priority, but still run with hardware interrupts enabled.
Processing hw interrupts is the utmost priority, since if they are not processed quickly, then either too high of latency will be introduced and user experience suffers, or the hardware buffer will fill before the interrupt services the device, and data is lost. Dont service a network adapter fast enough? It's going to overwrite data in the fifo and you'll lose packets. Don't service a hard disk fast enough? The hard drive stalls queued read requests because it has nowhere to put the results.
SoftIrqs allow the critical part of servicing hardware interrupts to be as short as possible; instead of having to process the entire hw interrupt on the spot, the important data is read off the device into RAM or otherwise, and then a SoftIrq is started to finish the work. This keeps the hardware interrupts disabled for the shortest period of time, while still completing the work with high priority.
This article is a decent reference on the matter:
https://lwn.net/Articles/520076/
Edits for questions:
SoftIrqs are re-entrant - they can be processed on any cpu. From the article I linked:
There are two places where software interrupts can "fire" and preempt
the current thread. One of them is at the end of the processing for a
hardware interrupt; it is common for interrupt handlers to raise
softirqs, so it makes sense (for latency and optimal cache use) to
process them as soon as hardware interrupts can be re-enabled
Emphasis added. They can be processed inline - I believe this means they can be processed without causing a context switch, which means as soon as hardware interrupts are enabled, we can jump straight to the SoftIrq right where we are with as little CPU cache abuse as possible. All of this contributes to SoftIrqs being lightweight but flexible, which makes them ideal for high-frequency processing.
They can be pushed to another CPU if needed, which improves throughput.
They can be processed immediately after hwints are enabled right in the current context, preserving processor state as much as possible, improving latency
They allow hardware interrupts to keep processing, since those are our most important goal
They can be rescheduled to the ksoftirqd process if load is too high and we need to take time from normal user processes.

Multiple hardware timers in Linux

Problem - There is an intermittent clock drift (of 2 seconds) on my Linux system, so once in a while the kernel timer threads get executed 2 seconds + timeout time
Question - There are multiple hardware clocksources (TSC, HPET, ACPI_PM), is it possible to create kernel timer threads that forcibly uses a secondary clocksource as a fallback, if the primary clocksource drifts..?
What you describe doesn't sound like clock drift (systematic error) but rather like lost timer interrupts. If you have another piece of hardware that can generate timed interrupts (HPET, RTC, but not TSC), you can make your time-sensitive processing from either the timer or the interrupt handler (or handlers), whichever happens first, you just need to design some kind of synchronization between them.
If you experience genuine clock drift, when the speed of your clock is less than real time, you can try to estimate it and compensate when timers are scheduled. But lost interrupts is a sign of a more serious problem and it makes sense to address the root cause, which may affect your secondary interrupt source as well.

How "Real-Time" is Linux 2.6?

I am looking at moving my product from an RTOS to embedded Linux. I don't have many real-time requirements, and the few RT requirements I have are on the order of 10s of milliseconds.
Can someone point me to a reference that will tell me how Real-Time the current version of Linux is?
Are there any other gotchas from moving to a commercial RTOS to Linux?
You can get most of your answers from the Real Time Linux wiki and FAQ
What are real-time capabilities of the stock 2.6 linux kernel?
Traditionally, the Linux kernel will only allow one process to preempt another only under certain circumstances:
When the CPU is running user-mode code
When kernel code returns from a system call or an interrupt back to user space
When kernel code code blocks on a mutex, or explicitly yields control to another process
If kernel code is executing when some event takes place that requires a high priority thread to start executing, the high priority thread can not preempt the running kernel code, until the kernel code explicitly yields control. In the worst case, the latency could potentially be hundreds milliseconds or more.
The Linux 2.6 configuration option CONFIG_PREEMPT_VOLUNTARY introduces checks to the most common causes of long latencies, so that the kernel can voluntarily yield control to a higher priority task waiting to execute. This can be helpful, but while it reduces the occurences of long latencies (hundreds of milliseconds to potentially seconds or more), it does not eliminate them. However unlike CONFIG_PREEMPT (discussed below), CONFIG_PREEMPT_VOLUNTARY has a much lower impact on the overall throughput of the system. (As always, there is a classical tradeoff between throughput --- the overall efficiency of the system --- and latency. With the faster CPU's of modern-day systems, it often makes sense to trade off throughput for lower latencies, but server class systems that do not need minimum latency guarantees may very well chose to use either CONFIG_PREEMPT_VOLUNTARY, or to stick with the traditional non-preemptible kernel design.)
The 2.6 Linux kernel has an additional configuration option, CONFIG_PREEMPT, which causes all kernel code outside of spinlock-protected regions and interrupt handlers to be eligible for non-voluntary preemption by higher priority kernel threads. With this option, worst case latency drops to (around) single digit milliseconds, although some device drivers can have interrupt handlers that will introduce latency much worse than that. If a real-time Linux application requires latencies smaller than single-digit milliseconds, use of the CONFIG_PREEMPT_RT patch is highly recommended.
They also have a list of "Gotcha's" as you called them in the FAQ.
What are important things to keep in
mind while writing realtime
applications?
Taking care of the following during
the initial startup phase:
Call mlockall() as soon as possible from main().
Create all threads at startup time of the application, and touch each page of the entire stack of each thread. Never start threads dynamically during RT show time, this will ruin RT behavior.
Never use system calls that are known to generate page faults, such as
fopen(). (Opening of files does the
mmap() system call, which generates a
page-fault).
If you use 'compile time global variables' and/or 'compile time global
arrays', then use mlockall() to
prevent page faults when accessing
them.
more information: HOWTO: Build an
RT-application
They also have a large publications page you might want to checkout.
Have you had a look at Xenomai? It will let you run "hard real time" processes above Linux, while still allowing you to access the regular Linux APIs for all the non-real-time needs.
There are two fundamentally different approaches to achieve real-time capabilities with Linux.
Patch the existing kernel with things like the rt-preempt patches. This will eventually lead to a fully preemptive kernel
Dual kernel approach (like xenomai, RTLinux, RTAI,...)
There are lots of gotchas moving from a RTOS to Linux.
Maybe you don't really need real-time?
I'm talking about real-time Linux in my training sessions:
https://rlbl.me/elisa
https://rlbl.me/elisa-en-pdf
https://rlbl.me/intely
https://rlbl.me/intely-en-pdf
https://rlbl.me/entirety-en-all-pdf
The answer is probably "good enough".
If you're running an embedded system, you probably have control of all or most of the software on the box.
Stock Linux 2.6 has several features suitable for low-latency tasks - chiefly these are:
Scheduling policies
Memory locking
Assuming you're using a single-core machine, if you have just one task which has set its scheduling policy to SCHED_FIFO or SCHED_RR (it doesn't matter which if you have just one task), AND locked all its memory in with mlockall(), then it WILL get scheduled as soon as it is ready to run.
Then the only thing you'd have to worry about was some non-preemptable part of the kernel taking longer than your acceptable latency to complete - which is unlikely to happen in an embedded system unless something bad happens, such as extreme memory pressure, or your drivers are dodgy.
I guess "try it and see" is a good answer, but that's probably rather complicated in your case (and might involve writing device drivers etc).
Look at the doc for sched_setscheduler for some good info.

Resources