I was reading this article, but my question is on a generic level, I was thinking along the following lines:
Can a kernel be called real time just because it has a real time scheduler? Or in other words, say I have a linux kernel, and if I change the default scheduler from O(1) or CFS to a real time scheduler, will it become an RTOS?
Does it require any support from the hardware? Generally I have seen embedded devices having an RTOS (eg VxWorks, QNX), do these have any special provisions/hw to support them? I know RTOS process's running time is deterministic, but then one can use longjump/setjump to get the output in determined time.
I'd really appreciate some input/insight on it, if I am wrong about something, please correct me.
After doing some research, talking to poeple (Jamie Hanrahan, Juha Aaltonen #linkedIn Group - Device Driver Experts) and ofcourse the input from #Jim Garrison, this what I can conclude:
In Jamie Hanrahan's words-
What makes a kernel real time?
The sine qua non of a real time OS -
The ability to guarantee a maximum latency between an external interrupt and the start of the interrupt handler.
Note that the maximum latency need not be particularly short (e.g. microseconds), you could have a real time OS that guaranteed an absolute maximum latency of 137 milliseconds.
A real time scheduler is one that offers completely predictable (to the developer) behavior of thread scheduling - "which thread runs next".
This is generally separate from the issue of a guaranteed maximum latency to responding to an interrupt (since interrupt handlers are not necessarily scheduled like ordinary threads) but it is often necessary to implement a real-time application. Schedulers in real-time OSs generally implement a large number of priority levels. And they almost always implement priority inheritance, to avoid priority inversion situations.
So, it is good to have a guaranteed latency for an interrupt and predictability of thread scheduling, then why not make every OS real time?
Because an OS suited for general purpose use (servers and/or desktops) needs to have characteristics that are generally at odds with real-time latency guarantees.
For example, a real-time scheduler should have completely predictable behavior. That means, among other things, that whatever priorities have been assigned to the various tasks by the developer should be left alone by the OS. This might mean that some low-priority tasks end up being starved for long periods of time. But the RT OS has to shrug and say "that's what the dev wanted." Note that to get the correct behavior, the RT system developer has to worry a lot about things like task priorities and CPU affinities.
A general-purpose OS is just the opposite. You want to be able to just throw apps and services on it, almost always things written by many different vendors (instead of being one tightly integrated system as in most R-T systems), and get good performance. Perhaps not the absolute best possible performance, but good.
Note that "good performance" is not just measured in interrupt latency. In particular, you want CPU and other resource allocations that are often described as "fair", without the user or admin or even the app developers having to worry much if at all about things like thread priorities and CPU affinities and NUMA nodes. One job might be more important than another, but in a general-purpose OS, that doesn't mean that the second job should get no resources at all.
So the general purpose OS will usually implement time-slicing among threads of equal priority, and it may adjust the priorities of threads according to their past behavior (e.g. a CPU hog might have its priority reduced; an I/O bound thread might have its priority increased, so it can keep the I/O devices working; a CPU-starved thread might have its priority boosted so it can get a little bit of CPU time now and then).
Can a kernel be called real time just because it has a real time scheduler?
No, an RT scheduler is a necessary component of an RT OS, but you also need predictable behavior in other parts of the OS.
Does it require any support from the hardware?
In general, the simpler the hardware the more predictable its behavior is. So PCI-E is less predictable than PCI, and PCI is less predictable than ISA, etc. There are specific I/O buses that were designed for (among other things) easy predictability of e.g. interrupt latency, but a lot of R-T requirements can be met these days with commodity hardware.
The specific description of real-time is that processes have minimum response time guarantees. This is often not sufficient for the application, and even less important than determinism. This is especially hard to achieve with modern feature rich OS's. Consider:
If I want to command some hardware or a machine at precise points in time, I need to be able to generate command signals at those specific moments, often with far sub millisecond accuracy. Generally if you compile let's say a C-code that runs a loop that waits for "half a millisecond" and does something, the wait time is not exactly half a millisecond, it is a little bit more, since the way common OS's handle this, is that they put the process aside at least up until the correct time has passed, after which the scheduler might (at some point) pick it up again.
What is seriously problematic is not that the time t is not exactly half a second but that it cannot be known in advance how much more it is. This inaccuracy is not constant nor deterministic.
This has surprising consequences when doing physical automation. For example it is impossible to command a stepper motor accurately with any typical OS without using dedicated hardware through kernel interfaces and telling them how long time steps you really want. Because of this, a single AVR module can command several motors accurately, but a Raspberry Pi (that absolutely stomps the AVR in terms of clockspeed) cannot manage more than 2 with any typical OS.
Related
The background for my question is that I have a game engine whose main rendering loop involves two threads: In short, one thread generating OpenGL commands and another thread that takes the generated OpenGL commands and dispatches them to the driver. This allows them a certain degree of overlap of the work, but they still have to wait for each other at certain points. This is of course a fairly specialized situation local to me, but I also can't imagine the situation of having two threads ping-ponging work between them with a limited degree of parallelism is a particularly uncommon scenario.
Now, my problem is that, since none of the two threads are 100% busy, but rather spends some time sleeping waiting for the other thread to catch up, my OS scheduler apparently thinks that I'm not using the CPU enough to clock it up to maximum frequency. In marginal situations, this causes me to drop FPS significantly. This is on Linux with the default powersave cpufreq governor. Setting it to performance makes the problem go away entirely, but that's obviously not the default, and not what other people will be using. I don't have any Windows systems readily available to test on, so I don't know if Windows' scheduler has a similar behavior. My program is written in Java, but I can't imagine that to have any significant bearing on the problem.
Again though, I'm sure I'm not the only one with a program that behaves in this manner. Has someone else had this problem? Is there a good way to "solve" it? It seems a bit unnecessary to be dropping performance significantly just because the operating system doesn't really understand that the CPU is in fact being used to 100%.
Considering one core , when multiple request is arrived at server at the same timestamp, and all have the same priority ,for which request the thread would be allotted first ?
Ex: CPU has single core and has 2 thread. Now the 4 people has made the request (process) A,B,C,D to any server & server need to assign threads in the message queue in order to process those request. But which 2 process would be given chance first to assign those 2 threads ?
Assumption they all have arrived at same timestamp and have equal priority.
TUSHAR, there is a bit of a language gap occurring here. Considering you chose, kernel, and didn't seem to think it was something to do with algebra, I am going to translate your question:
In a single CPU system, when multiple interrupts are asserted
simultaneously, and all have the same priority, which handler would be
serviced first?
The first bit of info is that most interrupt controllers are little more than a priority encoder with some extra glue. As such, they have no notion of same priority, but that is less important than you might think.
Real Time Operating Systems, in particular, seek to disassociate their implementation with the hardware, and may even dynamically adjust interrupt priorities to suit the current workload. The key here is that the OS spends a minimal time at the mercy of the interrupt controller, and chooses what to do based upon its state. As the system designer, you can choose what happens.
Time Sharing Operating Systems also have some control over this; but typically less as they strive for maximum throughput rather than predictable response. As such, they might do anything from first-in-first-served, random-served, or even random-starved.
So the answer to your question depends upon your environment. For the most part, if you have a very simple environment (eg. an executive like vxWorks or freeRTOS), expect it to follow the dictates of the interrupt controller. If you have a more sophisticated device OS (eg. INTEGRITY or QNX) it is up to your configuration. If you have Linux/winDOS, there are likely 320 control knobs that all result in burning the toast.
Searching answers here for "thread affinity", I see a lot of interest in doing it but little justification for it save possibly getting stable QueryPerformanceTimer results.
Assuming a modern OS and a modern 2-4 socket workstation/server class machine with modern 4-6 core CPUs, what good reasons would anyone have for thinking they know better than their OS's scheduler ? Are there any real world situations where taking more control of thead affinity is the right thing to do ? What sort of performance benefits can be demonstrated ?
The last time I saw a really good case for setting thread affinity somewhere (as in, it was backed up by concrete results showing genuine and significant improvements in system performance), it was some obscure thing to do with Win2K device drivers. But I haven't seen anything like that in years so when someone tells me they need to control thread affinity (but not why) these days I am deeply sceptical... but curious to be shown otherwise.
The primary reason is if you have something that depends heavily upon caching. The OS scheduler doesn't necessarily take that into account to the degree you might like.
I use it to assign threads to cores; for example in a simulation you do the physics entirely on one core, and allow the rest of the computation to be executed on another one. It makes sense to be able to control this, if you're on a tight environment where you know the hardware.
Of course, configuring this needs to be done per system, so by default I let the OS decide the cores on which to run, but keep the option of restricting core usage.
In the OS kernel and sometimes in kernel mode drivers you need to perform the same action on every CPU (e.g. update a system register). You can do that in a loop in a single thread, changing the affinity on each iteration.
For desktops it's quite unnecessary.
But I can see some applications where it would help. For example the CPU cache likes it if the app that runs on it doesn't change.
Another possibility is you have a critical task - you give it an entire CPU, and the other tasks use the rest of the CPUs.
Or the opposite: You have some low priority tasks, you put them all on one CPU, then leave the others free for more important tasks (using process priority will give you most of this benefit without having affinity, but I can imagine some memory heavy cases where it wouldn't).
I would agree its best to leave to the CPU to figure this out in most situations. However, the most common reason to go for thread affinity as far as I have seen is when you need good cache dependency. In multiple CPU systems, when a particular CPU caches something individually for itself and if the same thing has been cached in some other CPU, then I believe it can automatically get invalidated on the other CPU. So if a particular thread keeps changing CPUs on which it executes, then the cache hit rate will be too less. So in this case I guess it makes sense for the programmer to be a better judge of the COU affinities.
I also think the above point by Ariel about making sure a critical task constantly gets a CPU without throttling other low priority processes also makes sense.
I recently learned that sometimes people will lock specific processes or threads to specific processors or cores, and it's thought that this manual tuning will best distribute the load. This is a bit counter-intuitive to me -- I would think the OS scheduler would be able to make a better decision than a human about how to spread the load. I could see it being true for older operating systems that perhaps weren't aware of issues like their being more latency between specific pairs of cores, or shared cache between one pair of cores but not another pair. But I assume 'modern' OSs like Linux, Solaris 10, OS X, and Vista should have schedulers that know this information. Am I mistaken about their capabilities? Am I mistaken that it's a problem the OS can actually solve? I'm particularly interested in the answer for Solaris and Linux.
The consequence is whether or not I need to inform users of my (multithreaded) software of how they might consider balancing on their box.
First of all, 'Lock' is not a correct term to describe it. 'Affinity' is more suitable term.
In most case, you don't need to care about it. However, in some cases, manually setting CPU/Process/Thread affinity could be beneficial.
Operating systems are usually oblivious to the details of modern multicore architecture. For example, say we have 2-socket quadcore processors, and the processor supports SMT(=HyperThreading). In this case, we have 2 processors, 8 cores, and 16 hardware threads. So, OS will see 16 logical processors. If an OS does not recognize such hierarchy, it is highly likely to lose some performance gains. The reasons are:
Caches: in our example, two different processors (installed on two different sockets) are not sharing any on-chip caches. Say an application has 4 busy-running threads and a lot of data are shared by threads. If an OS schedules the threads across the processors, then we may lose some cache locality, resulting in performance lose. However, the threads are not sharing much data (having distinct working set), then separating to different physical processors would be better by increasing effective cache capacity. Also, more tricky scenario could be happen, which is very hard for OS to be aware of.
Resource conflict: let's consider SMT(=HyperThreading) case. SMT shares a lot of important resources of CPU such as caches, TLB, and execution units. Say there are only two busy threads. However, an OS may stupidly schedule these two threads on two logical processors from the same physical core. In such case, a significant resources are contended by two logical threads.
One good example is Windows 7. Windows 7 now supports a smart scheduling policy that consider SMT (related article). Windows 7 actually prevents the above 2. case. Here is a snapshot of task manger in Windows 7 with 20% load on Core i7 (quadcore with HyperThreading = 8 logical processors):
(source: egloos.com)
The CPU usage history is very interesting, isn't? :) You may see that only a single CPU in pairs is utilized, meaning Windows 7 avoids scheduling two threads on a same core simultaneously as possible. This policy will definitely decrease the negative effects of SMT such as resource conflict.
I'd like to say OS are not very smart to understand modern multicore architecture where a lot of caches, shared last-level cache, SMT, and even NUMA. So, there could be good reasons you may need to manually set CPU/process/thread affinity.
However, I won't say this is really needed. Only when you fully understand your workload patterns and your system architecture, then try it on. And, see the results whether your try is effective.
For general-purpose applications, there is no reason to set the CPU affinity; you should just allow the OS scheduler to choose which CPU should run the process or thread. However, there are instances where it is necessary to set the CPU affinity. For example, in real-time systems where the cost of migrating a thread from one core to another (which can happen at any time if the CPU affinity has not been set) can introduce unpredictable delays that can cause tasks to miss their deadlines and which preclude real-time guarantees.
You can take a look at this article about a multi-core aware implementation of real-time CORBA that, among other things, had to set the CPU affinity so that CPU migration could not result in missed deadlines.
The paper is: Real-Time Performance and Middleware for Multiprocessor and Multicore Linux Platforms
For applications designed with parallelism and multiple cores in mind, OS-default thread affinity is sometimes not enough. There are many approaches to parallelism, but so far all require involvement of the programmer and knowledge - at some level at least - of the architecture on which the solution will be mapped. This includes the machines, CPU's and threads that are involved.
This is an actively researched subject, and there is an excellent course on MIT's OpenCourseWare that delves into these issues: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-189January--IAP--2007/CourseHome/
Well something many people haven't thought here is the idea of forbidding two processes to run on the same processor (socket). It might be worth to help the system to bound different heavily used processes to different processors. This can avoid contention if the scheduler is not clever enough to figure it out itself.
But this is more a system admin task then one for the programmers. I have seen optimizations like this for a few high performance database servers.
Most modern operating systems will do an effective job of allocating work between cores. They also attempt to keep threads running on the same core, to get the cache benefits you mentioned.
In general, you should never be setting your thread affinity unless you have a very good reason to. You don't have as good an insight as the OS into the other work that threads on the system are doing. Kernels are constantly being updated based on new processor technology (single CPU per socket to hyper threading to multiple cores per sockets). Any attempt by you to set hard affinity may backfire on future platforms.
This article from MSDN Magazine, Using concurrency for scalability, gives a good overview of multithreading on Win32. Regarding CPU affinity,
Windows automatically employs
so-called ideal processor affinity in
an attempt to maximize cache
efficiency. For example, a thread
running on CPU 1 that gets context
switched out will prefer to run again
on CPU 1 in the hope that some of its
data will still reside in cache. But
if CPU 1 is busy and CPU 2 is not, the
thread could be scheduled on CPU 2
instead, with all the negative cache
effects that implies.
The article also warns that CPU affinity shouldn't be manipulated without a deep understanding of the problem. Based on this information, my answer to your question would be No, except for very specific, well-understood scenarios.
I am not even sure you can pin processes to a specific CPU on linux. So, my answer is "NO" - let the OS handle it, it's smarter then you most of the time.
Edit:
It seems that on win32 you have some control over which CPU family are you going to run this process. Now I only wait for someone to prove me wrong also on linux/posix ...
I am looking at moving my product from an RTOS to embedded Linux. I don't have many real-time requirements, and the few RT requirements I have are on the order of 10s of milliseconds.
Can someone point me to a reference that will tell me how Real-Time the current version of Linux is?
Are there any other gotchas from moving to a commercial RTOS to Linux?
You can get most of your answers from the Real Time Linux wiki and FAQ
What are real-time capabilities of the stock 2.6 linux kernel?
Traditionally, the Linux kernel will only allow one process to preempt another only under certain circumstances:
When the CPU is running user-mode code
When kernel code returns from a system call or an interrupt back to user space
When kernel code code blocks on a mutex, or explicitly yields control to another process
If kernel code is executing when some event takes place that requires a high priority thread to start executing, the high priority thread can not preempt the running kernel code, until the kernel code explicitly yields control. In the worst case, the latency could potentially be hundreds milliseconds or more.
The Linux 2.6 configuration option CONFIG_PREEMPT_VOLUNTARY introduces checks to the most common causes of long latencies, so that the kernel can voluntarily yield control to a higher priority task waiting to execute. This can be helpful, but while it reduces the occurences of long latencies (hundreds of milliseconds to potentially seconds or more), it does not eliminate them. However unlike CONFIG_PREEMPT (discussed below), CONFIG_PREEMPT_VOLUNTARY has a much lower impact on the overall throughput of the system. (As always, there is a classical tradeoff between throughput --- the overall efficiency of the system --- and latency. With the faster CPU's of modern-day systems, it often makes sense to trade off throughput for lower latencies, but server class systems that do not need minimum latency guarantees may very well chose to use either CONFIG_PREEMPT_VOLUNTARY, or to stick with the traditional non-preemptible kernel design.)
The 2.6 Linux kernel has an additional configuration option, CONFIG_PREEMPT, which causes all kernel code outside of spinlock-protected regions and interrupt handlers to be eligible for non-voluntary preemption by higher priority kernel threads. With this option, worst case latency drops to (around) single digit milliseconds, although some device drivers can have interrupt handlers that will introduce latency much worse than that. If a real-time Linux application requires latencies smaller than single-digit milliseconds, use of the CONFIG_PREEMPT_RT patch is highly recommended.
They also have a list of "Gotcha's" as you called them in the FAQ.
What are important things to keep in
mind while writing realtime
applications?
Taking care of the following during
the initial startup phase:
Call mlockall() as soon as possible from main().
Create all threads at startup time of the application, and touch each page of the entire stack of each thread. Never start threads dynamically during RT show time, this will ruin RT behavior.
Never use system calls that are known to generate page faults, such as
fopen(). (Opening of files does the
mmap() system call, which generates a
page-fault).
If you use 'compile time global variables' and/or 'compile time global
arrays', then use mlockall() to
prevent page faults when accessing
them.
more information: HOWTO: Build an
RT-application
They also have a large publications page you might want to checkout.
Have you had a look at Xenomai? It will let you run "hard real time" processes above Linux, while still allowing you to access the regular Linux APIs for all the non-real-time needs.
There are two fundamentally different approaches to achieve real-time capabilities with Linux.
Patch the existing kernel with things like the rt-preempt patches. This will eventually lead to a fully preemptive kernel
Dual kernel approach (like xenomai, RTLinux, RTAI,...)
There are lots of gotchas moving from a RTOS to Linux.
Maybe you don't really need real-time?
I'm talking about real-time Linux in my training sessions:
https://rlbl.me/elisa
https://rlbl.me/elisa-en-pdf
https://rlbl.me/intely
https://rlbl.me/intely-en-pdf
https://rlbl.me/entirety-en-all-pdf
The answer is probably "good enough".
If you're running an embedded system, you probably have control of all or most of the software on the box.
Stock Linux 2.6 has several features suitable for low-latency tasks - chiefly these are:
Scheduling policies
Memory locking
Assuming you're using a single-core machine, if you have just one task which has set its scheduling policy to SCHED_FIFO or SCHED_RR (it doesn't matter which if you have just one task), AND locked all its memory in with mlockall(), then it WILL get scheduled as soon as it is ready to run.
Then the only thing you'd have to worry about was some non-preemptable part of the kernel taking longer than your acceptable latency to complete - which is unlikely to happen in an embedded system unless something bad happens, such as extreme memory pressure, or your drivers are dodgy.
I guess "try it and see" is a good answer, but that's probably rather complicated in your case (and might involve writing device drivers etc).
Look at the doc for sched_setscheduler for some good info.