FreeRTOS vs Linux against single event upsets - linux

I am working on the on-board computer for a CubeSat. Our computer will be vulnerable to radiation, hence single event upsets, e.g. bit flips are likely to occur. Would a lighter, smaller OS like FreeRTOS bring more stability, robustness and a lower probability of failure over a full-blown Linux operating system?

The probability of a bit error in RAM is a function of time, memory size and radiation density, so a larger memory has a greater probability, and you can fit a FreeRTOS system in much less memory (like 10kb instead of 4Mb). However the usage rate of the smaller memory is likely much higher - i.e. in a FreeRTOS application, most of the code and data are accessed relatively frequently, while in a Linux deployment, much of it is redundant and if corrupted may never be accessed in any case.
However the question makes little sense for a number of reasons, such as:
The effect of a bit-flip event is entirely non-deterministic, any single event it may be benign or catastrophic. It is impossible to say that a system can tolerate 1 error when you don't know when or where the error will occur.
If your system can be implemented on FreeRTOS, why would you even consider Linux? They are chalk and cheese. If you need the extensive networking, filesystem, memory management, POSIX API and device support etc. provided by Linux, FreeRTOS is not suited to your application in any case, as you would have to add all that yourself from your own or additional third-party code. FreeRTOS is only a scheduling kernel, with threading, synchronisation and IPC support and little else. Conversely if you need hard real-time deterministic behaviour, Linux is unsuited to your application.
Where you might benefit from using an RTOS kernel like FreeRTOS is that it will execute from ROM which may be less prone to the bit-flipping cosmic ray issue - (although the availability of ECC/radiation hardened Flash memory may indicate otherwise). You still need RAM for R/W data, but at least the code itself will be robust. A typical FreeRTOS system might run in SRAM (possibly in on-chip RAM on a microcontroller) - I don't know whether low density SRAM is less prone to bit-flipping than high-density SDRAM, but I am willing to believe it is. It is also possible to source radiation hardened SRAM in any case.
The solution for a system using SDRAM in such an environment is to use ECC RAM which may largely overcome the problem of data corruption from radiation and non-deterministic system behaviour. However I would not imagine that even that would be sufficient for space or high-atmosphere applications.
In short the solution is not in the software, it has to be in the hardware, and the lengths you need to go to will depend on the radiation environment your system will be subjected to. However the selection of a small RTOS kernel allows the selection of hardware to be potentially much wider since it will run on a much wider range of architectures in much smaller memory, perform deterministically, respond to events in fewer cycles and is ROMable.

Related

Why can't we have a safe ISA?

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?
Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.
Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

Windows CE (RTOS) class-libraries for latency of interrupts and threads and USB?

I am getting started in working with Windows CE to utilize RTOS to reduce latency concerns with interrupts and threads and USB. What class-libraries(visual c++) can you point me to that would be good to have learned well to speed up the learning curve?
Thanks
That's a really, really broad question. The most important piece of advice I'll give you is that if you're after determinism and speed (your reference to an RTOS leads me to think you consider these important) then you need to be aware that any memory allocation or deallocation in a piece of code makes it non-deterministic.
C++ classes often have allocations and deallocations buried in them, so whatever you choose (and whatever you write), use them wisely. Sometimes they'll allow you to provide custom allocators (e.g. Boost) which you can use to just pull memory from an already allocated heap you create somewhere.
Keep the real-time parts of the code as small and simple as possible.

Do you expect that future CPU generations are not cache coherent?

I'm designing a program and i found that assuming implicit cache coherency make the design much much easier. For example my single writer (always the same thread) multiple reader (always other threads) scenarios are not using any mutexes.
It's not a problem for current Intel CPU's. But i want this program to generate income for at least the next ten years (a short time for software) so i wonder if you think this could be a problem for future cpu architectures.
I suspect that future CPU generations will still handle cache coherence for you. Without this, most mainstream programming methodologies would fail. I doubt any CPU architecture that will be used widely in the next ten years will invalidate the current programming model - it may extend it, but it's difficult to drop something so widely assumed.
That being said, programming with the assumption of implicit cache coherency is not always a good idea. There are many issues with false sharing that can easily be avoided if you purposefully try to isolate your data. Handling this properly can lead to huge performance boosts (rather, a lack of huge performance losses) on current generation CPUs. Granted, it's more work in the design, but it is often required.
We are already there. Computers claim cache coherency but at the same time they have a temporary store buffer for writes, reads can be completed via this buffer instead of the cache (ie the store buffer has just become a incoherent cache) and invalidate requests are also queued allowing the processor to temporarily use cache lines it knows are stale.
X86 doesn't use many of these techniques, but it does use some. As long as memory stays significantly slower than the CPU, expect to see more of these techniques and others yet devised to be used. Even itanium, failed as it is, uses many of these ideas, so expect intel to migrate them into x86 over time.
As for avoiding locks, etc: it is always hard to guage people's level of expertise over the Internet so either you are misguided with what you think might work, or you are on the cutting edge of lockfree programming. Hard to tell.
Do you understand the MESI protocol, memory barriers and visibility? Have you read stuff from Paul McKenney, etc?
I don't know per se. But I'd like to see a trend toward non-cache coherent modes.
The conceptual mind shift is significant (can't just pass data in a method call, must pass it through a queue to an async method), but it's required as we move more and more into a multicore world anyway. The closer we get to one processor per memory bank the better. Because then we're working in a world of network message routing, where data is just not available rather than having threads that can silently stomp on data.
However, as Reed Copsey points out, the whole x86 world of computing is built on the assumption of cache coherency (which is even bigger than Microsoft's market share!). So it won't go away any time soon!
Here is a paper from reputed authors in computer architecture area which argues that cache coherence is here to stay.
http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf
"Why On-Chip Cache Coherence Is Here to Stay" -By Martin, Hill and Sorin
You are making a strange request. You are asking for our (the SO community) assumptions about future CPU architectures - a very dangerous proposition. Are you willing to put your money where our mouth is? Because if we're wrong and your application will fail it will be you who's not making any money..
Anyway, I would suspect things are not going to change that dramatically because of all the legacy code that was written for single threaded execution but that's just my opinion.
The question seems misleading to me. The CPU architecture is not that important, what is important is the memory model of the platform you are working for.
You are developing the application is some environment, with some defined memory model. E.g. if you are currently targeting x86, you can be pretty sure any future platform will implement the same memory model when it is running x86 code. The same is true for Java or .NET VMs and other execution platforms.
If you expect to port your current application at some other platforms, if the platform memory model will be different, you will have to adjust for it, but in such case you are the one doing the port and you have the complete control over how you do it. This is however true even for current platforms, e.g. PowerPC memory model allows much more reorderings to happen than the x86 one.

Kernel Scheduling for 1024 CPUs

Azul Systems has an appliance that supports thousands of cache coherent CPUs. I would love insight into what changes would need to occur to an operating system in order to schedule thousands of simultaneously running threads.
Scheduling thousands of threads is not a big deal, but scheduling them on hundreds of CPUs is. What you need, first and foremost, is very fine-grained locking, or, better yet, lock-free data structures and algorithms. You just can't afford to let 200 CPUs waiting while one CPU executes a critical section.
You're asking for possible changes to the OS, so I presume there's a significant engineering team behind this effort.
There are also a few pieces of clarififying info that would help define the problem parameters:
How much IPC (inter process communication) do you need?
Do they really have to be threads, or can they be processes?
If they're processes, is it okay if the have to talk to each other through sockets, and not by using shared memory?
What is the memory architecture? Are you straight SMP with 1024 cores, or is there some other NUMA (Non-Uniform Memory Architecture) or MMP going on here? What are your page tables like?
Knowing only the very smallest of info about Azul systems, I would guess that you have very little IPC, and that a simple "run one kernel per core" model might actually work out just fine. If processes need to talk to each other, then they can create sockets and transfer data that way. Does your hardware support this model? (You would likely end up needing one IP address per core as well, and at 1024 IP addrs, this might be troublesome, although they could all be NAT'd, and maybe it's not such a big deal). If course, this model would lead to some inefficiencies, like extra page tables, and a fair bit of RAM overhead, and may even not be supported by your hardware system.
Even if "1 kernel per core" doesn't work, you could likely run 1024/8 kernels, and be just fine, letting each kernel control 8 physical CPUs.
That said, if you wanted to run 1 thread per core in a traditional SMP machine with 1024 cores (and only a few physical CPUs) then I would expect that the old fashioned O(1) scheduler is what you'd want. It's likely that your CPU[0] will end up nearly 100% in kernel and doing interrupt handling, but that's just fine for this use case, unless you need more than 1 core to handle your workload.
Making Linux scale has been a long and ongoing project. The first multiprocessor capable Linux kernel had a single lock protecting the entire kernel (the Big Kernel Lock, BKL), which was simple, but limited scalability.
Subsequently the locking has been made more fine-grained, i.e. there are many locks (thousands?), each covering only a small portion of data. However, there are limits to how far this can be taken, as fine-grained locking tends to be complicated, and the locking overhead starts to eat up the performance benefit, especially considering that most multi-CPU Linux systems have relatively few CPU's.
Another thing, is that as far as possible the kernel uses per-cpu data structures. This is very important, as it avoids the cache coherency performance issues with shared data, and of course there is no locking overhead. E.g. every CPU runs its own process scheduler, requiring only occasional global synchronization.
Also, some algorithms are chosen with scalability in mind. E.g. some read-mostly data is protected by Read-Copy-Update (RCU) instead of traditional mutexes; this allows readers to proceed during a concurrent update.
As for memory, Linux tries hard to allocate memory from the same NUMA node as where the process is running. This provides better memory bandwidth and latency for the applications.
My uneducated guess would be that there is a run-queue per processor and a work-stealing algorithm when a processor is idle. I could see this working in an M:N model, where there is a single process per cpu and light-weight processes as the work items. This would then feel similar to a work-stealing threadpool, such as the one in Java-7's fork-join library.
If you really want to know, go pick up Solaris Internals or dig into the Solaris kernel code. I'm still reading Design & Impl of FreeBSD, with Solaris Internals being the next on my list, so all I can do is make wild guesses atm.
I am pretty sure that the SGI Altix we have at work, (which does ccNUMA) uses special hardware for cache coherency.
There is a huge overhead connected to hold 4mb cache per core coherent. It's unlikely to happen in software only.
in an array of 256 cpus you would need 768mb ram just to hold the cache-invalidation bits.
12mb cache / 128 bytes per cache line * 256² cores.
Modifying the OS is one thing, but using unchanged application code is a waste of hardware. When going over some limit (depending on the hardware), the effort to keep coherency and synchronization in order to execute generic code is simply too much. You can do it, but it will be very expensive.
From the OS side you'll need complex affinity model, i.e. not to jump CPUs just because yours is busy. Scheduling threads based on hardware topology - cooperating threads on CPUs that are "close" to minimize penalties. Simple work stealing is not a good solution, you must consider topology. One solution is hierarchical work stealing - steal work by distance, divide topology to sectors and try to steal from closest first.
Touching a bit the lock issue; you'll still use spin-locks nd such, but using totally different implementations. This is probably the most patented field in CS these days.
But, again, you will need to program specifically for such massive scale. Or you'll simply under-use it. No automatic "parallelizers" will do it for you.
The easiest way to do this is to bind each process/thread to a few CPUS, and then only those CPUs would have to compete for a lock on that thread. Obviously, there would need to be some way to move threads around to even out the load, but on a NUMA architecture, you have to minimize this as much as possible.
Even on dual-core intel systems, I'm pretty sure that Linux can already handle "Thousands" of threads with native posix threads.
(Glibc and the kernel both need to be configured to support this, however, but I believe most systems these days have that by default now).

When should I write a Linux kernel module?

Some people want to move code from user space to kernel space in Linux for some reason. A lot of times the reason seems to be that the code should have particularly high priority or simply "kernel space is faster".
This seems strange to me. When should I consider writing a kernel module? Are there a set of criterias?
How can I motivate keeping code in user space that (I believe) belong there?
Rule of thumb: try your absolute best to keep your code in user-space. If you don't think you can, spend as much time researching alternatives to kernel code as you would writing the code (ie: a long time), and then try again to implement it in user-space. If you still can't, research more to ensure you're making the right choice, then very cautiously move into the kernel. As others have said, there are very few circumstances that dictate writing kernel modules and debugging kernel code can be quite hellish, so steer clear at all costs.
As far as concrete conditions you should check for when considering writing kernel-mode code, here are a few: Does it need access to extremely low-level resources, such as interrupts? Is your code defining a new interface/driver for hardware that cannot be built on top of currently exported functionality? Does your code require access to data structures or primitives that are not exported out of kernel space? Are you writing something that will be primarily used by other kernel subsystems, such as a scheduler or VM system (even here it isn't entirely necessary that the subsystem be kernel-mode: Mach has strong support for user-mode virtual memory pagers, so it can definitely be done)?
There are very limited reasons to put stuff into the kernel. If you're writing device drivers it's ok. Any standard application: never.
The drawbacks are huge. Debugging gets harder, errors become more frequent and hard to find. You might compromise security and stability. You might have to adapt to kernel changes more frequently. It becomes impossible to port to other UNIX OSs.
The closest I've ever come to the kernel was a custom filesystem (with mysql in the background) and even for that we used FUSE (where the U stands for userspace).
I'm not sure the question is the right way around. There should be a good reason to move things to kernel space. If there aren't any reasons, don't do it.
For one thing, debugging is made harder, and the effect of bugs is far worse (crash/panic instead of simple coredump).
Basically, I agree with rpj. Code has to be in user-space, unless it's REALLY necessary.
But, to emphasize your question, which condition?
Some people claims that driver has to be in the kernel, which is not true. Some drivers are not timing sensitive, in fact lots of drivers are like that.
For example, the framer, RTC timer, i2c devices, etc. Those drivers can be easily moved to user space. There are even some file-systems that are written in user-space.
You should move to kernel space where the overhead, eg. user-kernel swap, becomes unacceptable for your code to work properly.
But there are lots of way to deal with this. For example, the /dev/mem provides a good way to access your physical memory, just like you do it from the kernel space.
When people talk about going to RTOS, I'm usually skeptical.
These days, the processor is so powerful, that most of the time, the real-time aspect becomes negligible.
But even, let's say, you're dealing with SONET, and you need to do a protection switching within 50ms (actually even less, since the 50ms constrains applies to the whole ring), you still can do the switching very fast, IF your hardware supports it.
Lots of framer these days can give you a hardware support that reduces the amount of writes that you need to do. Your job is basically responds to the interrupt as quickly as possible. And Linux is not bad at all. The interrupt latency I got was less 1ms, even if I have tons of other interrupts running (eg. IDE, ethernet, etc.).
And if that's still not enough, then maybe your hardware design is wrong. Some things are better left on the hardware. And when I said hardware, I mean ASIC, FPGA, Network Processor, or other advanced logic.
Code running in the kernel accesses memory, peripherals, system functions in ways that are different from userspace code and thus has the ability to be more efficient. Not to mention the reduced security restrictions for kernel code. However, all this usually comes at a cost, such as increasing the possibility of opening the kernel up to security threats, locking up the OS, complicating the debugging, and so forth.
If your people want really high priority, determinism, low latency etc, the right way to go is to use some real-time version of Linux (or other OS).
Also look at the preemptible kernel options etc. Exactly what you should do depends on the requirements, but to put the code in kernel modules is not likely the right solution, unless you are interfacing some hardware directly.
Another reason to not move code into kernel space is that when you use it in production or commercial situations, you will have to publish that code due to the GPL agreement. A situation that many software companies don't want to come into. :)
As general rule. Think on what you want to know and if that is something you would see in an operating system development book or class then it has a good chance to belong into the kernel. If not, keep it out of the kernel. If you have a very good reason to break that rule, be sure, you will have enough knowledge to know it by yourself or you will be working with someone that has that knowledge.
Yes, might sound harsh, but this exactly what I meant, if you don't know, then be almost sure the answer is no, don't do it in the kernel. Moving your development to kernel space opens a giant can of worms that you must be sure to be able to handle.
If you just need lower latency, higher throughput, etc., it is probably cheaper to buy a faster computer than to develop kernel code.
Kernel modules may be faster (due to less context switches, less system call overhead, and less interruptions), and certainly do run at very high priority. If you want to export a small amount of fairly simple code into kernel space, this might be OK. That is, if a small piece of code is found to be crucial to performance, and is the sort of code that would benefit from being placed in kernel mode, then it may be justified to place it there.
But moving large parts of your program into kernel space should be avoided unless all other options are completely exhausted. Aside from the difficulty of doing so, the performance benefit is not likely to be very large.
If you're asking such a question, then you shouldn't go to the kernel layer. Basically just wondering means you don't need to. The time of the context switch is so negligible that it doesn't matter anyway these days.

Resources