Why can't we have a safe ISA? - security

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?

Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.

Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

Related

Why does x86 allows for unaligned accesses, and how unaligned accesses can be detected?

Perhaps I misunderstand something, but it seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
Why do x86 designers allow for unaligned accesses in the first place? (Performance is the only benefit I can think of.)
If x86 designers permit this unaligned access trouble, they should somehow know how to solve it, don't they? Can unaligned accesses get detected with static techniques or sanitization techniques?
I'm skeptical of the entire premise that there's a security downside here; a quick search of your link doesn't find any mention of unaligned access being a problem.
Many other ISAs support unaligned access now, too. e.g. AArch64, later ARM including ARMv6 and ARMv7, and even MIPS32r6 (but earlier MIPS revisions didn't guarantee that). Non-x86 implementations often have a performance penalty for unaligned load or especially store, even when it's within a single cache line (which has no penalty on modern x86 for cacheable loads/stores).
The primary designer of 8086 was Stephen Morse (who wrote a book about it, The 8086 Primer, which is now free on his web site).
The x86 design choice was made between 1976 and 1978. (And couldn't be changed in later x86 without breaking backwards compat, which is the main thing x86 has going for it.) 8086 needed to support byte loads and stores, and the hardware required to support unaligned 2-byte words on its 16-bit bus was presumably minor. Especially since 8088 was also planned, with an 8-bit bus. I think its only differences from 8086 were in the bus-interface unit. Or it might have been cheaper to just do it than to implement some mechanism for alignment faults.
There is no obvious security problem, and certainly none that anyone then would have heard of.
8086 was designed for easy asm source-porting from 8080 - IDK if 8080 could ever load or store 2 bytes at once, but if it allowed doing so, it probably didn't care about alignment, so 8086 needed to support. Modern static analysis tools probably weren't even dreamed of yet, and most 8080 code was hand-written in asm. (Like much early 8086 code, I'd guess.)
The Internet barely existed at the time and almost certainly wasn't a consideration. 8086 had no memory protection or privilege levels, so it certainly wasn't designed with security in mind. (Unlike contemporary CPUs for minicomputers that ran multi-user OSes).
The only real security threat for PCs at the time AFAIK was boot-sector viruses, and usually those spread by directly executing code that the system auto-ran during boot or from floppies, not attacking vulnerabilities in other programs. I could imagine malicious data files like .zip or word-processor formats were thought of at some point, but if there is any security advantage to disallowing misaligned accesses, it wasn't anything known then.
Software certainly wasn't spending extra code-size or cycles on hardening, not for decades after 8086.
Can unaligned accesses get detected with static techniques or sanitization techniques?
There's HW support for detecting unaligned accesses on x86, in the form of the AC bit in EFLAGS. But that's normally unusable because compilers (and hand-written asm memcpy etc. in libc) sometimes use unaligned loads, e.g. to initialize or copy adjacent narrow members of a struct.
GCC has -fsanitize=alignment which seems to check for C UB of dereferencing pointers that aren't sufficiently aligned for their type. e.g. it checks *int_ptr, but doesn't add checks for memcpy(char_arr, &my_int, 4) even though it inlines as a dword store. https://godbolt.org/z/ac6K13nc1
Misaligned locked instructions are extremely expensive (like system-wide bus lock or something), at least when split across two cache lines, and there is special support for detecting them specifically, without complaining about the normal misaligned loads/stores that happen in memcpy for odd sizes. The mechanisms include a perf counter for it, and a recent addition of an MSR (Model Specific Register) config bit to let the kernel make them raise an exception.
Cache-line-split locked instructions can apparently be a problem in terms of letting unprivileged code on one core interfere with hard-realtime code on another core.
It seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
How so?
The paper you linked mentions alignment of the Function Lookup Table in this proposed hardening mechanism. There are only two instances of the string "align" in the whole paper, and neither of them talk about ARMv7-M's support for unaligned load/store creating any difficulty. (ARMv7-M is the ISA they're discussing, since it's about hardening embedded systems.)

Why are Sempaphores limited in Linux

We just ran out of semaphores on our Linux box, due to the use of too many Websphere Message Broker instances or somesuch.
A colleague and I got to wondering why this is even limited - it's just a bit of memory, right?
I thoroughly googled and found nothing.
Anyone know why this is?
cheers
Semaphores, when being used, require frequent access with very, very low overhead.
Having an expandable system where memory for each newly requested semaphore structure is allocated on the fly would introduce complexity that would slow down access to them because it would have to first look up where the particular semaphore in question at the moment is stored, then go fetch the memory where it is stored and check the value. It is easier and faster to keep them in one compact block of fixed memory that is readily at hand.
Having them dispersed throughout memory via dynamic allocation would also make it more difficult to efficiently use memory pages that are locked (that is, not subject to being swapped out when there are high demands on memory). The use of "locked in" memory pages for kernel data is especially important for time-sensitive and/or critical kernel functions.
Having the limit be a tunable parameter (see links in the comments of original question) allows it to be increased at runtime if needed via an "expensive" reallocation and relocation of the block. But typically this is done one time at system initialization before anything much is even using semaphores.
That said, the amount of memory used by a semaphore set is rather tiny. With modern memory available on systems being in the many gigabytes the original default limits on the number of them might seem a bit stingy. But keep in mind that on many systems semaphores are rarely used by user space processes and the linux kernel finds its way into lots of small embedded systems with rather limited memory, so setting the default limit arbitrarily high in case it might be used seems wasteful.
The few software packages, such as Oracle database for example, that do depend on having many semaphores available, typically do recommend in their installation and/or system tuning advice to increase the system limits.

Windows CE (RTOS) class-libraries for latency of interrupts and threads and USB?

I am getting started in working with Windows CE to utilize RTOS to reduce latency concerns with interrupts and threads and USB. What class-libraries(visual c++) can you point me to that would be good to have learned well to speed up the learning curve?
Thanks
That's a really, really broad question. The most important piece of advice I'll give you is that if you're after determinism and speed (your reference to an RTOS leads me to think you consider these important) then you need to be aware that any memory allocation or deallocation in a piece of code makes it non-deterministic.
C++ classes often have allocations and deallocations buried in them, so whatever you choose (and whatever you write), use them wisely. Sometimes they'll allow you to provide custom allocators (e.g. Boost) which you can use to just pull memory from an already allocated heap you create somewhere.
Keep the real-time parts of the code as small and simple as possible.

Do you expect that future CPU generations are not cache coherent?

I'm designing a program and i found that assuming implicit cache coherency make the design much much easier. For example my single writer (always the same thread) multiple reader (always other threads) scenarios are not using any mutexes.
It's not a problem for current Intel CPU's. But i want this program to generate income for at least the next ten years (a short time for software) so i wonder if you think this could be a problem for future cpu architectures.
I suspect that future CPU generations will still handle cache coherence for you. Without this, most mainstream programming methodologies would fail. I doubt any CPU architecture that will be used widely in the next ten years will invalidate the current programming model - it may extend it, but it's difficult to drop something so widely assumed.
That being said, programming with the assumption of implicit cache coherency is not always a good idea. There are many issues with false sharing that can easily be avoided if you purposefully try to isolate your data. Handling this properly can lead to huge performance boosts (rather, a lack of huge performance losses) on current generation CPUs. Granted, it's more work in the design, but it is often required.
We are already there. Computers claim cache coherency but at the same time they have a temporary store buffer for writes, reads can be completed via this buffer instead of the cache (ie the store buffer has just become a incoherent cache) and invalidate requests are also queued allowing the processor to temporarily use cache lines it knows are stale.
X86 doesn't use many of these techniques, but it does use some. As long as memory stays significantly slower than the CPU, expect to see more of these techniques and others yet devised to be used. Even itanium, failed as it is, uses many of these ideas, so expect intel to migrate them into x86 over time.
As for avoiding locks, etc: it is always hard to guage people's level of expertise over the Internet so either you are misguided with what you think might work, or you are on the cutting edge of lockfree programming. Hard to tell.
Do you understand the MESI protocol, memory barriers and visibility? Have you read stuff from Paul McKenney, etc?
I don't know per se. But I'd like to see a trend toward non-cache coherent modes.
The conceptual mind shift is significant (can't just pass data in a method call, must pass it through a queue to an async method), but it's required as we move more and more into a multicore world anyway. The closer we get to one processor per memory bank the better. Because then we're working in a world of network message routing, where data is just not available rather than having threads that can silently stomp on data.
However, as Reed Copsey points out, the whole x86 world of computing is built on the assumption of cache coherency (which is even bigger than Microsoft's market share!). So it won't go away any time soon!
Here is a paper from reputed authors in computer architecture area which argues that cache coherence is here to stay.
http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf
"Why On-Chip Cache Coherence Is Here to Stay" -By Martin, Hill and Sorin
You are making a strange request. You are asking for our (the SO community) assumptions about future CPU architectures - a very dangerous proposition. Are you willing to put your money where our mouth is? Because if we're wrong and your application will fail it will be you who's not making any money..
Anyway, I would suspect things are not going to change that dramatically because of all the legacy code that was written for single threaded execution but that's just my opinion.
The question seems misleading to me. The CPU architecture is not that important, what is important is the memory model of the platform you are working for.
You are developing the application is some environment, with some defined memory model. E.g. if you are currently targeting x86, you can be pretty sure any future platform will implement the same memory model when it is running x86 code. The same is true for Java or .NET VMs and other execution platforms.
If you expect to port your current application at some other platforms, if the platform memory model will be different, you will have to adjust for it, but in such case you are the one doing the port and you have the complete control over how you do it. This is however true even for current platforms, e.g. PowerPC memory model allows much more reorderings to happen than the x86 one.

When should I write a Linux kernel module?

Some people want to move code from user space to kernel space in Linux for some reason. A lot of times the reason seems to be that the code should have particularly high priority or simply "kernel space is faster".
This seems strange to me. When should I consider writing a kernel module? Are there a set of criterias?
How can I motivate keeping code in user space that (I believe) belong there?
Rule of thumb: try your absolute best to keep your code in user-space. If you don't think you can, spend as much time researching alternatives to kernel code as you would writing the code (ie: a long time), and then try again to implement it in user-space. If you still can't, research more to ensure you're making the right choice, then very cautiously move into the kernel. As others have said, there are very few circumstances that dictate writing kernel modules and debugging kernel code can be quite hellish, so steer clear at all costs.
As far as concrete conditions you should check for when considering writing kernel-mode code, here are a few: Does it need access to extremely low-level resources, such as interrupts? Is your code defining a new interface/driver for hardware that cannot be built on top of currently exported functionality? Does your code require access to data structures or primitives that are not exported out of kernel space? Are you writing something that will be primarily used by other kernel subsystems, such as a scheduler or VM system (even here it isn't entirely necessary that the subsystem be kernel-mode: Mach has strong support for user-mode virtual memory pagers, so it can definitely be done)?
There are very limited reasons to put stuff into the kernel. If you're writing device drivers it's ok. Any standard application: never.
The drawbacks are huge. Debugging gets harder, errors become more frequent and hard to find. You might compromise security and stability. You might have to adapt to kernel changes more frequently. It becomes impossible to port to other UNIX OSs.
The closest I've ever come to the kernel was a custom filesystem (with mysql in the background) and even for that we used FUSE (where the U stands for userspace).
I'm not sure the question is the right way around. There should be a good reason to move things to kernel space. If there aren't any reasons, don't do it.
For one thing, debugging is made harder, and the effect of bugs is far worse (crash/panic instead of simple coredump).
Basically, I agree with rpj. Code has to be in user-space, unless it's REALLY necessary.
But, to emphasize your question, which condition?
Some people claims that driver has to be in the kernel, which is not true. Some drivers are not timing sensitive, in fact lots of drivers are like that.
For example, the framer, RTC timer, i2c devices, etc. Those drivers can be easily moved to user space. There are even some file-systems that are written in user-space.
You should move to kernel space where the overhead, eg. user-kernel swap, becomes unacceptable for your code to work properly.
But there are lots of way to deal with this. For example, the /dev/mem provides a good way to access your physical memory, just like you do it from the kernel space.
When people talk about going to RTOS, I'm usually skeptical.
These days, the processor is so powerful, that most of the time, the real-time aspect becomes negligible.
But even, let's say, you're dealing with SONET, and you need to do a protection switching within 50ms (actually even less, since the 50ms constrains applies to the whole ring), you still can do the switching very fast, IF your hardware supports it.
Lots of framer these days can give you a hardware support that reduces the amount of writes that you need to do. Your job is basically responds to the interrupt as quickly as possible. And Linux is not bad at all. The interrupt latency I got was less 1ms, even if I have tons of other interrupts running (eg. IDE, ethernet, etc.).
And if that's still not enough, then maybe your hardware design is wrong. Some things are better left on the hardware. And when I said hardware, I mean ASIC, FPGA, Network Processor, or other advanced logic.
Code running in the kernel accesses memory, peripherals, system functions in ways that are different from userspace code and thus has the ability to be more efficient. Not to mention the reduced security restrictions for kernel code. However, all this usually comes at a cost, such as increasing the possibility of opening the kernel up to security threats, locking up the OS, complicating the debugging, and so forth.
If your people want really high priority, determinism, low latency etc, the right way to go is to use some real-time version of Linux (or other OS).
Also look at the preemptible kernel options etc. Exactly what you should do depends on the requirements, but to put the code in kernel modules is not likely the right solution, unless you are interfacing some hardware directly.
Another reason to not move code into kernel space is that when you use it in production or commercial situations, you will have to publish that code due to the GPL agreement. A situation that many software companies don't want to come into. :)
As general rule. Think on what you want to know and if that is something you would see in an operating system development book or class then it has a good chance to belong into the kernel. If not, keep it out of the kernel. If you have a very good reason to break that rule, be sure, you will have enough knowledge to know it by yourself or you will be working with someone that has that knowledge.
Yes, might sound harsh, but this exactly what I meant, if you don't know, then be almost sure the answer is no, don't do it in the kernel. Moving your development to kernel space opens a giant can of worms that you must be sure to be able to handle.
If you just need lower latency, higher throughput, etc., it is probably cheaper to buy a faster computer than to develop kernel code.
Kernel modules may be faster (due to less context switches, less system call overhead, and less interruptions), and certainly do run at very high priority. If you want to export a small amount of fairly simple code into kernel space, this might be OK. That is, if a small piece of code is found to be crucial to performance, and is the sort of code that would benefit from being placed in kernel mode, then it may be justified to place it there.
But moving large parts of your program into kernel space should be avoided unless all other options are completely exhausted. Aside from the difficulty of doing so, the performance benefit is not likely to be very large.
If you're asking such a question, then you shouldn't go to the kernel layer. Basically just wondering means you don't need to. The time of the context switch is so negligible that it doesn't matter anyway these days.

Resources