Is a stack overflow a security hole? - security

Note: this question relates to stack overflows (think infinite recursion), NOT buffer overflows.
If I write a program that is correct, but it accepts an input from the Internet that determines the level of recursion in a recursive function that it calls, is that potentially sufficient to allow someone to compromise the machine?
I know someone might be able to crash the process by causing a stack overflow, but could they inject code? Or does the c runtime detect the stack overflow condition and abort cleanly?
Just curious...

Rapid Refresher
First off, you need to understand that the fundamental units of protection in modern OSes are the process and the memory page. Processes are memory protection domains; they are the level at which an OS enforces security policy, and they thus correspond strongly with a running program. (Where they don't, it's either because the program is running in multiple processes or because the program is being shared in some kind of framework; the latter case has the potential to be “security-interesting” but that's 'nother story.) Virtual memory pages are the level at which the hardware applies security rules; every page in a process's memory has attributes that determine what the process can do with the page: whether it can read the page, whether it can write to it, and whether it can execute program code on it (though the third attribute is rather more rarely used than perhaps it should be). Compiled program code is mapped into memory into pages that are both readable and executable, but not writable, whereas the stack should be readable and writable, but not executable. Most memory pages are not readable, writable or executable at all; the OS only lets a process use as many pages as it explicitly asks for, and that's what memory allocation libraries (malloc() et al.) manage for you.
Analysis
Provided each stack frame is smaller than a memory page[1] so that, as the program advances through the stack, it writes to each page, the OS (i.e., the privileged part of the runtime) can at least in principle detect stack overflows reliably and terminate the program if that occurs. Basically, all that happens to do this detection is that there is a page that the program cannot write to at the end of the stack; if the program tries to write to it, the memory management hardware traps it and the OS gets a chance to intervene.
The potential problems with this come if the OS can be tricked into not setting such a page or if the stack frames can become so large and sparsely written to that the guard page is jumped over. (Keeping more guard pages would help prevent the second case with little cost; forcing variable-sized stack allocations – e.g., alloca() – to always write to the space they allocate before returning control to the program, and so detect a smashed stack, would prevent the first with some cost in terms of speed, though the writes could be reasonably sparse to keep the cost fairly small.)
Consequences
What are the consequences of this? Well, the OS has to do the right thing with memory management. (#Michael's link illustrates what can happen when it gets that wrong.) But also it is dangerous to let an attacker determine memory allocation sizes where you don't force a write to the whole allocation immediately; alloca and C99 variable-sized arrays are a particular threat. Moreover, I would be more suspicious of C++ code as that tends to do a lot more stack-based memory allocation; it might be OK, but there's a greater potential for things to go wrong.
Personally, I prefer to keep stack sizes and stack-frame sizes small anyway and do all variable-sized allocations on the heap. In part, this is a legacy of working on some types of embedded system and with code which uses very large numbers of threads, but it does make protecting against stack overflow attacks much simpler; the OS can reliably trap them and all the attacker has then is a denial-of-service (annoying, but rarely fatal). I don't know whether this is a solution for all programmers.
[1] Typical page sizes: 4kB on 32-bit systems, 16kB on 64-bit systems. Check your system documentation for what it is in your environment.

Most systems (like Windows) exit when the stack is overflowed. I don't think you are likely to see a security issue here. At least, not an elevation of privilege security issue. You could get some denial of service issues.

There is no universally correct answer... on some system the stack might grow down or up to overwrite other memory that the program's using (or another program, or the OS), but on any well designed, vaguely security-conscious OS (Linux, any common UNIX variant, even Windows) there will be no rights escalation potential. On some systems, with stack size checks disabled, the address space might approach or exceed the free virtual memory size, allowing memory exhaustion to negatively affect or even bring down the entire machine rather than just the process, but on good OSes by default there's a limit on that (e.g. Linux's limit / ulimit commands).
Worth mentioning that it's typically pretty easy to use a counter to put an arbitrary but generous limit of recursive depth too: you can use a static local variable if single-threaded, or a final parameter (conveniently defaulted to 0 if your language allows it, else have an outer caller provide 0 the first time).

Yes it is. There are algorithms to avoid recursiveness. For example in case arithmetic expressions the reverse polish notation enable you to avoid recursiveness. The main idea behind is to alter the original expression. There could be some algorythm that can help you as well.
One other problem with stack overflow, that if error handling is not appropiate, it can cause anything. To explain it for example in Java StackOverflowError is an error, and it is caught if someone catches Throwable, which is a common mistake. So error handling is a key question in case of stack overflow.

Yes, it is. Availability is an important aspect of security, that is mostly overlooked.
Don't fall into that pit.
edit
As an example of poorly understood security-consciousness in modern OSs, take a look at a relatively newly discovered vulnerability that nobody yet patched completely. There are countless other examples of privilege escalation vulnerabilities that OS developers have written off as denial of service attacks.

Related

Why can't we have a safe ISA?

Accroding to this paper: https://doi.org/10.1109/SP.2013.13, Memory corruption bugs are one of the oldest problems in computer security. The lack of memory safety and type safety has caused countless bugs, causing billions of dollars and huge efforts to fix them.
But the root of C/C++'s memory vulnerability can trace down to the ISA level. At ISA level, every instruction can access any memory address without any fine grained safe check (only corase grained check like page fault). Sure, we can implement memory safe at a higher software level, like Java (JVM), but this leads to significant cost of performance. In a word, we can't have both safety and performance at the same time on existing CPUs.
My question is, why can't we implement the safety at the hardware level? If the CPU has a safe ISA, which ensures the memory safe by, I don't know, taking the responsbilities of malloc and free, then maybe we can get rid of the performance decline of software safe checking. If anyone professional in microelectronics can tell me, is this idea realistic?
Depending on what you mean, it could make it impossible implement memory-unsafe languages like C in a normal way. e.g. every memory access would have to be to some object that has a known size? I'd guess an operating system for such a machine might have to work around that "feature" by telling it that the entire address space was one large array object. Or else you'd need some mechanism for a read system call to know the proper bounds of the object it's writing in the copy_to_user() part of its job. And then there's other OS stuff like accessing the same physical page from different virtual pages.
The OP (via asking on Reddit) found the CHERI project which is an attempt at this idea, involving "... revisit fundamental design choices in hardware and software to dramatically improve system security." Changing hardware alone can't work; compilers need to change, too. But they were able to adapt "Clang/LLVM, FreeBSD, FreeRTOS, and applications such as WebKit," so their approach could be practical. (Unlike the hypothetical versions I was imagining when writing other parts of this answer.)
CHERI uses "fine-grained memory protection", and "Language and compiler extensions" to implement memory-safe C and C++, and higher-level languages.
So it's not a drop-in replacement, and it sounds like you have to actively use the features to gain safety. As I argue in the rest of the answer, hardware can't do it alone, and it's highly non-trivial even with software cooperation. It's easy to come up with ways that wouldn't work. :P
For hardware-enforced memory-safety to be possible, hardware would have to know about every object and its size, and be able to cache that structure in a way that allows efficient lookups to find the bounds. Page tables (4k granularity, or larger in more modern ISAs) are already hard enough for hardware for hardware to cache efficiently for large programs, and that's without even considering which pointer goes with which object.
Checking a TLBs as part of every load and store can be done efficiently, but checking another structure in parallel with that might be problematic. Especially when the ranges don't have power-of-2 sizes and natural alignment, the way pages do, which makes it possible to build a TLB from content-addressable memory that checks for a match against each of several possible values for the high bits. (e.g. a page is 4k in size, always starting at a 4k alignment boundary.)
You mean it may cost too much at hardware level, like the die area?
Die area might not even be the biggest problem, especially these days. It would cost power, and/or cost latency in very important critical paths such as L1d hit load-use latency. Even if you could come up with some plausible way for software to make tables that hardware could check, or otherwise solve the other parts of this problem.
Modifying a page-table entry requires invalidating the entry, including TLB shootdown for other cores. If every free (and some malloc) cost inter-core communication to do similar things for object tables, that would be very expensive.
I think inventing a way for software to tell the hardware about objects would be an even bigger problem. malloc and free aren't something you can just build in to a CPU where memory addressing works anything like existing CPUs, or like it does in C. Software needs to manage memory, it doesn't make sense to try to build that in to a CPU. So then malloc and free (and mmap with file-backed mappings and shared memory...) need a way to tell the CPU about objects. Seems like a mess.
I think at best an ISA could provide more tools software can use to make bounds-checks cheaper. Perhaps some kind of extra semantics on loads/stores, like an extra operand for indexed addressing modes for load or store that takes a max?
At least if we want an ISA to work anything like current ones, rather than work like a JVM or a Transmeta Crusoe and internally recompile for some real ISA.
Intel's MPX ISA extension to x86 was an attempt to let software set up bound ranges, but it's been mostly abandoned due to lower performance than pure software. Intel even dropped it from their recent CPUs (Not present in 10th Gen CPUs using 10nm lithography, or later.)
This is all just off the top of my head; I haven't searched for any serious proposals for how a system could plausibly work.
I don't think memory safety is something you can easily add after the fact to languages like C that weren't originally designed with it.
Have a look to "Code for malloc and free" at SO. Those commands are very, very far away from even being defined within an instruction set.

Does using the program stack involves syscalls?

I'm studying operating system theory, and I know that heap allocation involves a specific syscall and I know that compilers usually optimize for this requesting more than needed beforehand.
But I don't find information about stack allocation. What about it? It involves a specific syscall every time you read from it or write to it (for example when you call a function with some parameters)? Or there is some other mechanism that don't involve syscall perhaps?
Typically when the OS starts your program it examines the executable file's headers and arranges various areas for various things (an area for your executable's code, and area for your executable's data, etc). This includes setting up an initial stack (and a lot more - e.g. finding shared libraries and doing dynamic linking).
After the OS has done all this, your executable starts executing. At this point you already have memory for a stack and can just use it without any system calls.
Note 1: If you create threads, then there will probably be a system call involved to create the thread and that system call will probably allocate memory for the new thread's stack.
Note 2: Typically there's "virtual memory" (what your program sees) and "physical memory" (what the hardware sees); and in between typically the OS does lots of tricks to improve performance and avoid wasting physical memory, and to hide resource limits (so you don't have to worry so much about running out of physical memory). One of these tricks is to allocate virtual memory (e.g. for a large stack) without allocating any actual physical memory, and then allocate the physical memory if/when the virtual memory is first modified. Other tricks include various "swap space" schemes, and memory mapped files. These tricks rely on requests generated by the CPU on your program's behalf (e.g. page fault exceptions) which aren't system calls, but have similar ("ask kernel to do something") characteristics.
Note 3: All of the above depends on which OS. Different operating systems do things differently. I've chosen words carefully - e.g. "Typically" means that most modern operating systems work like I've described (but "typically" does not imply that all possible operating systems work like that; and some operating systems do not work like I've described).
No, stack is normal memory. For process point of view, there is no difference (and so the nasty security bug, where you return a pointer to a data in stack, but stack now is changed.
As Brendan wrote, OS will setup stack for the process at program loading. But if you access a non-allocated page of stack (e.g. if your stack if growing), kernel may allocate automatically for you a new stack page. (not much different as when you try to allocate new memory in heap, and there is no more memory available on program space: but in this case you explicitly do a syscall to tell kernel you want more heap memory).
You will notice that usually stack go in one direction and heap (allocated memory) in the other direction (usually toward each others). So if you program need more stack you have space, but if you program do not need much stack, you can use memory for e.g. huge array. Or the contrary: if you do a lot of recursion, you allocate much stack (but you probably need less heap memory).
Two additional consideration: CPU may have special stack instruction. But you can see them as syntactic sugar (you can simulate PUSH and POP with MOV. CALL and RET with JMP (and simulated PUSH and POP).
And kernel may use a special stack for his own purposes (especially important for interrupts).

What is the mincore syscall used for in userland applications?

What is Linux' mincore(2) useful for in userland applications? Why is it exposed to non-privileged users?
I can imagine some databases taking advantage of knowing which pages are cached but what are some other examples?
What is Linux mincore(2) useful for in userland applications?
I'd say that's most probably opinion based. Profiling, statistics, performance evaluation and stuff like that comes to mind. Other than that, I cannot think of other realistic legitimate use cases.
Here are some examples of programs I found that use mincore (as you can see, all profiling/statistics related):
http://man7.org/linux/man-pages/man1/fincore.1.html
https://github.com/fornwall/pagecache
https://github.com/bwaldvogel/mongocachestat
https://github.com/touzaniMarouane/BlackLab
Why is it exposed to non-privileged users?
This would have actually been a good question around one year ago, when the syscall's semantics weren't clearly defined, and the existence of such a syscall was rather questionable. Prior to kernel version 4.14.2 mincore could in fact have been abused to leak uninitialized kernel memory from user space (see CVE-2017-16994 and the relative Project Zero bug report).
Since then the syscall has been patched and its semantics updated. The only thing that a process can do by invoking it is to query information about its virtual memory map. Nothing harmful really, just self-inspection, hence the availability to unprivileged processes. There's no real reason to make it a privileged syscall, which if done could be also considered an API breakage.

Detect Stack overflows

How do operating systems detect stack overflows of user-space programs [and then send SIGTERM or SIGSEGV to those userspace programs] ?
Guard pages. When the OS creates the stack for the program it will allocate a little bit more than is specified. The memory is allocated in pages (usually 4KB each), and the extra page will have settings such that any attempt to access it will result in an exception being thrown.
The answer will depend on the target architecture and the particular OS. Since the question is tagged Linux, you have rather biased the question which on the face of it seems more general.
In a sophisticated OS or RTOS such as Linux or QNX Neutrino, with MMU protection support, memory protection mechanisms may be used such as the guard pages already mentioned. Such OSs require a target with an MMU of course.
Simpler OSs and typical RTOS scheduling kernels without MMU support may use a number of methods. The simplest is to place a guard signature at the top of the stack, which is checked for modification when the scheduler runs. This is a bit hit-and-miss, it requires that the stack-overflow actually modifies the signature, and that the resulting corruption does not cause a crash before the scheduler next runs. Some systems with on-chip debug resources may be able to place an access break-point on the signature word and cause an exception when it is hit.
In development a common technique is to initially fill each thread stack with a signature and to have a thread periodically check for the "high-tide" and issue a warning if it exceeds a certain percentage level.
As well as guard pages mentioned in another answer, some smaller (MMU-less) embedded microcontrollers have specific exceptions for stack overflow (and underflow).

Nvidia Information Disclosure / Memory Vulnerability on Linux and General OS Memory Protection

I thought this was expected behavior?
From: http://classic.chem.msu.su/cgi-bin/ceilidh.exe/gran/gamess/forum/?C35e9ea936bHW-7675-1380-00.htm
Paraphrased summary: "Working on the Linux port we found that cudaHostAlloc/cuMemHostAlloc CUDA API calls return un-initialized pinned memory. This hole may potentially allow one to examine regions of memory previously used by other programs and Linux kernel. We recommend everybody to stop running CUDA drivers on any multiuser system."
My understanding was that "Normal" malloc returns un-initialized memory, so I don't see what the difference here is...
The way I understand how memory allocation works would allow the following to happen:
-userA runs a program on a system that crunches a bunch of sensitive information. When the calculations are done, the results are written to disk, the processes exits, and userA logs off.
-userB logs in next. userB runs a program that requests all available memory in the system, and writes the content of his un-initialized memory, which contains some of userA's sensitive information that was left in RAM, to disk.
I have to be missing something here. What is it? Is memory zero'd-out somewhere? Is kernel/pinned memory special in a relevant way?
Memory returned by malloc() may be nonzero, but only after being used and freed by other code in the same process. Never another process. The OS is supposed to rigorously enforce memory protections between processes, even after they have exited.
Kernel/pinned memory is only special in that it apparently gave a kernel mode driver the opportunity to break the OS's process protection guarantees.
So no, this is not expected behavior; yes, this was a bug. Kudos to NVIDIA for acting on it so quickly!
The only part that requires root priviledges to install CUDA is the NVIDIA driver. As a result all operations done using NVIDIA compiler and link can be done using regular system calls, and standard compiling (provided you have the proper information -lol-). If any security holes lies there, it remains, wether or not cudaHostAlloc/cuMemHostAlloc is modified.
I am dubious about the first answer seen on this post. The man page for malloc specifies that
the memory is not cleared. The man page for free does not mention any clearing of the memory.
The clearing of memory seems to be in the responsability of the coder of a sensitive section -lol-, that leave the problem of an unexpected (rare) exit. Apart from VMS (good but not widely used OS), I dont think any OS accept the performance cost of a systematic clearing. I am not clear about the way the system may track in the heap of a newly allocated memory what was previously in the process area, and what was not.
My conclusion is: if you need a strict level of privacy, do not use a multi-user system
(or use VMS).

Resources