Why ROP attacks occur despite buffer overflow detection? - security

I read these sentences in wikipedia about ROP:
"Return-oriented programming is an advanced version of a stack smashing attack. Generally, these types of attacks arise when an adversary manipulates the call stack by taking advantage of a bug in the program, often a buffer overrun."
That means if buffer overrun don't occur, ROP will not occur. But some compilers (in my case LLVM) supports detection of buffer overflow, but defense against ROP is open in them.
I'm confused. Is there something that I didn't consider?

According to this Wikipedia article
Clang supports three buffer overflow detectors, namely
AddressSanitizer (-fsanitize=address), -fsanitize=bounds, and
SafeCode. These systems have different tradeoffs in terms of
performance penalty, memory overhead, and classes of detected bugs.
So it can only detect certain classes of bugs (not all of them), which means that it has false negatives.
The problem mainly lies with the fact that any static analysis of programs cannot, in general, be both sound and complete. That is, any static analysis trying to detect buffer overflow will either have false positives and/or false negatives. This is a corollary of Rice's theorem, which intuitively states that any nontrivial property of programs is generally undecidable. The word "generally" here is important and means for all programs.
A false positive is when a static analysis flags a program statement as a buffer overflow while it is not.
A false negative is when a static analysis flags a program statement as a safe buffer access while it is not.
The most widely adopted approach in many fields not just buffer overflow detection (e.g., signature-based intrusion detection) is to tolerate false negatives rather false positives because false positives will otherwise be too many and will inundate programmers and obscure the real problems. That approach is also applied if the detection problem is decidable but too complex (e.g., NP-hard) to solve exactly. Bottom line, approximations permeate computer science.

Several techniques can be used to detect the possibility of a buffer overflow at run-time.
One, very cheap and enabled by default in code generated by modern compilers, only detects buffer overflows on the stack, only probabilistically (the buffer overflow has, say, (232-1)/232 chances of being detected), and only checks for them just before returning from protected functions (which is a very good time to check but means the detection is not instantaneous). It works by inserting a “canary” value that the attacker wouldn't be able to predict at the top of the stack frame when the function begins its execution, and by checking the value of the canary at the end of the function 's execution.
The above technique is very interesting because it is cheap; in one single check, it protects well against all stack-based buffer overflows that could have happened during the function. However:
It does not protect against heap-based buffer overflows.
The attacker can limit themselves to a small stack-based buffer overflow in order to change the contents of some local variables and function arguments, without overwriting the canary. This may be enough to gain control of the target.
The attacker can try their luck at guessing the value of the canary. With one chance in 232 at each try, if they are allowed to cause as many buffer overflows as they want (say, in a server that is not monitored and is automatically restarted after each crash), they may eventually get lucky.
According to this article, GCC developers felt that even protecting every function with a single check was still too much (the option I have been describing is called -fstack-protector in GCC), and they used heuristics to omit the check from various functions, including some where it could have been useful.
At the other end of the spectrum, there exist techniques that detect all possibilities of buffer overflow at run-time, at a greater cost.
There is nothing impossible about systematically preventing all buffer overflows at run-time. It is only more expensive than the above cheap check. Some techniques, while more efficient, are cheap enough to belong in compiler features, so you many find them as disabled-by-default options in Clang or GCC. The techniques that detect all buffer overflows, on heap as well as on stack, incur an overhead that can go up to, say, 900% (they make execution 10 times slower). Most people in charge of deploying software do not find this compromise acceptable, and thus these techniques are found only in specialized academic C compilers, in source-to-source transformation tools sold separately from the compiler, or in sound static analyzers used as C interpreters.
To reiterate, there is nothing impossible about detecting all buffer overflows at run-time. Rice's theorem does not apply. The instrumentation techniques to do so are only too expensive (in run-time speed) to be used in practice.
Another possibility, which is differently expensive, is to statically check that the program does not have any possibility of buffer overflow, for any possible input that might be sent to it. This way, the program can be compiled with an ordinary compiler and run at full speed without risk of remote code execution. This is where Rice's theorem begin to apply: it says that it is impossible to make an automatic static analyzer that guarantees that all safe programs are safe. This is no issue in practice, because Rice's theorem does not say that it is impossible to guarantee that one particular safe program is safe.
The important thing is to build a static analyzer that never says “safe” for a program that it cannot guarantee to be safe. The static analyzer is always allowed to say “maybe”, and the only difficulty in practice is if it says “maybe” too often, because it is never sure that any program is safe.
The above kind of static analyzer is called a sound static analyzer. They are a little difficult to use and are mostly discussed in academia only, but for instance, I work at a company that sells a sound static analyzer and the expertise to apply it to security-critical C software components. The first C library we verified is PolarSSL, used in a specific configuration. Because we have checked that no buffer overflow can occur for any message sent from the network to a PolarSSL server in the configuration we chose, it can be compiled with an ordinary compiler and is safe from all consequences of buffer overflows (and generally of C's undefined behaviors), including ROP attacks.

Related

How to reliably influence generated code at near machine level using GHC?

While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.
Some assumptions:
It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed
What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.
Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?
It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).
When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.
Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).
There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.
You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.
GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.
If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).
One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.

Why is return-to-libc much earlier than return-to-user?

I'm really new to this topic and only know some basic concepts. Nevertheless, there is a question that quite confuses me.
Solar Designer proposed the idea of return-to-libc in 1997 (http://seclists.org/bugtraq/1997/Aug/63). Return-to-user, from my understanding, didn't become popular until 2007 (http://seclists.org/dailydave/2007/q1/224).
However, return-to-user seems much easier than return-to-libc. So my question is, why did hackers spend so much effort in building a gadget chain by using libc, rather than simply using their own code in the user space when exploiting a kernel vulnerability?
I don't believe that they did not realize there are NULL pointer dereference vulnerabilities in the kernel.
Great question and thank you for taking some time to research the topic before making a post that causes it's readers to lose hope in the future of mankind (you might be able to tell I've read a few [exploit] tags tonight)
Anyway, there are two reasons
return-to-libc is generic, provided you have the offsets. There is no need to programmatically or manually build either a return chain or scrape existing user functionality.
Partially because of linker-enabled relocations and partly because of history, the executable runtime of most programs executing on a Linux system essentially demand the libc runtime, or at least a runtime that can correctly handle _start and cons. This formula still stands on Windows, just under the slightly different paradigm of return-to-kernel32/oleaut/etc but can actually be more immediately powerful, particularly for shellcode with length requirements, for reasons relating to the way in which system calls are invoked indirectly by kernel32-SSDT functions
As a side-note, If you are discussing NULL pointer dereferences in the kernel, you may be confusing return to vDSO space, which is actually subject to a different set of constraints that standard "mprotect and roll" userland does not.
Ret2libc or ROP is a technique you deploy when you cannot return to your shellcode (jmp esp/whatever) because of memory protections (DEP/NX).
They perform different goals, and you may engage with both.
Return to libc
This is a buffer overrun exploit where you have control over some input, and you know there exists a badly written program which will copy your input onto the stack and return through it. This allows you to start executing code on a machine you have not got a login to, or a very restrictive access.
Return to libc gives you the ability to control what code is executed, and would generally result in you running within the space, a more natural piece of code.
return from kernel
This is a privilege escalation attack. It takes code which is running on the machine, and exploits bugs in the kernel causing the kernel privileges to be applied to the user process.
Thus you may use return-to-libc to get your code running in a web-browser, and then return-to-user to perform restricted tasks.
Earlier shellcode
Before return-to-libc, there was straight shellcode, where the buffer overrun included the exploit code, with knowledge about where the stack would be, this was possible to run directly. This became somewhat obsolete with the NX bit in windows. (x86 hardware was capable of having execute on a segment, but not a page).
Why is return-to-libc much earlier than return-to-user?
The goal in these attacks is to own the machine, frequently, it was easy to find a vulnerability in a user-mode process which gave you access to what you needed, it is only with the hardening of the system by reducing the privilege at the boundary, and fixing the bugs in important programs, that the new return-from-kernel became necessary.

Haskell for mission-critical systems [duplicate]

I've been curious to understand if it is possible to apply the power of Haskell to embedded realtime world, and in googling have found the Atom package. I'd assume that in the complex case the code might have all the classical C bugs - crashes, memory corruptions, etc, which would then need to be traced to the original Haskell code that
caused them. So, this is the first part of the question: "If you had the experience with Atom, how did you deal with the task of debugging the low-level bugs in compiled C code and fixing them in Haskell original code ?"
I searched for some more examples for Atom, this blog post mentions the resulting C code 22KLOC (and obviously no code:), the included example is a toy. This and this references have a bit more practical code, but this is where this ends. And the reason I put "sizable" in the subject is, I'm most interested if you might share your experiences of working with the generated C code in the range of 300KLOC+.
As I am a Haskell newbie, obviously there may be other ways that I did not find due to my unknown unknowns, so any other pointers for self-education in this area would be greatly appreciated - and this is the second part of the question - "what would be some other practical methods (if) of doing real-time development in Haskell?". If the multicore is also in the picture, that's an extra plus :-)
(About usage of Haskell itself for this purpose: from what I read in this blog post, the garbage collection and laziness in Haskell makes it rather nondeterministic scheduling-wise, but maybe in two years something has changed. Real world Haskell programming question on SO was the closest that I could find to this topic)
Note: "real-time" above is would be closer to "hard realtime" - I'm curious if it is possible to ensure that the pause time when the main task is not executing is under 0.5ms.
At Galois we use Haskell for two things:
Soft real time (OS device layers, networking), where 1-5 ms response times are plausible. GHC generates fast code, and has plenty of support for tuning the garbage collector and scheduler to get the right timings.
for true real time systems EDSLs are used to generate code for other languages that provide stronger timing guarantees. E.g. Cryptol, Atom and Copilot.
So be careful to distinguish the EDSL (Copilot or Atom) from the host language (Haskell).
Some examples of critical systems, and in some cases, real-time systems, either written or generated from Haskell, produced by Galois.
EDSLs
Copilot: A Hard Real-Time Runtime Monitor -- a DSL for real-time avionics monitoring
Equivalence and Safety Checking in Cryptol -- a DSL for cryptographic components of critical systems
Systems
HaLVM -- a lightweight microkernel for embedded and mobile applications
TSE -- a cross-domain (security level) network appliance
It will be a long time before there is a Haskell system that fits in small memory and can guarantee sub-millisecond pause times. The community of Haskell implementors just doesn't seem to be interested in this kind of target.
There is healthy interest in using Haskell or something Haskell-like to compile down to something very efficient; for example, Bluespec compiles to hardware.
I don't think it will meet your needs, but if you're interested in functional programming and embedded systems you should learn about Erlang.
Andrew,
Yes, it can be tricky to debug problems through the generated code back to the original source. One thing Atom provides is a means to probe internal expressions, then leaves if up to the user how to handle these probes. For vehicle testing, we build a transmitter (in Atom) and stream the probes out over a CAN bus. We can then capture this data, formated it, then view it with tools like GTKWave, either in post-processing or realtime. For software simulation, probes are handled differently. Instead of getting probe data from a CAN protocol, hooks are made to the C code to lift the probe values directly. The probe values are then used in the unit testing framework (distributed with Atom) to determine if a test passes or fails and to calculate simulation coverage.
I don't think Haskell, or other Garbage Collected languages are very well-suited to hard-realtime systems, as GC's tend to amortize their runtimes into short pauses.
Writing in Atom is not exactly programming in Haskell, as Haskell here can be seen as purely a preprocessor for the actual program you are writing.
I think Haskell is an awesome preprocessor, and using DSEL's like Atom is probably a great way to create sizable hard-realtime systems, but I don't know if Atom fits the bill or not. If it doesn't, I'm pretty sure it is possible (and I encourage anyone who does!) to implement a DSEL that does.
Having a very strong pre-processor like Haskell for a low-level language opens up a huge window of opportunity to implement abstractions through code-generation that are much more clumsy when implemented as C code text generators.
I've been fooling around with Atom. It is pretty cool, but I think it is best for small systems. Yes it runs in trucks and buses and implements real-world, critical applications, but that doesn't mean those applications are necessarily large or complex. It really is for hard-real-time apps and goes to great lengths to make every operation take the exact same amount of time. For example, instead of an if/else statement that conditionally executes one of two code branches that might differ in running time, it has a "mux" statement that always executes both branches before conditionally selecting one of the two computed values (so the total execution time is the same whichever value is selected). It doesn't have any significant type system other than built-in types (comparable to C's) that are enforced through GADT values passed through the Atom monad. The author is working on a static verification tool that analyzes the output C code, which is pretty cool (it uses an SMT solver), but I think Atom would benefit from more source-level features and checks. Even in my toy-sized app (LED flashlight controller), I've made a number of newbie errors that someone more experienced with the package might avoid, but that resulted in buggy output code that I'd rather have been caught by the compiler instead of through testing. On the other hand, it's still at version 0.1.something so improvements are undoubtedly coming.

What are the prevention techniques for the Buffer overflow attacks?

what are the ideas of preventing buffer overflow attacks? and i heard about Stackguard,but until now is this problem completely solved by applying stackguard or combination of it with other techniques?
after warm up, as an experienced programmer
Why do you think that it is so
difficult to provide adequate
defenses for buffer overflow attacks?
Edit: thanks for all answers and keeping security tag active:)
There's a bunch of things you can do. In no particular order...
First, if your language choices are equally split (or close to equally split) between one that allows direct memory access and one that doesn't , choose the one that doesn't. That is, use Perl, Python, Lisp, Java, etc over C/C++. This isn't always an option, but it does help prevent you from shooting yourself in the foot.
Second, in languages where you have direct memory access, if classes are available that handle the memory for you, like std::string, use them. Prefer well exercised classes to classes that have fewer users. More use means that simpler problems are more likely to have been discovered in regular usage.
Third, use compiler options like ASLR and DEP. Use any security related compiler options that your application offers. This won't prevent buffer overflows, but will help mitigate the impact of any overflows.
Fourth, use static code analysis tools like Fortify, Qualys, or Veracode's service to discover overflows that you didn't mean to code. Then fix the stuff that's discovered.
Fifth, learn how overflows work, and how to spot them in code. All your coworkers should learn this, too. Create an organization-wide policy that requires people be trained in how overruns (and other vulns) work.
Sixth, do secure code reviews separately from regular code reviews. Regular code reviews make sure code works, that it passes functional tests, and that it meets coding policy (indentation, naming conventions, etc). Secure code reviews are specifically, explicitly, and only intended to look for security issues. Do secure code reviews on all code that you can. If you have to prioritize, start with mission critical stuff, stuff where problems are likely (where trust boundaries are crossed (learn about data flow diagrams and threat models and create them), where interpreters are used, and especially where user input is passed/stored/retrieved, including data retrieved from your database).
Seventh, if you have the money, hire a good consultant like Neohapsis, VSR, Matasano, etc. to review your product. They'll find far more than overruns, and your product will be all the better for it.
Eighth, make sure your QA team knows how overruns work and how to test for them. QA should have test cases specifically designed to find overruns in all inputs.
Ninth, do fuzzing. Fuzzing finds an amazingly large number of overflows in many products.
Edited to add: I misread the question. THe title says, "what are the techniques" but the text says "why is it hard".
It's hard because it's so easy to make a mistake. Little mistakes, like off-by-one errors or numeric conversions, can lead to overflows. Programs are complex beassts, with complex interactions. Where there's complexity there's problems.
Or, to turn the question back on you: why is it so hard to write bug-free code?
Buffer overflow exploits can be prevented. If programmers were perfect, there would be no
unchecked buffers, and consequently, no buffer overflow exploits. However, programmers are not
perfect, and unchecked buffers continue to abound.
Only one technique is necessary: Don't trust data from external sources.
There's no magic bullet for security: you have to design carefully, code carefully, hold code reviews, test, and arrange to fix vulnerabilities as they arise.
Fortunately, the specific case of buffer overflows has been a solved problem for a long time. Most programming languages have array bounds checking and do not allow programs to make up pointers. Just don't use the few that permit buffer overflows, such as C and C++.
Of course, this applies to the whole software stack, from embedded firmware¹ up to your application.
¹ For those of you not familiar with the technologies involved, this exploit can allow an attacker on the network to wake up and take control of a powered off computer. (Typical firewall configurations block the offending packets.)
You can run analyzers to help you find problems before the code goes into production. Our Memory Safety Checker will find buffer overuns, bad pointer faults, array access errors, and memory management mistakes in C code, by instrumenting your code to watch for mistakes at the moment they are made. If you want the C program to be impervious to such errors, you can simply use the results of the Memory Safety analyzer as the production version of your code.
In modern exploitation the big three are:
ASLR
Canary
NX Bit
Modern builds of GCC applies Canaries by default. Not all ASLR is created equally, Windows 7, Linux and *BSD have some of the best ASLR. OSX has by far the worst ASLR implementation, its trivial to bypass. Some of the most advanced buffer overflow attacks use exotic methods to bypass ASLR. The NX Bit is by far the easist method to byapss, return-to-libc style attacks make it a non-issue for exploit developers.

Trivial mathematical problems as language benchmarks

Why do people insist on using trivial mathematical problems like finding numbers in the Fibonacci sequence for language benchmarks? Don't these usually get optimized to relativistic speeds? Isn't the brunt of the bottlenecks usually in I/O, system API calls, operations on strings and structures, processing large quantities of data, abstract object-oriented stuff, etc?
It is a throwback to the old days, when compiler technology for what we would now call basic math was still evolving rapidly.
Now, compiler evolution is more focused on exploiting new instructions for niche operations, 64-bit math, and so on.
Micro-benchmarks such as the ones you mention were useful, though, when evaluating the efficiency of the hotspot compiler when Java was first launched, and in evaluating the efficiency of .NET versus C/C++.
Your suggestion that I/O and system calls are the likely bottlenecks is correct, at least for some space of problems. But I notice you suggested string operations. One person's irrelevant micro-benchmark is another person's critical performance metric.
EDIT: ps, I also remember using linpack and other micro-benchmarks to compare versions of the JVM, and to compare vendors of the JVM. From v4 to v5 there was a big jump in perf, I guess the JIT compiler got more effective. Also, IBM's JVM was ahead of Sun's at that time, on Windows-x86.
Because if you want to benchmark the language/compiler, these "math problems" are good indicators of the "bare speed" of the generated code. Either they use the iterative solution, which is a tight loop and indicates how well can the compiler push the instructions to the processor, or they use the recursive solution, which indicates how does it handle recursive calls of short functions (inlining, tail-recursion etc.) (although the Ackermann function is usually used for that too).
Usually, the benchmark suite for the language contain tests benchmarking other parts as well - eg. gzip compression, text searching, object creation, virtual function call, exception throw/catch benchmarks.
The other things you've noticed, syscalls and IO are usually not included because
syscalls are in fact not that slow - applications don't spend significant porion of the time in the kernel, except for test specifically targeted at them or when something is seriously wrong with the program
syscall and IO performance does not depend on the language, but rather on the OS & hardware
I'd think a simple, well-established algorithm would remove the possibility that the benchmark is biased (whether through ignorance or malice) to favor one language. It is very difficult to write a complex program in two different languages exactly the same. Testing something like the efficiency of a multithreaded application in c# vs java, for example, would require developers skilled in multithreaded development both languages, and there would still be questions as to whether the benchmark app properly represents the general case, or if it is misrepresenting a special case that only one language handles well.
Back when the sieve of eratosthanes was a popular benchmark for C compilers, I thought it would be funny if one of the compiler authors would recognize the sieve code and replace it with a pre-computed lookup.

Resources