Stack overflow exploits: RET vs. SEH overwrite

Stack overflow exploits: RET vs. SEH overwrite - security

I have been studying tutorials on writing both RET value and SEH overwrite exploits, using stack overflow.
As I understand it, when I overwrite the SEH value, the RET value is overwritten as well, also it is much harder to make a SEH exploit because you need also to throw an exception in order to make the exploit run.
If so, what is the use of SEH overwrite exploit, if I can always use the RET value instead?
And what are the pros and cons of SEH overwrite over RET overwrite?

It depends what is the vulnerability and what are the exploit conditions.
If you can overwrite the RET and build a full blown exploit than you are correct and overwriting the SEH would is unnecessary.
But this is not always the case .. In some cases RET overwrite protections will be present, like the stack canary.
In this case exploiting using RET overwrite would be far more difficult (if not impossible) than overwriting the SEH handler and generating an exception.
Same can be said about overwriting the SEH, if SafeSeh is ON and stack canary is OFF it would be far more easier to exploit using RET overwrite than SEH.
Generally speaking, I would say that the main fact that would determine which exploit technique to use depends on the existing mitigations and the ease of exploitation.
Its always good to have another attack vector that could be used if all other options fail.

Related

Adding a new attribute on source code that propagates until MC level in LLVM?

I am interested in how the following is propagated:
void foo(int __attribute__((aligned(16)))* p) { ... }
In this case the “alignedness” of the pointer is available at the MC level, but it is evidently not using the LLVM-IR metadata approach to achieve this. The alignment information is very important to some targets which will change code-generation dependent on this value, and I think that what I need is more like this attribute.
How difficult would it be to add a new attribute such that it propagates through the compiler in the same way as ‘aligned’? So, I already added a new element to the LLVM-IR to do this. I also expect that the hardest part would be making other parts of LLVM ignore this new element when they don’t care about it.
It really is a pity that LLVM does not have a generic target independent way of passing target dependent information from parser to back-end.
Using the ‘DebugLoc’ approach was suggested in a similar question, but I think it’s a bit-of-a-hack since this is not related to debugging. But if the implementation is less difficult this way, then the hack might be acceptable.
UPDATE:
Would inline assembly instead of the use of a new attribute work here? If yes, what are the pros/cons?

As you have demonstrated, alignment is not using metadata.
To anyone who doesn't know: alignment is mentioned (implicitly or explicitly) in all relevant instructions, so for example that function in the question will be compiled to something like this (notice the aligns):
define void #foo(i32*) {
%2 = alloca i32*, align 16 ; Allocate a 16-aligned pointer
store i32* %0, i32** %2, align 16 ; An aligned store to place the arg there
...
Now, if you want to attach some information to existing instructions and have most of the rest of the compiler ignore them, using metadata is a good idea. However, since metadata is a compiler-internal abstract thing, at some point you'll have to actually do something with it. Typically, by adding a pass of your own to consume it and do something accordingly.
As for where to place your pass and how to implement it, it really depends on the actual information you're trying to pass and its intended effect.

Obfuscation of checksum guards

As part of my project, I have to insert small codes in a C program called checksum guards. What these guards do is they calculate the checksum value of a portion of code using a function(add, xor, etc.) which operates on the instruction opcodes. So, if somebody has tampered with the instructions(add, modify, delete) in that region of code, the checksum value will change and intrusion will be detected.
Here is the research paper which talks about this technique:
https://www.cerias.purdue.edu/assets/pdf/bibtex_archive/2001-49.pdf
Here is the guard template:
guard:
add ebp, -checksum
mov eax, client_addr
for:
cmp eax, client_end
jg end
mov ebx, dword[eax]
add ebp, ebx
add eax, 4
jmp for
end:
Now, I have two questions.
Would putting the guards in the assembly better than putting it in the source program?
Assuming I put them in the assembly(at an appropriate place) what kind of obfuscation should I use to prevent the guard template to be easily visible? (Since when I have more than 1 guard, the attacker should not easily find out all the guard templates and disable all the guards together as that would leave the code with no security)
Thank you in advance.

From attacker's (without sources) point of view the first question doesn't matter; he's tampering with the final binary machine code, whether it was produced from .c or .s will make zero difference. So I would worry mainly how to generate the correct binary with appropriate checksums. I'm not aware of any way how to get proper checksum inside the C source. But I can imagine to have some external tool running over assembler files created by C compiler, in some post-process way - before compiling the .s files into .o. But... Keep in mind, that some calls and addresses are just relative offsets, and the binary loaded into memory is patched by the OS loader according to linker's table, to make those point to the real memory addresses. Thus the data bytes will change (opcodes will stay fixed).
Yours guard template doesn't take that into account, and does checksum whole opcodes with data bytes as well (Some advanced guards have opcodes definitions, and checksum/encrypt/decipher only the opcodes themselves without operand bytes).
Otherwise it's neat, that the result is damaged ebp value, ruining any C code around (*) working with stack variables. But it's still artificial test, you can simply comment out both add ebp,-checksum and add ebp,ebx making the guard harmless.
(*) notice you have to put the guard in between some classic C code to get some real runtime problems from invalid ebp value. If you would put it at the end of subroutine, which ends with pop ebp, everything would work well.
So to the second question:
You definitely want more malicious ways to guard correct value, than only ebp damage. Usually the hardest (to remove) way is to make checksum value part of some calculation, eventually skewing results just slightly, so serious usage of the SW will be impossible, but it will take time to notice by the user.
You can also use some genuine code loop to add your checksumming to it, so simply skipping whole loop will skip also valid code (but I can imagine this one only added by hand into generated assembly from C, so you will have to redo it after every new compilation of particular C source).
Then the particular guard template can be obfuscated by any imaginable mutation (different registers used, modified order of instructions, instruction variants), try to search about viruses with mutation encoding to get some ideas.
And I didn't read that paper, but from the Figures I would say the main point is to make those guarding areas to overlap, so patching off one of them will affect another one, which sounds to me like that extra sugar to make it somewhat functional (although this still looks like normal challenge to 8bit game crackers ;), not even "hard" level). But that also means you would need either very cunning external tool to calculate that cyclic tree of dependencies, and insert the guard templates in correct order, or do it again manually completely.
Of course when doing manually, you have to do it after each new C compilation, so it's worth of the effort only on something very precious and expensive, or rock solid stable, where you will not produce another revision for next 10y or so... :D

How to extract stack size of all library functions from c code?

I have few benchmarks like fft, dijkstra. I want to collect the stack size of all library functions and user defined functions. C code is also available.
I am managing cache in hardware, so i need the exact stack size of each small function, variable.

If compiling with a recent GCC you could pass the -fstack-usage flag to gcc (in addition of optimization flags, if any) which:
Makes the compiler output stack usage information for the program, on a per-function basis. The filename for the dump is made by appending .su to the auxname. auxname is generated from the name of the output file, if explicitly specified and it is not an executable, otherwise it is the basename of the source file. An entry is made up of three fields:
The name of the function.
A number of bytes.
One or more qualifiers: static, dynamic, bounded.
The qualifier static means that the function manipulates the stack statically: a fixed number of bytes are allocated for the frame on function entry and released on function exit; no stack adjustments are otherwise made in the function. The second field is this fixed number of bytes.
The qualifier dynamic means that the function manipulates the stack dynamically: in addition to the static allocation described above, stack adjustments are made in the body of the function, for example to push/pop arguments around function calls. If the qualifier bounded is also present, the amount of these adjustments is bounded at compile time and the second field is an upper bound of the total amount of stack used by the function. If it is not present, the amount of these adjustments is not bounded at compile time and the second field only represents the bounded part.
You could also pass a -Wstack-usage=len warning flag, which:
Warn if the stack usage of a function might be larger than len bytes. The computation done to determine the stack usage is conservative. Any space allocated via alloca, variable-length arrays, or related constructs is included by the compiler when determining whether or not to issue a warning.
You may consider writing your GCC plugin to extract the stack size of of functions compiled by a recent GCC (e.g. GCC 10 in October 2020), and since GCC is free software, you could improve it.
Of course, if you want the same information for libraries, you should re-compile them from their source code.
BTW, the stack usage of some functions, or of some function calls occurrences, might be ill-defined (and certainly depends upon the optimization flags and the target system), since GCC is sometimes capable of tail call optimizations, and of function inlining (even on functions not qualified inline!) and/or function cloning. Also, some few C standard library functions (printf, memset, ....) are magically known to the compiler which might use some internal builtin functions to compile them. At last, several softwares (and more and more libraries) are compiled with link-time optimizations (using -flto), then the stack usage of individual functions is not well defined (since they are often inlined).
So I am not sure your question makes any precise sense. You might rephrase it and motivate and improve it.

What is the fastcall keyword used for in visual c?

I have seen the fastcall notation appended before many functions. Why it is used?

That notation before the function is called the "calling convention." It specifies how (at a low level) the compiler will pass input parameters to the function and retrieve its results once it's been executed.
There are many different calling conventions, the most popular being stdcall and cdecl.
You might think there's only one way of doing it, but in reality, there are dozens of ways you could call a function and pass variables in and out. You could place the input parameters on a stack (push, push, push to call; pop, pop, pop to read input parameters). Or perhaps you would rather stick them in registers (this is fastcall - it tries to fit some of the input params in registers for speed).
But then what about the order? Do you push them from left to right or right to left? What about the result - there's always only one (assuming no reference parameters), so do you place the result on the stack, in a register, at a certain memory address?
Also, let's assume you're using the stack for communication - who's job is it to actually clear the stack after the function is called - the caller or the callee?
What about backing up and then restoring the contents of (certain) CPU registers - should the caller do it, or will the callee guarantee that it'll return everything the way it was?
The most popular calling convention (by far) is cdecl, which is the standard calling convention in both C and C++. The WIN32 API uses stdcall, which means any code that calls the WIN32 API needs to use stdcall for those function calls (making it another popular choice).
fastcall is a bit of an oddball - people realized for many functions with only one in/out parameter, pushing and popping from a memory-based stack is quite a bit of overhead and makes function calls a little bit heavy so the different compilers introduced (different) calling conventions that will place one or more parameters in registers before placing the rest in the stack for better performance. The problem is, not all compilers used the same rules for what goes where and who does what with fastcall, and as a result you have to be careful when using it because you'll never know who does what. Finally, see Is fastcall really faster? for info on fastcall performance benefits.
Complicated stuff.
Something important to keep in mind: don't add or change calling conventions if you don't know exactly what you're doing, because if both the caller and the callee do not agree on the calling convention, you'll likely end up with stack corruption and a segfault. This usually happens when you have the function being called in a DLL/shared library and a program is written that depends on the DLL/SO/dylib being a certain calling convention (say, cdecl), then the library is recompiled with a different calling convention (say, fastcall). Now the old program can no longer communicate with the new library.

Wikipedia states that
Conventions entitled fastcall have not been standardized, and have been implemented differently, depending on the compiler vendor. Typically fastcall calling conventions pass one or more arguments in registers which reduces the number of memory accesses required for the call.

Why is bounds checking not implemented in some of the languages?

According to the Wikipedia (http://en.wikipedia.org/wiki/Buffer_overflow)
Programming languages commonly associated with buffer overflows include C and C++, which provide no built-in protection against accessing or overwriting data in any part of memory and do not automatically check that data written to an array (the built-in buffer type) is within the boundaries of that array. Bounds checking can prevent buffer overflows.
So, why are 'Bounds Checking' not implemented in some of the languages like C and C++?

Basically, it's because it means every time you change an index, you have to do an if statement.
Let's consider a simple C for loop:
int ary[X] = {...}; // Purposefully leaving size and initializer unknown
for(int ix=0; ix< 23; ix++){
printf("ary[%d]=%d\n", ix, ary[ix]);
}
if we have bounds checking, the generated code for ary[ix] has to be something like
LOOP:
INC IX ; add `1 to ix
CMP IX, 23 ; while test
CMP IX, X ; compare IX and X
JGE ERROR ; if IX >= X jump to ERROR
LD R1, IX ; put the value of IX into register 1
LD R2, ARY+IX ; put the array value in R2
LA R3, Str42 ; STR42 is the format string
JSR PRINTF ; now we call the printf routine
J LOOP ; go back to the top of the loop
;;; somewhere else in the code
ERROR:
HCF ; halt and catch fire
If we don't have that bounds check, then we can write instead:
LD R1, IX
LOOP:
CMP IX, 23
JGE END
LD R2, ARY+R1
JSR PRINTF
INC R1
J LOOP
This saves 3-4 instructions in the loop, which (especially in the old days) meant a lot.
In fact, in the PDP-11 machines, it was even better, because there was something called "auto-increment addressing". On a PDP, all of the register stuff etc turned into something like
CZ -(IX), END ; compare IX to zero, then decrement; jump to END if zero
(And anyone who happens to remember the PDP better than I do, don't give me trouble about the precise syntax etc; you're an old fart like me, you know how these things slip away.)

It's all about the performance. However, the assertion that C and C++ have no bounds checking is not entirely correct. It is quite common to have "debug" and "optimized" versions of each library, and it is not uncommon to find bounds-checking enabled in the debugging versions of various libraries.
This has the advantage of quickly and painlessly finding out-of-bounds errors when developing the application, while at the same time eliminating the performance hit when running the program for realz.
I should also add that the performance hit is non-negigible, and many languages other than C++ will provide various high-level functions operating on buffers that are implemented directly in C and C++ specifically to avoid the bounds checking. For example, in Java, if you compare the speed of copying one array into another using pure Java vs. using System.arrayCopy (which does bounds checking once, but then straight-up copies the array without bounds-checking each individual element), you will see a decently large difference in the performance of those two operations.

It is easier to implement and faster both to compile and at run-time. It also simplifies the language definition (as quite a few things can be left out if this is skipped).
Currently, when you do:
int *p = (int*)malloc(sizeof(int));
*p = 50;
C (and C++) just says, "Okey dokey! I'll put something in that spot in memory".
If bounds checking were required, C would have to say, "Ok, first let's see if I can put something there? Has it been allocated? Yes? Good. I'll insert now." By skipping the test to see whether there is something which can be written there, you are saving a very costly step. On the other hand, (she wore a glove), we now live in an era where "optimization is for those who cannot afford RAM," so the arguments about the speed are getting much weaker.

The primary reason is the performance overhead of adding bounds checking to C or C++. While this overhead can be reduced substantially with state-of-the-art techniques (to 20-100% overhead, depending upon the application), it is still large enough to make many folks hesitate. I'm not sure whether that reaction is rational -- I sometimes suspect that people focus too much on performance, simply because performance is quantifiable and measurable -- but regardless, it is a fact of life. This fact reduces the incentive for major compilers to put effort into integrating the latest work on bounds checking into their compilers.
A secondary reason involves concerns that bounds checking might break your app. Particularly if you do funky stuff with pointer arithmetic and casting that violate the standard, bounds checking might block something your application is currently doing. Large applications sometimes do amazingly crufty and ugly things. If the compiler breaks the application, then there's no point in pointing blaming the crufty code for the problem; people aren't going to keep using a compiler that breaks their application.
Another major reason is that bounds checking competes with ASLR + DEP. ASLR + DEP are perceived as solving, oh, 80% of the problem or so. That reduces the perceived need for full-fledged bounds checking.

Because it would cripple those general purpose languages for HPC requirements. There are plenty of applications where buffer overflows really do not matter one iota, simply because they do not happen. Such features are much better off in a library (where in fact you can already find examples for C/C++).
For domain specific languages it may make sense to bake such features into the language definition and trade the resulting performance hit for increased security.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string