I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?
Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.
Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).
You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.
This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.
You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx
Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.
On MSVC for x64 (19.10.25019),
InterlockedOr(&g, 1)
generates this code sequence:
prefetchw BYTE PTR ?g##3JC
mov eax, DWORD PTR ?g##3JC ; g
npad 3
$LL3#f:
mov ecx, eax
or ecx, 1
lock cmpxchg DWORD PTR ?g##3JC, ecx ; g
jne SHORT $LL3#f
I would have expected the much simpler (and loopless):
mov eax, 1
lock or [?g##3JC], eax
InterlockedAnd generates analogous code to InterlockedOr.
It seems wildly inefficient to have to have a loop for this instruction. Why is this code generated?
(As a side note: the whole reason I was using InterlockedOr was to do an atomic load of the variable - I have since learned that InterlockedCompareExchange is the way to do this. It is odd to me that there is no InterlockedLoad(&x), but I digress...)
The documented contract for InterlockedOr has it returning the original value:
InterlockedOr
Performs an atomic OR operation on the specified LONG values. The function prevents more than one thread from using the same variable simultaneously.
LONG __cdecl InterlockedOr(
_Inout_ LONG volatile *Destination,
_In_ LONG Value
);
Parameters:
Destination [in, out]
A pointer to the first operand. This value will be replaced with the result of the operation.
Value [in]
The second operand.
Return value
The function returns the original value of the Destination parameter.
This is why the unusual code that you've observed is required. The compiler cannot simply emit an OR instruction with a LOCK prefix, because the OR instruction does not return the previous value. Instead, it has to use the odd workaround with LOCK CMPXCHG in a loop. In fact, this apparently unusual sequence is the standard pattern for implementing interlocked operations when they aren't natively supported by the underlying hardware: capture the old value, perform an interlocked compare-and-exchange with the new value, and keep trying in a loop until the old value from this attempt is equal to the captured old value.
As you observed, you see the same thing with InterlockedAnd, for exactly the same reason: the x86 AND instruction doesn't return the original value, so the code-generator has to fallback on the general pattern involving compare-and-exchange, which is directly supported by the hardware.
Note that, at least on x86 where InterlockedOr is implemented as an intrinsic, the optimizer is smart enough to figure out whether you're using the return value or not. If you are, then it uses the workaround code involving CMPXCHG. If you are ignoring the return value, then it goes ahead and emits code using LOCK OR, just like you would expect.
#include <intrin.h>
LONG InterlockedOrWithReturn()
{
LONG val = 42;
return _InterlockedOr(&val, 8);
}
void InterlockedOrWithoutReturn()
{
LONG val = 42;
LONG old = _InterlockedOr(&val, 8);
}
InterlockedOrWithoutReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
lock or DWORD PTR [rsp+8], 8
ret 0
InterlockedOrWithoutReturn ENDP
InterlockedOrWithReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
prefetchw BYTE PTR [rsp+8]
mov eax, DWORD PTR [rsp+8]
LoopTop:
mov ecx, eax
or ecx, 8
lock cmpxchg DWORD PTR [rsp+8], ecx
jne SHORT LoopTop
ret 0
InterlockedOrWithReturn ENDP
The optimizer is equally as smart for InterlockedAnd, and should be for the other Interlocked* functions, as well.
As intuition would tell you, the LOCK OR implementation is more efficient than the LOCK CMPXCHG in a loop. Not only is there the expanded code size and the overhead of looping, but you risk branch prediction misses, which can cost a large number of cycles. In performance-critical code, if you can avoid relying on the return value for interlocked operations, you can gain a performance boost.
However, what you really should be using in modern C++ is std::atomic, which allows you to specify the desired memory model/semantics, and then let the standard library maintainers deal with the complexity.
So I'm a bit of a beginner to ARM Assembly (assembly in general, too). Right now I'm writing a program and one of the biggest parts of it is that the user will need to type in a letter, and then I will compare that letter to some other pre-inputted letter to see if the user typed the same thing.
For instance, in my code I have
.balign 4 /* Forces the next data declaration to be on a 4 byte segment */
dime: .asciz "D\n"
at the top of the file and
addr_dime : .word dime
at the bottom of the file.
Also, based on what I've been reading online I put
.balign 4
inputChoice: .asciz "%d"
at the top of the file, and put
inputVal : .word 0
at the bottom of the file.
Near the middle of the file (just trust me that there is something wrong with this standalone code, and the rest of the file doesn't matter in this context) I have this block of code:
ldr r3, addr_dime
ldr r2, addr_inputChoice
cmp r2, r3 /*See if the user entered D*/
addeq r5, r5, #10 /*add 10 to the total if so*/
Which I THINK should load "D" into r3, load whatever String or character the user inputted into r2, and then add 10 to r5 if they are the same.
For some reason this doesn't work, and the r5, r5, #10 code only works if addne comes before it.
addr_dime : .word dime is poitlessly over-complicated. The address is already a link-time constant. Storing the address in memory (at another location which has its own address) doesn't help you at all, it just adds another layer of indirection. (Which is actually the source of your problem.)
Anyway, cmp doesn't dereference its register operands, so you're comparing pointers. If you single-step with a debugger, you'll see that the values in registers are pointers.
To load the single byte at dime, zero-extended into r3, do
ldrb r3, dime
Using ldr to do a 32-bit load would also get the \n byte, and a 32-bit comparison would have to match that too for eq to be true.
But this can only work if dime is close enough for a PC-relative addressing mode to fit; like most RISC machines, ARM can't use arbitrary absolute addresses because the instruction-width is fixed.
For the constant, the easiest way to avoid that is not to store it in memory in the first place. Use .equ dime, 'D' to define a numeric constant, then you can use
cmp r2, dime # compare with immediate operand
Or ldr r3, =dime to ask the assembler to get the constant into a register for you. You can do this with addresses, so you could do
ldr r2, =inputVal # r2 = &inputVal
ldrb r2, [r2] # load first byte of inputVal
This is the generic way to handle loading from static data that might be too far away for a PC-relative addressing mode.
You could avoid that by using a stack address (sub sp, #16 / mov r5, sp or something). Then you already have the address in a register.
This is exactly what a C compiler does:
char dime[4] = "D\n";
char input[4] = "xyz";
int foo(int start) {
if (dime[0] == input[0])
start += 10;
return start;
}
From ARM32 gcc6.3 on the Godbolt compiler explorer:
foo:
ldr r3, .L4 # load a pointer to the data section at dime / input
ldrb r2, [r3]
ldrb r3, [r3, #4]
cmp r2, r3
addeq r0, r0, #10
bx lr
.L4:
# gcc greated this "literal pool" next to the code
# holding a pointer it can use to access the data section,
# wherever the linker ends up putting it.
.word .LANCHOR0
.section .data
.p2align 2
### These are in a different section, near each other.
### On Godbolt, click the .text button to see full assembler directives.
.LANCHOR0: # actually defined with a .set directive, but same difference.
dime:
.ascii "D\012\000"
input:
.ascii "xyz\000"
Try changing the C to compare with a literal character instead of a global the compiler can't optimize into a constant, and see what you get.
I am writing an LC3 program that increments each letter of a three-letter word stored in memory following the program. 'a' becomes 'd', 'n' becomes 'q', 'z' becomes 'c', etc.
I am using this as LC3 Assembly a reference
Here is my code so far
.orig x3000
ADD R1, R1, #3
LEA R2, STRING
HALT
STRING .STRINGZ "anz"
.END
I was able to figure out how to declare a string of characters in LC3 from my reference. However does anyone how to do the actual incrementation or have any references that I could use to figure out how to do it?
Using a while loop, I was able to get it to increment each char of the string until a null value is found. I didn't code it to loop back around (z becoming c) but this should get you started.
;tells simulator where to put my code in memory(starting location). PC is set to thsi address at start up
.orig x3000
MAIN
AND R1, R1, #0 ; clear our loop counter
WHILE_LOOP
LEA R2, STRING ; load the memory location of the first char into R1
ADD R2, R2, R1 ; Add our counter to str memory location. R2 = mem[R1 + R2]
LDR R3, R2, #0 ; Loads the value stored in the memory location of R2
BRz END_WHILE ; If there is no char then exit loop
ADD R3, R3, #3 ; change the char
STR R3, R2, #0 ; store the value in R3 back to the location in R2
ADD R1, R1, #1 ; add one to our loop counter
BR WHILE_LOOP ; jump to the top of our loop
END_WHILE
HALT
; Stored Data
STRING .STRINGZ "anz"
.END
I'm trying to understand how the format string vulnerabilities can work and how we can read any address.
I can check in real time how inputing "%x %x %x" in a printf will pop elements off the stack just above the string address.
This is how that stack looks once inside the printf:
(...)
0x7fffffffe018: 0x000000000040052c
0x7fffffffe020: 0x00007fffffffe180
0x7fffffffe028: 0x00007fffffffe168
0x7fffffffe030: *0x00007fffffffe48d* << address of argument
0x7fffffffe038: 0x00007ffff7dd4e80
0x7fffffffe040: 0x00007ffff7de9d60
0x7fffffffe048: 0x00007ffff7ffe268
0x7fffffffe050: 0x0000000000000000
0x7fffffffe058: 0x0000000000400563 <</ return address after printf
0x7fffffffe060: 0x00007fffffffe168
0x7fffffffe068: 0x0000000200400440
(...)
and 0x7fffffffe48d is the adress of the "%x %x %x" string further away in memory:
0x7fffffffe469: "/home/.../C/format_string/test"
*0x7fffffffe48d*: "%x %x %x"
0x7fffffffe4ac: "SSH_AGENT_PID=..."
So logically this will output the 3 elements on top of this, ie:
ffffe168 ffffe180 40052c
Now, what I don't understand is, if I put a random address in the parameter, say:
"\x15\xff\x0A\x23 %x %x %x %x %s", why would that "\x15\xff\x0A\x23" be actually stored on the stack and read by the "%s" ?
From what i'm seeing above, only the adress of the whole string is put on the stack (0x00007fffffffe48d) but not the characters (and indeed the address i intend to read) themselves.
Said differently, whatever I put inside my string, I can only control the content of address:
0x7fffffffe48d: "blablabla %x %x %x"
but not what's going popped from the stack.
You are correct, the code does not push the content of the string on the stack.
It merely pushes its address.
Using a buffer overflow technique here will not allow you to overwrite the return address.
Unfortunately you don't show the C code.
The following C code would put its string data on the stack and thus could be exploited.
The following example from: http://insecure.org/stf/smashstack.html
Stack based string - exploitable
void overrunmybuffer(char *str) {
char buffer[16]; <<-- local fixed sized array, stored on the stack.
//should have used `strncpy()`
strcpy(buffer,str); <<-- strcpy will blindly push 256 bytes in a 16 byte buffer
}
void main() {
char large_string[256];
int i;
for( i = 0; i < 255; i++)
large_string[i] = 'A';
overrunmybuffer(large_string);
}
Heap based string - difficult to exploit
In your code the function looks something like this:
void cannotoverrunstack(char *str) {
char *buffer; <<-- pointer to char, only the address is stored on stack
buffer = malloc(16);
strcpy(buffer,str); <<-- some other data in the heap will be overwritten
but not the stack.
}
Note that the data overwrite on the heap may or may not trigger an access violation and may overwrite useful data.
Because of the random way data gets assigned in the heap this is much less useful then overwriting the stack.
The stack frame has a very predictable layout and is thus very manipulatable.
The proper solution to the buffer overrun problem is to never use strcpy but to only use strncpy like so:
Save code for fixed sized buffers
void cannotoverrunmybuffer(char *str) {
char buffer[16]; <<-- local fixed sized array, stored on the stack.
strncpy(buffer,str,sizeof(buffer)-1); <<-- only copy first 15 bytes.
buffer[15] = 0; <<-- !!put terminating 0 in!
}
Do not forget to forcefully zero-terminate your string, or you'll end-up reading past the end of the string, leading to lossage. If you're working with Unicode strings you need to put two terminating 0's in.
Even better would be to use automatically expanding strings like those used in Java or Delphi (System::OpenString in C++).
Buffer overruns are impossible with those because they will automatically scale from 0 to 2GB's.