Can't understand the buffer overflow example in "The Art of Exploitation"

Can't understand the buffer overflow example in "The Art of Exploitation" - security

My problem is very similar but not the same with the this one.
I run the same example of exploit_notesearch.c in the book: Hacking, the Art of Exploitation on my 64-bit OS, Archlinux and it doesn't work.
From the above link I learnt that it just can't work on most 64-bit systems. But I still can't understand why the programme have to do this: ret = (unsigned int)&i - offset. Why can't I just do this: ret = (unsigned)shellcode so that I can replace the vulnerable program's return address with shellcode's beginning address?

The ret = (unsigned)shellcode will make the ret equals to the address of the shellcode array in your program. But that address is not the address of your malicious code in the target program(notesearch.c). The target process will put its searchstring on stack, so that your malicious code will be also put onto the stack of the target process.
In old days, the memory layout of processes was typically highly deterministic - the location of the stack buffer could usually be predicted quite well by the attacker (particularly if they knew exactly which version of the target software was being attacked). So it will be very easy to know what is the exact address of the searchstring and your shellcode.
However, today, many operating system will perform ASLR. So attackers trying to execute shellcode injected on the stack have to find the stack first. The system obscures related memory-addresses from the attackers. These values have to be guessed, and a mistaken guess is not usually recoverable due to the application crashing(Segmentation Fault).
To improve the chances of success when there was some guesswork involved, the active shellcode would often be preceeded by a large quantity of executable machine code that performed no useful operation - called a "NOP sled" or "NOP slide".
So even the ret = (unsigned int)&i - offset can not make sure your shellcode will be executed succesfully.

Related

How do different commands get executed in CPU x86-64 registers?

Years ago a teacher once said to class that 'everything that gets parsed through the CPU can also be exploited'.
Back then I didn't know too much about the topic, but now the statement is nagging on me and I
lack the correct vocabulary to find an answer to this question in the internet myself, so I kindly ask you for help.
We had the lesson about 'cat', 'grep' and 'less' and she said that in the worst case even those commands can cause harm if we parse the wrong content through it.
I don't really understand how she meant that. I do know how CPU registers work, we also had to write an educational buffer overflow so I have seen assembly code in the registers aswell.
I still don't get the following:
How do commands get executed in the CPU at all? e.g. I use 'cat' so somehwere there will be a call of the command. But how does the data I enter get parsed to the CPU? If I 'cat' a .txt file which contains 'hello world' - can I find that string in HEX somewhere in the CPU registers? And if yes:
How does the CPU know that said string is NOT to be executed?
Could you think of any scencario where the above commands could get exploited? Afaik only text gets parsed through it, how could that be exploitable? What do I have to be careful about?
Thanks alot!

Machine code executes by being fetched by the instruction-fetch part of the CPU, at the address pointed to by RIP, the instruction-pointer. CPUs can only execute machine code from memory.
General-purpose registers get loaded with data from data load/store instructions, like mov eax, [rdi]. Having data in registers is totally unrelated to having it execute as machine code. Remember that RIP is a pointer, not actual machine-code bytes. (RIP can be set with jump instructions, including indirect jump to copy a GP register into it, or ret to pop the stack into it).
It would help to learn some basics of assembly language, because you seem to be missing some key concepts there. It's kind of hard to answer the security part of this question when the entire premise seems to be built on some misunderstanding of how computers work. (Which I don't think I can easily clear up here without writing a book on assembly language.) All I can really do is point you at CPU-architecture stuff that answers part of the title question of how instructions get executed. (Not from registers).
Related:
How does a computer distinguish between Data and Instructions?
How instructions are differentiated from data?
Modern Microprocessors
A 90-Minute Guide! covers the basic fetch/decode/execute cycle of simple pipelines. Modern CPUs might have more complex internals, but from a correctness / security POV are equivalent. (Except for exploits like Spectre and Meltdown that depend on speculative execution).
https://www.realworldtech.com/sandy-bridge/3/ is a deep-dive on Intel's Sandybridge microarchitecture. That page covering instruction-fetch shows how things really work under the hood in real CPUs. (AMD Zen is fairly similar.)
You keep using the word "parse", but I think you just mean "pass". You don't "parse content through" something, but you can "pass content through". Anyway no, cat usually doesn't involve copying or looking-at data in user-space, unless you run cat -n to add line numbers.
See Race condition when piping through x86-64 assembly program for an x86-64 Linux asm implementation of plain cat using read and write system calls. Nothing in it is data-dependent, except for the command-line arg. The data being copied is never loaded into CPU registers in user-space.
Inside the kernel, copy_to_user inside Linux's implementation of a read() system call on x86-64 will normally use rep movsb for the copy, not a loop with separate load/store, so even in kernel the data gets copied from the page-cache, pipe buffer, or whatever, to user-space without actually being in a register. (Same for write copying it to whatever stdout is connected to.)
Other commands, like less and grep, would load data into registers, but that doesn't directly introduce any risk of it being executed as code.

Most of the things have already been answered by Peter. However i would like to add a few things.
How do commands get executed in the CPU at all? e.g. I use 'cat' so somehwere there will be a call of the command. But how does the data I enter get parsed to the CPU? If I 'cat' a .txt file which contains 'hello world' - can I find that string in HEX somewhere in the CPU registers?
cat is not directly executed by the CPU cat.c. You could check the source code and get and in-depth view. .
What actually happens is that each instruction is converted to assembly instruction and they get executed by the CPU. The instructions are not vulnerable because what they do is just move some data and switch some bits. Most of the vulnerability are due to memory management and cat has been vulnerable in the past Check this for more detail
How does the CPU know that said string is NOT to be executed?
It does not. Its the job of the operating system to tell what is to be executed and what not.
Could you think of any scencario where the above commands could get exploited? Afaik only text gets parsed through it, how could that be exploitable? What do I have to be careful about?
You have to be careful about how you are passing the text file to the memory. You could even make your own interpreter that would execute txt file and then the interpreter will be telling the CPU about how to execute that instruction.

Can a single byte instruction be executed while being only partially overwritten?

I have made an experiment in which a new thread executes a shellcode with this simple infinite loop:
NOP
JMP REL8 0xFE (-0x2)
This generate the following shellcode:
0x90, 0xEB, 0xFE
After this infinite loop there are other instructions ending by the overwriting of the destination byte back to -0x2 to make it an infinite loop again, and an absolute jump to send the thread back to this infinite loop.
Now I was asking myself if it was possible that the instruction of the jump was executed while the single byte of the destination is only partially overwritten by the other thread.
For example, let's say that the other thread overwrites the destination of the jump (0xFE, or 11111110 in binary) to 0x0 (00000000) to release the thread of this infinite loop.
Could it happen that the jump goes to let's say 0x1E (00011110) because the destination byte wasn't completely overwritten at that nanosecond?
Before asking this question here I have done the experiment myself in a C++ program and I have let it run for some hours without it never missing a single jump.
If you want to have a look at the code I made for this experiment I have uploaded it to GitHub
Accordingly to this experiment, it seems to be impossible that an instruction is executed while being only partially overwritten .
However, I have very little knowledge in assembly and in processors, this is for this reason that I ask the question here:
Can anyone confirm my observation please? Is it indeed impossible to have an instruction executed while being partially overwritten by another thread? Does anyone knows why for sure?
Thank you very much for your help and knowledge on that, I did not know where to look for such an information.

No, byte stores are always atomic on x86, even for cross-modifying code.
See Observing stale instruction fetching on x86 with self-modifying code for some links to Intel's manuals for cross modifying code. And maybe Reproducing Unexpected Behavior w/Cross-Modifying Code on x86-64 CPUs
Of course, all the recommendations for writing efficient cross-modifying code (and running code that you just JIT-compiled) involve avoiding stores into pages that other threads are currently executing.
Why are you doing this with "shellcode", anyway? Is this supposed to be part of an exploit? If not, why not just write code in asm like a normal person, with a label on the jmp instruction so you can store to it from C by assigning to extern char jmp_bytes[2]?
And if this is supposed to be an efficient cross-thread notification mechanism... it isn't. Spinning on a data load and a conditional branch with a pause loop would allow a lower latency exit from the loop than a self-modifying code machine nuke that flushes the whole pipeline right when you want it to finally be doing something useful instead of wasting CPU time. At least several times the delay of a simple branch miss.
Even better, use an OS-supported condition variable so the thread can sleep instead of heating up your CPU (reducing the thermal headroom for the CPU to turbo above its rated clock speed up when there is work to do).
The mechanism used by current CPUs is that if a store near the EIP/RIP or any instruction in flight in the pipeline is detected, it does a machine clear. (perf counter machine_clears.smc, aka machine nuke.) It doesn't even try to handle it "efficiently", but if you did a non-atomic store (e.g. actually two separate stores, or a store split across a cache line boundary) the target CPU core could see it in different parts and potentially decode it with some bytes updated and other bytes not. But a single byte is always updated atomically, so tearing within a byte is not possible.
However, x86 on paper doesn't guarantee that, but as Andy Glew (one of the architects of Intel's P6 microarchitecture family) says, implementing stronger behaviour than the paper spec can actually be the most efficient way to meet all the required guarantees and run fast. (And / or avoid breaking existing code in widely-used software!)

Can anyone explain why NO-OP slide is used in shelllcoding?

An example where NO-OP slide is a must for the exploit to work would be really helpful.

An example of when it is a must is when you want an exploit to be portable when targeting a non-ASLR enabled executable/system. Consider a local privilege escalation exploit where you return to shellcode on the stack. Because the stack holds the environment, the shellcode on the stack will be at slightly different offsets from the top of the stack when executing from within different users' shells, or on different systems. By prefixing the shellcode with, for example, 64k nop instructions, you provide a large margin of error for the stack address since your code will execute the same whether you land on the first nop or the last one.
Using nops is generally not as useful when targeting ASLR enabled systems since data sections will be mapped in entirely different areas of memory

Is Linux program's stack somehow modified in a non-explicit way?

I am trying to write a simple ELF64 virus in NASM on Linux. It appends itself to the victim (and of course does all that section & segment related stuff) and changes victim's entry point so that it points to the malicious code.
When the infected program is being launched, the first one to be executed is my virus, and when does all it's work, it jumps to the original entry point and the victim's code is being executed.
When the virus infects simple C++ hello world, everything works fine: as I can see in strace, the virus executes properly and then the victim's code executes.
But if I append:
printf("%s\n", argv[0]);
to the victim's code, re-infect it and run, the virus' code executes properly, "hello world" is printed, but then a segmentation fault error is thrown.
I think it means that the stack is being changed during virus' execution so that there's some random number instead of the original argv[0].
However, I've analyzed the whole source of my virus, marked all pushes, pops and direct modifications of rsp, analyzed them carefully and it seems that the stack should be in the same state. But, as I see, it isn't.
Is it possible that the stack is being modified in some non-explicit way by for example a syscall? Or maybe it's impossible and I should just spend few more hours staring at the source to find a bug?

ELF64 virus in NASM on Linux
Persumably on x86_64 (64-bit Linux could also be aarch64, or powerpc64, or sparcv9, or ...).
it seems that the stack should be in the same state
What about registers? Note that on x86_64 argv is passed to main in $rsi, not on the stack.
You must read and understand x86_64 calling conventions.

2 questions regarding ASLR

I've been reading about ASLR and I have a couple of questions. I have little programming experience but I am interested in the theory behind it.
I understand that it randomizes where DLLs, stacks and heaps are in the virtual address space so that malicious code doesn't know their location, but how does the actual program know their location when it needs them?
If the legitimate process can locate them, what stops the malicious code doing the same?
and finally, is the malicious code that ASLR tries to prevent running in the user space of the process it is attacking?
Thanks

As background, ASLR is intended to complicate code injection attacks where the attacker tries to utilize your overflow bug to trick your application into running the attacker's code. For example, in a successful stack buffer overflow attack the attacker pushes their code onto the stack and modifies the call frame's return pointer to point to the on-stack code.
Most code injection attacks require the attacker to know the absolute address of some part of your process's memory layout. For stack buffer overflow attacks, they need to know the address of the stack frame of the vulnerable function call so they can set the functions return pointer to point to the stack. For other attacks this could be the address of heap variables, exception tables, etc...
One more important background fact: unlike programming languages, machine code has absolute addresses in it. While your program may call function foo(), the machine code will call address 0x12345678.
but how does the actual program know their location when it needs them?
This is established by the dynamic linker and other operating system features that are responsible for converting your on-disk executable into an in-memory process. This involves replacing references to foo with references to 0x12345678.
If the legitimate process can locate them, what stops the malicious code doing the same?
The legitimate process knows where the addresses are because the dynamic linker creates the process such that the actual addresses are hard-wired into the process. So the process isn't locating them, per se. By the time the process is started, the addresses are all calculated and inserted into the code. An attacker can't utilize this because their code is not modified by the dynamic linker.
Consider the scenario where an attacker has a copy of the same executable that they are trying to attack. They can run the executable on their machine, examine it, and find all of the relevant addresses. Without ASLR, these addresses have a good chance of being the same on your machine when you're running the executable. ASLR randomizes these addresses meaning that the attacker can't (easily) find the addresses.
and finally, is the malicious code that ASLR tries to prevent running in the user space of the process it is attacking?
Unless there's a kernel injection vulnerability (which would likely be very bad and result in patches by your OS vendpr), yes, it's running in the user space. More specifically, it will likely be located on the stack or the heap as this is where user input is stored. Using data execution prevention will also help to prevent successful injection attacks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string