How do different commands get executed in CPU x86-64 registers?

How do different commands get executed in CPU x86-64 registers? - security

Years ago a teacher once said to class that 'everything that gets parsed through the CPU can also be exploited'.
Back then I didn't know too much about the topic, but now the statement is nagging on me and I
lack the correct vocabulary to find an answer to this question in the internet myself, so I kindly ask you for help.
We had the lesson about 'cat', 'grep' and 'less' and she said that in the worst case even those commands can cause harm if we parse the wrong content through it.
I don't really understand how she meant that. I do know how CPU registers work, we also had to write an educational buffer overflow so I have seen assembly code in the registers aswell.
I still don't get the following:
How do commands get executed in the CPU at all? e.g. I use 'cat' so somehwere there will be a call of the command. But how does the data I enter get parsed to the CPU? If I 'cat' a .txt file which contains 'hello world' - can I find that string in HEX somewhere in the CPU registers? And if yes:
How does the CPU know that said string is NOT to be executed?
Could you think of any scencario where the above commands could get exploited? Afaik only text gets parsed through it, how could that be exploitable? What do I have to be careful about?
Thanks alot!

Machine code executes by being fetched by the instruction-fetch part of the CPU, at the address pointed to by RIP, the instruction-pointer. CPUs can only execute machine code from memory.
General-purpose registers get loaded with data from data load/store instructions, like mov eax, [rdi]. Having data in registers is totally unrelated to having it execute as machine code. Remember that RIP is a pointer, not actual machine-code bytes. (RIP can be set with jump instructions, including indirect jump to copy a GP register into it, or ret to pop the stack into it).
It would help to learn some basics of assembly language, because you seem to be missing some key concepts there. It's kind of hard to answer the security part of this question when the entire premise seems to be built on some misunderstanding of how computers work. (Which I don't think I can easily clear up here without writing a book on assembly language.) All I can really do is point you at CPU-architecture stuff that answers part of the title question of how instructions get executed. (Not from registers).
Related:
How does a computer distinguish between Data and Instructions?
How instructions are differentiated from data?
Modern Microprocessors
A 90-Minute Guide! covers the basic fetch/decode/execute cycle of simple pipelines. Modern CPUs might have more complex internals, but from a correctness / security POV are equivalent. (Except for exploits like Spectre and Meltdown that depend on speculative execution).
https://www.realworldtech.com/sandy-bridge/3/ is a deep-dive on Intel's Sandybridge microarchitecture. That page covering instruction-fetch shows how things really work under the hood in real CPUs. (AMD Zen is fairly similar.)
You keep using the word "parse", but I think you just mean "pass". You don't "parse content through" something, but you can "pass content through". Anyway no, cat usually doesn't involve copying or looking-at data in user-space, unless you run cat -n to add line numbers.
See Race condition when piping through x86-64 assembly program for an x86-64 Linux asm implementation of plain cat using read and write system calls. Nothing in it is data-dependent, except for the command-line arg. The data being copied is never loaded into CPU registers in user-space.
Inside the kernel, copy_to_user inside Linux's implementation of a read() system call on x86-64 will normally use rep movsb for the copy, not a loop with separate load/store, so even in kernel the data gets copied from the page-cache, pipe buffer, or whatever, to user-space without actually being in a register. (Same for write copying it to whatever stdout is connected to.)
Other commands, like less and grep, would load data into registers, but that doesn't directly introduce any risk of it being executed as code.

Most of the things have already been answered by Peter. However i would like to add a few things.
How do commands get executed in the CPU at all? e.g. I use 'cat' so somehwere there will be a call of the command. But how does the data I enter get parsed to the CPU? If I 'cat' a .txt file which contains 'hello world' - can I find that string in HEX somewhere in the CPU registers?
cat is not directly executed by the CPU cat.c. You could check the source code and get and in-depth view. .
What actually happens is that each instruction is converted to assembly instruction and they get executed by the CPU. The instructions are not vulnerable because what they do is just move some data and switch some bits. Most of the vulnerability are due to memory management and cat has been vulnerable in the past Check this for more detail
How does the CPU know that said string is NOT to be executed?
It does not. Its the job of the operating system to tell what is to be executed and what not.
Could you think of any scencario where the above commands could get exploited? Afaik only text gets parsed through it, how could that be exploitable? What do I have to be careful about?
You have to be careful about how you are passing the text file to the memory. You could even make your own interpreter that would execute txt file and then the interpreter will be telling the CPU about how to execute that instruction.

Related

Record dynamic instruction trace or histogram in QEMU?

I've written and compiled a RISC-V Linux application.
I want to dump all the instructions that get executed at run-time (which cannot be achieved by static analysis).
Is it possible to get a dynamic assembly instruction execution historgram from QEMU (or other tools)?

For instruction tracing, I go with -singlestep -d nochain,cpu, combined with some awk. This can become painfully slow and large depending on the code you run.
Regarding the statistics you'd like to obtain, delegate it to R/numpy/pandas/whatever after extracting the program counter.
The presentation or video of user "yvr18" on that topic, might cover some aspects of QEMU tracing at various levels (as well as some interesting heatmap visualization).

QEMU doesn't currently support that sort of trace of all instructions executed.
The closest we have today is that there are various bits of debug logging under the -d switch, and you can combine the tracing of "instructions translated from guest to native" with the "blocks of translated code executed" translation to work out what was executed, but this is pretty awkward.
Alternatively you could try scripting the gdbstub interface to do something like "disassemble instruction at PC; singlestep" which will (slowly!) give you all the instructions executed.
Note: There ongoing work to improve QEMU's ability to introspect guest execution so that you can write a simple 'plugin' with functions that are called back on events like guest instruction execution; with that it would be fairly easy to write a dump of guest instructions executed (or do more interesting processing), but this is still work-in-progress, so not available yet.

It seems you can do something similar with rv8 (https://github.com/rv8-io/rv8), using the command:
rv-jit -l

The "spike" RISC-V emulator allows tracing instructions executed, new values stored into registers, or just simply a histogram of PC values (from which you can extract what instruction was at each PC location).
It's not as fast as qemu, but runs at 100 to 200 MIPS on current x86 hardware (at least without tracing enabled)

Can a single byte instruction be executed while being only partially overwritten?

I have made an experiment in which a new thread executes a shellcode with this simple infinite loop:
NOP
JMP REL8 0xFE (-0x2)
This generate the following shellcode:
0x90, 0xEB, 0xFE
After this infinite loop there are other instructions ending by the overwriting of the destination byte back to -0x2 to make it an infinite loop again, and an absolute jump to send the thread back to this infinite loop.
Now I was asking myself if it was possible that the instruction of the jump was executed while the single byte of the destination is only partially overwritten by the other thread.
For example, let's say that the other thread overwrites the destination of the jump (0xFE, or 11111110 in binary) to 0x0 (00000000) to release the thread of this infinite loop.
Could it happen that the jump goes to let's say 0x1E (00011110) because the destination byte wasn't completely overwritten at that nanosecond?
Before asking this question here I have done the experiment myself in a C++ program and I have let it run for some hours without it never missing a single jump.
If you want to have a look at the code I made for this experiment I have uploaded it to GitHub
Accordingly to this experiment, it seems to be impossible that an instruction is executed while being only partially overwritten .
However, I have very little knowledge in assembly and in processors, this is for this reason that I ask the question here:
Can anyone confirm my observation please? Is it indeed impossible to have an instruction executed while being partially overwritten by another thread? Does anyone knows why for sure?
Thank you very much for your help and knowledge on that, I did not know where to look for such an information.

No, byte stores are always atomic on x86, even for cross-modifying code.
See Observing stale instruction fetching on x86 with self-modifying code for some links to Intel's manuals for cross modifying code. And maybe Reproducing Unexpected Behavior w/Cross-Modifying Code on x86-64 CPUs
Of course, all the recommendations for writing efficient cross-modifying code (and running code that you just JIT-compiled) involve avoiding stores into pages that other threads are currently executing.
Why are you doing this with "shellcode", anyway? Is this supposed to be part of an exploit? If not, why not just write code in asm like a normal person, with a label on the jmp instruction so you can store to it from C by assigning to extern char jmp_bytes[2]?
And if this is supposed to be an efficient cross-thread notification mechanism... it isn't. Spinning on a data load and a conditional branch with a pause loop would allow a lower latency exit from the loop than a self-modifying code machine nuke that flushes the whole pipeline right when you want it to finally be doing something useful instead of wasting CPU time. At least several times the delay of a simple branch miss.
Even better, use an OS-supported condition variable so the thread can sleep instead of heating up your CPU (reducing the thermal headroom for the CPU to turbo above its rated clock speed up when there is work to do).
The mechanism used by current CPUs is that if a store near the EIP/RIP or any instruction in flight in the pipeline is detected, it does a machine clear. (perf counter machine_clears.smc, aka machine nuke.) It doesn't even try to handle it "efficiently", but if you did a non-atomic store (e.g. actually two separate stores, or a store split across a cache line boundary) the target CPU core could see it in different parts and potentially decode it with some bytes updated and other bytes not. But a single byte is always updated atomically, so tearing within a byte is not possible.
However, x86 on paper doesn't guarantee that, but as Andy Glew (one of the architects of Intel's P6 microarchitecture family) says, implementing stronger behaviour than the paper spec can actually be the most efficient way to meet all the required guarantees and run fast. (And / or avoid breaking existing code in widely-used software!)

Source of clock_gettime

I've tried to understand the behavior of the function clock_gettime by looking at the source code of the linux kernel.
I'm currently using a 4.4.0-83-lowlatency but I only could get the 4.4.76 source files (but it should be close enough).
My first issue is that there is several occurrence of the function. I chose pc_clock_gettime which appears to be the closest and the only one handling CLOCK_MONOTONIC_RAW but if I'm wrong, please correct me.
I tracked back the execution flow of the function and came to a mysterious ravb_ptp_gettime64 and ravb_ptp_time_read which is related to the Ethernet driver.
So... If I understand correctly when I ask the system to give me the time, it ask to the Ethernet driver ?
This is the first time I looked into kernel code so I'm not used to it. If someone could give me an explanation of "how" and "why", it would be marvelous.

clock_gettime use a mechanism named vDSO (Virutal Dynamic Shared Object). It's a shared library which is mapped in the user space by the kernel.
vDSO allow the use of syscall frequently without a drawback on performances. So the kernel "puts" time informations into memory which user programm can access. In the end, it won't be a system call but only a simple function call.

Kernel Panic -- Failed copy_from_user, kmalloc?

I am writing a rootkit for my OS class (the teacher is okay with me asking for help here). My rootkit hooks the sys_read system call to hide "magic" ports from the user. When I copy the user buffer *buf (one of the arguments of sys_read) to kernel space (into a buffer called kbuf) I get kernel panic/core dump error. It is possible that this is just because breaking read brings the system to a halt, but I wonder if anyone has any perspective on this.
The code is available online. Look at line 207: https://github.com/joshimhoff/toykit/blob/master/toykit.c
I hooked getdents and used copy_from_user to bring the getdents structs into kernel space, and this worked well! I am not sure what is different about read.
Thanks for the help!

I figured it out. I called the actual sys_read function and didn't check the return value. Sometimes it is negative to indicate an error. Instead of failing early, I asked kmalloc for a negative number of bytes.
Imagine that. Allocating negative memory. That would be a crazy world.

Shell Code and kernel hacking OS Agnostic?

I'm reading a book about hacking the kernel and one area the author keeps coming back to is shell code, that many attempts at kernel hacking try to find a way to execute shell code.
Can someone elaborate more on this topic, particularly can you clarify "shell code."
How does shell code get around sudo in *NIX systems or not being Admin in a windows machine?
Are there examples of shell code attacks that aren't OS specific? I would think one has to be targeting specific OS.

Shell code is the payload used when exploiting a vulnerability that is used to create a command shell from which the attacker can control the machine.
A typical shell code when run might open a network connection and spawn cmd.exe on a Windows machine (or /bin/sh on Linux/unix) piping stdin and stdout over the network connection. An attacker may complete the connection from his machine and enter commands and get feedback as if he was sitting at the compromised machine.
A buffer overflow is not shell code. It is the vulnerability that is exploited to execute the shell code.
The buffer overflow is exploited to copy the shell code to the user's machine and overwrite the return address on the program's stack. When the currently executing function returns, the processor jumps to the uploaded shell code which creates the shell for the attacker.
For more information on exploiting buffer overflows, have a look at Smashing the Stack for Fun and Profit.
You can try to use the -fno-stack-protector flag for gcc but I'm not very familiar with OSX or whatever stack protections it may use.
If you want to play around with buffer overflows, modern compilers and modern OSs have protections in place that make this difficult. Your best bet would be to grab yourself a Linux distro and turn them off. See this question for more information on disabling these protections.
Note you don't need to have a buffer overflow to execute a shell code. I've demonstrated opening a remote shell using a command injection exploit to upload and execute a batch file.

Essentially it's finding a buffer overflow or similar technique that allows you to insert malicious code into a process running as root.
For example, if you used a fixed sized buffer and you overrun that buffer, you can essentially overwrite memory contents and use this to execute a malicious payload.

A simple shell code snippet that can come back to bite you is:
/bin/sh
or inside a C program:
system("/bin/sh");
If you can direct your exploits to execute such a line of code (e.g. through a buffer overflow that hijacks the intended control path of the program), you get a shell prompt with the victim's privileges and you're in.

Basically, when a program runs, everything that's related to it (Variables, Instructions etc.) is stored in the Memory, as a Buffer.
Memory is essentially a hell lot of bits in your RAM.
So, for the purpose of our example, let's say that there's a variable Name that get's stored in Bit# 1-10. Let's assume that Bits 11-30 is used for storing Instructions. It's clear that the programmer expects Name to be 10 bits long. If I give a 20-bit-long Name, it's buffer's gonna overflow into the area that holds the instructions. So I'm gonna design the latter 10 bits of my Name such that the instructions will get overwritten by naughty ones.
'innocentmeNAUGHTYCOD'
That's an Attack.
Though not all instances are this obvious, there's some vulnerability in almost every large piece of code. It's all about how you exploit it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string