I'm writing a bootloader and was thinking about how to jump to the kernel. I saw that people use jmp 0x8000 (or whatever address the kernel is at) in order to load it. but if the JMP fails for whatever reason (like there being nothing at the kernel, or the kernel being 64 bit when we are in 32 bit mode,) how could I make NASM run some different code instead? for example, It would attempt a JMP to the kernel address but in the case that whatever is there can't/won't run, it could print a message stating that the kernel could not be loaded?
Related
I'm trying to inject code into Windows process (a game) running under Wine in Linux (for learning purposes, of course). Not a pre-built DLL, but native code.
On Windows, we can use procedures like VirtualAllocEx and WriteProcessMemory to write code to the virtual memory of a foreign process and then create a thread in it via CreateRemoteThread, thus executing our code in the context of the foreign process.
This is a popular approach, which is often used by all sorts of cheats and trainers for games. The popular game hacking program Cheat Engine has its own API that does this trick.
To do the same on Linux, you need a lot more, much more complex manipulations: attach to the process using ptrace, swapping registers and making system calls to allocate memory, write code there, then create a thread using the clone system call that will execute our code. And it will be even better and safer if the thread created by clone creates another thread using pthread_create, and the last one will already execute our code.
In order for a thread created with clone to jump to the desired code, the address of this code is written to the end of the memory allocated for its stack, and the clone system call itself is performed using the syscall instruction, followed by the ret instruction. This combination of instructions is easily found in libc.so.
I found an example in C on Github on how to do it correctly and implemented my own program. I won't include its source code here because it's quite large. Let me just say that it works as expected with native Linux processes. Moreover, I can also inject code that calls Linux routines (for example, puts from libc.so) into the Wine process and see the correct result:
lea rdi, [text]
call <libc.so code vaddr + `puts` offset>
ret
text:
db 'puts called successfully!', 0
But when I try to call any Windows procedure in the thread created by pthread_create (be it any procedure from the game code or, for example, MessageBoxA from user32.dll), the thread hangs:
sub rsp, 32
xor rcx, rcx
lea rdx, [text]
lea r8, [caption]
mov r9, 0
call <user32.dll code vaddr + `MessageBoxA` offset>
add rsp, 32
ret
text:
db 'MessageBoxA called successfully!', 0
caption:
db 'Yaaay!', 0
But why is this happening? After all, Wine is not an emulator, but just an API and system call compatibility layer. Doesn't this mean that the code itself runs natively on the processor, just like the code of any Linux programs, and I should be able to interoperate between them?
I suspect that the problem may be, for example, in the stack incompatibility. I once heard that Wine uses some kind of trick to convert the stack between Windows and Linux, but I don't know what exactly that trick is and I'm too bad to find and understand it in the Wine source code.
Could you explain to me exactly why my idea of calling Windows procedures from a thread created by pthread_create does not work and how can I make it work?
On an x86-64 Intel system that supports syscall and sysret what's the "fastest" system call from 64-bit user code on a vanilla kernel?
In particular, it must be a system call that exercises the syscall/sysret user <-> kernel transition1, but does the least amount of work beyond that. It doesn't even need to do the syscall itself: some type of early error which never dispatches to the specific call on the kernel side is fine, as long as it doesn't go down some slow path because of that.
Such a call could be used to estimate the raw syscall and sysret overhead independent of any work done by the call.
1 In particular, this excludes things that appear to be system calls but are implemented in the VDSO (e.g., clock_gettime) or are cached by the runtime (e.g., getpid).
One that doesn't exist, and therefore returns -ENOSYS quickly.
From arch/x86/entry/entry_64.S:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
#ifdef CONFIG_RETPOLINE
movq sys_call_table(, %rax, 8), %rax
call __x86_indirect_thunk_rax
#else
call *sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
1:
Use an invalid system call number so the dispatching code simply returns with
eax = -ENOSYS instead of dispatching to a system-call handling function at all.
Unless this causes the kernel to use the iret slow path instead of sysret / sysexit. That might explain the measurements showing an invalid number being 17 cycles slower than syscall(SYS_getpid), because glibc error handling (setting errno) probably doesn't explain it. But from my reading of the kernel source, I don't see any reason why it wouldn't still use sysret while returning -ENOSYS.
This answer is for sysenter, not syscall. The question originally said sysenter / sysret (which was weird because sysexit goes with sysenter, while sysret goes with syscall). I answered based on sysenter for a 32-bit process on an x86-64 kernel.
Native 64-bit syscall is handled more efficiently inside the kernel. (Update; with Meltdown / Spectre mitigation patches, it still dispatches via C do_syscall_64 in 4.16-rc2).
My What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? Q&A gives an overview of the kernel side of system-call entry points from compat mode into an x86-64 kernel (entry_64_compat.S). This answer is just taking the relevant parts of that.
The links in that answer and this are to Linux 4.12 sources, which doesn't contain the Meltdown mitigation page-table manipulation, so that will be significant extra overhead.
int 0x80 and sysenter have different entry points. You're looking for entry_SYSENTER_compat. AFAIK, sysenter always goes there, even if you execute it in a 64-bit user-space process. Linux's entry point pushes a constant __USER32_CS as the saved CS value, so it will always return to user-space in 32-bit mode.
After pushing registers to construct a struct pt_regs on the kernel stack, there's a TRACE_IRQS_OFF hook (no idea how many instructions that amounts to), then call do_fast_syscall_32 which is written in C. (Native 64-bit syscall dispatching is done directly from asm, but 32-bit compat system calls are always dispatched through C).
do_syscall_32_irqs_on in arch/x86/entry/common.c is pretty light-weight: just a check if the process is being traced (I think this is how strace can hook system calls via ptrace), then
...
if (likely(nr < IA32_NR_syscalls)) {
regs->ax = ia32_sys_call_table[nr]( ... arg );
}
syscall_return_slowpath(regs);
}
AFAIK, the kernel can use sysexit after this function returns.
So the return path is the same whether or not EAX had a valid system call number, and obviously returning without dispatching at all is the fastest path through that function, especially in a kernel with Spectre mitigation where the indirect branch on the table of function pointers would go through a retpoline and always mispredict.
If you want to really test sysenter/sysexit without all that extra overhead, you'll need to modify Linux to put a much simpler entry point without checking for tracing or pushing / popping all the registers.
You'd probably also want to modify the ABI to pass a return address in a register (like syscall does on its own) instead of saved on the user-space stack which Linux's current sysenter ABI does; it has to get_user() to read the EIP value it should return to.
Of if all this overhead is part of what you want to measure, you're definitely all set with an eax that gives you -ENOSYS; at worst you'll be getting one extra branch miss from the range-check if branch predictors are hot for that branch based on normal 32-bit system calls.
In this benchmark by Brendan Gregg (linked from this blog post which is interesting reading on the topic) close(999) (or some other fd not in use) is recommended.
Some system calls don't even go thru any user->kernel transition, read vdso(7).
I suspect that these VDSO system calls (e.g. time(2), ...) are the fastest. You could claim that there are no "real" system calls.
BTW, you could add a dummy system call to your kernel (e.g. some system call always returning 0, or a hello world system call, see also this) and measure it.
I suspect (without having benchmarked it) that getpid(2) should be a very fast system call, because the only thing it needs to do is fetch some data from the kernel memory. And AFAIK, it is a genuine system call, not using VDSO techniques. And you could use syscall(2) to avoid its caching done by your libc and forcing the genuine system call.
I maintain my position (given in a comment to your initial question): without actual motivation your question does not make any concrete sense. Then I still do think that syscall(2) doing getpid is measuring the typical overhead to make a system call (and I guess you really care about that one). In practice almost all system calls are doing more work that such a getpid (or getppid).
All,
I'm trying to figure this out by endlessly debugging applications, but I can't seem to find my answer.
In my 32bit PE injection I eventually change EAX with the new EntryPoint of the injected PE, then resume the thread. I've read that the kernel runs a call EAX at the end to get to the entrypoint(I did not see this when debugging applications, so no idea if that is really the case).
However, I can't seem to find if this is possible in x64 (Tried about all registers :)).
So all in all two questions:
Does the kernel actually call EAX, because I can't see that call when debugging
Is the same method usable of changing a register to get the new entrypoint to run in x64 or do I need to rely on e.g. CreateRemoteThread?
P.S.: I'm a security researcher :)
in x64 RCX register used as application defined entry point of thread. in x86 - EAX register used. and not kernel call this address. but kernel32.dll
Using only (32 bits) x86 assembly, is it possible to check if an address is writable, without interacting with the operating system, and without risking a segfault?
The program will run on a Linux system in ring 3. I cannot use the "verw" instruction.
To give an example, I might want to check if the address 0x0804a000 is writable. But if I e.g. do "mov eax, 0x0804a000; mov [eax], eax" then, if the address is not writable, the program will segfault. The only other way I know is to e.g. call sys_read into the address and see if it fails, but this interacts with the operating system.
Is there a way to check if an address is writable given the constraints? If so, how?
If you have kernel privileges you could probably find that info in the MMU.
But if you don't, you simply do not have access to it and must use OS facilities.
If you mean not calling an OS function, then it is possible at least on Windows by using Structured Exception Handling. It is still OS specific of course, because you need to access the Windows TIB at the FS segment.
This might be a silly question, but I was debugging a binary with gdb trying to "reverse engineer" it and reached an instruction that makes a syscall after which the effect I want to reverse engineer appears. I assume that another process is taking over and does the job so I was wondering if it was possible to debug the kernel code that handles the syscall with gdb.
Here is the x86 assembly snippet that makes the syscall (it appears that it is sys_getpid):
0x00007ffff7660d3e <+14>: movsxd rdx,edx
0x00007ffff7660d41 <+17>: movsxd rdi,edi
0x00007ffff7660d44 <+20>: mov eax,0x14
0x00007ffff7660d49 <+25>: syscall
The syscall (or sysenter or int 0x80 etc...) machine instruction is for making syscalls which by definition are handled by the Linux kernel. Details are defined in the x86-64 ABI specification. Read Advanced Linux Programming to get an overview of most of them. See also Linux Assembly HowTo.
From the point of view of a user application, a syscall is a virtual atomic instruction.
No specific userland process is handling syscalls, it is the job of the kernel to handle them, and it is nearly the sole way for an application to interact with the kernel.
the processing of syscalls by the kernel for a given process is accounted as system CPU time, e.g. by time(1)
The list of documented syscalls is given in syscalls(2). See also <asm/unistd.h> and <asm/unistd_64.h> etc... headers.
You could use strace(1) to understand the sequence of syscalls done by a particular run (of some process).
See also vdso(7).