Linux 64 bit context switch - linux

in the switch_to macro in 32 bit mode, there the following code is executed before the __switch_to function is called:
asm volatile("pushfl\n\t" /* save flags */ \
"pushl %%ebp\n\t" /* save EBP */ \
"movl %%esp,%[prev_sp]\n\t" /* save ESP */ \
"movl %[next_sp],%%esp\n\t" /* restore ESP */ \
"movl $1f,%[prev_ip]\n\t" /* save EIP */ \
"pushl %[next_ip]\n\t" /* restore EIP */ \
__switch_canary \
"jmp __switch_to\n" /* regparm call */
The EIP is pushed onto the stack (restore EIP). When __switch_to finishes, there is a ret which returns to that location.
Here is the corrsponding 64 bit code:
asm volatile(SAVE_CONTEXT \
"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \
"movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \
"call __switch_to\n\t"
There, only the rsp is saved and restored. I think that the RIP is already at
the top of stack. But I cannot find the instruction where that is done.
How is the 64 bit context switch, especially for the RIP register, actually done?
Thanks in advance!

In 32 bit kernel, the thread.ip may be one of:
the 1 label in switch_to
ret_from_fork
ret_from_kernel_thread
The return to the proper place is ensured by simulating a call using a push + jmp pair.
In 64 bit kernel, thread.ip is not used like this. Execution always continues after the call (which used to be the 1 label in the 32 bit case). As such, there is no need to emulate the call, it can be done normally. Dispatching to ret_from_fork happens using a conditional jump after __switch_to returns (you have omitted this part):
#define switch_to(prev, next, last) \
asm volatile(SAVE_CONTEXT \
"movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \
"movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \
"call __switch_to\n\t" \
"movq "__percpu_arg([current_task])",%%rsi\n\t" \
__switch_canary \
"movq %P[thread_info](%%rsi),%%r8\n\t" \
"movq %%rax,%%rdi\n\t" \
"testl %[_tif_fork],%P[ti_flags](%%r8)\n\t" \
"jnz ret_from_fork\n\t" \
RESTORE_CONTEXT \
The ret_from_kernel_thread is incorporated into the ret_from_fork path, using yet another conditional jump in entry_64.S:
ENTRY(ret_from_fork)
DEFAULT_FRAME
LOCK ; btr $TIF_FORK,TI_flags(%r8)
pushq_cfi $0x0002
popfq_cfi # reset kernel eflags
call schedule_tail # rdi: 'prev' task parameter
GET_THREAD_INFO(%rcx)
RESTORE_REST
testl $3, CS-ARGOFFSET(%rsp) # from kernel_thread?
jz 1f

Related

what's %b1(b one, not b L) in `xorb %b1, %b1`?

I am reading the copy_from_user function, in copy_from_user function, the macro __get_user_asm is used.
there is a mmap syscall in linux, mmap syscall will call function copy_from_user. this function will use the macro __get_user_asm if the size is constant. the content of __get_user_asm is
#define __get_user_asm(x, addr, err, itype, rtype, ltype, errret) \
asm volatile("1: mov"itype" %2,%"rtype"1\n" \
"2:\n" \
".section .fixup,\"ax\"\n" \
"3: mov %3,%0\n" \
" xor"itype" %"rtype"1,%"rtype"1\n" \
" jmp 2b\n" \
".previous\n" \
_ASM_EXTABLE(1b, 3b) \
: "=r" (err), ltype(x) \
: "m" (__m(addr)), "i" (errret), "0" (err))
when i try to translate
__get_user_asm(*(u8 *)dst, (u8 __user *)src, ret, "b", "b", "=q", 1); to the real source,
1: movb %2,%b1\n
2:\n
.section .fixup, "ax" \n
3: mov %3, %0 \n
**xorb %b1, %b1\n**
jmp 2b\n
.previous\n
: "=r" (ret), =q(dst)
:"m"(dst), "i"(1), "0"(ret)
.quad "1b", "2b"\n
.previous\n```
,
there are somewhere i can't understand.
1, in xorb %b1, %b1, what's %b1(b one, not b L)?
2, in jmp 2b, is 2b a label or a memroy address? if 2b is a label, how can i find this lable?
3, what's the function of .quad "1b", "2b"?
where can i get the knowledge that make me to understand the linux kernel source in semantics layer?
Reading the docs for gcc's extended asm, we see that %1 refers to the second parameter (because parameter numbers are zero based). In your example, that's dst.
Adding b (ie %b1) is described here:
Modifier Description Operand masm=att masm=intel
b Print the QImode name of the register. %b0 %al al
jmp 2b means look backward for a label named 2.
The .quad directive is defined here:
.quad expects zero or more bignums, separated by commas. For each
bignum, it emits an 8-byte integer. If the bignum won't fit in 8
bytes, it prints a warning message; and just takes the lowest order 8
bytes of the bignum.
As for where to get info, hopefully the links I have provided help.
XOR any register with itself sets it to zero. So %B1 = 0.

ARM inline asm: exit system call with value read from memory

Problem
I want to execute the exit system call in ARM using inline assembly on a Linux Android device, and I want the exit value to be read from a location in memory.
Example
Without giving this extra argument, a macro for the call looks like:
#define ASM_EXIT() __asm__("mov %r0, #1\n\t" \
"mov %r7, #1\n\t" \
"swi #0")
This works well.
To accept an argument, I adjust it to:
#define ASM_EXIT(var) __asm__("mov %r0, %0\n\t" \
"mov %r7, #1\n\t" \
"swi #0" \
: \
: "r"(var))
and I call it using:
#define GET_STATUS() (*(int*)(some_address)) //gets an integer from an address
ASM_EXIT(GET_STATUS());
Error
invalid 'asm': operand number out of range
I can't explain why I get this error, as I use one input variable in the above snippet (%0/var). Also, I have tried with a regular variable, and still got the same error.
Extended-asm syntax requires writing %% to get a single % in the asm output. e.g. for x86:
asm("inc %eax") // bad: undeclared clobber
asm("inc %%eax" ::: "eax"); // safe but still useless :P
%r7 is treating r7 as an operand number. As commenters have pointed out, just omit the %s, because you don't need them for ARM, even with GNU as.
Unfortunately, there doesn't seem to be a way to request input operands in specific registers on ARM, the way you can for x86. (e.g. "a" constraint means eax specifically).
You can use register int var asm ("r7") to force a var to use a specific register, and then use an "r" constraint and assume it will be in that register. I'm not sure this is always safe, or a good idea, but it appears to work even after inlining. #Jeremy comments that this technique was recommended by the GCC team.
I did get some efficient code generated, which avoids wasting an instruction on a reg-reg move:
See it on the Godbolt Compiler Explorer:
__attribute__((noreturn)) static inline void ASM_EXIT(int status)
{
register int status_r0 asm ("r0") = status;
register int callno_r7 asm ("r7") = 1;
asm volatile("swi #0\n"
:
: "r" (status_r0), "r" (callno_r7)
: "memory" // any side-effects on shared memory need to be done before this, not delayed until after
);
// __builtin_unreachable(); // optionally let GCC know the inline asm doesn't "return"
}
#define GET_STATUS() (*(int*)(some_address)) //gets an integer from an address
void foo(void) { ASM_EXIT(12); }
push {r7} # # gcc is still saving r7 before use, even though it sees the "noreturn" and doesn't generate a return
movs r0, #12 # stat_r0,
movs r7, #1 # callno,
swi #0
# yes, it literally ends here, after the inlined noreturn
void bar(int status) { ASM_EXIT(status); }
push {r7} #
movs r7, #1 # callno,
swi #0 # doesn't touch r0: already there as bar()'s first arg.
Since you always want the value read from memory, you could use an "m" constraint and include a ldr in your inline asm. Then you wouldn't need the register int var asm("r0") trick to avoid a wasted mov for that operand.
The mov r7, #1 might not always be needed either, which is why I used the register asm() syntax for it, too. If gcc wants a 1 constant in a register somewhere else in a function, it can do it in r7 so it's already there for the ASM_EXIT.
Any time the first or last instructions of a GNU C inline asm statement are mov instructions, there's probably a way to remove them with better constraints.

Some kernel ARM code

I was reading through some ARM kernel sources till I stumbled upon the following function :-
314 #define __get_user_asm_byte(x, addr, err) \
315 __asm__ __volatile__( \
316 "1: " TUSER(ldrb) " %1,[%2],#0\n" \
317 "2:\n" \
318 " .pushsection .fixup,\"ax\"\n" \
319 " .align 2\n" \
320 "3: mov %0, %3\n" \
321 " mov %1, #0\n" \
322 " b 2b\n" \
323 " .popsection\n" \
324 " .pushsection __ex_table,\"a\"\n" \
325 " .align 3\n" \
326 " .long 1b, 3b\n" \
327 " .popsection" \
328 : "+r" (err), "=&r" (x) \
329 : "r" (addr), "i" (-EFAULT) \
330 : "cc")
The calling context seems is as follows :-
299 #define __get_user_err(x, ptr, err) \
300 do { \
301 unsigned long __gu_addr = (unsigned long)(ptr); \
302 unsigned long __gu_val; \
303 __chk_user_ptr(ptr); \
304 might_fault(); \
305 switch (sizeof(*(ptr))) { \
306 case 1: __get_user_asm_byte(__gu_val, __gu_addr, err); break; \
Now I have a few doubts I'd like to clear about the ARM assembly above.
What is the function __get_user_asm_byte doing? I can see that r3 is copied into r0 and that the value 0 is moved into r1. After that does it branches to the offset 0x2b?
What is the function trying to do? What does the "+r" (err), "=&r" (x) from line 328 onward mean?
What is a .pushsection and a .popsection?
Why does ARM assembly thats written for the kernel so different syntactically(what's the assembler used? Why are there %<regno> instead of r<regno>?)
First, as comments have noted, the syntax is standard gcc inline assembly syntax (the +r, =&r, %<arg> parts).
The rest is kernel magic designed to handle page faults. The point of get_user_asm_byte is to pull a byte from user-space. However, in pulling data from user-space, two special situations need to be accommodated:
A perfectly legitimate user-space address that is simply not present at the moment (i.e. usually because it's paged out)
An illegal user-space address.
Either could cause a page fault. For (1), the desired behavior is to restore the user page (read it back in from swap space or whatever), then re-execute the load instruction, resulting in eventual success; for (2), the desired behavior is to fail the operation without retrying, returning an EFAULT error to the caller.
Without special handling, the normal behavior of a page fault in kernel mode is an "Oops". The special sections coordinate with the page fault handling code and make it possible to recover correctly. If you want to understand the details of how this works, search for __ex_table in the kernel source.
Also, checkout kernel doc at: Documentation/x86/exception-tables.txt
The only ARM specific item is to use either ldrt or MMU domains; this is conditional in domain.h. Modern ARM CPUs support domains. The translated load/store variants apply a user mode access for older CPUs.

sys_execve hooking on 3.5 kernel

I am trying to hook sys_execve syscall in Linux kernel v3.5 on x86_32. I simply change sys_call_table entry address to my hook function
asmlinkage long (*real_execve)( const char __user*, const char __user* const __user*,
const char __user* const __user* );
...
asmlinkage long hook_execve( const char __user* filename, const char __user* const __user* argv,
const char __user* const __user* envp )
{
printk( "Called execve hook\n" );
return real_execve( filename, argv, envp );
}
...
real_execve = (void*)sys_call_table[ __NR_execve ];
sys_call_table[ __NR_execve ] = (unsigned long)hook_execve;
I do set page permission for modifying sys_call_table entries, and mentioned scheme works well for another syscalls (chdir, mkdir and so on). But on execve hooking i got null pointer dereference:
Mar 11 14:18:08 mbz-debian kernel: [ 5590.596033] Called execve hook
Mar 11 14:18:08 mbz-debian kernel: [ 5590.596408] BUG: unable to handle kernel NULL pointer dereference at (null)
Mar 11 14:18:08 mbz-debian kernel: [ 5590.596486] IP: [< (null)>] (null)
Mar 11 14:18:08 mbz-debian kernel: [ 5590.596526] *pdpt = 0000000032302001 *pde = 0000000000000000
Mar 11 14:18:08 mbz-debian kernel: [ 5590.596584] Oops: 0010 [#1] SMP
I call sys_execve with three parameters because of arch/x86/kernel/entry_32.S, that contains PTREGSCALL3(execve). However, i've tried calling it with four parameters (adding struct pt_regs*) but i got the same error. Maybe something is totally wrong with this approach to execve? Or did i miss something?
UPDATE #1
I found that sys_call_table[ __NR_execve ] actually contains address of ptregs_execve (not sys_execve). It is defined as follows in arch/x86/kernel/entry_32.S:
#define PTREGSCALL3(name) \
ENTRY(ptregs_##name) ; \
CFI_STARTPROC; \
leal 4(%esp),%eax; \
pushl_cfi %eax; \
movl PT_EDX(%eax),%ecx; \
movl PT_ECX(%eax),%edx; \
movl PT_EBX(%eax),%eax; \
call sys_##name; \
addl $4,%esp; \
CFI_ADJUST_CFA_OFFSET -4; \
ret; \
CFI_ENDPROC; \
ENDPROC(ptregs_##name)
...
PTREGSCALL3(execve)
So in order to modify sys_execve i need to replace its code without modifying its address? I have read something similar here, is this the way to go?
UPDATE #2
Actually i found following call sequence: do_execve->do_execve_common->search_binary_handler->security_bprm_check, and this security_bprm_check is a wrapper around LSM(Linux Security Module) operation, that controls execution of a binary. After that i've read and followed this link and i got it working. It solves my problem as now i can see the name of process to be executed, but i am still unsure about correctness of it. Maybe someone else will add some clarity about all this stuff.
In the past, hooking syscalls in the Linux kernel was an easier task, however, in newer kernels, assembly stubs were added to the syscalls. In order to solve this problem, I patch the kernel's memory on the fly.
You can view my full solution for hooking sys_execve here:
https://github.com/kfiros/execmon

Assembler code in __range_ok macros

Can you explain me this code ? I really don't understand it.
See http://lxr.free-electrons.com/source/arch/arm/include/asm/uaccess.h#L70
#define __addr_ok(addr) ({ \
unsigned long flag; \
__asm__("cmp %2, %0; movlo %0, #0" \
: "=&r" (flag) \
: "" (current_thread_info()->addr_limit), "r" (addr) \
: "cc"); \
(flag == 0); })
/* We use 33-bit arithmetic here... */
#define __range_ok(addr,size) ({ \
unsigned long flag, roksum; \
__chk_user_ptr(addr); \
__asm__("adds %1, %2, %3; sbcccs %1, %1, %0; movcc %0, #0" \
: "=&r" (flag), "=&r" (roksum) \
: "r" (addr), "Ir" (size), "" (current_thread_info()->addr_limit) \
: "cc"); \
flag; })
This is from ARM Linux kernel, __range_ok
As a general source of info regarding the register usage and other decorations, look at the docs for GCC Extended Inline Assembly
I suggest you run this source through
gcc .... -S
to see what the resultant assmebly generated is.
You could also run
objdump -dC -S <objectfile.o>
You will need objdump from your cross-compiler toolchain.
Also, compile with debug information to get source annotation (-S).
Compile with -O0 to avoid confusion due to optimization.

Resources