The difference between `entry_SYSCALL64_slow_path` and `entry_SYSCALL64_fast_path` - linux

We know that system call will call the function entry_SYSCALL_64 in entry_64.S. When I read the source code, I find there are two different types of call after the prepartion of registers, one is entry_SYSCALL64_slow_path and the other is entry_SYSCALL64_fast_path. Can you tell the difference between the two functions?

Upon entry in entry_SYSCALL_64 Linux will:
Swap gs to get per-cpu parameters.
Set the stack from the parameters of above.
Disable the IRQs.
Create a partial pt_regs structure on the stack. This saves the caller context.
If the current task has _TIF_WORK_SYSCALL_ENTRY or _TIF_ALLWORK_MASK set, it goes to the slow path.
Enter the fast path otherwise.
_TIF_WORK_SYSCALL_ENTRY is defined here with a comment stating:
/*
* work to do in syscall_trace_enter(). Also includes TIF_NOHZ for
* enter_from_user_mode()
*/
_TIF_ALLWORK_MASK does not seems to be defined for x86, a definition for MIPS is here with a comment stating:
/* work to do on any return to u-space */
Fast path
Linux will:
Enable the IRQs.
Check if the syscall number is out of range (note the pt_regs struct was already created with ENOSYS for the value of rax).
Dispatch to the system call with an indirect jump.
Save the return value (rax) of the syscall into rax in the pt_regs on the stack.
Check again if _TIF_ALLWORK_MASK is set for the current task, if it is it will jump to the slow return path.
Restore the caller context and issue a sysret.
Slow return path
Save the registers not saved before in pt_regs (rbx, rbp, r12-r15).
Call syscall_return_slowpath, defined here.
Note that point 2 will end up calling trace_sys_exit.
Slow path
Save the registers not saved before in pt_regs (see above)
Call do_syscall_64, defined here.
Point 2 will call syscall_trace_enter.
So the slow vs fast path has to do with ptrace. I haven't dug into the code but I suppose the whole machinery is skipped if ptrace is not needed for the caller.
This is indeed an important optimization.

Related

save the number of bytes read from file [duplicate]

When I try to research about return values of system calls of the kernel, I find tables that describe them and what do I need to put in the different registers to let them work. However, I don't find any documentation where it states what is that return value I get from the system call. I'm just finding in different places that what I receive will be in the EAX register.
TutorialsPoint:
The result is usually returned in the EAX register.
Assembly Language Step-By-Step: Programming with Linux book by Jeff Duntemann states many times in his programs:
Look at sys_read's return value in EAX
Copy sys_read return value for safe keeping
Any of the websites I have don't explain about this return value. Is there any Internet source? Or can someone explain me about this values?
See also this excellent LWN article about system calls which assumes C knowledge.
Also: The Definitive Guide to Linux System Calls (on x86), and related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?
C is the language of Unix systems programming, so all the documentation is in terms of C. And then there's documentation for the minor differences between the C interface and the asm on any given platform, usually in the Notes section of man pages.
sys_read means the raw system call (as opposed to the libc wrapper function). The kernel implementation of the read system call is a kernel function called sys_read(). You can't call it with a call instruction, because it's in the kernel, not a library. But people still talk about "calling sys_read" to distinguish it from the libc function call. However, it's ok to say read even when you mean the raw system call (especially when the libc wrapper doesn't do anything special), like I do in this answer.
Also note that syscall.h defines constants like SYS_read with the actual system call number, or asm/unistd.h for the Linux __NR_read names for the same constants. (The value you put in EAX before an int 0x80 or syscall instruction).
Linux system call return values (in EAX/RAX on x86) are either "normal" success, or a -errno code for error. e.g. -EFAULT if you pass an invalid pointer. This behaviour is documented in the syscalls(2) man page.
-1 to -4095 means error, anything else means success. See AOSP non-obvious syscall() implementation for more details on this -4095UL .. -1UL range, which is portable across architectures on Linux, and applies to every system call. (In the future, a different architecture could use a different value for MAX_ERRNO, but the value for existing arches like x86-64 is guaranteed to stay the same as part of Linus's don't-break-userspace policy of keeping kernel ABIs stable.)
For example, glibc's generic syscall(2) wrapper function uses this sequence: cmp rax, -4095 / jae SYSCALL_ERROR_LABEL, which is guaranteed to be future-proof for all Linux system calls.
You can use that wrapper function to make any system call, like syscall( __NR_mmap, ... ). (Or use an inline-asm wrapper header like https://github.com/linux-on-ibm-z/linux-syscall-support/blob/master/linux_syscall_support.h that has safe inline-asm for multiple ISAs, avoiding problems like missing "memory" clobbers that some other inline-asm wrappers have.)
Interesting cases include getpriority where the kernel ABI maps the -20..19 return-value range to 1..40, and libc decodes it. More details in a related answer about decoding syscall error return values.
For mmap, if you wanted you could also detect error just by checking that the return value isn't page-aligned (e.g. any non-zero bits in the low 11, for a 4k page size), if that would be more efficient than checking p > -4096ULL.
To find the actual numeric values of constants for a specific platform, you need to find the C header file where they're #defined. See my answer on a question about that for details. e.g. in asm-generic/errno-base.h / asm-generic/errno.h.
The meanings of return values for each sys call are documented in the section 2 man pages, like read(2). (sys_read is the raw system call that the glibc read() function is a very thin wrapper for.) Most man pages have a whole section for the return value. e.g.
RETURN VALUE
On success, the number of bytes read is returned (zero indicates
end of file), and the file position is advanced by this number. It
is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to end-of-
file, or because we are reading from a pipe, or from a terminal), or
because read() was interrupted by a signal. See also NOTES.
On error, -1 is returned, and errno is set appropriately. In this
case, it is left unspecified whether the file position (if any)
changes.
Note that the last paragraph describes how the glibc wrapper decodes the value and sets errno to -EAX if the raw system call's return value is negative, so errno=EFAULT and return -1 if the raw system call returned -EFAULT.
And there's a whole section listing all the possible error codes that read() is allowed to return, and what they mean specifically for read(). (POSIX standardizes most of this behaviour.)

Can eBPF modify the return value or parameters of a syscall?

To simulate some behavior I would like to attach a probe to a syscall and modify the return value when certain parameters are passed. Alternatively, it would also be enough to modify the parameters of the function before they are processes.
Is this possible with BPF?
Within kernel probes (kprobes), the eBPF virtual machine has read-only access to the syscall parameters and return value.
However the eBPF program will have a return code of it's own. It is possible to apply a seccomp profile that traps BPF (NOT eBPF; thanks #qeole) return codes and interrupt the system call during execution.
The allowed runtime modifications are:
SECCOMP_RET_KILL: Immediate kill with SIGSYS
SECCOMP_RET_TRAP: Send a catchable SIGSYS, giving a chance to emulate the syscall
SECCOMP_RET_ERRNO: Force errno value
SECCOMP_RET_TRACE: Yield decision to ptracer or set errno to -ENOSYS
SECCOMP_RET_ALLOW: Allow
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
The SECCOMP_RET_TRACE method enables modifying the system call performed, arguments, or return value. This is architecture dependent and modification of mandatory external references may cause an ENOSYS error.
It does so by passing execution up to a waiting userspace ptrace, which has the ability to modify the traced process memory, registers, and file descriptors.
The tracer needs to call ptrace and then waitpid. An example:
ptrace(PTRACE_SETOPTIONS, tracee_pid, 0, PTRACE_O_TRACESECCOMP);
waitpid(tracee_pid, &status, 0);
http://man7.org/linux/man-pages/man2/ptrace.2.html
When waitpid returns, depending on the contents of status, one can retrieve the seccomp return value using the PTRACE_GETEVENTMSG ptrace operation. This will retrieve the seccomp SECCOMP_RET_DATA value, which is a 16-bit field set by the BPF program. Example:
ptrace(PTRACE_GETEVENTMSG, tracee_pid, 0, &data);
Syscall arguments can be modified in memory before continuing operation. You can perform a single syscall entry or exit with the PTRACE_SYSCALL step. Syscall return values can be modified in userspace before resuming execution; the underlying program won't be able to see that the syscall return values have been modified.
An example implementation:
Filter and Modify System Calls with seccomp and ptrace
I believe that attaching eBPF to kprobes/kretprobes gives you read access to function arguments and return values, but that you cannot tamper with them. I am NOT 100% sure; good places to ask for confirmation would be the IO Visor project mailing list or IRC channel (#iovisor at irc.oftc.net).
As an alternative solution, I know you can at least change the return value of a syscall with strace, with the -e option. Quoting the manual page:
-e inject=set[:error=errno|:retval=value][:signal=sig][:when=expr]
Perform syscall tampering for the specified set of syscalls.
Also, there was a presentation on this, and fault injection, at Fosdem 2017, if it is of any interest to you. Here is one example command from the slides:
strace -P precious.txt -efault=unlink:retval=0 unlink precious.txt
Edit: As stated by Ben, eBPF on kprobes and tracepoints is definitively read only, for tracing and monitoring use cases. I also got confirmation about this on IRC.
It is possible to modify some user space memory using eBPF. As stated in the bpf.h header file:
* int bpf_probe_write_user(void *dst, const void *src, u32 len)
* Description
* Attempt in a safe way to write *len* bytes from the buffer
* *src* to *dst* in memory. It only works for threads that are in
* user context, and *dst* must be a valid user space address.
*
* This helper should not be used to implement any kind of
* security mechanism because of TOC-TOU attacks, but rather to
* debug, divert, and manipulate execution of semi-cooperative
* processes.
*
* Keep in mind that this feature is meant for experiments, and it
* has a risk of crashing the system and running programs.
* Therefore, when an eBPF program using this helper is attached,
* a warning including PID and process name is printed to kernel
* logs.
* Return
* 0 on success, or a negative error in case of failure.
Also, quoting from the BPF design Q&A:
Tracing BPF programs can overwrite the user memory of the current
task with bpf_probe_write_user(). Every time such program is loaded
the kernel will print warning message, so this helper is only useful
for experiments and prototypes. Tracing BPF programs are root only.
Your eBPF may write data into user space memory locations. Note that you still cannot modify kernel structures from within you eBPF program.
It is possible to inject errors into a system call invocation using eBPF: https://lwn.net/Articles/740146/
There is a bpf function called bpf_override_return(), which can override the return value of an invocation. This is an example using bcc as the front-end: https://github.com/iovisor/bcc/blob/master/tools/inject.py
According to the Linux manual page:
bpf_override_return() is only available if the kernel was compiled with the CONFIG_BPF_KPROBE_OVERRIDE configuration option, and in this case it only works on functions tagged with ALLOW_ERROR_INJECTION in the kernel code.
Also, the helper is only available for the architectures having the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing, x86 architecture is the only one to support this feature.
It is possible to add a function to the error injection framework. More information could be found here: https://github.com/iovisor/bcc/issues/2485

how can I call a system call in freebsd?

I created a syscall same as /usr/share/examples/kld/syscall/module/syscall.c with a little change in message.
I used kldload and module loaded. now I want to call the syscall.
what is this syscall number so I can call it?
or what is the way to call this syscall?
I suggest you take a look at Designing BSD rootkits, that's how I learned kernel programming on FreeBSD, there's even a section that talks all about making your own syscalls.
Well, if you check /usr/share/examples/kld/syscall directory you will see it contains a test program..... but hey, let's assume the program is not there.
Let's take a look at part of the module itself:
/*
* The offset in sysent where the syscall is allocated.
*/
static int offset = NO_SYSCALL;
[..]
case MOD_LOAD :
printf("syscall loaded at %d\n", offset);
break;
The module prints syscall number on load, so the job now is to learn how to call it... a 'freebsd call syscall' search on google...
Reveals: http://www.freebsd.cz/doc/en/books/developers-handbook/x86-system-calls.html (although arguably not something to use on amd64) and.. https://www.freebsd.org/cgi/man.cgi?query=syscall&sektion=2 - a manual page for a function which allows you to call arbitrary syscalls.
I strongly suggest you do some digging on your own. If you don't, there is absolutely no way you will be able to write any kernel code.

Good references for the syscalls

I need some reference but a good one, possibly with some nice examples. I need it because I am starting to write code in assembly using the NASM assembler. I have this reference:
http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html
which is quite nice and useful, but it's got a lot of limitations because it doesn't explain the fields in the other registers. For example, if I am using the write syscall, I know I should put 1 in the EAX register, and the ECX is probably a pointer to the string, but what about EBX and EDX? I would like that to be explained too, that EBX determines the input (0 for stdin, 1 for something else etc.) and EDX is the length of the string to be entered, etc. etc. I hope you understood me what I want, I couldn't find any such materials so that's why I am writing here.
Thanks in advance.
The standard programming language in Linux is C. Because of that, the best descriptions of the system calls will show them as C functions to be called. Given their description as a C function and a knowledge of how to map them to the actual system call in assembly, you will be able to use any system call you want easily.
First, you need a reference for all the system calls as they would appear to a C programmer. The best one I know of is the Linux man-pages project, in particular the system calls section.
Let's take the write system call as an example, since it is the one in your question. As you can see, the first parameter is a signed integer, which is usually a file descriptor returned by the open syscall. These file descriptors could also have been inherited from your parent process, as usually happens for the first three file descriptors (0=stdin, 1=stdout, 2=stderr). The second parameter is a pointer to a buffer, and the third parameter is the buffer's size (as an unsigned integer). Finally, the function returns a signed integer, which is the number of bytes written, or a negative number for an error.
Now, how to map this to the actual system call? There are many ways to do a system call on 32-bit x86 (which is probably what you are using, based on your register names); be careful that it is completely different on 64-bit x86 (be sure you are assembling in 32-bit mode and linking a 32-bit executable; see this question for an example of how things can go wrong otherwise). The oldest, simplest and slowest of them in the 32-bit x86 is the int $0x80 method.
For the int $0x80 method, you put the system call number in %eax, and the parameters in %ebx, %ecx, %edx, %esi, %edi, and %ebp, in that order. Then you call int $0x80, and the return value from the system call is on %eax. Note that this return value is different from what the reference says; the reference shows how the C library will return it, but the system call returns -errno on error (for instance -EINVAL). The C library will move this to errno and return -1 in that case. See syscalls(2) and intro(2) for more detail.
So, in the write example, you would put the write system call number in %eax, the first parameter (file descriptor number) in %ebx, the second parameter (pointer to the string) in %ecx, and the third parameter (length of the string) in %edx. The system call will return in %eax either the number of bytes written, or the error number negated (if the return value is between -1 and -4095, it is a negated error number).
Finally, how do you find the system call numbers? They can be found at /usr/include/linux/unistd.h. On my system, this just includes /usr/include/asm/unistd.h, which finally includes /usr/include/asm/unistd_32.h, so the numbers are there (for write, you can see __NR_write is 4). The same goes for the error numbers, which come from /usr/include/linux/errno.h (on my system, after chasing the inclusion chain I find the first ones at /usr/include/asm-generic/errno-base.h and the rest at /usr/include/asm-generic/errno.h). For the system calls which use other constants or structures, their documentation tells which headers you should look at to find the corresponding definitions.
Now, as I said, int $0x80 is the oldest and slowest method. Newer processors have special system call instructions which are faster. To use them, the kernel makes available a virtual dynamic shared object (the vDSO; it is like a shared library, but in memory only) with a function you can call to do a system call using the best method available for your hardware. It also makes available special functions to get the current time without even having to do a system call, and a few other things. Of course, it is a bit harder to use if you are not using a dynamic linker.
There is also another older method, the vsyscall, which is similar to the vDSO but uses a single page at a fixed address. This method is deprecated, will result in warnings on the system log if you are using recent kernels, can be disabled on boot on even more recent kernels, and might be removed in the future. Do not use it.
If you download that web page (like it suggests in the second paragraph) and download the kernel sources, you can click the links in the "Source" column, and go directly to the source file that implements the system calls. You can read their C signatures to see what each parameter is used for.
If you're just looking for a quick reference, each of those system calls has a C library interface with the same name minus the sys_. So, for example, you could check out man 2 lseek to get the information about the parameters forsys_lseek:
off_t lseek(int fd, off_t offset, int whence);
where, as you can see, the parameters match the ones from your HTML table:
%ebx %ecx %edx
unsigned int off_t unsigned int

Initial state of program registers and stack on Linux ARM

I'm currently playing with ARM assembly on Linux as a learning exercise. I'm using 'bare' assembly, i.e. no libcrt or libgcc. Can anybody point me to information about what state the stack-pointer and other registers will at the start of the program before the first instruction is called? Obviously pc/r15 points at _start, and the rest appear to be initialised to 0, with two exceptions; sp/r13 points to an address far outside my program, and r1 points to a slightly higher address.
So to some solid questions:
What is the value in r1?
Is the value in sp a legitimate stack allocated by the kernel?
If not, what is the preferred method of allocating a stack; using brk or allocate a static .bss section?
Any pointers would be appreciated.
Since this is Linux, you can look at how it is implemented by the kernel.
The registers seem to be set by the call to start_thread at the end of load_elf_binary (if you are using a modern Linux system, it will almost always be using the ELF format). For ARM, the registers seem to be set as follows:
r0 = first word in the stack
r1 = second word in the stack
r2 = third word in the stack
sp = address of the stack
pc = binary entry point
cpsr = endianess, thumb mode, and address limit set as needed
Clearly you have a valid stack. I think the values of r0-r2 are junk, and you should instead read everything from the stack (you will see why I think this later). Now, let's look at what is on the stack. What you will read from the stack is filled by create_elf_tables.
One interesting thing to notice here is that this function is architecture-independent, so the same things (mostly) will be put on the stack on every ELF-based Linux architecture. The following is on the stack, in the order you would read it:
The number of parameters (this is argc in main()).
One pointer to a C string for each parameter, followed by a zero (this is the contents of argv in main(); argv would point to the first of these pointers).
One pointer to a C string for each environment variable, followed by a zero (this is the contents of the rarely-seen envp third parameter of main(); envp would point to the first of these pointers).
The "auxiliary vector", which is a sequence of pairs (a type followed by a value), terminated by a pair with a zero (AT_NULL) in the first element. This auxiliary vector has some interesting and useful information, which you can see (if you are using glibc) by running any dynamically-linked program with the LD_SHOW_AUXV environment variable set to 1 (for instance LD_SHOW_AUXV=1 /bin/true). This is also where things can vary a bit depending on the architecture.
Since this structure is the same for every architecture, you can look for instance at the drawing on page 54 of the SYSV 386 ABI to get a better idea of how things fit together (note, however, that the auxiliary vector type constants on that document are different from what Linux uses, so you should look at the Linux headers for them).
Now you can see why the contents of r0-r2 are garbage. The first word in the stack is argc, the second is a pointer to the program name (argv[0]), and the third probably was zero for you because you called the program with no arguments (it would be argv[1]). I guess they are set up this way for the older a.out binary format, which as you can see at create_aout_tables puts argc, argv, and envp in the stack (so they would end up in r0-r2 in the order expected for a call to main()).
Finally, why was r0 zero for you instead of one (argc should be one if you called the program with no arguments)? I am guessing something deep in the syscall machinery overwrote it with the return value of the system call (which would be zero since the exec succeeded). You can see in kernel_execve (which does not use the syscall machinery, since it is what the kernel calls when it wants to exec from kernel mode) that it deliberately overwrites r0 with the return value of do_execve.
Here's what I use to get a Linux/ARM program started with my compiler:
/** The initial entry point.
*/
asm(
" .text\n"
" .globl _start\n"
" .align 2\n"
"_start:\n"
" sub lr, lr, lr\n" // Clear the link register.
" ldr r0, [sp]\n" // Get argc...
" add r1, sp, #4\n" // ... and argv ...
" add r2, r1, r0, LSL #2\n" // ... and compute environ.
" bl _estart\n" // Let's go!
" b .\n" // Never gets here.
" .size _start, .-_start\n"
);
As you can see, I just get the argc, argv, and environ stuff from the stack at [sp].
A little clarification: The stack pointer points to a valid area in the process' memory. r0, r1, r2, and r3 are the first three parameters to the function being called. I populate them with argc, argv, and environ, respectively.
Here's the uClibc crt. It seems to suggest that all registers are undefined except r0 (which contains a function pointer to be registered with atexit()) and sp which contains a valid stack address.
So, the value you see in r1 is probably not something you can rely on.
Some data are placed on the stack for you.
I've never used ARM Linux but I suggest you either look at the source for the libcrt and see what they do, or use gdb to step into an existing executable. You shouldn't need the source code just step through the assembly code.
Everything you need to find out should happen within the very first code executed by any binary executable.
Hope this helps.
Tony

Resources