Where is the definition of __NR_accept? - linux

As I known, the syscall numbers' definitions are in the format of __NR_xxxx in asm/unistd.h, but I cannot find the definition of __NR_accept, Why?

On many architectures, the accept system call number is in <asm/unistd.h>.
However, I suspect you're asking about i386 or another "older" architecture. In that case, for historical reasons, there isn't really an accept system call -- instead, one uses the multiplexed socketcall system call with a call number of SYS_ACCEPT to perform accept(). You will find a definition of __NR_socketcall in your <asm/unistd.h> (and definitions of SYS_SOCKET, SYS_BIND, SYS_CONNECT, SYS_LISTEN, SYS_ACCEPT and so on in <linux/net.h> for the various socket-related system calls that are multiplexed through socketcall).
In any case, for architectures where there is no true accept system call, you will of course also not have a system call number __NR_accept.

Related

save the number of bytes read from file [duplicate]

When I try to research about return values of system calls of the kernel, I find tables that describe them and what do I need to put in the different registers to let them work. However, I don't find any documentation where it states what is that return value I get from the system call. I'm just finding in different places that what I receive will be in the EAX register.
TutorialsPoint:
The result is usually returned in the EAX register.
Assembly Language Step-By-Step: Programming with Linux book by Jeff Duntemann states many times in his programs:
Look at sys_read's return value in EAX
Copy sys_read return value for safe keeping
Any of the websites I have don't explain about this return value. Is there any Internet source? Or can someone explain me about this values?
See also this excellent LWN article about system calls which assumes C knowledge.
Also: The Definitive Guide to Linux System Calls (on x86), and related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?
C is the language of Unix systems programming, so all the documentation is in terms of C. And then there's documentation for the minor differences between the C interface and the asm on any given platform, usually in the Notes section of man pages.
sys_read means the raw system call (as opposed to the libc wrapper function). The kernel implementation of the read system call is a kernel function called sys_read(). You can't call it with a call instruction, because it's in the kernel, not a library. But people still talk about "calling sys_read" to distinguish it from the libc function call. However, it's ok to say read even when you mean the raw system call (especially when the libc wrapper doesn't do anything special), like I do in this answer.
Also note that syscall.h defines constants like SYS_read with the actual system call number, or asm/unistd.h for the Linux __NR_read names for the same constants. (The value you put in EAX before an int 0x80 or syscall instruction).
Linux system call return values (in EAX/RAX on x86) are either "normal" success, or a -errno code for error. e.g. -EFAULT if you pass an invalid pointer. This behaviour is documented in the syscalls(2) man page.
-1 to -4095 means error, anything else means success. See AOSP non-obvious syscall() implementation for more details on this -4095UL .. -1UL range, which is portable across architectures on Linux, and applies to every system call. (In the future, a different architecture could use a different value for MAX_ERRNO, but the value for existing arches like x86-64 is guaranteed to stay the same as part of Linus's don't-break-userspace policy of keeping kernel ABIs stable.)
For example, glibc's generic syscall(2) wrapper function uses this sequence: cmp rax, -4095 / jae SYSCALL_ERROR_LABEL, which is guaranteed to be future-proof for all Linux system calls.
You can use that wrapper function to make any system call, like syscall( __NR_mmap, ... ). (Or use an inline-asm wrapper header like https://github.com/linux-on-ibm-z/linux-syscall-support/blob/master/linux_syscall_support.h that has safe inline-asm for multiple ISAs, avoiding problems like missing "memory" clobbers that some other inline-asm wrappers have.)
Interesting cases include getpriority where the kernel ABI maps the -20..19 return-value range to 1..40, and libc decodes it. More details in a related answer about decoding syscall error return values.
For mmap, if you wanted you could also detect error just by checking that the return value isn't page-aligned (e.g. any non-zero bits in the low 11, for a 4k page size), if that would be more efficient than checking p > -4096ULL.
To find the actual numeric values of constants for a specific platform, you need to find the C header file where they're #defined. See my answer on a question about that for details. e.g. in asm-generic/errno-base.h / asm-generic/errno.h.
The meanings of return values for each sys call are documented in the section 2 man pages, like read(2). (sys_read is the raw system call that the glibc read() function is a very thin wrapper for.) Most man pages have a whole section for the return value. e.g.
RETURN VALUE
On success, the number of bytes read is returned (zero indicates
end of file), and the file position is advanced by this number. It
is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to end-of-
file, or because we are reading from a pipe, or from a terminal), or
because read() was interrupted by a signal. See also NOTES.
On error, -1 is returned, and errno is set appropriately. In this
case, it is left unspecified whether the file position (if any)
changes.
Note that the last paragraph describes how the glibc wrapper decodes the value and sets errno to -EAX if the raw system call's return value is negative, so errno=EFAULT and return -1 if the raw system call returned -EFAULT.
And there's a whole section listing all the possible error codes that read() is allowed to return, and what they mean specifically for read(). (POSIX standardizes most of this behaviour.)

Where can I find system call number for RISC-V

I'm trying to do some low level stuff, so I need to know the system call number of open on riscv32 platform.
The only thing close to my problem is in here, but it doesn't show the number of open.
Use openat (AT_FDCWD, ...) instead.  Passing the AT_FDCWD argument for the first argument dirfd makes it perform exactly the same as open.  The rest of the arguments are the same as open.

Can eBPF modify the return value or parameters of a syscall?

To simulate some behavior I would like to attach a probe to a syscall and modify the return value when certain parameters are passed. Alternatively, it would also be enough to modify the parameters of the function before they are processes.
Is this possible with BPF?
Within kernel probes (kprobes), the eBPF virtual machine has read-only access to the syscall parameters and return value.
However the eBPF program will have a return code of it's own. It is possible to apply a seccomp profile that traps BPF (NOT eBPF; thanks #qeole) return codes and interrupt the system call during execution.
The allowed runtime modifications are:
SECCOMP_RET_KILL: Immediate kill with SIGSYS
SECCOMP_RET_TRAP: Send a catchable SIGSYS, giving a chance to emulate the syscall
SECCOMP_RET_ERRNO: Force errno value
SECCOMP_RET_TRACE: Yield decision to ptracer or set errno to -ENOSYS
SECCOMP_RET_ALLOW: Allow
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
The SECCOMP_RET_TRACE method enables modifying the system call performed, arguments, or return value. This is architecture dependent and modification of mandatory external references may cause an ENOSYS error.
It does so by passing execution up to a waiting userspace ptrace, which has the ability to modify the traced process memory, registers, and file descriptors.
The tracer needs to call ptrace and then waitpid. An example:
ptrace(PTRACE_SETOPTIONS, tracee_pid, 0, PTRACE_O_TRACESECCOMP);
waitpid(tracee_pid, &status, 0);
http://man7.org/linux/man-pages/man2/ptrace.2.html
When waitpid returns, depending on the contents of status, one can retrieve the seccomp return value using the PTRACE_GETEVENTMSG ptrace operation. This will retrieve the seccomp SECCOMP_RET_DATA value, which is a 16-bit field set by the BPF program. Example:
ptrace(PTRACE_GETEVENTMSG, tracee_pid, 0, &data);
Syscall arguments can be modified in memory before continuing operation. You can perform a single syscall entry or exit with the PTRACE_SYSCALL step. Syscall return values can be modified in userspace before resuming execution; the underlying program won't be able to see that the syscall return values have been modified.
An example implementation:
Filter and Modify System Calls with seccomp and ptrace
I believe that attaching eBPF to kprobes/kretprobes gives you read access to function arguments and return values, but that you cannot tamper with them. I am NOT 100% sure; good places to ask for confirmation would be the IO Visor project mailing list or IRC channel (#iovisor at irc.oftc.net).
As an alternative solution, I know you can at least change the return value of a syscall with strace, with the -e option. Quoting the manual page:
-e inject=set[:error=errno|:retval=value][:signal=sig][:when=expr]
Perform syscall tampering for the specified set of syscalls.
Also, there was a presentation on this, and fault injection, at Fosdem 2017, if it is of any interest to you. Here is one example command from the slides:
strace -P precious.txt -efault=unlink:retval=0 unlink precious.txt
Edit: As stated by Ben, eBPF on kprobes and tracepoints is definitively read only, for tracing and monitoring use cases. I also got confirmation about this on IRC.
It is possible to modify some user space memory using eBPF. As stated in the bpf.h header file:
* int bpf_probe_write_user(void *dst, const void *src, u32 len)
* Description
* Attempt in a safe way to write *len* bytes from the buffer
* *src* to *dst* in memory. It only works for threads that are in
* user context, and *dst* must be a valid user space address.
*
* This helper should not be used to implement any kind of
* security mechanism because of TOC-TOU attacks, but rather to
* debug, divert, and manipulate execution of semi-cooperative
* processes.
*
* Keep in mind that this feature is meant for experiments, and it
* has a risk of crashing the system and running programs.
* Therefore, when an eBPF program using this helper is attached,
* a warning including PID and process name is printed to kernel
* logs.
* Return
* 0 on success, or a negative error in case of failure.
Also, quoting from the BPF design Q&A:
Tracing BPF programs can overwrite the user memory of the current
task with bpf_probe_write_user(). Every time such program is loaded
the kernel will print warning message, so this helper is only useful
for experiments and prototypes. Tracing BPF programs are root only.
Your eBPF may write data into user space memory locations. Note that you still cannot modify kernel structures from within you eBPF program.
It is possible to inject errors into a system call invocation using eBPF: https://lwn.net/Articles/740146/
There is a bpf function called bpf_override_return(), which can override the return value of an invocation. This is an example using bcc as the front-end: https://github.com/iovisor/bcc/blob/master/tools/inject.py
According to the Linux manual page:
bpf_override_return() is only available if the kernel was compiled with the CONFIG_BPF_KPROBE_OVERRIDE configuration option, and in this case it only works on functions tagged with ALLOW_ERROR_INJECTION in the kernel code.
Also, the helper is only available for the architectures having the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing, x86 architecture is the only one to support this feature.
It is possible to add a function to the error injection framework. More information could be found here: https://github.com/iovisor/bcc/issues/2485

how can I call a system call in freebsd?

I created a syscall same as /usr/share/examples/kld/syscall/module/syscall.c with a little change in message.
I used kldload and module loaded. now I want to call the syscall.
what is this syscall number so I can call it?
or what is the way to call this syscall?
I suggest you take a look at Designing BSD rootkits, that's how I learned kernel programming on FreeBSD, there's even a section that talks all about making your own syscalls.
Well, if you check /usr/share/examples/kld/syscall directory you will see it contains a test program..... but hey, let's assume the program is not there.
Let's take a look at part of the module itself:
/*
* The offset in sysent where the syscall is allocated.
*/
static int offset = NO_SYSCALL;
[..]
case MOD_LOAD :
printf("syscall loaded at %d\n", offset);
break;
The module prints syscall number on load, so the job now is to learn how to call it... a 'freebsd call syscall' search on google...
Reveals: http://www.freebsd.cz/doc/en/books/developers-handbook/x86-system-calls.html (although arguably not something to use on amd64) and.. https://www.freebsd.org/cgi/man.cgi?query=syscall&sektion=2 - a manual page for a function which allows you to call arbitrary syscalls.
I strongly suggest you do some digging on your own. If you don't, there is absolutely no way you will be able to write any kernel code.

Good references for the syscalls

I need some reference but a good one, possibly with some nice examples. I need it because I am starting to write code in assembly using the NASM assembler. I have this reference:
http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html
which is quite nice and useful, but it's got a lot of limitations because it doesn't explain the fields in the other registers. For example, if I am using the write syscall, I know I should put 1 in the EAX register, and the ECX is probably a pointer to the string, but what about EBX and EDX? I would like that to be explained too, that EBX determines the input (0 for stdin, 1 for something else etc.) and EDX is the length of the string to be entered, etc. etc. I hope you understood me what I want, I couldn't find any such materials so that's why I am writing here.
Thanks in advance.
The standard programming language in Linux is C. Because of that, the best descriptions of the system calls will show them as C functions to be called. Given their description as a C function and a knowledge of how to map them to the actual system call in assembly, you will be able to use any system call you want easily.
First, you need a reference for all the system calls as they would appear to a C programmer. The best one I know of is the Linux man-pages project, in particular the system calls section.
Let's take the write system call as an example, since it is the one in your question. As you can see, the first parameter is a signed integer, which is usually a file descriptor returned by the open syscall. These file descriptors could also have been inherited from your parent process, as usually happens for the first three file descriptors (0=stdin, 1=stdout, 2=stderr). The second parameter is a pointer to a buffer, and the third parameter is the buffer's size (as an unsigned integer). Finally, the function returns a signed integer, which is the number of bytes written, or a negative number for an error.
Now, how to map this to the actual system call? There are many ways to do a system call on 32-bit x86 (which is probably what you are using, based on your register names); be careful that it is completely different on 64-bit x86 (be sure you are assembling in 32-bit mode and linking a 32-bit executable; see this question for an example of how things can go wrong otherwise). The oldest, simplest and slowest of them in the 32-bit x86 is the int $0x80 method.
For the int $0x80 method, you put the system call number in %eax, and the parameters in %ebx, %ecx, %edx, %esi, %edi, and %ebp, in that order. Then you call int $0x80, and the return value from the system call is on %eax. Note that this return value is different from what the reference says; the reference shows how the C library will return it, but the system call returns -errno on error (for instance -EINVAL). The C library will move this to errno and return -1 in that case. See syscalls(2) and intro(2) for more detail.
So, in the write example, you would put the write system call number in %eax, the first parameter (file descriptor number) in %ebx, the second parameter (pointer to the string) in %ecx, and the third parameter (length of the string) in %edx. The system call will return in %eax either the number of bytes written, or the error number negated (if the return value is between -1 and -4095, it is a negated error number).
Finally, how do you find the system call numbers? They can be found at /usr/include/linux/unistd.h. On my system, this just includes /usr/include/asm/unistd.h, which finally includes /usr/include/asm/unistd_32.h, so the numbers are there (for write, you can see __NR_write is 4). The same goes for the error numbers, which come from /usr/include/linux/errno.h (on my system, after chasing the inclusion chain I find the first ones at /usr/include/asm-generic/errno-base.h and the rest at /usr/include/asm-generic/errno.h). For the system calls which use other constants or structures, their documentation tells which headers you should look at to find the corresponding definitions.
Now, as I said, int $0x80 is the oldest and slowest method. Newer processors have special system call instructions which are faster. To use them, the kernel makes available a virtual dynamic shared object (the vDSO; it is like a shared library, but in memory only) with a function you can call to do a system call using the best method available for your hardware. It also makes available special functions to get the current time without even having to do a system call, and a few other things. Of course, it is a bit harder to use if you are not using a dynamic linker.
There is also another older method, the vsyscall, which is similar to the vDSO but uses a single page at a fixed address. This method is deprecated, will result in warnings on the system log if you are using recent kernels, can be disabled on boot on even more recent kernels, and might be removed in the future. Do not use it.
If you download that web page (like it suggests in the second paragraph) and download the kernel sources, you can click the links in the "Source" column, and go directly to the source file that implements the system calls. You can read their C signatures to see what each parameter is used for.
If you're just looking for a quick reference, each of those system calls has a C library interface with the same name minus the sys_. So, for example, you could check out man 2 lseek to get the information about the parameters forsys_lseek:
off_t lseek(int fd, off_t offset, int whence);
where, as you can see, the parameters match the ones from your HTML table:
%ebx %ecx %edx
unsigned int off_t unsigned int

Resources