Understanding how Linux syscall() works

Understanding how Linux syscall() works - linux

I'm trying to understand what the Linux syscall() function expects to get. I'm looking at the man of the syscall and I can't seem to figure out the amount of parameters and what they represent. In the source code:
extern long int syscall (long int __sysno, ...) __THROW;
Does it mean that it can handle unlimited number of parameters? If not, what which parameter represents?

The second arg ... indicates a variadic function -- one that accepts a variable number of args; common examples are printf() and co. By design, while the number and types of args are unknown to any variadic function, for syscall() the correct arg-count and types are specific to each system call, which is indexed by __sysno and should be a manifest constant like SYS_exit found in a system header.
Although the number of args is mostly unlimited, there are practical limitations, performance considerations, and arch differences; in short, fewer is often better.
Note that variadic functions can be quite versatile. As one example: create your own (error_message + exit) variadic routine that combines an error status as the first arg followed by printf args; see man stdarg and services like vdprintf() and vfprintf().
Dual benefits include more concise source and a smaller .text segment.

Related

save the number of bytes read from file [duplicate]

When I try to research about return values of system calls of the kernel, I find tables that describe them and what do I need to put in the different registers to let them work. However, I don't find any documentation where it states what is that return value I get from the system call. I'm just finding in different places that what I receive will be in the EAX register.
TutorialsPoint:
The result is usually returned in the EAX register.
Assembly Language Step-By-Step: Programming with Linux book by Jeff Duntemann states many times in his programs:
Look at sys_read's return value in EAX
Copy sys_read return value for safe keeping
Any of the websites I have don't explain about this return value. Is there any Internet source? Or can someone explain me about this values?

See also this excellent LWN article about system calls which assumes C knowledge.
Also: The Definitive Guide to Linux System Calls (on x86), and related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?
C is the language of Unix systems programming, so all the documentation is in terms of C. And then there's documentation for the minor differences between the C interface and the asm on any given platform, usually in the Notes section of man pages.
sys_read means the raw system call (as opposed to the libc wrapper function). The kernel implementation of the read system call is a kernel function called sys_read(). You can't call it with a call instruction, because it's in the kernel, not a library. But people still talk about "calling sys_read" to distinguish it from the libc function call. However, it's ok to say read even when you mean the raw system call (especially when the libc wrapper doesn't do anything special), like I do in this answer.
Also note that syscall.h defines constants like SYS_read with the actual system call number, or asm/unistd.h for the Linux __NR_read names for the same constants. (The value you put in EAX before an int 0x80 or syscall instruction).
Linux system call return values (in EAX/RAX on x86) are either "normal" success, or a -errno code for error. e.g. -EFAULT if you pass an invalid pointer. This behaviour is documented in the syscalls(2) man page.
-1 to -4095 means error, anything else means success. See AOSP non-obvious syscall() implementation for more details on this -4095UL .. -1UL range, which is portable across architectures on Linux, and applies to every system call. (In the future, a different architecture could use a different value for MAX_ERRNO, but the value for existing arches like x86-64 is guaranteed to stay the same as part of Linus's don't-break-userspace policy of keeping kernel ABIs stable.)
For example, glibc's generic syscall(2) wrapper function uses this sequence: cmp rax, -4095 / jae SYSCALL_ERROR_LABEL, which is guaranteed to be future-proof for all Linux system calls.
You can use that wrapper function to make any system call, like syscall( __NR_mmap, ... ). (Or use an inline-asm wrapper header like https://github.com/linux-on-ibm-z/linux-syscall-support/blob/master/linux_syscall_support.h that has safe inline-asm for multiple ISAs, avoiding problems like missing "memory" clobbers that some other inline-asm wrappers have.)
Interesting cases include getpriority where the kernel ABI maps the -20..19 return-value range to 1..40, and libc decodes it. More details in a related answer about decoding syscall error return values.
For mmap, if you wanted you could also detect error just by checking that the return value isn't page-aligned (e.g. any non-zero bits in the low 11, for a 4k page size), if that would be more efficient than checking p > -4096ULL.
To find the actual numeric values of constants for a specific platform, you need to find the C header file where they're #defined. See my answer on a question about that for details. e.g. in asm-generic/errno-base.h / asm-generic/errno.h.
The meanings of return values for each sys call are documented in the section 2 man pages, like read(2). (sys_read is the raw system call that the glibc read() function is a very thin wrapper for.) Most man pages have a whole section for the return value. e.g.
RETURN VALUE
On success, the number of bytes read is returned (zero indicates
end of file), and the file position is advanced by this number. It
is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to end-of-
file, or because we are reading from a pipe, or from a terminal), or
because read() was interrupted by a signal. See also NOTES.
On error, -1 is returned, and errno is set appropriately. In this
case, it is left unspecified whether the file position (if any)
changes.
Note that the last paragraph describes how the glibc wrapper decodes the value and sets errno to -EAX if the raw system call's return value is negative, so errno=EFAULT and return -1 if the raw system call returned -EFAULT.
And there's a whole section listing all the possible error codes that read() is allowed to return, and what they mean specifically for read(). (POSIX standardizes most of this behaviour.)

print an unknown length argument in linux assembly

In linux assembly, we can write a string to the default output with system call write. But this system call needs the string length, but the argument doesn't have a fixed length over all the executions.
So, I know that we can calculate the length of the argument by browsing it and looking for the null byte. Although, I am looking for a simpler way to print an argument (or any string with unknown length) with Linux assembly.
So can anyone tell me the simplest way to print an unknown string length with Linux assembly.

There are no Linux system calls that write an implicit-length string (C-style null-terminated) to a file descriptor. So you have to just work out the length yourself before making a system call.
Linux is portable across many architectures, so I'll express the answer in portable assembly language, aka C:
int write_implicit_length_string(const char *str) {
size_t size = strlen(str);
return write(1, str, size); // stdout is always fd 1
}
If you want to see the asm, compile it with gcc (although that will just show you a function call to strlen. gcc -O3 doesn't inline code for strlen on x86).
As far as asm implementations of strlen, for x86-64 your best bet is an SSE2 loop that uses pcmpeqb / pmovmskb / test / jnz to find the first zero byte. Obviously every ISA will have its own way of doing it, but the important point is that there's no way to have the kernel do it for you.
There are C standard library functions that print strings to stdio FILE * (e.g. fputs) but not to unix file descriptors (libc just has wrappers for system calls).

How to access errno after clone (or: How to set errno location)

Per traditional POSIX, errno is simply an integer lvalue, which works perfectly well with fork, but oviously doesn't work nearly as well with threads. As per pthreads, errno is a thread-local integer lvalue. Under Linux/NTPL, as an implementation detail, errno is some "macro that expands to a function returning an integer lvalue".
On my Debian system, this seems to be *__errno_location (), on some other systems I've seen things like &(gettib()->errnum.
TL;DR
Assuming I've used clone to create a thread, can I just call errno and expect that it will work, or do I have to do some special rain dance? For example, do I need to read some special field in the thread information block, or some special TLS value, or, do I get to set the address of the thread-local variable where the glibc stores the error values somehow? Something like __set_errno_location() maybe?
Or, will it "just work" as it is?
Inevitably, someone will be tempted to reply "simply use phtreads" -- please don't. I do not want to use pthreads. I want clone. I do not want any of the ill-advised functionality of pthreads, and I do not want to deal with any of its quirks, nor do I want the overhead to implement those quirks. I recognize that much of the crud in pthreads comes from the fact that it has to work (and, surprisingly, it successfully works) amongst others for some completely broken systems that are nearly three decades old, but that doesn't mean that it is necessarily a good thing for everyone and every situation. Portability is not of any concern in this case.
All I want in this particular situation is fire up another process running in the same address space as the parent, synchronization via a simple lock (say, a futex), and write working properly (which means I also have to be able to read errno correctly). As little overhead as possible, no other functionality or special behavior needed or even desired.

According to the glibc source code, errno is defined as a thread-local variable. Unfortunately, this requires significant C library support. Any threads created using pthread_create() will be made aware of thread-local variables. I would not even bother trying to get glibc to accept your foreign threads.
An alternative would be to use a different libc implementation that may allow you to extract some of its internal structures and manually set the thread control block if errno is part of it. This would be incredibly hacky and unreliable. I doubt you'll find anything like __set_errno_location(), but rather something like __set_tcb().
#include <bits/some_hidden_file.h>
void init_errno(void)
{
struct __tcb* tcb;
/* allocate a dummy thread control block (malloc may set errno
* so might have to store the tcb on stack or allocate it in the
* parent) */
tcb = malloc(sizeof(struct __tcb));
/* initialize errno */
tcb->errno = 0;
/* set pointer to thread control block (x86) */
arch_prctl(ARCH_SET_FS, tcb);
}
This assumes that the errno macro expands to something like: ((struct __tcb*)__read_fs())->errno.
Of course, there's always the option of implementing an extremely small subset of libc yourself. Or you could write your own implementation of the write() system call with a custom stub to handle errno and have it co-exist with the chosen libc implementation.
#define my_errno /* errno variable stored at some known location */
ssize_t my_write(int fd, const void* buf, size_t len)
{
ssize_t ret;
__asm__ (
/* set system call number */
/* set up parameters */
/* make the call */
/* retrieve return value in c variable */
);
if (ret >= -4096 && ret < 0) {
my_errno = -ret;
return -1;
}
return ret;
}
I don't remember the exact details of GCC inline assembly and the system call invocation details vary depending on platform.
Personally, I'd just implement a very small subset of libc, which would just consist of a little assembler and a few constants. This is remarkably simple with so much reference code available out there, although it may be overambitious.

If errno is a thread local variable, so clone() will copy it in the new process's address space? i had overrode the errno_location() function like around 2001 to use an errno based on the pid.
http://tamtrajnana.blogspot.com/2012/03/thread-safety-of-errno-variable.html
since errno is now defined as "__thread int errno;" (see above comment) this explains how __thread types are handled: Linux's thread local storage implementation

trying to understand the sys_socketcall parameter

Can anyone explain what this line exactly does:
socketcall(7,255);
I know, that the command is opening a port on the system, but I don't understand the parameter.
the man-page says
int socketcall(int call, unsigned long *args);
DESCRIPTION
socketcall() is a common kernel entry point for the socket system calls. call determines which socket function to invoke. args points to a block con-
taining the actual arguments, which are passed through to the appropriate call.
User programs should call the appropriate functions by their usual names. Only standard library implementors and kernel hackers need to know about
socketcall().
Ok, call 7 is sys_getpeername, but if I take a look in the man-page:
int getpeername(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
DESCRIPTION
getpeername() returns the address of the peer connected to the socket sockfd, in the buffer pointed to by addr. The addrlen argument should be initial-
ized to indicate the amount of space pointed to by addr. On return it contains the actual size of the name returned (in bytes). The name is truncated
if the buffer provided is too small.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
I really don't get it. The function needs 3 parameter. how did the function get the parameter? what means the 255? has anyone an idea how the function is opening a port?

Although Linux has a system call that is commonly called socketcall, the C library does not expose any C function with that name. Normally the standard wrapper functions such as socket() and getpeername() should be used, which will end up calling the system call, but if for some reason it is necessary to call the system call directly then that can be done with syscall(SYS_socketcall, call, args) or using assembly.
In this case the application or a library that it uses (other than the standard C library) has most likely defined its own function called socketcall(), that is unrelated to the system call. You should check that function or its documentation to see what it does.

Good references for the syscalls

I need some reference but a good one, possibly with some nice examples. I need it because I am starting to write code in assembly using the NASM assembler. I have this reference:
http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html
which is quite nice and useful, but it's got a lot of limitations because it doesn't explain the fields in the other registers. For example, if I am using the write syscall, I know I should put 1 in the EAX register, and the ECX is probably a pointer to the string, but what about EBX and EDX? I would like that to be explained too, that EBX determines the input (0 for stdin, 1 for something else etc.) and EDX is the length of the string to be entered, etc. etc. I hope you understood me what I want, I couldn't find any such materials so that's why I am writing here.
Thanks in advance.

The standard programming language in Linux is C. Because of that, the best descriptions of the system calls will show them as C functions to be called. Given their description as a C function and a knowledge of how to map them to the actual system call in assembly, you will be able to use any system call you want easily.
First, you need a reference for all the system calls as they would appear to a C programmer. The best one I know of is the Linux man-pages project, in particular the system calls section.
Let's take the write system call as an example, since it is the one in your question. As you can see, the first parameter is a signed integer, which is usually a file descriptor returned by the open syscall. These file descriptors could also have been inherited from your parent process, as usually happens for the first three file descriptors (0=stdin, 1=stdout, 2=stderr). The second parameter is a pointer to a buffer, and the third parameter is the buffer's size (as an unsigned integer). Finally, the function returns a signed integer, which is the number of bytes written, or a negative number for an error.
Now, how to map this to the actual system call? There are many ways to do a system call on 32-bit x86 (which is probably what you are using, based on your register names); be careful that it is completely different on 64-bit x86 (be sure you are assembling in 32-bit mode and linking a 32-bit executable; see this question for an example of how things can go wrong otherwise). The oldest, simplest and slowest of them in the 32-bit x86 is the int $0x80 method.
For the int $0x80 method, you put the system call number in %eax, and the parameters in %ebx, %ecx, %edx, %esi, %edi, and %ebp, in that order. Then you call int $0x80, and the return value from the system call is on %eax. Note that this return value is different from what the reference says; the reference shows how the C library will return it, but the system call returns -errno on error (for instance -EINVAL). The C library will move this to errno and return -1 in that case. See syscalls(2) and intro(2) for more detail.
So, in the write example, you would put the write system call number in %eax, the first parameter (file descriptor number) in %ebx, the second parameter (pointer to the string) in %ecx, and the third parameter (length of the string) in %edx. The system call will return in %eax either the number of bytes written, or the error number negated (if the return value is between -1 and -4095, it is a negated error number).
Finally, how do you find the system call numbers? They can be found at /usr/include/linux/unistd.h. On my system, this just includes /usr/include/asm/unistd.h, which finally includes /usr/include/asm/unistd_32.h, so the numbers are there (for write, you can see __NR_write is 4). The same goes for the error numbers, which come from /usr/include/linux/errno.h (on my system, after chasing the inclusion chain I find the first ones at /usr/include/asm-generic/errno-base.h and the rest at /usr/include/asm-generic/errno.h). For the system calls which use other constants or structures, their documentation tells which headers you should look at to find the corresponding definitions.
Now, as I said, int $0x80 is the oldest and slowest method. Newer processors have special system call instructions which are faster. To use them, the kernel makes available a virtual dynamic shared object (the vDSO; it is like a shared library, but in memory only) with a function you can call to do a system call using the best method available for your hardware. It also makes available special functions to get the current time without even having to do a system call, and a few other things. Of course, it is a bit harder to use if you are not using a dynamic linker.
There is also another older method, the vsyscall, which is similar to the vDSO but uses a single page at a fixed address. This method is deprecated, will result in warnings on the system log if you are using recent kernels, can be disabled on boot on even more recent kernels, and might be removed in the future. Do not use it.

If you download that web page (like it suggests in the second paragraph) and download the kernel sources, you can click the links in the "Source" column, and go directly to the source file that implements the system calls. You can read their C signatures to see what each parameter is used for.
If you're just looking for a quick reference, each of those system calls has a C library interface with the same name minus the sys_. So, for example, you could check out man 2 lseek to get the information about the parameters forsys_lseek:
off_t lseek(int fd, off_t offset, int whence);
where, as you can see, the parameters match the ones from your HTML table:
%ebx %ecx %edx
unsigned int off_t unsigned int

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string