How does seccomp-bpf filter syscalls? - linux

I'm investigating the implementation detail of seccomp-bpf, the syscall filtration mechanism that was introduced into Linux since version 3.5.
I looked into the source code of kernel/seccomp.c from Linux 3.10 and want to ask some questions about it.
From seccomp.c, it seems that seccomp_run_filters() is called from __secure_computing() to test the syscall called by the current process.
But looking into seccomp_run_filters(), the syscall number that is passed as an argument is not used anywhere.
It seems that sk_run_filter() is the implementation of BPF filter machine, but sk_run_filter() is called from seccomp_run_filters() with the first argument (the buffer to run the filter on) NULL.
My question is: how can seccomp_run_filters() filter syscalls without using the argument?
The following is the source code of seccomp_run_filters():
/**
* seccomp_run_filters - evaluates all seccomp filters against #syscall
* #syscall: number of the current system call
*
* Returns valid seccomp BPF response codes.
*/
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
u32 ret = SECCOMP_RET_ALLOW;
/* Ensure unexpected behavior doesn't result in failing open. */
if (WARN_ON(current->seccomp.filter == NULL))
return SECCOMP_RET_KILL;
/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
for (f = current->seccomp.filter; f; f = f->prev) {
u32 cur_ret = sk_run_filter(NULL, f->insns);
if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
ret = cur_ret;
}
return ret;
}

When a user process enters the kernel, the register set is stored to a kernel variable.
The function sk_run_filter implements the interpreter for the filter language. The relevant instruction for seccomp filters is BPF_S_ANC_SECCOMP_LD_W. Each instruction has a constant k, and in this case it specifies the index of the word to be read.
#ifdef CONFIG_SECCOMP_FILTER
case BPF_S_ANC_SECCOMP_LD_W:
A = seccomp_bpf_load(fentry->k);
continue;
#endif
The function seccomp_bpf_load uses the current register set of the user thread to determine the system call information.

Related

mutex unlocking and request_module() behaviour

I've observed the following code pattern in the Linux kernel, for example net/sched/act_api.c or many other places as well :
rtnl_lock();
rtnetlink_rcv_msg(skb, ...);
replay:
ret = process_msg(skb);
...
/* try to obtain symbol which is in module. */
/* if fail, try to load the module, otherwise use the symbol */
a = get_symbol();
if (a == NULL) {
rtnl_unlock();
request_module();
rtnl_lock();
/* now verify that we can obtain symbols from requested module and return EAGAIN.*/
a = get_symbol();
module_put();
return -EAGAIN;
}
...
if (ret == -EAGAIN)
goto replay;
...
rtnl_unlock();
After request_module has succeeded, the symbol we are interested in, becomes available in kernel memory space, and we can use it. However I don't understand why return EAGAIN and re-read the symbol, why can't just continue right after request_module()?
If you look at the current implementation in the Linux kernel, there is a comment right after the 2nd call equivalent to get_symbol() in your above code (it is tc_lookup_action_n()) that explains exactly why:
rtnl_unlock();
request_module("act_%s", act_name);
rtnl_lock();
a_o = tc_lookup_action_n(act_name);
/* We dropped the RTNL semaphore in order to
* perform the module load. So, even if we
* succeeded in loading the module we have to
* tell the caller to replay the request. We
* indicate this using -EAGAIN.
*/
if (a_o != NULL) {
err = -EAGAIN;
goto err_mod;
}
Even though the module could be requested and loaded, since the semaphore was dropped in order to load the module which is an operation that can sleep (and is not the "standard way" this function is executed, the function returns EAGAIN to signal it.
EDIT for clarification:
If we look at the call sequence when a new action is added (which could cause a required module to be loaded) we have this sequence: tc_ctl_action() -> tcf_action_add() -> tcf_action_init() -> tcf_action_init_1().
Now if "move back" the EAGAIN error back up to tc_ctl_action() in the case RTM_NEWACTION:, we see that with the EAGAIN ret value the call to tcf_action_add is repeated.

Atomicity of writev() system call in Linux

I've looked in the kernel source for linux kernel 4.4.0-57-generic and don't see any locks in the writev() source. Is there something I'm missing? I don't see how writev() is atomic or thread-safe.
Not a kernel expert here, but I'll share my point of view anyway. Feel free to spot any mistakes.
Browsing the kernel (v4.9 though I wouldn't expect it to be so different), and trying to trace the writev(2) system call, I can observe subsequent function calls that create the following path:
SYSCALL_DEFINE3(writev, ..)
do_writev(..)
vfs_writev(..)
do_readv_writev(..)
Now the path branches, depending on whether a write_iter method is implemented and hooked on the struct file_operations field of the struct file that the system call is referring to.
If it's not NULL, the path is:
5a. do_iter_readv_writev(..), which calls the method filp->f_op->write_iter(..) at this point.
If it is NULL, the path is:
5b. do_loop_readv_writev(..), which calls repeatedly in a loop the method filp->f_op->write at this point.
So, as far as I understand, the writev() system call is as thread safe as the underlying write() (or write_iter()) is, which of course can be implemented in various ways, e.g. in a device driver, and may or may not use locks according to its needs and its design.
EDIT:
In kernel v4.4 the paths look pretty similar:
SYSCALL_DEFINE3(writev, ..)
vfs_writev(..)
do_readv_writev(..)
and then it depends on whether the write_iter method as a field in struct file_operations of the struct file is NULL or not, just like the case in v4.9, described above.
VFS (Virtual File System) by itself doesn't garantee atomicity of writev() call. It just calls filesystem-specific .write_iter method of struct file_operations.
It is responsibility of specific filesystem implementation for make method atomically write to the file.
For example, in ext4 filesystem function ext4_file_write_iter uses
mutex_lock(&inode->i_mutex);
for make writting atomic.
Found it in fs.h:
static inline void file_start_write(struct file *file)
{
if (!S_ISREG(file_inode(file)->i_mode))
return;
__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}
and then in super.c:
/*
* This is an internal function, please use sb_start_{write,pagefault,intwrite}
* instead.
*/
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
bool force_trylock = false;
int ret = 1;
#ifdef CONFIG_LOCKDEP
/*
* We want lockdep to tell us about possible deadlocks with freezing
* but it's it bit tricky to properly instrument it. Getting a freeze
* protection works as getting a read lock but there are subtle
* problems. XFS for example gets freeze protection on internal level
* twice in some cases, which is OK only because we already hold a
* freeze protection also on higher level. Due to these cases we have
* to use wait == F (trylock mode) which must not fail.
*/
if (wait) {
int i;
for (i = 0; i < level - 1; i++)
if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
force_trylock = true;
break;
}
}
#endif
if (wait && !force_trylock)
percpu_down_read(sb->s_writers.rw_sem + level-1);
else
ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
WARN_ON(force_trylock & !ret);
return ret;
}
EXPORT_SYMBOL(__sb_start_write);
Thanks again.

Why my implementation of sbrk system call does not work?

I try to write a very simple os to better understand the basic principles. And I need to implement user-space malloc. So at first I want to implement and test it on my linux-machine.
At first I have implemented the sbrk() function by the following way
void* sbrk( int increment ) {
return ( void* )syscall(__NR_brk, increment );
}
But this code does not work. Instead, when I use sbrk given by os, this works fine.
I have tryed to use another implementation of the sbrk()
static void *sbrk(signed increment)
{
size_t newbrk;
static size_t oldbrk = 0;
static size_t curbrk = 0;
if (oldbrk == 0)
curbrk = oldbrk = brk(0);
if (increment == 0)
return (void *) curbrk;
newbrk = curbrk + increment;
if (brk(newbrk) == curbrk)
return (void *) -1;
oldbrk = curbrk;
curbrk = newbrk;
return (void *) oldbrk;
}
sbrk invoked from this function
static Header *morecore(unsigned nu)
{
char *cp;
Header *up;
if (nu < NALLOC)
nu = NALLOC;
cp = sbrk(nu * sizeof(Header));
if (cp == (char *) -1)
return NULL;
up = (Header *) cp;
up->s.size = nu; // ***Segmentation fault
free((void *)(up + 1));
return freep;
}
This code also does not work, on the line (***) I get segmentation fault.
Where is a problem ?
Thanks All. I have solved my problem using new implementation of the sbrk. The given code works fine.
void* __sbrk__(intptr_t increment)
{
void *new, *old = (void *)syscall(__NR_brk, 0);
new = (void *)syscall(__NR_brk, ((uintptr_t)old) + increment);
return (((uintptr_t)new) == (((uintptr_t)old) + increment)) ? old :
(void *)-1;
}
The first sbrk should probably have a long increment. And you forgot to handle errors (and set errno)
The second sbrk function does not change the address space (as sbrk does). You could use mmap to change it (but using mmap instead of sbrk won't update the kernel's view of data segment end as sbrk does). You could use cat /proc/1234/maps to query the address space of process of pid 1234). or even read (e.g. with fopen&fgets) the /proc/self/maps from inside your program.
BTW, sbrk is obsolete (most malloc implementations use mmap), and by definition every system call (listed in syscalls(2)) is executed by the kernel (for sbrk the kernel maintains the "data segment" limit!). So you cannot avoid the kernel, and I don't even understand why you want to emulate any system call. Almost by definition, you cannot emulate syscalls since they are the only way to interact with the kernel from a user application. From the user application, every syscall is an atomic elementary operation (done by a single SYSENTER machine instruction with appropriate contents in machine registers).
You could use strace(1) to understand the actual syscalls done by your running program.
BTW, the GNU libc is a free software. You could look into its source code. musl-libc is a simpler libc and its code is more readable.
At last compile with gcc -Wall -Wextra -g and use the gdb debugger (you can even query the registers, if you wanted to). Perhaps read the x86/64-ABI specification and the Linux Assembly HowTo.

Confusing result from counting page fault in linux

I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).

Linux equivalent of FreeBSD's cpu_set_syscall_retval()

The title pretty much says it all. Looking for the Linux equivalent of cpu_set_syscall_retval() found in /usr/src/sys/amd64/amd64/vm_machdep.c. Not sure if there is even such a thing in Linux but I thought I'd ask anyway.
cpu_set_syscall_retval(struct thread *td, int error)
{
switch (error) {
case 0:
td->td_frame->tf_rax = td->td_retval[0];
td->td_frame->tf_rdx = td->td_retval[1];
td->td_frame->tf_rflags &= ~PSL_C;
break;
case ERESTART:
/*
* Reconstruct pc, we know that 'syscall' is 2 bytes,
* lcall $X,y is 7 bytes, int 0x80 is 2 bytes.
* We saved this in tf_err.
* %r10 (which was holding the value of %rcx) is restored
* for the next iteration.
* %r10 restore is only required for freebsd/amd64 processes,
* but shall be innocent for any ia32 ABI.
*/
td->td_frame->tf_rip -= td->td_frame->tf_err;
td->td_frame->tf_r10 = td->td_frame->tf_rcx;
break;
case EJUSTRETURN:
break;
default:
if (td->td_proc->p_sysent->sv_errsize) {
if (error >= td->td_proc->p_sysent->sv_errsize)
error = -1; /* XXX */
else
error = td->td_proc->p_sysent->sv_errtbl[error];
}
td->td_frame->tf_rax = error;
td->td_frame->tf_rflags |= PSL_C;
break;
}
}
There's no way to do the equivalent in linux. The return value of system calls is propagated via return value from whatever functions are called internally to implement the function all the way back to user-mode. The general convention is that a non-negative return value means success and a negative value indicates an error (with the errno being the negated return value: for example, a "-2" indicates an error with an errno value of 2 [ENOENT]).
You could look up the stored register values that will be popped on return to user-mode and replace one of them (what the BSD code here is doing), but the critical one that contains the return value will just be overwritten by the normal return-from-system-call path anyway, just prior to returning to user mode.

Resources