How to identify read or write operations of page fault when using sigaction handler on SIGSEGV?(LINUX) - linux

I use sigaction to handle page fault exception, and the handler function is defind like this:
void sigaction_handler(int signum, siginfo_t *info, void *_context)
So it's easy to get page fault address by reading info->si_addr.
The question is, how to know whether this operation is memory READ or WRITE ?
I found the type of _context parameter is ucontext_t defined in /usr/include/sys/ucontext.h
There is a cr2 field defined in mcontext_t, but unforunately, it is only avaliable when x86_64 is not defind, thus I could not used cr2 to identify read/write operations.
On anotherway, there is a struct named sigcontext defined in /usr/include/bits/sigcontext.h
This struct contains cr2 field. But I don't know where to get it.

You can check this in x86_64 by referring to the ucontext's mcontext struct and the err register:
void pf_sighandler(int sig, siginfo_t *info, ucontext_t *ctx) {
...
if (ctx->uc_mcontext.gregs[REG_ERR] & 0x2) {
// Write fault
} else {
// Read fault
}
...
}

Here is the generation of SIGSEGV from the kernel arch/x86/mm/fault.c, __bad_area_nosemaphore() function:
http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/mm/fault.c#L760
760 tsk->thread.cr2 = address;
761 tsk->thread.error_code = error_code;
762 tsk->thread.trap_nr = X86_TRAP_PF;
763
764 force_sig_info_fault(SIGSEGV, si_code, address, tsk, 0);
There is error_code field, and it values are defined in arch/x86/mm/fault.c too:
http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/mm/fault.c#L23
23/*
24 * Page fault error code bits:
25 *
26 * bit 0 == 0: no page found 1: protection fault
27 * bit 1 == 0: read access 1: write access
28 * bit 2 == 0: kernel-mode access 1: user-mode access
29 * bit 3 == 1: use of reserved bit detected
30 * bit 4 == 1: fault was an instruction fetch
31 */
32enum x86_pf_error_code {
33
34 PF_PROT = 1 << 0,
35 PF_WRITE = 1 << 1,
36 PF_USER = 1 << 2,
37 PF_RSVD = 1 << 3,
38 PF_INSTR = 1 << 4,
39};
So, exact information about access type is stored in the thread_struct.error_code: http://lxr.missinglinkelectronics.com/linux+v3.12/arch/x86/include/asm/processor.h#L470
The error_code field is not exported into siginfo_t struct as I see (it is defined in
http://man7.org/linux/man-pages/man2/sigaction.2.html .. search for si_signo).
So you can
Hack the kernel to export tsk->thread.error_code (or check, is it exported already or not, for example in ptrace)
Get the memory address, read /proc/self/maps, parse them and check access bits on the page. If the page is present and read-only, the only possible fault is from writing, if page is not present both kinds of access are possible, and if... there should be no write-only pages.
Also you can try to find the address of failed instruction, read it and disassemble.

The error_code information can be accessed through:
err = ((ucontext_t*)context)->uc_mcontext.gregs[REG_ERR]
It is passed by the hardware on the stack, which is then passed to the signal handler by the kernel, since the kernel passes the entire `frame'. Then
bool write_fault = !(err & 0x2);
will be true if the access was a write access, and false otherwise.

Related

BUILD_BUG_ON(dt_virt_base % SZ_2M) gives me error when NR_CPUS is reduced to 2

When I set NR_CPUS to 2, I get this error while building linux-5.10.0rc (when NR_CPUS is 4, it does not).
arch/arm64/mm/mmu.c: In function 'fixmap_remap_fdt':
././include/linux/compiler_types.h:315:38: error: call to
'__compiletime_assert_404' declared with attribute error: BUILD_BUG_ON
failed: dt_virt_base % SZ_2M 315 | _compiletime_assert(condition,
msg, _compiletime_assert, COUNTER)
| ^
The function fixmap_remap_fdt looks like this and the line BUILD_BUG_ON(dt_virt_base % SZ_2M); seems to generate error.
void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
{
const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
int offset;
void *dt_virt;
/*
* skip some comments
*/
BUILD_BUG_ON(MIN_FDT_ALIGN < 8);
if (!dt_phys || dt_phys % MIN_FDT_ALIGN)
return NULL;
/*
* skip (some comments)
*/
BUILD_BUG_ON(dt_virt_base % SZ_2M); // <=== line generating error
BUILD_BUG_ON(__fix_to_virt(FIX_FDT_END) >> SWAPPER_TABLE_SHIFT !=
__fix_to_virt(FIX_BTMAP_BEGIN) >> SWAPPER_TABLE_SHIFT);
I know this BUILD_BUG_ON is supposed to generate error when some condition is true during build time and this is possible because some conditions can be known at build time. But in this case, this dt_virt_base is some virtual address of fixmap (predefined virtual addresses for special purposes in kernel) and it is aligned to SZ_2M. See the definition in arch/arm64/include/asm/fixmap.h.
enum fixed_addresses {
FIX_HOLE,
/*
* Reserve a virtual window for the FDT that is 2 MB larger than the
* maximum supported size, and put it at the top of the fixmap region.
* The additional space ensures that any FDT that does not exceed
* MAX_FDT_SIZE can be mapped regardless of whether it crosses any
* 2 MB alignment boundaries.
*
* Keep this at the top so it remains 2 MB aligned.
*/
#define FIX_FDT_SIZE (MAX_FDT_SIZE + SZ_2M)
FIX_FDT_END,
FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
And this indexs (enum type) means number of pages from the fixmap top virtual address. So if we have this index, we can get the virtual address. __fix_to_virt is defined in include/asm-generic/fixmap.h as below.
#define __fix_to_virt(x) (FIXADDR_TOP - ((x) << PAGE_SHIFT))
So, when I reduce NR_CPUS from 4 to 2, I get this build time compile error. But I can't understand why this is giving me the error. Could anyone tell me what is causing the error?

How to use proc_pid_cmdline in kernel module

I am writing a kernel module to get the list of pids with their complete process name. The proc_pid_cmdline() gives the complete process name;using same function /proc/*/cmdline gets the complete process name. (struct task_struct) -> comm gives hint of what process it is, but not the complete path.
I have included the function name, but it gives error because it does not know where to find the function.
How to use proc_pid_cmdline() in a module ?
You are not supposed to call proc_pid_cmdline().
It is a non-public function in fs/proc/base.c:
static int proc_pid_cmdline(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
However, what it does is simple:
get_cmdline(task, m->buf, PAGE_SIZE);
That is not likely to return the full path though and it will not be possible to determine the full path in every case. The arg[0] value may be overwritten, the file could be deleted or moved, etc. A process may exec() in a way which obscures the original command line, and all kinds of other maladies.
A scan of my Fedora 20 system /proc/*/cmdline turns up all kinds of less-than-useful results:
-F
BUG:
WARNING: at
WARNING: CPU:
INFO: possible recursive locking detecte
ernel BUG at
list_del corruption
list_add corruption
do_IRQ: stack overflow:
ear stack overflow (cur:
eneral protection fault
nable to handle kernel
ouble fault:
RTNL: assertion failed
eek! page_mapcount(page) went negative!
adness at
NETDEV WATCHDOG
ysctl table check failed
: nobody cared
IRQ handler type mismatch
Machine Check Exception:
Machine check events logged
divide error:
bounds:
coprocessor segment overrun:
invalid TSS:
segment not present:
invalid opcode:
alignment check:
stack segment:
fpu exception:
simd exception:
iret exception:
/var/log/messages
--
/usr/bin/abrt-dump-oops
-xtD
I have managed to solve a version of this problem. I wanted to access the cmdline of all PIDs but within the kernel itself (as opposed to a kernel module as the question states), but perhaps these principles can be applied to kernel modules as well?
What I did was, I added the following function to fs/proc/base.c
int proc_get_cmdline(struct task_struct *task, char * buffer) {
int i;
int ret = proc_pid_cmdline(task, buffer);
for(i = 0; i < ret - 1; i++) {
if(buffer[i] == '\0')
buffer[i] = ' ';
}
return 0;
}
I then added the declaration in include/linux/proc_fs.h
int proc_get_cmdline(struct task_struct *, char *);
At this point, I could access the cmdline of all processes within the kernel.
To access the task_struct, perhaps you could refer to kernel: efficient way to find task_struct by pid?.
Once you have the task_struct, you should be able to do something like:
char cmdline[256];
proc_get_cmdline(task, cmdline);
if(strlen(cmdline) > 0)
printk(" cmdline :%s\n", cmdline);
else
printk(" cmdline :%s\n", task->comm);
I was able to obtain the commandline of all processes this way.
To get the full path of the binary behind a process.
char * exepathp;
struct file * exe_file;
struct mm_struct *mm;
char exe_path [1000];
//straight up stolen from get_mm_exe_file
mm = get_task_mm(current);
down_read(&mm->mmap_sem); //lock read
exe_file = mm->exe_file;
if (exe_file) get_file(exe_file);
up_read(&mm->mmap_sem); //unlock read
//reduce exe path to a string
exepathp = d_path( &(exe_file->f_path), exe_path, 1000*sizeof(char) );
Where current is the task struct for the process you are interested in. The variable exepathp gets the string of the full path. This is slightly different than the process cmd, this is the path of binary which was loaded to start the process. Combining this path with the process cmd should give you the full path.

Linux equivalent of FreeBSD's cpu_set_syscall_retval()

The title pretty much says it all. Looking for the Linux equivalent of cpu_set_syscall_retval() found in /usr/src/sys/amd64/amd64/vm_machdep.c. Not sure if there is even such a thing in Linux but I thought I'd ask anyway.
cpu_set_syscall_retval(struct thread *td, int error)
{
switch (error) {
case 0:
td->td_frame->tf_rax = td->td_retval[0];
td->td_frame->tf_rdx = td->td_retval[1];
td->td_frame->tf_rflags &= ~PSL_C;
break;
case ERESTART:
/*
* Reconstruct pc, we know that 'syscall' is 2 bytes,
* lcall $X,y is 7 bytes, int 0x80 is 2 bytes.
* We saved this in tf_err.
* %r10 (which was holding the value of %rcx) is restored
* for the next iteration.
* %r10 restore is only required for freebsd/amd64 processes,
* but shall be innocent for any ia32 ABI.
*/
td->td_frame->tf_rip -= td->td_frame->tf_err;
td->td_frame->tf_r10 = td->td_frame->tf_rcx;
break;
case EJUSTRETURN:
break;
default:
if (td->td_proc->p_sysent->sv_errsize) {
if (error >= td->td_proc->p_sysent->sv_errsize)
error = -1; /* XXX */
else
error = td->td_proc->p_sysent->sv_errtbl[error];
}
td->td_frame->tf_rax = error;
td->td_frame->tf_rflags |= PSL_C;
break;
}
}
There's no way to do the equivalent in linux. The return value of system calls is propagated via return value from whatever functions are called internally to implement the function all the way back to user-mode. The general convention is that a non-negative return value means success and a negative value indicates an error (with the errno being the negated return value: for example, a "-2" indicates an error with an errno value of 2 [ENOENT]).
You could look up the stored register values that will be popped on return to user-mode and replace one of them (what the BSD code here is doing), but the critical one that contains the return value will just be overwritten by the normal return-from-system-call path anyway, just prior to returning to user mode.

How does seccomp-bpf filter syscalls?

I'm investigating the implementation detail of seccomp-bpf, the syscall filtration mechanism that was introduced into Linux since version 3.5.
I looked into the source code of kernel/seccomp.c from Linux 3.10 and want to ask some questions about it.
From seccomp.c, it seems that seccomp_run_filters() is called from __secure_computing() to test the syscall called by the current process.
But looking into seccomp_run_filters(), the syscall number that is passed as an argument is not used anywhere.
It seems that sk_run_filter() is the implementation of BPF filter machine, but sk_run_filter() is called from seccomp_run_filters() with the first argument (the buffer to run the filter on) NULL.
My question is: how can seccomp_run_filters() filter syscalls without using the argument?
The following is the source code of seccomp_run_filters():
/**
* seccomp_run_filters - evaluates all seccomp filters against #syscall
* #syscall: number of the current system call
*
* Returns valid seccomp BPF response codes.
*/
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
u32 ret = SECCOMP_RET_ALLOW;
/* Ensure unexpected behavior doesn't result in failing open. */
if (WARN_ON(current->seccomp.filter == NULL))
return SECCOMP_RET_KILL;
/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
for (f = current->seccomp.filter; f; f = f->prev) {
u32 cur_ret = sk_run_filter(NULL, f->insns);
if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
ret = cur_ret;
}
return ret;
}
When a user process enters the kernel, the register set is stored to a kernel variable.
The function sk_run_filter implements the interpreter for the filter language. The relevant instruction for seccomp filters is BPF_S_ANC_SECCOMP_LD_W. Each instruction has a constant k, and in this case it specifies the index of the word to be read.
#ifdef CONFIG_SECCOMP_FILTER
case BPF_S_ANC_SECCOMP_LD_W:
A = seccomp_bpf_load(fentry->k);
continue;
#endif
The function seccomp_bpf_load uses the current register set of the user thread to determine the system call information.

Check validity of virtual memory address

I am iterating through the pages between VMALLOC_START and VMALLOC_END and I want to
check if the address that I get every time is valid.
How can I manage this?
I iterate through the pages like this:
unsigned long *p;
for(p = (unsigned long *) VMALLOC_START; p <= (unsigned long *) (VMALLOC_END - PAGE_SIZE); p += PAGE_SIZE)
{
//How to check if p is OK to access it?
}
Thanks!
The easiest way is to try to red it, and catch the exception.
Catching the exception is done by defining an entry in the __ex_table secion, using inline assembly.
The exception table entry contains a pointer to a memory access instruction, and a pointer to a recovery address. If an segfault happens on this instruction, EIP will be set to the recovery address.
Something like this (I didn't test this, I may be missing something):
void *ptr=whatever;
int ok=1;
asm(
"1: mov (%1),%1\n" // Try to access
"jmp 3f\n" // Success - skip error handling
"2: mov $0,%0\n" // Error - set ok=0
"3:\n" // Jump here on success
"\n.section __ex_table,\"a\""
".long 1b,2b\n" // Use .quad for 64bit.
".prev\n"
:"=r"(ok) : "r"(ptr)
);

Resources