Reading x86 MSR from kernel module - linux

My main aim is to get the address values of the last 16 branches maintained by the LBR registers when a program crashes. I tried two ways till now -
1) msr-tools
This allows me to read the msr values from the command line. I make system calls to it from the C program itself and try to read the values. But the register values seem no where related to the addresses in the program itself. Most probably the registers are getting polluted from the other branches in system code. I tried turning off recording of branches in ring 0 and far jumps. But that doesn't help. Still getting unrelated values.
2) accessing through kernel module
Ok I wrote a very simple module (I've never done this before) to access the msr registers directly and possibly avoid register pollution.
Here's what I have -
#define LBR 0x1d9 //IA32_DEBUGCTL MSR
//I first set this to some non 0 value using wrmsr (msr-tools)
static void __init do_rdmsr(unsigned msr, unsigned unused2)
{
uint64_t msr_value;
__asm__ __volatile__ (" rdmsr"
: "=A" (msr_value)
: "c" (msr)
);
printk(KERN_EMERG "%lu \n",msr_value);
}
static int hello_init(void)
{
printk(KERN_EMERG "Value is ");
do_rdmsr (LBR,0);
return 0;
}
static void hello_exit(void)
{
printk(KERN_EMERG "End\n");
}
module_init(hello_init);
module_exit(hello_exit);
But the problem is that every time I use dmesg to read the output I get just
Value is 0
(I have tried for other registers - it always comes as 0)
Is there something that I am forgetting here?
Any help? Thanks

Use the following:
unsigned long long x86_get_msr(int msr)
{
unsigned long msrl = 0, msrh = 0;
/* NOTE: rdmsr is always return EDX:EAX pair value */
asm volatile ("rdmsr" : "=a"(msrl), "=d"(msrh) : "c"(msr));
return ((unsigned long long)msrh << 32) | msrl;
}

You can use Ilya Matveychikov's answer... or... OR :
#include <asm/msr.h>
int err;
unsigned int msr, cpu;
unsigned long long val;
/* rdmsr without exception handling */
val = rdmsrl(msr);
/* rdmsr with exception handling */
err = rdmsrl_safe(msr, &val);
/* rdmsr on a given CPU (instead of current one) */
err = rdmsrl_safe_on_cpu(cpu, msr, &val);
And there are many more functions, such as :
int msr_set_bit(u32 msr, u8 bit)
int msr_clear_bit(u32 msr, u8 bit)
void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs)
int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8])
Have a look at /lib/modules/<uname -r>/build/arch/x86/include/asm/msr.h

Related

How to test/validate the vmalloc guard page is working in Linux

I am studying stack guarding in Linux. I found that the Linux kernel VMAP_STACK config parameter is using the guard page mechanism along with vmalloc() to provide stack guarding.
I am trying to find a way to check how this guard page is working in Linux kernel. I googled and checked the kernel code, but did NOT find out the codes.
A further question is how to verify the guarded stack.
I had a kernel module to underrun/overflow a process's kernel stack, like this
static void shoot_kernel_stack(void)
{
unsigned char *ptr = task_stack_page(current);
unsigned char *tmp = NULL;
tmp = ptr + THREAD_SIZE + PAGE_SIZE + 0;
// tmp -= 0x100;
memset(tmp, 0xB4, 0x10); // Underrun
}
I really get the kernel panic like below,
[ 8006.358354] BUG: stack guard page was hit at 00000000e8dc2d98 (stack is 00000000cff0f921..00000000653b24a9)
[ 8006.361276] kernel stack overflow (page fault): 0000 [#1] SMP PTI
Is this the right way to verify the guard page?
The VMAP_STACK Linux feature is used to map the kernel stack of the threads into VMA. By virtually mapping stack, the underlying physical pages don't need to be contiguous. It is possible to detect cross-page overflows by adding guard pages. As the VMA are followed by a guard (unless the VM_NO_GUARD flag is passed at allocation time), the stacks allocated in those area benefits from it for stack overflow detection.
ALLOCATION
The thread stacks are allocated at thread creation time with alloc_thread_stack_node() in kernel/fork.c. When VMAP_STACK is activated, the stacks are cached because according to the comments in the source code:
vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
flush. Try to minimize the number of calls by caching stacks.
The kernel stack size is THREAD_SIZE (equal to 4 pages on x86_64 platforms). The source code of the allocation invoked at thread creation time is:
static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
{
#ifdef CONFIG_VMAP_STACK
void *stack;
int i;
[...] // <----- Part which gets a previously cached stack. If no stack in cache
// the following is run to allocate a brand new stack:
/*
* Allocated stacks are cached and later reused by new threads,
* so memcg accounting is performed manually on assigning/releasing
* stacks to tasks. Drop __GFP_ACCOUNT.
*/
stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
VMALLOC_START, VMALLOC_END,
THREADINFO_GFP & ~__GFP_ACCOUNT,
PAGE_KERNEL,
0, node, __builtin_return_address(0));
[...]
__vmalloc_node_range() is defined in mm/vmalloc.c. This calls __get_vm_area_node(). As the latter is not passed the VM_NO_GUARD flags, an additional page is added at the end of the allocated area. This is the guard page of the VMA:
static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long align, unsigned long flags, unsigned long start,
unsigned long end, int node, gfp_t gfp_mask, const void *caller)
{
struct vmap_area *va;
struct vm_struct *area;
BUG_ON(in_interrupt());
size = PAGE_ALIGN(size);
if (unlikely(!size))
return NULL;
if (flags & VM_IOREMAP)
align = 1ul << clamp_t(int, get_count_order_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);
area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!area))
return NULL;
if (!(flags & VM_NO_GUARD)) // <----- A GUARD PAGE IS ADDED
size += PAGE_SIZE;
va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
if (IS_ERR(va)) {
kfree(area);
return NULL;
}
setup_vmalloc_vm(area, va, flags, caller);
return area;
}
OVERFLOW MANAGEMENT
The stack overflow management is architecture dependent (i.e. source code located in arch/...). The links referenced below provide some pointers on some architecture dependent implementations.
For x86_64 platform, the overflow check is done upon the page fault interruption which triggers the following chain of function calls: do_page_fault()->__do_page_fault()->do_kern_addr_fault()->bad_area_nosemaphore()->no_context() function defined in arch/x86/mm/fault.c. In no_context(), there is a part dedicated to VMAP_STACK management for the detection of the stack under/overflow:
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int signal, int si_code)
{
struct task_struct *tsk = current;
unsigned long flags;
int sig;
[...]
#ifdef CONFIG_VMAP_STACK
/*
* Stack overflow? During boot, we can fault near the initial
* stack in the direct map, but that's not an overflow -- check
* that we're in vmalloc space to avoid this.
*/
if (is_vmalloc_addr((void *)address) &&
(((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
/*
* We're likely to be running with very little stack space
* left. It's plausible that we'd hit this condition but
* double-fault even before we get this far, in which case
* we're fine: the double-fault handler will deal with it.
*
* We don't want to make it all the way into the oops code
* and then double-fault, though, because we're likely to
* break the console driver and lose most of the stack dump.
*/
asm volatile ("movq %[stack], %%rsp\n\t"
"call handle_stack_overflow\n\t"
"1: jmp 1b"
: ASM_CALL_CONSTRAINT
: "D" ("kernel stack overflow (page fault)"),
"S" (regs), "d" (address),
[stack] "rm" (stack));
unreachable();
}
#endif
[...]
}
In the above code, when a stack under/overflow is detected, the handle_stack_overflow() function defined in arch/x86/kernel/traps.c) is called:
#ifdef CONFIG_VMAP_STACK
__visible void __noreturn handle_stack_overflow(const char *message,
struct pt_regs *regs,
unsigned long fault_address)
{
printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
(void *)fault_address, current->stack,
(char *)current->stack + THREAD_SIZE - 1);
die(message, regs, 0);
/* Be absolutely certain we don't return. */
panic("%s", message);
}
#endif
The example error message "BUG: stack guard page was hit at..." pointed out in the question comes from the above handle_stack_overflow() function.
FROM YOUR EXAMPLE MODULE
When VMAP_STACK is defined, the stack_vm_area field of the task descriptor appears and is set with the VMA address associated to the stack. From there, it is possible to grab interesting information:
struct task_struct *task;
#ifdef CONFIG_VMAP_STACK
struct vm_struct *vm;
#endif // CONFIG_VMAP_STACK
task = current;
printk("\tKernel stack: 0x%lx\n", (unsigned long)(task->stack));
printk("\tStack end magic: 0x%lx\n", *(unsigned long *)(task->stack));
#ifdef CONFIG_VMAP_STACK
vm = task->stack_vm_area;
printk("\tstack_vm_area->addr = 0x%lx\n", (unsigned long)(vm->addr));
printk("\tstack_vm_area->nr_pages = %u\n", vm->nr_pages);
printk("\tstack_vm_area->size = %lu\n", vm->size);
#endif // CONFIG_VMAP_STACK
printk("\tLocal var in stack: 0x%lx\n", (unsigned long)(&task));
The nr_pages field is the number of pages without the additional guard page. The last unsigned long at the top of the stack is set with STACK_END_MAGIC defined in include/uapi/linux/magic.h as:
#define STACK_END_MAGIC 0x57AC6E9D
REFERENCES:
Preventing stack guard-page hopping
arm64: VMAP_STACK support
CONFIG_VMAP_STACK: Use a virtually-mapped stack
Linux 4.9 On x86_64 To Support Vmapped Stacks
A Decade of Linux Kernel Vulnerabilities

Is it possible to dump inode information from the inotify subsystem?

I am trying to figure out what files my editor is watching on.
I have learnt that count the number of inotify fds from /proc/${PID}/fd is possible, and my question is: Is it possible to dump the list of watched inodes by one process?
UPDATE:
I have updated one working solution, and thanks for a helpful reference here.
UPDATE 2: well, recently I found kallsyms_lookup_name (and more symbols) not export since Linux Kernel v5.7, so I decide to update my own solution if anyone else cares.
Solved.
With the help of kprobe mechanism used in khook , I just simply hook the __x64_sys_inotify_add_watch and use user_path_at to steal the dentry.
The code snippet is listed below, and my working solution is provided here.
#define IN_ONLYDIR 0x01000000 /* only watch the path if it is a directory */
#define IN_DONT_FOLLOW 0x02000000 /* don't follow a sym link */
//regs->(di, si, dx, r10), reference: arch/x86/include/asm/syscall_wrapper.h#L125
//SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname, u32, mask)
KHOOK_EXT(long, __x64_sys_inotify_add_watch, const struct pt_regs *);
static long khook___x64_sysinotify_add_watch(const struct pt_regs *regs)
{
int wd;
struct path path;
unsigned int flags = 0;
char buf[PATH_MAX];
char *pname;
// decode the registers
int fd = (int) regs->di;
const char __user *pathname = (char __user *) regs->si;
u32 mask = (u32) regs->dx;
// do the original syscall
wd = KHOOK_ORIGIN(__x64_sys_inotify_add_watch, regs);
// get the pathname
if (!(mask & IN_DONT_FOLLOW))
flags |= LOOKUP_FOLLOW;
if (mask & IN_ONLYDIR)
flags |= LOOKUP_DIRECTORY;
if ( wd>=0 && (user_path_at(AT_FDCWD, pathname, flags, &path)==0) )
{
pname = dentry_path_raw(path.dentry, buf, PATH_MAX); //"pname" points to "buf[PATH_MAX]"
path_put(&path);
printk("%s, PID %d add (%d,%d): %s\n", current->comm, task_pid_nr(current), fd, wd, pname);
}
return wd;
}

VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting)

I am working a Linux kernel module (VMM) to test Intel VMX, to run a self-made VM (The VM starts in real-mode, then switches to 32bit protected mode with Paging enabled).
The VMM is configured to NOT use rdtsc exit, and use rdtsc offsetting.
Then, the VM runs rdtsc to check the performance, like below.
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
void i386mode_tests(void)
{
u32 eax, ebx, ecx, edx;
u32 i = 0;
asm ("mov %%cr0, %%eax\n"
"mov %%eax, %0 \n" : "=m" (eax) : :);
my_printf("Guest CR0 = 0x%x\n", eax);
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
my_printf("Rdtsc takes %d\n", vm_tsc[1] - vm_tsc[0]);
}
The output is something like this,
Guest CR0 = 0x80050033
Rdtsc takes 2742
On the other hand, I make a host application to do the same thing, like above
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
int main(int argc, char **argv)
{
uint64_t vm_tsc[2];
uint32_t eax, ebx, ecx, edx, i;
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
printf("Rdtsc takes %ld\n", vm_tsc[1] - vm_tsc[0]);
return 0;
}
It outputs followings,
Rdtsc takes 2325
Running above two codes in 40 iterations to get the average value as followings,
avag(VM) = 3188.000000
avag(host) = 2331.000000
The performance difference can NOT be ignored, when running the codes in VM and in host. It is NOT expected.
My understanding is, using TSC offsetting + no RDTSC exit, there should be little difference in rdtsc, running in VM and host.
Here are VMCS fields,
0xA501E97E = control_VMX_cpu_based
0xFFFFFFFFFFFFFFF0 = control_CR0_mask
0x0000000080050033 = control_CR0_shadow
In the last level of EPT PTEs, bit[5:3] = 6 (Write Back), bit[6] = 1. EPTP[2:0] = 6 (Write Back)
I tested in bare-metal, and in VMware, I got the similar results.
I am wondering if there is anything I missed in this case.

How to restore stack frame in gcc?

I want to build my own checkpoint library. I'm able to save the stack frame to a file calling checkpoint_here(stack pointer) and that can be restored later via calling recover(stack pointer) function.
Here is my problem: I'm able to jump from function recover(sp) to main() but the stack frame gets changed(stack pointer,frame pointer). So I want to jump to main from recover(sp) just after is checkpoint_here(sp) called retaining the stack frame of main(). I've tried setjmp/longjmp but can't make them working.Thanks in anticipation.
//jmp_buf env;
void *get_pc () { return __builtin_return_address(1); }
void checkpoint_here(register int *sp){
//printf("%p\n",get_pc());
void *pc;
pc=get_pc();//getting the program counter of caller
//printf("pc inside chk:%p\n",pc);
size_t i;
long size;
//if(!setjmp(env)){
void *l=__builtin_frame_address(1);//frame pointer of caller
int fd=open("ckpt1.bin", O_WRONLY|O_CREAT,S_IWUSR|S_IRUSR|S_IRGRP);
int mfd=open("map.bin", O_WRONLY|O_CREAT,S_IWUSR|S_IRUSR|S_IRGRP);
size=(long)l-(long)sp;
//printf("s->%ld\n",size);
write(mfd,&size,sizeof(long)); //writing the size of the data to be written to file.
write(mfd,&pc,sizeof(long)); //writing program counter of the caller.
write(fd,(char *)sp,(long)l-(long)sp); //writing local variables on the stack frame of caller.
close(fd);
close(mfd);
//}
}
void recover(register int *sp){
//int dummy;
long size;
void *pc;
//printf("old %p\n",sp);
/*void *newsp=(void *)&dummy;
printf("new %p old %p\n",newsp,sp);
if(newsp>=(void *)sp)
recover(sp);*/
int fd=open("ckpt1.bin", O_RDONLY,0644);
int mfd=open("map.bin", O_RDONLY,0644);
read(mfd,&size,sizeof(long)); //reading size of data written
read(mfd,&pc,sizeof(long)); //reading program counter
read(fd,(char *)sp,size); //reading local variables
close(mfd);
close(fd);
//printf("got->%ld\n",size);
//longjmp(env,1);
void (*foo)(void) =pc;
foo(); //trying to jump to main just after checkpoint_here() is called.
//asm volatile("jmp %0" : : "r" (pc));
}
int main(int argc,char **argv)
{
register int *sp asm ("rsp");
if(argc==2){
if(strcmp(argv[1],"recover")==0){
recover(sp); //restoring local variables
exit(0);
}
}
int a, b, c;
float s, area;
char x='a';
printf("Enter the sides of triangle\n");
//printf("\na->%p b->%p c->%p s->%p area->%p\n",&a,&b,&c,&s,&area);
scanf("%d %d %d",&a,&b,&c);
s = (a+b+c)/2.0;
//printf("%p\n",get_pc());
checkpoint_here(sp); //saving stack
//printf("here\n");
//printf("nsp->%p\n",sp);
area = (s*(s-a)*(s-b)*(s-c));
printf("%d %d %d %f %f %d\n",a,b,c,s,area,x);
printf("Area of triangle = %f\n", area);
printf("%f\n",s);
return 0;
}
You cannot do that in general.
You might try non-portable extended asm instructions (to restore %rsp and %rbp on x86-64). You could use longjmp (see setjmp(3) and longjmp(3)) -since longjmp is restoring the stack pointer- assuming you understand the implementation details.
The stack has, thanks to ASLR, a "random", non reproducible, location. In other words, if you start twice the same program, the stack pointer of main would be different. And in C some stack frames contain a pointer into other stack frames. See also this answer.
Read more about application checkpointing (see this) and study the source code (or use) BLCR.
You could perhaps restrict the C code to be used (e.g. if you generate the C code) and you might perhaps extend GCC using MELT for your needs. This is a significant amount of work.
BTW, MELT is (internally also) generating C++ code, with restricted stack frames which could be easily checkpointable. You could take that as an inspiration source.
Read also about x86 calling conventions and garbage collection (since a precise GC has to scan local pointers, which is similar to your needs).

numa_police_memory

I'm debugging NUMACTL on MIPS machine. In numa_police_memory() API, we have:
void numa_police_memory(void *mem, size_t size)
{
int pagesize = numa_pagesize_int();
unsigned long i;
for (i = 0; i < size; i += pagesize)
asm volatile("" :: "r" (((volatile unsigned char *)mem)[i]));
}
It seems "asm volatile("" :: "r" (((volatile unsigned char *)mem)[i]));" is used for reading a VM so that all the memory applied by previous mmap will be allocated onto some specific physical memory. But how does this asm code work? I can't read assembly language! Why is the first double quote empty???
Thanks
Interestingly, there is no assembly code in this snippet at all, though the asm statement is used. It contains a blank assembly "program", an empty list of outputs, and a list of inputs. The input specification forces ((volatile unsigned char *)mem)[i] to be in a register. So all this bit of magic will do is generate a load of the first byte of each page (pre-fault the pages).

Resources