mprotect() like functionality within Linux kernel - linux

I am in a Linux kernel module, and I allocate some memory with, say, vmalloc(). I want to make the memory have read, write, and execute permission. What is the clean and appropriate way of doing that? Basically, this is generally the equivalent of calling mprotect(), but in kernel space.
If I do the page walk, pgd_offset(), pud_offset(), pmd_offset(), pte_offset_map(), and then pte_mkwrite(), I run into linking errors when I tried it on 2.6.39. Also, it seems that if I am doing the page walk, it is a hack, and there ought to be a cleaner and more appropriate method.
My kernel module will be a loadable module, so internal symbols are not available to me.
Thanks, in advance, for your guidance.

There is a good answer to this question here: https://unix.stackexchange.com/questions/450557/is-there-any-function-analogous-to-mprotect-in-the-linux-kernel.
asm-generic/set_memory.h:int set_memory_ro(unsigned long addr, int numpages);
asm-generic/set_memory.h:int set_memory_rw(unsigned long addr, int numpages);
asm-generic/set_memory.h:int set_memory_x(unsigned long addr, int numpages);
asm-generic/set_memory.h:int set_memory_nx(unsigned long addr, int numpages);
they are defined here: https://elixir.bootlin.com/linux/v4.3/source/arch/x86/include/asm/cacheflush.h#L47

Have you tried by invoking do_mprotect() [kernel function corresponding to mprotect()] directly ?

Related

How can a LKM call a function of kernel driver?

I write a LKM (loadable kernel module), which needs to call functions in another kernel driver module under /linux/driver. I don't know how to import these functions into LKM. As the /lib/modules/linux/ (as make -C option) doesn't contain the header files of the kernel driver, I can't directly include them as the header files. Is there any way to do that?
Basically, you can only call a function from another module or the kernel if it's explicitly exported by the driver using the EXPORT macro in the source code.
Which kernel driver exactly did you think of ? can't you just copy the code to your driver ?
a) As #stdcall points out reg macros, the macros names are EXPORT_SYMBOL and EXPORT_SYMBOL_GPL actually
b) Reg the particular call you'd like to use, I found this as the closest match on kernel ver 4.6 : arch/x86/include/asm/xen/hypercall.h
208 static inline long
209 privcmd_call(unsigned call,
210 unsigned long a1, unsigned long a2,
211 unsigned long a3, unsigned long a4,
212 unsigned long a5)
...
The call is 'static' and not exported; hence you cannot use it in an LKM.
c) As #stdcall points out, you could try copying it, but in my experience this isn't always going to work out as there may be too many dependencies.
Some things are delibrately meant to be done only in the inline tree and not as kernel modules..

Linux device driver - memory mapped I/O example discussion

I have gone through the following topic and I still have some questions.
ioread32 followed by iowrite32 not giving same value
In the link, where can I get my base which is defined as 0xfed00000
in the post ?
what should I put for the second parameter in
void request_mem_region(unsigned long start, unsigned long len,char *name);
what should I put for the second parameter in
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
By having the Makefile and generating the kernel module, I should use the insmod and then dmesg to check if the code works as I expect, is this correct ?
In the case, should I add iounmap(virtual_base); before return 0; in the source ?
Thanks
In the link, where can I get my base which is defined as 0xfed00000 in the post ?
It's the base (physical) address of the peripheral's registers.
If the peripheral is a discrete chip on the board, then consult the board documentation.
If the peripheral is embedded in a SoC, then consult the memory map in the SoC datasheet.
what should I put for the second parameter in
void request_mem_region(unsigned long start, unsigned long len,char *name);
what should I put for the second parameter in
void *ioremap_nocache(unsigned long phys_addr, unsigned long size);
These two routines should be called with the same first and second parameters.
The length/size is the number of bytes the peripheral's register set occupies.
Sometimes the entire memory region to the next peripheral is specified.
By having the Makefile and generating the kernel module, I should use the insmod and then dmesg to check if the code works as I expect, is this correct ?
A judicious sprinkling of printk() statements is the tried & true method of testing a Linux kernel driver/module.
Unix has kdb.
In the case, should I add iounmap(virtual_base); before return 0; in the source ?
Do not copy that poorly written example of init code.
If ioremap() is performed in a driver's probe() (or other initialization) routine, then the iounmap() should be in the probe's error exit sequence and in the driver's remove() (or the complementary to init) routine.
There are numerous examples to study in the Linux kernel source. Use an online Linux cross reference such as http://lxr.free-electrons.com/source/
Note that almost all Linux drivers use iounmap() two or more times.

How do I use performance counters inside of the kernel?

I want to access performance counters inside the kernel. I found many ways to use performance counters in user space, but can you tell me some way to use those in kernel space.
Please don't specify tool name, I want to write my own code, preferably a kernel module. I am using Ubuntu with kernel 3.18.1.
http://www.cise.ufl.edu/~sb3/files/pmc.pdf
http://www.cs.inf.ethz.ch/stricker/lab/doc/intel-part4.pdf
The first pdf contains description on how to use pmc.
The second contains the address of perfeventsel0 and perfeventsel1.
Ive shown an example below.U'll need to set the event number and umask as per ur requirement.
void SetUpEvent(void){
int reg_addr=0x186;
int event_no=0x0024;
int umask=0x3F00;
int enable_bits=0x430000;
int event=event_no | umask | enable_bits;
__asm__ ("wrmsr" : : "c"(reg_addr), "a"(event), "d"(0x00));
}
/* Read the performance monitor counter */
long int ReadCounter(void){
long int count;
long int eax_low, edx_high;
int reg_addr=0xC1;
__asm__("rdmsr" : "=a"(eax_low), "=d"(edx_high) : "c"(reg_addr));
count = ((long int)eax_low | (long int)edx_high<<32);
return count;
}
You should check if you CPU and other HW support you needs. Try look into oprofile source code. It have kernel module and userspace api. You can for example cut part of interesting code from oprofile kernel module part and use it into you module. I gues you module should have several reader or listeners with circle buffers for events keeping. You can also look inside linux/drivers/oprofile and to correspond linux/arch/.../oprofile. Inside make menuconfig you can config it like module or build-in and add additional timers. Available events and counters you can find under oprofile/events/ of oprofile tool (TLB_MISS, CPU_CYCLES, CYCLES_DATA_STALL, ...).
ARM Performance monitoring register
Under linux/arch/arm64/kernel/perf_regs.c you can find arm specific details.

How do I find the ptep for a given address?

I'm trying to write a function that write-protects every pte in a given vm_area_struct. What is the function that gives me the ptep for a given address? I have:
pte_t *ptep;
for (addr = start; addr < end; addr += PAGE_SIZE) {
ptep = WHATS_THIS_FUNCTION_CALLED(addr);
ptep_set_wrprotect(mm, addr, ptep);
}
What's the WHATS_THIS_FUNCTION_CALLED called?
The short answer to your question is to use __get_locked_pte . However, I would advise against it since there are much better (efficient and fair in terms of resource contention) ways to accomplish your goal.
In linux, the typical idiom for traversing page tables is with a nested for loop four levels deep (four is the number of page table levels linux supports). For examples, see copy_page_range and apply_to_page_range in mm/memory.c. In fact, if you look closely at copy_page_range, it is called when forking from dup_mmap in kernel/fork.c. It operates on an entire vm_area_struct essentially.
You can replicate the idiom used in either of those functions. There are some caveats, however. For example, copy_page_range fully supports transparent hugepages (2.6.38) by using a completely separate copy_huge_pmd inside copy_pmd_range. Unless you want to write two separate functions (one for normal pages and one for transparent huge pages, see Gracefull fallback in Documentation/vm/transhuge.txt.
The point is that virtual memory in linux is very complicated, so be sure to completely understand every possible use case. follow_page in mm/memory.c should demonstrate how to cover all of your bases.
I believe you are looking for either the function virt_to_pte as defined here or your own re-implementation of it.
This function uses pte_offset_kernel(pmd_t * dir, unsigned long address) which takes a pmd (Page mid-level Directory) structure and an address , though one could also use pte_offset(pmd_t * dir, unsigned long address).
See Linux Device Drivers 3rd Edition Chapter 15 or Linux Device Drivers, 2nd Edition Chapter 13 mmap and DMA for further references.

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.
Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}
I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines
It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Resources