Walking page tables of a process in Linux - linux

i'm trying to navigate the page tables for a process in linux. In a kernel module i realized the following function:
static struct page *walk_page_table(unsigned long addr)
{
pgd_t *pgd;
pte_t *ptep, pte;
pud_t *pud;
pmd_t *pmd;
struct page *page = NULL;
struct mm_struct *mm = current->mm;
pgd = pgd_offset(mm, addr);
if (pgd_none(*pgd) || pgd_bad(*pgd))
goto out;
printk(KERN_NOTICE "Valid pgd");
pud = pud_offset(pgd, addr);
if (pud_none(*pud) || pud_bad(*pud))
goto out;
printk(KERN_NOTICE "Valid pud");
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd) || pmd_bad(*pmd))
goto out;
printk(KERN_NOTICE "Valid pmd");
ptep = pte_offset_map(pmd, addr);
if (!ptep)
goto out;
pte = *ptep;
page = pte_page(pte);
if (page)
printk(KERN_INFO "page frame struct is # %p", page);
out:
return page;
}
This function is called from the ioctl and addr is a virtual address in process address space:
static int my_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long addr)
{
struct page *page = walk_page_table(addr);
...
return 0;
}
The strange thing is that calling ioctl in a user space process, this segfaults...but it seems that the way i'm looking for the page table entry is correct because with dmesg i obtain for example for each ioctl call:
[ 1721.437104] Valid pgd
[ 1721.437108] Valid pud
[ 1721.437108] Valid pmd
[ 1721.437110] page frame struct is # c17d9b80
So why the process can't complete correcly the `ioctl' call? Maybe i have to lock something before navigating the page tables?
I'm working with kernel 2.6.35-22 and three levels page tables.
Thank you all!

pte_unmap(ptep);
is missing just before the label out. Try to change the code in this way:
...
page = pte_page(pte);
if (page)
printk(KERN_INFO "page frame struct is # %p", page);
pte_unmap(ptep);
out:

Look at /proc/<pid>/smaps filesystem, you can see the userspace memory:
cat smaps
bfa60000-bfa81000 rw-p 00000000 00:00 0 [stack]
Size: 136 kB
Rss: 44 kB
and how it is printed is via fs/proc/task_mmu.c (from kernel source):
http://lxr.linux.no/linux+v3.0.4/fs/proc/task_mmu.c
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
show_map_vma(m, vma.....);
seq_printf(m,
"Size: %8lu kB\n"
"Rss: %8lu kB\n"
"Pss: %8lu kB\n"
And your function is somewhat like that of walk_page_range(). Looking into walk_page_range() you can see that the smaps_walk structure is not supposed to change while it is walking:
http://lxr.linux.no/linux+v3.0.4/mm/pagewalk.c#L153
For eg:
}
201 if (walk->pgd_entry)
202 err = walk->pgd_entry(pgd, addr, next, walk);
203 if (!err &&
204 (walk->pud_entry || walk->pmd_entry || walk->pte_entry
If memory contents were to change, then all the above checking may get inconsistent.
All these just mean that you have to lock the mmap_sem when walking the page table:
if (!down_read_trylock(&mm->mmap_sem)) {
/*
* Activate page so shrink_inactive_list is unlikely to unmap
* its ptes while lock is dropped, so swapoff can make progress.
*/
activate_page(page);
unlock_page(page);
down_read(&mm->mmap_sem);
lock_page(page);
}
and then followed by unlocking:
up_read(&mm->mmap_sem);
And of course, when you issue printk() of the pagetable inside your kernel module, the kernel module is running in the process context of your insmod process (just printk the "comm" and you can see "insmod") meaning the mmap_sem is lock, it also mean the process is NOT running, and thus there is no console output till the process is completed (all printk() output goes to memory only).
Sounds logical?

Related

Could I/O memory access be used inside ISR under Linux (ARM)?

I'm creating driver for communication with FPGA under Linux. FPGA is connected via GPMC interface. When I tested read/write from driver context - everithing works perfectly. But the problem is that I need to read some address on interrupt. So I created interrupt handler, registred it and put iomemory reading in it (readw function). But when interrupt is fired - only zero's are readed. I tested every part of driver from the top to the bottom and it seems like the problem is in iomemory access inside ISR. When I replaced io access with constant value - it successfully passed to user-level application.
ARM version: armv7a (Cortex ARM-A8 (DM3730))
Compiler: CodeSourcery 2014.05
Here is some code from driver which represents performed actions:
// Request physical memory region for FPGA address IO
void* uni_PhysMem_request(const unsigned long addr, const unsigned long size) {
// Handle to be returned
void* handle = NULL;
// Check if memory region successfully requested (mapped to module)
if (!request_mem_region(addr, size, moduleName)) {
printk(KERN_ERR "\t\t\t\t%s() failed to request_mem_region(0x%p, %lu)\n", __func__, (void*)addr, size);
}
// Remap physical memory
if (!(handle = ioremap(addr, size))) {
printk(KERN_ERR "\t\t\t\t%s() failed to ioremap(0x%p, %lu)\n", __func__, (void*)addr, size);
}
// Return virtual address;
return handle;
}
// ...
// ISR
static irqreturn_t uni_IRQ_handler(int irq, void *dev_id) {
size_t readed = 0;
if (irq == irqNumber) {
printk(KERN_DEBUG "\t\t\t\tIRQ handling...\n");
printk(KERN_DEBUG "\t\t\t\tGPIO %d pin is %s\n", irqGPIOPin, ((gpio_get_value(irqGPIOPin) == 0) ? "LOW" : "HIGH"));
// gUniAddr is a struct which holds GPMC remapped virtual address (from uni_PhysMem_request), offset and read size
if ((readed = uni_ReadBuffer_IRQ(gUniAddr.gpmc.addr, gUniAddr.gpmc.offset, gUniAddr.size)) < 0) {
printk(KERN_ERR "\t\t\t\tunable to read data\n");
}
else {
printk(KERN_INFO "\t\t\t\tdata readed success (%zu bytes)\n", readed);
}
}
return IRQ_HANDLED;
}
// ...
// Read buffer by IRQ
ssize_t uni_ReadBuffer_IRQ(void* physAddr, unsigned long physOffset, size_t buffSize) {
size_t size = 0;
size_t i;
for (i = 0; i < buffSize; i += 2) {
size += uni_RB_write(readw(physAddr + physOffset)); // Here readed value sent to ring buffer. When "readw" replaced with any constant - everything OK
}
return size;
}
Looks like the problem was in code optimizations. I changed uni_RB_write function to pass physical address and data size, also read now performed via ioread16_rep function. So now everything works just fine.

Arm64 Linux Page Table Walk

Currently I'm developing some research-related programs and I need to find the pte of some specific addresses. My development environment is Juno r1 board (CPUs are A53 and A57 ) and it's running arm64 Linux kernel.
I use some typical page table walk codes like this:
int find_physical_pte(void *addr)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
unsigned long long address;
address = (unsigned long long)addr;
pgd = pgd_offset(current->mm, address);
printk(KERN_INFO "\npgd is: %p\n", (void *)pgd);
printk(KERN_INFO "pgd value: %llx\n", *pgd);
if (pgd_none(*pgd) || pgd_bad(*pgd))
return -1;
pud = pud_offset(pgd, address);
printk(KERN_INFO "\npud is: %p\n", (void *)pud);
printk(KERN_INFO "pud value: %llx\n", (*pud).pgd);
if (pud_none(*pud) || pud_bad(*pud))
return -2;
pmd = pmd_offset(pud, address);
printk(KERN_INFO "\npmd is: %p\n", (void *)pmd);
printk(KERN_INFO "pmd value: %llx\n",*pmd);
if (pmd_none(*pmd) || pmd_bad(*pmd))
return -3;
ptep = pte_offset_kernel(pmd, address);
printk(KERN_INFO "\npte is: %p\n", (void *)ptep);
printk(KERN_INFO "pte value: %llx\n",*ptep);
if (!ptep)
return -4;
return 1;
}
However, when the program checks the pte for the address(0xffffffc0008b2000), it always returns an empty pmd.
My guess is that I got the wrong pgd in the first step. I saw Tims Notes said that using current->mm only could get the pgd of TTBR0 (user space pgd) while the address I checked is a kernel space address so I should try to get the pgd of TTBR1.
So my question is: If I want to get the pte of a kernel space address, can I use current->mm to get the pgd?
If I can't, is there anything else I could try instead?
Any suggestion is welcome! Thank you.
Simon
I finally solved the problem.
Actually, my code is correct. The only part I missed is a page table entry check.
According to the page table design of ARMv8, ARM uses 4 levels page table for 4kb granule case. Each level (level 0-3 defined in the link) is implemented as pgd, pud, pmd, and ptep in Linux code.
In the ARM architecture, each level can be either block entry or the table entry (see the AArch64 Descriptor Format Section in the link).
If the memory address belongs to a 4kb table entry, then it needs to be traced down till level 3 entry (ptep). However, for the address belongs to a larger chunk, the corresponding table entry may save in the pgd, pud, or pmd level.
By checking the last 2 bits of the entry in each level, you know it's block entry or not and you only keep tracing down for the block entry.
Here is how to improve my code above:
Retrieving the descriptor based on the page table pointer desc = *pgd and then checking the last 2 bits of the descriptor.
If the descriptor is a block entry (0x01) then you need to extract the lower level entry as my code shows above.
If you already get the table entry (0x11) at any level, then you can stop there and translate the VA to PA based on the descriptor desc you just get.
int find_physical_pte(void *addr)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
unsigned long long address;
address = (unsigned long long)addr;
pgd = pgd_offset(current->mm, address);
printk(KERN_INFO "\npgd is: %p\n", (void *)pgd);
printk(KERN_INFO "pgd value: %llx\n", *pgd);
if (pgd_none(*pgd) || pgd_bad(*pgd))
return -1;
//check if (*pgd) is a table entry. Exit here if you get the table entry.
pud = pud_offset(pgd, address);
printk(KERN_INFO "\npud is: %p\n", (void *)pud);
printk(KERN_INFO "pud value: %llx\n", (*pud).pgd);
if (pud_none(*pud) || pud_bad(*pud))
return -2;
//check if (*pud) is a table entry. Exit here if you get the table entry.
pmd = pmd_offset(pud, address);
printk(KERN_INFO "\npmd is: %p\n", (void *)pmd);
printk(KERN_INFO "pmd value: %llx\n",*pmd);
if (pmd_none(*pmd) || pmd_bad(*pmd))
return -3;
//check if (*pmd) is a table entry. Exit here if you get the table entry.
ptep = pte_offset_kernel(pmd, address);
printk(KERN_INFO "\npte is: %p\n", (void *)ptep);
printk(KERN_INFO "pte value: %llx\n",*ptep);
if (!ptep)
return -4;
return 1;
}
I think the problem you are having is that you are passing the struct mm_struct * pointer of the current process. But the address you are passing if from the kernel virtual address space. You need to pass the mm pointer to the init process (&init_mm):
pgd = pgd_offset(&init_mm, address);
I think the rest should be fine, but I haven't tested it. You can also look at how it is done in the kernel in the file arch/arm64/mm/dump.c

Sharing memory from kernel to user space by disabling the "Write Protect Bit (CR0:16)"

If I have my own kernel space memory manager, is it theoretically possible to share a memory pointer allocated by a kernel module to a user space application by disabling the "Write Protect Bit (CR0:16)" using read_cr0() and write_cr0() thus allowing read/write access to the entire system memory space?
(This is for embedded devices where we "trust" our own processes)
I just experimented on my own computer but to answer my own question: NO, it is not possible to share the kernel's memory directly with a user space process.
I did try to write my own kernel module based on Tempesta's stack-like region-based memory manager (pool.c):
[...]
static int __init
tfw_pool_init(void)
{
printk(KERN_ALERT "HIJACK INIT\n");
write_cr0 (read_cr0 () & (~ 0x10000));
pg_cache = alloc_percpu(unsigned long [TFW_POOL_PGCACHE_SZ]);
if (pg_cache == NULL)
return -ENOMEM;
printk(KERN_NOTICE "__tfw_pool_new = %p\n", __tfw_pool_new);
printk(KERN_NOTICE "tfw_pool_alloc = %p\n", tfw_pool_alloc);
printk(KERN_NOTICE "tfw_pool_realloc = %p\n", tfw_pool_realloc);
printk(KERN_NOTICE "tfw_pool_free = %p\n", tfw_pool_free);
printk(KERN_NOTICE "tfw_pool_destroy = %p\n", tfw_pool_destroy);
return 0;
}
static void __exit
tfw_pool_exit(void)
{
free_percpu(pg_cache);
write_cr0 (read_cr0 () | 0x10000);
printk(KERN_ALERT "MODULE EXIT\n");
}
module_init(tfw_pool_init);
module_exit(tfw_pool_exit);
MODULE_LICENSE("GPL");
And not only calling the functions printed out segfaults but my system became highly unstable after loading the module so DO NOT TRY THIS AT HOME.

flush_cache_range() and flush_tlb_range() do not seem to work

Here is what I did:
A user space process uses malloc() to allocate memory on the heap and fills it with a specific pattern of characters and then spells out the address returned by the malloc().
The process id and the address of the memory chunk are passed to a kernel module that looks like this:
int init_module(void) {
int res = 0;
struct page *data_page;
struct task_struct *task = NULL;
struct vm_area_struct *next_vma;
struct mm_struct *mm;
task = pid_task(find_vpid(pid), PIDTYPE_PID);
if (pid != -1)
target_process_id = pid;
if (!task) {
printk("Could not find the task struct for process id %d\n", pid);
return 0;
} else {
printk("Found the task <%s>\n", task->comm);
}
mm = task->mm;
if (!mm) {
printk("Could not find the mmap struct for process id %d\n", pid);
return 0;
}
next_vma = find_vma(mm, addr);
down_read(&task->mm->mmap_sem);
res = get_user_pages(task, task->mm, addr, 1, 1, 1, &data_page, NULL);
if (res != 1) {
printk(KERN_INFO "get_user_pages error\n");
up_read(&task->mm->mmap_sem);
return 0;
} else {
printk("Found vma struct and it starts at: %lu\n", next_vma->vm_start);
}
flush_cache_range(next_vma,next_vma->vm_start,next_vma->vm_end);
flush_tlb_range(next_vma,next_vma->vm_start,next_vma->vm_end);
up_read(&task->mm->mmap_sem);
return 0;
}
I added printk() statement to the handle_mm_fault() function in the Linux kernel to track page faults caused by target_process_id (3rd line of code after variable definitions above). Something like this:
if (unlikely(current->pid == target_process_id))
printk("Target process <%d> generated a page fault at address %lu\n", current->pid, address);
Now, what I noticed is that the last printk() statement does not catch anything.
The function init_module is the initialization function for a kernel module. It is inserted into the running kernel using insmod...using the command insmod module.ko pid=<processId> addr=<address>
Any idea what might going wrong?

mmap in linux-kernel with vm_insert_page()

I am trying to understand mmap() functionality which can map page-by-page to the user space address.I am mapping these pages using the vm_insert_page(). This is returning with out any error. But when from the user space tried to write into the mapped memory, i'm getting the null pointer dereference. I am posting the code below. The system is x86,32 bit system.
The mmap() anf fault() function is posted.
static int mmap(struct file *filp , struct vm_area_struct *vma)
{
vma->vm_ops = &vm_ops;
vma->vm_flags |= VM_MIXEDMAP;
vm_open(vma);
return 0;
}
static int vm_fault(struct vm_area_struct *vma,struct vm_fault *vmf)
{
struct page *page=NULL;
page = alloc_page(GFP_HIGHUSER);//Both the alloc_pages() are working fine.
if(!page)
{
printk("The allocation of pagee memory is failed\n");
return -VM_FAULT_OOM;
}
// get_page(page);
if(vm_insert_page(vma,vmf->virtual_address,(page)))
printk("Thepage table creation is failed\n");
printk("The vm_insert of page is done\n");
return 0;
}
Thanks in advance.

Resources