I often see Freeing unused kernel memory: xxxK (......) from dmesg, but I can never find this log from kernel source code with the help of grep/rg.
Where does it come from?
That line of text does not exist as a single, complete string, hence your failure to grep it.
This all gets rolling when free_initmem() in init/main.c calls free_initmem_default().
The line in question originates from free_initmem_default() in include/linux/mm.h:
/*
* Default method to free all the __init memory into the buddy system.
* The freed pages will be poisoned with pattern "poison" if it's within
* range [0, UCHAR_MAX].
* Return pages freed into the buddy system.
*/
static inline unsigned long free_initmem_default(int poison)
{
extern char __init_begin[], __init_end[];
return free_reserved_area(&__init_begin, &__init_end,
poison, "unused kernel");
}
The rest of that text is from free_reserved_area() in mm/page_alloc.c:
unsigned long free_reserved_area(void *start, void *end, int poison, const char *s)
{
void *pos;
unsigned long pages = 0;
...
if (pages && s)
pr_info("Freeing %s memory: %ldK\n",
s, pages << (PAGE_SHIFT - 10));
return pages;
}
(Code excerpts from v5.2)
From my answer here:
Some functions in the kernel source code are marked with __init because they run only once during initialization. This instructs the compiler to mark a function in a special way. The linker collects all such functions and puts them at the end of the final binary file.
Example method signature:
static int __init clk_disable_unused(void)
{
// some code
}
When the kernel starts, this code runs only once during initialization. After it runs, the kernel can free this memory to reuse it and you will see the kernel
message:
Freeing unused kernel memory: 108k freed
Related
I am trying to read /proc//pagemap in a kernel driver like this:
uint64_t page;
uint64_t va = 0x7FFD1BF46530;`
loff_t pos = va / PAGE_SIZE * sizeof(uint64_t);
struct file * filp = filp_open("/proc/19030/pagemap", O_RDONLY, 0);
ssize_t nread = kernel_read(filp, &page, sizeof(page), &pos);
I get error -22 in nread (EINVAL, invalid argument) and
"kernel read not supported for file /19030/pagemap (pid: 19030 comm: tester)" in dmesg.
0x7FFD1BF46530 is a virtual address in a user space process pid 19030 (tester). I assume that pos is the offset into the file like in lseek64.
Doing the precise same thing as sudo with same values in a user space process, i.e. reading /proc/19030/pagemap works fine and produces a correct physical address.
The actual thing I am trying to do here is to find the physical address of a user space virtual address. I need the physical address for a device DMA transfer operation and a user space app needs to access this memory. This app allocates 1GB DMA memory with anonymous mmap from THP (Transparent Huge Pages). And I am trying to avoid the need for sudo by reading /proc//pagemap in a kernel driver via ioctl instead.
I would be happy to allocate huge page DMA memory in the driver but don't know how to do that. dma_alloc_coherent is limited to max 4MB allocations. Is there a way to get those allocated as continuous physical memory? I need hundreds of MB or many GB of DMA memory.
Problem with anonymous mmap is that it can only allocate max 1GB huge page as physically continuous memory. Allocating more works but the memory is not physically continuous and unusable for DMA.
Any good ideas or alternative ways of allocating huge pages as DMA memory?
Tried reading file /proc//pagemap in a kernel driver. Expected same results as when reading the file in a user space application which works ok.
"kernel read not supported for file …"
Indeed, as we see in __kernel_read()
if (unlikely(!file->f_op->read_iter || file->f_op->read))
return warn_unsupported(file, "read");
it fails if f_op->read_iter isn't or f_op->read is wired up (implemented), which is both the case for a pagemap file.
You could try pagemap_read() instead. – not feasible for reasons in the comments
When I had the problem of getting the physical address for a virtual address in a driver, I included and copied some kernel code (not that I recommend this, but I saw no other solution); here's an extract.
static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr
, unsigned long sz)
{ return NULL; }
void p4d_clear_bad(p4d_t *p4d) { p4d_ERROR(*p4d); p4d_clear(p4d); }
#include "mm/pagewalk.c"
static int pte(pte_t *pte, unsigned long addr
, unsigned long next, struct mm_walk *walk)
{
*(pte_t **)walk->private = pte;
return 1;
}
/* Scan the real Linux page tables and return a PTE pointer for
* a virtual address in a context.
* Returns true (1) if PTE was found, zero otherwise. The pointer to
* the PTE pointer is unmodified if PTE is not found.
*/
int
get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp)
{
struct mm_walk walk = { .pte_entry = pte, .mm = mm, .private = ptep };
return walk_page_range(addr, addr+PAGE_SIZE, &walk);
}
/* Find physical address for this virtual address. Normally used by
* I/O functions, but anyone can call it.
*/
static inline unsigned long iopa(unsigned long addr)
{
unsigned long pa;
/* I don't know why this won't work on PMacs or CHRP. It
* appears there is some bug, or there is some implicit
* mapping done not properly represented by BATs or in page
* tables.......I am actively working on resolving this, but
* can't hold up other stuff. -- Dan
*/
pte_t *pte;
struct mm_struct *mm;
#if 0
/* Check the BATs */
phys_addr_t v_mapped_by_bats(unsigned long va);
pa = v_mapped_by_bats(addr);
if (pa)
return pa;
#endif
/* Allow mapping of user addresses (within the thread)
* for DMA if necessary.
*/
if (addr < TASK_SIZE)
mm = current->mm;
else
mm = &init_mm;
ATTENTION: I needed the current address space.
You'd have to use mm = file->private_data instead.
pa = 0;
if (get_pteptr(mm, addr, &pte, NULL))
pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK);
return(pa);
}
I am studying stack guarding in Linux. I found that the Linux kernel VMAP_STACK config parameter is using the guard page mechanism along with vmalloc() to provide stack guarding.
I am trying to find a way to check how this guard page is working in Linux kernel. I googled and checked the kernel code, but did NOT find out the codes.
A further question is how to verify the guarded stack.
I had a kernel module to underrun/overflow a process's kernel stack, like this
static void shoot_kernel_stack(void)
{
unsigned char *ptr = task_stack_page(current);
unsigned char *tmp = NULL;
tmp = ptr + THREAD_SIZE + PAGE_SIZE + 0;
// tmp -= 0x100;
memset(tmp, 0xB4, 0x10); // Underrun
}
I really get the kernel panic like below,
[ 8006.358354] BUG: stack guard page was hit at 00000000e8dc2d98 (stack is 00000000cff0f921..00000000653b24a9)
[ 8006.361276] kernel stack overflow (page fault): 0000 [#1] SMP PTI
Is this the right way to verify the guard page?
The VMAP_STACK Linux feature is used to map the kernel stack of the threads into VMA. By virtually mapping stack, the underlying physical pages don't need to be contiguous. It is possible to detect cross-page overflows by adding guard pages. As the VMA are followed by a guard (unless the VM_NO_GUARD flag is passed at allocation time), the stacks allocated in those area benefits from it for stack overflow detection.
ALLOCATION
The thread stacks are allocated at thread creation time with alloc_thread_stack_node() in kernel/fork.c. When VMAP_STACK is activated, the stacks are cached because according to the comments in the source code:
vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
flush. Try to minimize the number of calls by caching stacks.
The kernel stack size is THREAD_SIZE (equal to 4 pages on x86_64 platforms). The source code of the allocation invoked at thread creation time is:
static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
{
#ifdef CONFIG_VMAP_STACK
void *stack;
int i;
[...] // <----- Part which gets a previously cached stack. If no stack in cache
// the following is run to allocate a brand new stack:
/*
* Allocated stacks are cached and later reused by new threads,
* so memcg accounting is performed manually on assigning/releasing
* stacks to tasks. Drop __GFP_ACCOUNT.
*/
stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
VMALLOC_START, VMALLOC_END,
THREADINFO_GFP & ~__GFP_ACCOUNT,
PAGE_KERNEL,
0, node, __builtin_return_address(0));
[...]
__vmalloc_node_range() is defined in mm/vmalloc.c. This calls __get_vm_area_node(). As the latter is not passed the VM_NO_GUARD flags, an additional page is added at the end of the allocated area. This is the guard page of the VMA:
static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long align, unsigned long flags, unsigned long start,
unsigned long end, int node, gfp_t gfp_mask, const void *caller)
{
struct vmap_area *va;
struct vm_struct *area;
BUG_ON(in_interrupt());
size = PAGE_ALIGN(size);
if (unlikely(!size))
return NULL;
if (flags & VM_IOREMAP)
align = 1ul << clamp_t(int, get_count_order_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);
area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!area))
return NULL;
if (!(flags & VM_NO_GUARD)) // <----- A GUARD PAGE IS ADDED
size += PAGE_SIZE;
va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
if (IS_ERR(va)) {
kfree(area);
return NULL;
}
setup_vmalloc_vm(area, va, flags, caller);
return area;
}
OVERFLOW MANAGEMENT
The stack overflow management is architecture dependent (i.e. source code located in arch/...). The links referenced below provide some pointers on some architecture dependent implementations.
For x86_64 platform, the overflow check is done upon the page fault interruption which triggers the following chain of function calls: do_page_fault()->__do_page_fault()->do_kern_addr_fault()->bad_area_nosemaphore()->no_context() function defined in arch/x86/mm/fault.c. In no_context(), there is a part dedicated to VMAP_STACK management for the detection of the stack under/overflow:
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int signal, int si_code)
{
struct task_struct *tsk = current;
unsigned long flags;
int sig;
[...]
#ifdef CONFIG_VMAP_STACK
/*
* Stack overflow? During boot, we can fault near the initial
* stack in the direct map, but that's not an overflow -- check
* that we're in vmalloc space to avoid this.
*/
if (is_vmalloc_addr((void *)address) &&
(((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
/*
* We're likely to be running with very little stack space
* left. It's plausible that we'd hit this condition but
* double-fault even before we get this far, in which case
* we're fine: the double-fault handler will deal with it.
*
* We don't want to make it all the way into the oops code
* and then double-fault, though, because we're likely to
* break the console driver and lose most of the stack dump.
*/
asm volatile ("movq %[stack], %%rsp\n\t"
"call handle_stack_overflow\n\t"
"1: jmp 1b"
: ASM_CALL_CONSTRAINT
: "D" ("kernel stack overflow (page fault)"),
"S" (regs), "d" (address),
[stack] "rm" (stack));
unreachable();
}
#endif
[...]
}
In the above code, when a stack under/overflow is detected, the handle_stack_overflow() function defined in arch/x86/kernel/traps.c) is called:
#ifdef CONFIG_VMAP_STACK
__visible void __noreturn handle_stack_overflow(const char *message,
struct pt_regs *regs,
unsigned long fault_address)
{
printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
(void *)fault_address, current->stack,
(char *)current->stack + THREAD_SIZE - 1);
die(message, regs, 0);
/* Be absolutely certain we don't return. */
panic("%s", message);
}
#endif
The example error message "BUG: stack guard page was hit at..." pointed out in the question comes from the above handle_stack_overflow() function.
FROM YOUR EXAMPLE MODULE
When VMAP_STACK is defined, the stack_vm_area field of the task descriptor appears and is set with the VMA address associated to the stack. From there, it is possible to grab interesting information:
struct task_struct *task;
#ifdef CONFIG_VMAP_STACK
struct vm_struct *vm;
#endif // CONFIG_VMAP_STACK
task = current;
printk("\tKernel stack: 0x%lx\n", (unsigned long)(task->stack));
printk("\tStack end magic: 0x%lx\n", *(unsigned long *)(task->stack));
#ifdef CONFIG_VMAP_STACK
vm = task->stack_vm_area;
printk("\tstack_vm_area->addr = 0x%lx\n", (unsigned long)(vm->addr));
printk("\tstack_vm_area->nr_pages = %u\n", vm->nr_pages);
printk("\tstack_vm_area->size = %lu\n", vm->size);
#endif // CONFIG_VMAP_STACK
printk("\tLocal var in stack: 0x%lx\n", (unsigned long)(&task));
The nr_pages field is the number of pages without the additional guard page. The last unsigned long at the top of the stack is set with STACK_END_MAGIC defined in include/uapi/linux/magic.h as:
#define STACK_END_MAGIC 0x57AC6E9D
REFERENCES:
Preventing stack guard-page hopping
arm64: VMAP_STACK support
CONFIG_VMAP_STACK: Use a virtually-mapped stack
Linux 4.9 On x86_64 To Support Vmapped Stacks
A Decade of Linux Kernel Vulnerabilities
I am writing a kernel module to get the list of pids with their complete process name. The proc_pid_cmdline() gives the complete process name;using same function /proc/*/cmdline gets the complete process name. (struct task_struct) -> comm gives hint of what process it is, but not the complete path.
I have included the function name, but it gives error because it does not know where to find the function.
How to use proc_pid_cmdline() in a module ?
You are not supposed to call proc_pid_cmdline().
It is a non-public function in fs/proc/base.c:
static int proc_pid_cmdline(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
However, what it does is simple:
get_cmdline(task, m->buf, PAGE_SIZE);
That is not likely to return the full path though and it will not be possible to determine the full path in every case. The arg[0] value may be overwritten, the file could be deleted or moved, etc. A process may exec() in a way which obscures the original command line, and all kinds of other maladies.
A scan of my Fedora 20 system /proc/*/cmdline turns up all kinds of less-than-useful results:
-F
BUG:
WARNING: at
WARNING: CPU:
INFO: possible recursive locking detecte
ernel BUG at
list_del corruption
list_add corruption
do_IRQ: stack overflow:
ear stack overflow (cur:
eneral protection fault
nable to handle kernel
ouble fault:
RTNL: assertion failed
eek! page_mapcount(page) went negative!
adness at
NETDEV WATCHDOG
ysctl table check failed
: nobody cared
IRQ handler type mismatch
Machine Check Exception:
Machine check events logged
divide error:
bounds:
coprocessor segment overrun:
invalid TSS:
segment not present:
invalid opcode:
alignment check:
stack segment:
fpu exception:
simd exception:
iret exception:
/var/log/messages
--
/usr/bin/abrt-dump-oops
-xtD
I have managed to solve a version of this problem. I wanted to access the cmdline of all PIDs but within the kernel itself (as opposed to a kernel module as the question states), but perhaps these principles can be applied to kernel modules as well?
What I did was, I added the following function to fs/proc/base.c
int proc_get_cmdline(struct task_struct *task, char * buffer) {
int i;
int ret = proc_pid_cmdline(task, buffer);
for(i = 0; i < ret - 1; i++) {
if(buffer[i] == '\0')
buffer[i] = ' ';
}
return 0;
}
I then added the declaration in include/linux/proc_fs.h
int proc_get_cmdline(struct task_struct *, char *);
At this point, I could access the cmdline of all processes within the kernel.
To access the task_struct, perhaps you could refer to kernel: efficient way to find task_struct by pid?.
Once you have the task_struct, you should be able to do something like:
char cmdline[256];
proc_get_cmdline(task, cmdline);
if(strlen(cmdline) > 0)
printk(" cmdline :%s\n", cmdline);
else
printk(" cmdline :%s\n", task->comm);
I was able to obtain the commandline of all processes this way.
To get the full path of the binary behind a process.
char * exepathp;
struct file * exe_file;
struct mm_struct *mm;
char exe_path [1000];
//straight up stolen from get_mm_exe_file
mm = get_task_mm(current);
down_read(&mm->mmap_sem); //lock read
exe_file = mm->exe_file;
if (exe_file) get_file(exe_file);
up_read(&mm->mmap_sem); //unlock read
//reduce exe path to a string
exepathp = d_path( &(exe_file->f_path), exe_path, 1000*sizeof(char) );
Where current is the task struct for the process you are interested in. The variable exepathp gets the string of the full path. This is slightly different than the process cmd, this is the path of binary which was loaded to start the process. Combining this path with the process cmd should give you the full path.
Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}
The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.
I'm trying to access physical memory directly for an embedded Linux project, but I'm not sure how I can best designate memory for my use.
If I boot my device regularly, and access /dev/mem, I can easily read and write to just about anywhere I want. However, in this, I'm accessing memory that can easily be allocated to any process; which I don't want to do
My code for /dev/mem is (all error checking, etc. removed):
mem_fd = open("/dev/mem", O_RDWR));
mem_p = malloc(SIZE + (PAGE_SIZE - 1));
if ((unsigned long) mem_p % PAGE_SIZE) {
mem_p += PAGE_SIZE - ((unsigned long) mem_p % PAGE_SIZE);
}
mem_p = (unsigned char *) mmap(mem_p, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, mem_fd, BASE_ADDRESS);
And this works. However, I'd like to be using memory that no one else will touch. I've tried limiting the amount of memory that the kernel sees by booting with mem=XXXm, and then setting BASE_ADDRESS to something above that (but below the physical memory), but it doesn't seem to be accessing the same memory consistently.
Based on what I've seen online, I suspect I may need a kernel module (which is OK) which uses either ioremap() or remap_pfn_range() (or both???), but I have absolutely no idea how; can anyone help?
EDIT:
What I want is a way to always access the same physical memory (say, 1.5MB worth), and set that memory aside so that the kernel will not allocate it to any other process.
I'm trying to reproduce a system we had in other OSes (with no memory management) whereby I could allocate a space in memory via the linker, and access it using something like
*(unsigned char *)0x12345678
EDIT2:
I guess I should provide some more detail. This memory space will be used for a RAM buffer for a high performance logging solution for an embedded application. In the systems we have, there's nothing that clears or scrambles physical memory during a soft reboot. Thus, if I write a bit to a physical address X, and reboot the system, the same bit will still be set after the reboot. This has been tested on the exact same hardware running VxWorks (this logic also works nicely in Nucleus RTOS and OS20 on different platforms, FWIW). My idea was to try the same thing in Linux by addressing physical memory directly; therefore, it's essential that I get the same addresses each boot.
I should probably clarify that this is for kernel 2.6.12 and newer.
EDIT3:
Here's my code, first for the kernel module, then for the userspace application.
To use it, I boot with mem=95m, then insmod foo-module.ko, then mknod mknod /dev/foo c 32 0, then run foo-user , where it dies. Running under gdb shows that it dies at the assignment, although within gdb, I cannot dereference the address I get from mmap (although printf can)
foo-module.c
#include <linux/module.h>
#include <linux/config.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <asm/io.h>
#define VERSION_STR "1.0.0"
#define FOO_BUFFER_SIZE (1u*1024u*1024u)
#define FOO_BUFFER_OFFSET (95u*1024u*1024u)
#define FOO_MAJOR 32
#define FOO_NAME "foo"
static const char *foo_version = "#(#) foo Support version " VERSION_STR " " __DATE__ " " __TIME__;
static void *pt = NULL;
static int foo_release(struct inode *inode, struct file *file);
static int foo_open(struct inode *inode, struct file *file);
static int foo_mmap(struct file *filp, struct vm_area_struct *vma);
struct file_operations foo_fops = {
.owner = THIS_MODULE,
.llseek = NULL,
.read = NULL,
.write = NULL,
.readdir = NULL,
.poll = NULL,
.ioctl = NULL,
.mmap = foo_mmap,
.open = foo_open,
.flush = NULL,
.release = foo_release,
.fsync = NULL,
.fasync = NULL,
.lock = NULL,
.readv = NULL,
.writev = NULL,
};
static int __init foo_init(void)
{
int i;
printk(KERN_NOTICE "Loading foo support module\n");
printk(KERN_INFO "Version %s\n", foo_version);
printk(KERN_INFO "Preparing device /dev/foo\n");
i = register_chrdev(FOO_MAJOR, FOO_NAME, &foo_fops);
if (i != 0) {
return -EIO;
printk(KERN_ERR "Device couldn't be registered!");
}
printk(KERN_NOTICE "Device ready.\n");
printk(KERN_NOTICE "Make sure to run mknod /dev/foo c %d 0\n", FOO_MAJOR);
printk(KERN_INFO "Allocating memory\n");
pt = ioremap(FOO_BUFFER_OFFSET, FOO_BUFFER_SIZE);
if (pt == NULL) {
printk(KERN_ERR "Unable to remap memory\n");
return 1;
}
printk(KERN_INFO "ioremap returned %p\n", pt);
return 0;
}
static void __exit foo_exit(void)
{
printk(KERN_NOTICE "Unloading foo support module\n");
unregister_chrdev(FOO_MAJOR, FOO_NAME);
if (pt != NULL) {
printk(KERN_INFO "Unmapping memory at %p\n", pt);
iounmap(pt);
} else {
printk(KERN_WARNING "No memory to unmap!\n");
}
return;
}
static int foo_open(struct inode *inode, struct file *file)
{
printk("foo_open\n");
return 0;
}
static int foo_release(struct inode *inode, struct file *file)
{
printk("foo_release\n");
return 0;
}
static int foo_mmap(struct file *filp, struct vm_area_struct *vma)
{
int ret;
if (pt == NULL) {
printk(KERN_ERR "Memory not mapped!\n");
return -EAGAIN;
}
if ((vma->vm_end - vma->vm_start) != FOO_BUFFER_SIZE) {
printk(KERN_ERR "Error: sizes don't match (buffer size = %d, requested size = %lu)\n", FOO_BUFFER_SIZE, vma->vm_end - vma->vm_start);
return -EAGAIN;
}
ret = remap_pfn_range(vma, vma->vm_start, (unsigned long) pt, vma->vm_end - vma->vm_start, PAGE_SHARED);
if (ret != 0) {
printk(KERN_ERR "Error in calling remap_pfn_range: returned %d\n", ret);
return -EAGAIN;
}
return 0;
}
module_init(foo_init);
module_exit(foo_exit);
MODULE_AUTHOR("Mike Miller");
MODULE_LICENSE("NONE");
MODULE_VERSION(VERSION_STR);
MODULE_DESCRIPTION("Provides support for foo to access direct memory");
foo-user.c
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void)
{
int fd;
char *mptr;
fd = open("/dev/foo", O_RDWR | O_SYNC);
if (fd == -1) {
printf("open error...\n");
return 1;
}
mptr = mmap(0, 1 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 4096);
printf("On start, mptr points to 0x%lX.\n",(unsigned long) mptr);
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr, *mptr);
mptr[0] = 'a';
mptr[1] = 'b';
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr, *mptr);
close(fd);
return 0;
}
I think you can find a lot of documentation about the kmalloc + mmap part.
However, I am not sure that you can kmalloc so much memory in a contiguous way, and have it always at the same place. Sure, if everything is always the same, then you might get a constant address. However, each time you change the kernel code, you will get a different address, so I would not go with the kmalloc solution.
I think you should reserve some memory at boot time, ie reserve some physical memory so that is is not touched by the kernel. Then you can ioremap this memory which will give you
a kernel virtual address, and then you can mmap it and write a nice device driver.
This take us back to linux device drivers in PDF format. Have a look at chapter 15, it is describing this technique on page 443
Edit : ioremap and mmap.
I think this might be easier to debug doing things in two step : first get the ioremap
right, and test it using a character device operation, ie read/write. Once you know you can safely have access to the whole ioremapped memory using read / write, then you try to mmap the whole ioremapped range.
And if you get in trouble may be post another question about mmaping
Edit : remap_pfn_range
ioremap returns a virtual_adress, which you must convert to a pfn for remap_pfn_ranges.
Now, I don't understand exactly what a pfn (Page Frame Number) is, but I think you can get one calling
virt_to_phys(pt) >> PAGE_SHIFT
This probably is not the Right Way (tm) to do it, but you should try it
You should also check that FOO_MEM_OFFSET is the physical address of your RAM block. Ie before anything happens with the mmu, your memory is available at 0 in the memory map of your processor.
Sorry to answer but not quite answer, I noticed that you have already edited the question. Please note that SO does not notify us when you edit the question. I'm giving a generic answer here, when you update the question please leave a comment, then I'll edit my answer.
Yes, you're going to need to write a module. What it comes down to is the use of kmalloc() (allocating a region in kernel space) or vmalloc() (allocating a region in userspace).
Exposing the prior is easy, exposing the latter can be a pain in the rear with the kind of interface that you are describing as needed. You noted 1.5 MB is a rough estimate of how much you actually need to reserve, is that iron clad? I.e are you comfortable taking that from kernel space? Can you adequately deal with ENOMEM or EIO from userspace (or even disk sleep)? IOW, what's going into this region?
Also, is concurrency going to be an issue with this? If so, are you going to be using a futex? If the answer to either is 'yes' (especially the latter), its likely that you'll have to bite the bullet and go with vmalloc() (or risk kernel rot from within). Also, if you are even THINKING about an ioctl() interface to the char device (especially for some ad-hoc locking idea), you really want to go with vmalloc().
Also, have you read this? Plus we aren't even touching on what grsec / selinux is going to think of this (if in use).
/dev/mem is okay for simple register peeks and pokes, but once you cross into interrupts and DMA territory, you really should write a kernel-mode driver. What you did for your previous memory-management-less OSes simply doesn't graft well to an General Purpose OS like Linux.
You've already thought about the DMA buffer allocation issue. Now, think about the "DMA done" interrupt from your device. How are you going to install an Interrupt Service Routine?
Besides, /dev/mem is typically locked out for non-root users, so it's not very practical for general use. Sure, you could chmod it, but then you've opened a big security hole in the system.
If you are trying to keep the driver code base similar between the OSes, you should consider refactoring it into separate user & kernel mode layers with an IOCTL-like interface in-between. If you write the user-mode portion as a generic library of C code, it should be easy to port between Linux and other OSes. The OS-specific part is the kernel-mode code. (We use this kind of approach for our drivers.)
It seems like you have already concluded that it's time to write a kernel-driver, so you're on the right track. The only advice I can add is to read these books cover-to-cover.
Linux Device Drivers
Understanding the Linux Kernel
(Keep in mind that these books are circa-2005, so the information is a bit dated.)
I am by far no expert on these matters, so this will be a question to you rather than an answer. Is there any reason you can't just make a small ram disk partition and use it only for your application? Would that not give you guaranteed access to the same chunk of memory? I'm not sure of there would be any I/O performance issues, or additional overhead associated with doing that. This also assumes that you can tell the kernel to partition a specific address range in memory, not sure if that is possible.
I apologize for the newb question, but I found your question interesting, and am curious if ram disk could be used in such a way.
Have you looked at the 'memmap' kernel parameter? On i386 and X64_64, you can use the memmap parameter to define how the kernel will hand very specific blocks of memory (see the Linux kernel parameter documentation). In your case, you'd want to mark memory as 'reserved' so that Linux doesn't touch it at all. Then you can write your code to use that absolute address and size (woe be unto you if you step outside that space).