TSS usage in linux kernel - linux

While reading linux kernel code, I came across usage of tss (task state segment) inside __switch_to (arch/x86_64/kernel/process.c#L441).
Since linux handles saving/restoring process context via software, what is the need to maintain duplicate data? For e.g. stack pointer. Attaching a small snippet for reference
struct task_struct *__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
struct thread_struct *prev = &prev_p->thread,
*next = &next_p->thread;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(init_tss, cpu);
unlazy_fpu(prev_p);
/*
* Reload esp0, LDT and the page table pointer:
*/
tss->rsp0 = next->rsp0;
What is the need to copy rsp0 from next(thread_struct) to tss(tss_struct). Why can't we use next->rsp0 wherever tss-rsp0 is used? Why is tss_struct still present in kernel?
I am a noob who has recently started to learn kernel insides.

Related

How to test/validate the vmalloc guard page is working in Linux

I am studying stack guarding in Linux. I found that the Linux kernel VMAP_STACK config parameter is using the guard page mechanism along with vmalloc() to provide stack guarding.
I am trying to find a way to check how this guard page is working in Linux kernel. I googled and checked the kernel code, but did NOT find out the codes.
A further question is how to verify the guarded stack.
I had a kernel module to underrun/overflow a process's kernel stack, like this
static void shoot_kernel_stack(void)
{
unsigned char *ptr = task_stack_page(current);
unsigned char *tmp = NULL;
tmp = ptr + THREAD_SIZE + PAGE_SIZE + 0;
// tmp -= 0x100;
memset(tmp, 0xB4, 0x10); // Underrun
}
I really get the kernel panic like below,
[ 8006.358354] BUG: stack guard page was hit at 00000000e8dc2d98 (stack is 00000000cff0f921..00000000653b24a9)
[ 8006.361276] kernel stack overflow (page fault): 0000 [#1] SMP PTI
Is this the right way to verify the guard page?
The VMAP_STACK Linux feature is used to map the kernel stack of the threads into VMA. By virtually mapping stack, the underlying physical pages don't need to be contiguous. It is possible to detect cross-page overflows by adding guard pages. As the VMA are followed by a guard (unless the VM_NO_GUARD flag is passed at allocation time), the stacks allocated in those area benefits from it for stack overflow detection.
ALLOCATION
The thread stacks are allocated at thread creation time with alloc_thread_stack_node() in kernel/fork.c. When VMAP_STACK is activated, the stacks are cached because according to the comments in the source code:
vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
flush. Try to minimize the number of calls by caching stacks.
The kernel stack size is THREAD_SIZE (equal to 4 pages on x86_64 platforms). The source code of the allocation invoked at thread creation time is:
static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
{
#ifdef CONFIG_VMAP_STACK
void *stack;
int i;
[...] // <----- Part which gets a previously cached stack. If no stack in cache
// the following is run to allocate a brand new stack:
/*
* Allocated stacks are cached and later reused by new threads,
* so memcg accounting is performed manually on assigning/releasing
* stacks to tasks. Drop __GFP_ACCOUNT.
*/
stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
VMALLOC_START, VMALLOC_END,
THREADINFO_GFP & ~__GFP_ACCOUNT,
PAGE_KERNEL,
0, node, __builtin_return_address(0));
[...]
__vmalloc_node_range() is defined in mm/vmalloc.c. This calls __get_vm_area_node(). As the latter is not passed the VM_NO_GUARD flags, an additional page is added at the end of the allocated area. This is the guard page of the VMA:
static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long align, unsigned long flags, unsigned long start,
unsigned long end, int node, gfp_t gfp_mask, const void *caller)
{
struct vmap_area *va;
struct vm_struct *area;
BUG_ON(in_interrupt());
size = PAGE_ALIGN(size);
if (unlikely(!size))
return NULL;
if (flags & VM_IOREMAP)
align = 1ul << clamp_t(int, get_count_order_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);
area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!area))
return NULL;
if (!(flags & VM_NO_GUARD)) // <----- A GUARD PAGE IS ADDED
size += PAGE_SIZE;
va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
if (IS_ERR(va)) {
kfree(area);
return NULL;
}
setup_vmalloc_vm(area, va, flags, caller);
return area;
}
OVERFLOW MANAGEMENT
The stack overflow management is architecture dependent (i.e. source code located in arch/...). The links referenced below provide some pointers on some architecture dependent implementations.
For x86_64 platform, the overflow check is done upon the page fault interruption which triggers the following chain of function calls: do_page_fault()->__do_page_fault()->do_kern_addr_fault()->bad_area_nosemaphore()->no_context() function defined in arch/x86/mm/fault.c. In no_context(), there is a part dedicated to VMAP_STACK management for the detection of the stack under/overflow:
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int signal, int si_code)
{
struct task_struct *tsk = current;
unsigned long flags;
int sig;
[...]
#ifdef CONFIG_VMAP_STACK
/*
* Stack overflow? During boot, we can fault near the initial
* stack in the direct map, but that's not an overflow -- check
* that we're in vmalloc space to avoid this.
*/
if (is_vmalloc_addr((void *)address) &&
(((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
/*
* We're likely to be running with very little stack space
* left. It's plausible that we'd hit this condition but
* double-fault even before we get this far, in which case
* we're fine: the double-fault handler will deal with it.
*
* We don't want to make it all the way into the oops code
* and then double-fault, though, because we're likely to
* break the console driver and lose most of the stack dump.
*/
asm volatile ("movq %[stack], %%rsp\n\t"
"call handle_stack_overflow\n\t"
"1: jmp 1b"
: ASM_CALL_CONSTRAINT
: "D" ("kernel stack overflow (page fault)"),
"S" (regs), "d" (address),
[stack] "rm" (stack));
unreachable();
}
#endif
[...]
}
In the above code, when a stack under/overflow is detected, the handle_stack_overflow() function defined in arch/x86/kernel/traps.c) is called:
#ifdef CONFIG_VMAP_STACK
__visible void __noreturn handle_stack_overflow(const char *message,
struct pt_regs *regs,
unsigned long fault_address)
{
printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
(void *)fault_address, current->stack,
(char *)current->stack + THREAD_SIZE - 1);
die(message, regs, 0);
/* Be absolutely certain we don't return. */
panic("%s", message);
}
#endif
The example error message "BUG: stack guard page was hit at..." pointed out in the question comes from the above handle_stack_overflow() function.
FROM YOUR EXAMPLE MODULE
When VMAP_STACK is defined, the stack_vm_area field of the task descriptor appears and is set with the VMA address associated to the stack. From there, it is possible to grab interesting information:
struct task_struct *task;
#ifdef CONFIG_VMAP_STACK
struct vm_struct *vm;
#endif // CONFIG_VMAP_STACK
task = current;
printk("\tKernel stack: 0x%lx\n", (unsigned long)(task->stack));
printk("\tStack end magic: 0x%lx\n", *(unsigned long *)(task->stack));
#ifdef CONFIG_VMAP_STACK
vm = task->stack_vm_area;
printk("\tstack_vm_area->addr = 0x%lx\n", (unsigned long)(vm->addr));
printk("\tstack_vm_area->nr_pages = %u\n", vm->nr_pages);
printk("\tstack_vm_area->size = %lu\n", vm->size);
#endif // CONFIG_VMAP_STACK
printk("\tLocal var in stack: 0x%lx\n", (unsigned long)(&task));
The nr_pages field is the number of pages without the additional guard page. The last unsigned long at the top of the stack is set with STACK_END_MAGIC defined in include/uapi/linux/magic.h as:
#define STACK_END_MAGIC 0x57AC6E9D
REFERENCES:
Preventing stack guard-page hopping
arm64: VMAP_STACK support
CONFIG_VMAP_STACK: Use a virtually-mapped stack
Linux 4.9 On x86_64 To Support Vmapped Stacks
A Decade of Linux Kernel Vulnerabilities

Inserting a PID in the Linux Hash-Table

Currently I'm working on a Linux-Kernel-Module, that can hide any normal Process.
The hiding works fine, but I haven't found a way to unhide the process yet.
First I delete the struct task_struct from the big list of task_structs inside the Kernel:
struct task_struct *p;
//Finding the correct task_struct
for_each_process(p)
if(p->pid == pid){
// Removing the task_struct
struct list_head *next = task->tasks.next;
struct list_head *prev = task->tasks.prev;
next->prev=prev;
prev->next=next;
}
But the task_struct is still traceable, because it's inside a hash_table containing every process' task_struct. In fact most of the PID-Lookup is performed by this hash-table. Removing the task-struct from there is a bit tricky:
struct pid *pid; //struct pid of the task_struct
//Deleting for every pid_namespace the pid_chain from the hash_list
for (i = 0; i <= pid->level; i++) {
struct upid *upid = pid->numbers + i;
hlist_del_rcu(&upid->pid_chain);
}
The Problem is to restore both structs: Inserting the task_struct back in the list of task_structs is easy, but I haven't found a way to restore the link in the hash-table. It's difficult, because the kernel doesn't expose the needed structures.
Inside the Kernel, it's done within this line:
hlist_add_head_rcu(&upid->pid_chain,&pid_hash[pid_hashfn(upid->nr, upid->ns)]);
pid_hashfn and pid_hash are defined as follows:
#define pid_hashfn(nr, ns) hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift)
static struct hlist_head *pid_hash;
And the struct I need to insert, is the pid_chain:
struct hlist_node pid_chain;
So my question is, how can I insert the pid_chain in the correct hash-list?
Is there a way to obtain the reference to the hash-list-array, even if it's declared as static?
Or, maybe an uncommon idea: The hash-list is allocated via
pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,HASH_EARLY | HASH_SMALL, &pidhash_shift, NULL,0, 4096);
So, if I could get the starting-position of the memory of the hash-list, could I scan the corresponding memoryspace for the pointer of my struct and then cast the surrounding memoryregion to struct of type struct hlist?
Thanks for your help. Every solution or idea is appreciated :)
There is a hash list available in sysmap file. you can check that once.
The pid_hash can be located in /proc/kallsyms and also is accesible programatically by kallsyms_lookup_name.

Why mm_struct->start_stack and vm_area_struct->start don't point to the same address?

As far as I understand memory management in Linux kernel, there is a mm_struct structure responsible for address space in each process. One important memory region is stack. This should be identified by vm_area_struct memory region and mm_struct itself has a pointer mm_struct->stack_start which is stack's address.
I came accross the code below and what I cannot understand is why any of the memory region start/end addresses are not equal to mm_struct->stack_start value. Any help in understanding this would be very much appreciated. Thanks
Some of the results of loading the compiled kernel module:
Vma number 14: Starts at 0x7fff4bb68000, Ends at 0x7fff4bb8a000
Vma number 15: Starts at 0x7fff4bbfc000, Ends at 0x7fff4bbfe000
Vma number 16: Starts at 0x7fff4bbfe000, Ends at 0x7fff4bc00000
Code Segment start = 0x400000, end = 0x400854
Data Segment start = 0x600858, end = 0x600a94
Stack Segment start = 0x7fff4bb88420
One can find that stack segment start (0x7fff4bb88420) belongs to the vma number 14 but I don't know the addresses are different.
Kernel module source code:
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/mm.h>
static int pid_mem = 1;
static void print_mem(struct task_struct *task)
{
struct mm_struct *mm;
struct vm_area_struct *vma;
int count = 0;
mm = task->mm;
printk("\nThis mm_struct has %d vmas.\n", mm->map_count);
for (vma = mm->mmap ; vma ; vma = vma->vm_next) {
printk ("\nVma number %d: \n", ++count);
printk(" Starts at 0x%lx, Ends at 0x%lx\n",
vma->vm_start, vma->vm_end);
}
printk("\nCode Segment start = 0x%lx, end = 0x%lx \n"
"Data Segment start = 0x%lx, end = 0x%lx\n"
"Stack Segment start = 0x%lx\n",
mm->start_code, mm->end_code,
mm->start_data, mm->end_data,
mm->start_stack);
}
static int mm_exp_load(void){
struct task_struct *task;
printk("\nGot the process id to look up as %d.\n", pid_mem);
for_each_process(task) {
if ( task->pid == pid_mem) {
printk("%s[%d]\n", task->comm, task->pid);
print_mem(task);
}
}
return 0;
}
static void mm_exp_unload(void)
{
printk("\nPrint segment information module exiting.\n");
}
module_init(mm_exp_load);
module_exit(mm_exp_unload);
module_param(pid_mem, int, 0);
MODULE_AUTHOR ("Krishnakumar. R, rkrishnakumar#gmail.com");
MODULE_DESCRIPTION ("Print segment information");
MODULE_LICENSE("GPL");
Looks like start_stack is the initial stack pointer address. It's calculated by the kernel when the program is executed and is based on the stack section address given in the executable file. I don't think it gets updated at all thereafter. The system uses start_stack in at least one instance: to identify which vma represents "the stack" (when providing /proc/<pid>/maps), as the vma containing that address is guaranteed to contain the (main) stack.
But note that this is only the stack for the "main" (initial) thread; a multi-threaded program will have other stacks too -- one per thread. Since they all share the same address space, all threads will show the same set of vmas, and I think you'll find they all have the same start_stack value as well. But only the main thread's stack pointer will be within the main stack vma. The other threads will each have their own stack vmas -- this is so that each thread's stack can grow independently.
In general, there is one mm_struct for a process, but many vm_area_struct and each responds for a mmaped area.
For example, in a 32-bit system, a process have a virtual address space of 4GB, all of which is pointed by the mm_struct. However, there can be many regions within the 4GB space. Each of the region is pointed by a vm_area_struct, and this region is limited by the vm_area_struct->start and vm_area_struct->end. So, obviously the mm_struct struct contains a list of vm_area_struct.
Here is the detail introduction.

How to use proc_pid_cmdline in kernel module

I am writing a kernel module to get the list of pids with their complete process name. The proc_pid_cmdline() gives the complete process name;using same function /proc/*/cmdline gets the complete process name. (struct task_struct) -> comm gives hint of what process it is, but not the complete path.
I have included the function name, but it gives error because it does not know where to find the function.
How to use proc_pid_cmdline() in a module ?
You are not supposed to call proc_pid_cmdline().
It is a non-public function in fs/proc/base.c:
static int proc_pid_cmdline(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
However, what it does is simple:
get_cmdline(task, m->buf, PAGE_SIZE);
That is not likely to return the full path though and it will not be possible to determine the full path in every case. The arg[0] value may be overwritten, the file could be deleted or moved, etc. A process may exec() in a way which obscures the original command line, and all kinds of other maladies.
A scan of my Fedora 20 system /proc/*/cmdline turns up all kinds of less-than-useful results:
-F
BUG:
WARNING: at
WARNING: CPU:
INFO: possible recursive locking detecte
ernel BUG at
list_del corruption
list_add corruption
do_IRQ: stack overflow:
ear stack overflow (cur:
eneral protection fault
nable to handle kernel
ouble fault:
RTNL: assertion failed
eek! page_mapcount(page) went negative!
adness at
NETDEV WATCHDOG
ysctl table check failed
: nobody cared
IRQ handler type mismatch
Machine Check Exception:
Machine check events logged
divide error:
bounds:
coprocessor segment overrun:
invalid TSS:
segment not present:
invalid opcode:
alignment check:
stack segment:
fpu exception:
simd exception:
iret exception:
/var/log/messages
--
/usr/bin/abrt-dump-oops
-xtD
I have managed to solve a version of this problem. I wanted to access the cmdline of all PIDs but within the kernel itself (as opposed to a kernel module as the question states), but perhaps these principles can be applied to kernel modules as well?
What I did was, I added the following function to fs/proc/base.c
int proc_get_cmdline(struct task_struct *task, char * buffer) {
int i;
int ret = proc_pid_cmdline(task, buffer);
for(i = 0; i < ret - 1; i++) {
if(buffer[i] == '\0')
buffer[i] = ' ';
}
return 0;
}
I then added the declaration in include/linux/proc_fs.h
int proc_get_cmdline(struct task_struct *, char *);
At this point, I could access the cmdline of all processes within the kernel.
To access the task_struct, perhaps you could refer to kernel: efficient way to find task_struct by pid?.
Once you have the task_struct, you should be able to do something like:
char cmdline[256];
proc_get_cmdline(task, cmdline);
if(strlen(cmdline) > 0)
printk(" cmdline :%s\n", cmdline);
else
printk(" cmdline :%s\n", task->comm);
I was able to obtain the commandline of all processes this way.
To get the full path of the binary behind a process.
char * exepathp;
struct file * exe_file;
struct mm_struct *mm;
char exe_path [1000];
//straight up stolen from get_mm_exe_file
mm = get_task_mm(current);
down_read(&mm->mmap_sem); //lock read
exe_file = mm->exe_file;
if (exe_file) get_file(exe_file);
up_read(&mm->mmap_sem); //unlock read
//reduce exe path to a string
exepathp = d_path( &(exe_file->f_path), exe_path, 1000*sizeof(char) );
Where current is the task struct for the process you are interested in. The variable exepathp gets the string of the full path. This is slightly different than the process cmd, this is the path of binary which was loaded to start the process. Combining this path with the process cmd should give you the full path.

Direct Memory Access in Linux

I'm trying to access physical memory directly for an embedded Linux project, but I'm not sure how I can best designate memory for my use.
If I boot my device regularly, and access /dev/mem, I can easily read and write to just about anywhere I want. However, in this, I'm accessing memory that can easily be allocated to any process; which I don't want to do
My code for /dev/mem is (all error checking, etc. removed):
mem_fd = open("/dev/mem", O_RDWR));
mem_p = malloc(SIZE + (PAGE_SIZE - 1));
if ((unsigned long) mem_p % PAGE_SIZE) {
mem_p += PAGE_SIZE - ((unsigned long) mem_p % PAGE_SIZE);
}
mem_p = (unsigned char *) mmap(mem_p, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, mem_fd, BASE_ADDRESS);
And this works. However, I'd like to be using memory that no one else will touch. I've tried limiting the amount of memory that the kernel sees by booting with mem=XXXm, and then setting BASE_ADDRESS to something above that (but below the physical memory), but it doesn't seem to be accessing the same memory consistently.
Based on what I've seen online, I suspect I may need a kernel module (which is OK) which uses either ioremap() or remap_pfn_range() (or both???), but I have absolutely no idea how; can anyone help?
EDIT:
What I want is a way to always access the same physical memory (say, 1.5MB worth), and set that memory aside so that the kernel will not allocate it to any other process.
I'm trying to reproduce a system we had in other OSes (with no memory management) whereby I could allocate a space in memory via the linker, and access it using something like
*(unsigned char *)0x12345678
EDIT2:
I guess I should provide some more detail. This memory space will be used for a RAM buffer for a high performance logging solution for an embedded application. In the systems we have, there's nothing that clears or scrambles physical memory during a soft reboot. Thus, if I write a bit to a physical address X, and reboot the system, the same bit will still be set after the reboot. This has been tested on the exact same hardware running VxWorks (this logic also works nicely in Nucleus RTOS and OS20 on different platforms, FWIW). My idea was to try the same thing in Linux by addressing physical memory directly; therefore, it's essential that I get the same addresses each boot.
I should probably clarify that this is for kernel 2.6.12 and newer.
EDIT3:
Here's my code, first for the kernel module, then for the userspace application.
To use it, I boot with mem=95m, then insmod foo-module.ko, then mknod mknod /dev/foo c 32 0, then run foo-user , where it dies. Running under gdb shows that it dies at the assignment, although within gdb, I cannot dereference the address I get from mmap (although printf can)
foo-module.c
#include <linux/module.h>
#include <linux/config.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <asm/io.h>
#define VERSION_STR "1.0.0"
#define FOO_BUFFER_SIZE (1u*1024u*1024u)
#define FOO_BUFFER_OFFSET (95u*1024u*1024u)
#define FOO_MAJOR 32
#define FOO_NAME "foo"
static const char *foo_version = "#(#) foo Support version " VERSION_STR " " __DATE__ " " __TIME__;
static void *pt = NULL;
static int foo_release(struct inode *inode, struct file *file);
static int foo_open(struct inode *inode, struct file *file);
static int foo_mmap(struct file *filp, struct vm_area_struct *vma);
struct file_operations foo_fops = {
.owner = THIS_MODULE,
.llseek = NULL,
.read = NULL,
.write = NULL,
.readdir = NULL,
.poll = NULL,
.ioctl = NULL,
.mmap = foo_mmap,
.open = foo_open,
.flush = NULL,
.release = foo_release,
.fsync = NULL,
.fasync = NULL,
.lock = NULL,
.readv = NULL,
.writev = NULL,
};
static int __init foo_init(void)
{
int i;
printk(KERN_NOTICE "Loading foo support module\n");
printk(KERN_INFO "Version %s\n", foo_version);
printk(KERN_INFO "Preparing device /dev/foo\n");
i = register_chrdev(FOO_MAJOR, FOO_NAME, &foo_fops);
if (i != 0) {
return -EIO;
printk(KERN_ERR "Device couldn't be registered!");
}
printk(KERN_NOTICE "Device ready.\n");
printk(KERN_NOTICE "Make sure to run mknod /dev/foo c %d 0\n", FOO_MAJOR);
printk(KERN_INFO "Allocating memory\n");
pt = ioremap(FOO_BUFFER_OFFSET, FOO_BUFFER_SIZE);
if (pt == NULL) {
printk(KERN_ERR "Unable to remap memory\n");
return 1;
}
printk(KERN_INFO "ioremap returned %p\n", pt);
return 0;
}
static void __exit foo_exit(void)
{
printk(KERN_NOTICE "Unloading foo support module\n");
unregister_chrdev(FOO_MAJOR, FOO_NAME);
if (pt != NULL) {
printk(KERN_INFO "Unmapping memory at %p\n", pt);
iounmap(pt);
} else {
printk(KERN_WARNING "No memory to unmap!\n");
}
return;
}
static int foo_open(struct inode *inode, struct file *file)
{
printk("foo_open\n");
return 0;
}
static int foo_release(struct inode *inode, struct file *file)
{
printk("foo_release\n");
return 0;
}
static int foo_mmap(struct file *filp, struct vm_area_struct *vma)
{
int ret;
if (pt == NULL) {
printk(KERN_ERR "Memory not mapped!\n");
return -EAGAIN;
}
if ((vma->vm_end - vma->vm_start) != FOO_BUFFER_SIZE) {
printk(KERN_ERR "Error: sizes don't match (buffer size = %d, requested size = %lu)\n", FOO_BUFFER_SIZE, vma->vm_end - vma->vm_start);
return -EAGAIN;
}
ret = remap_pfn_range(vma, vma->vm_start, (unsigned long) pt, vma->vm_end - vma->vm_start, PAGE_SHARED);
if (ret != 0) {
printk(KERN_ERR "Error in calling remap_pfn_range: returned %d\n", ret);
return -EAGAIN;
}
return 0;
}
module_init(foo_init);
module_exit(foo_exit);
MODULE_AUTHOR("Mike Miller");
MODULE_LICENSE("NONE");
MODULE_VERSION(VERSION_STR);
MODULE_DESCRIPTION("Provides support for foo to access direct memory");
foo-user.c
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void)
{
int fd;
char *mptr;
fd = open("/dev/foo", O_RDWR | O_SYNC);
if (fd == -1) {
printf("open error...\n");
return 1;
}
mptr = mmap(0, 1 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 4096);
printf("On start, mptr points to 0x%lX.\n",(unsigned long) mptr);
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr, *mptr);
mptr[0] = 'a';
mptr[1] = 'b';
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr, *mptr);
close(fd);
return 0;
}
I think you can find a lot of documentation about the kmalloc + mmap part.
However, I am not sure that you can kmalloc so much memory in a contiguous way, and have it always at the same place. Sure, if everything is always the same, then you might get a constant address. However, each time you change the kernel code, you will get a different address, so I would not go with the kmalloc solution.
I think you should reserve some memory at boot time, ie reserve some physical memory so that is is not touched by the kernel. Then you can ioremap this memory which will give you
a kernel virtual address, and then you can mmap it and write a nice device driver.
This take us back to linux device drivers in PDF format. Have a look at chapter 15, it is describing this technique on page 443
Edit : ioremap and mmap.
I think this might be easier to debug doing things in two step : first get the ioremap
right, and test it using a character device operation, ie read/write. Once you know you can safely have access to the whole ioremapped memory using read / write, then you try to mmap the whole ioremapped range.
And if you get in trouble may be post another question about mmaping
Edit : remap_pfn_range
ioremap returns a virtual_adress, which you must convert to a pfn for remap_pfn_ranges.
Now, I don't understand exactly what a pfn (Page Frame Number) is, but I think you can get one calling
virt_to_phys(pt) >> PAGE_SHIFT
This probably is not the Right Way (tm) to do it, but you should try it
You should also check that FOO_MEM_OFFSET is the physical address of your RAM block. Ie before anything happens with the mmu, your memory is available at 0 in the memory map of your processor.
Sorry to answer but not quite answer, I noticed that you have already edited the question. Please note that SO does not notify us when you edit the question. I'm giving a generic answer here, when you update the question please leave a comment, then I'll edit my answer.
Yes, you're going to need to write a module. What it comes down to is the use of kmalloc() (allocating a region in kernel space) or vmalloc() (allocating a region in userspace).
Exposing the prior is easy, exposing the latter can be a pain in the rear with the kind of interface that you are describing as needed. You noted 1.5 MB is a rough estimate of how much you actually need to reserve, is that iron clad? I.e are you comfortable taking that from kernel space? Can you adequately deal with ENOMEM or EIO from userspace (or even disk sleep)? IOW, what's going into this region?
Also, is concurrency going to be an issue with this? If so, are you going to be using a futex? If the answer to either is 'yes' (especially the latter), its likely that you'll have to bite the bullet and go with vmalloc() (or risk kernel rot from within). Also, if you are even THINKING about an ioctl() interface to the char device (especially for some ad-hoc locking idea), you really want to go with vmalloc().
Also, have you read this? Plus we aren't even touching on what grsec / selinux is going to think of this (if in use).
/dev/mem is okay for simple register peeks and pokes, but once you cross into interrupts and DMA territory, you really should write a kernel-mode driver. What you did for your previous memory-management-less OSes simply doesn't graft well to an General Purpose OS like Linux.
You've already thought about the DMA buffer allocation issue. Now, think about the "DMA done" interrupt from your device. How are you going to install an Interrupt Service Routine?
Besides, /dev/mem is typically locked out for non-root users, so it's not very practical for general use. Sure, you could chmod it, but then you've opened a big security hole in the system.
If you are trying to keep the driver code base similar between the OSes, you should consider refactoring it into separate user & kernel mode layers with an IOCTL-like interface in-between. If you write the user-mode portion as a generic library of C code, it should be easy to port between Linux and other OSes. The OS-specific part is the kernel-mode code. (We use this kind of approach for our drivers.)
It seems like you have already concluded that it's time to write a kernel-driver, so you're on the right track. The only advice I can add is to read these books cover-to-cover.
Linux Device Drivers
Understanding the Linux Kernel
(Keep in mind that these books are circa-2005, so the information is a bit dated.)
I am by far no expert on these matters, so this will be a question to you rather than an answer. Is there any reason you can't just make a small ram disk partition and use it only for your application? Would that not give you guaranteed access to the same chunk of memory? I'm not sure of there would be any I/O performance issues, or additional overhead associated with doing that. This also assumes that you can tell the kernel to partition a specific address range in memory, not sure if that is possible.
I apologize for the newb question, but I found your question interesting, and am curious if ram disk could be used in such a way.
Have you looked at the 'memmap' kernel parameter? On i386 and X64_64, you can use the memmap parameter to define how the kernel will hand very specific blocks of memory (see the Linux kernel parameter documentation). In your case, you'd want to mark memory as 'reserved' so that Linux doesn't touch it at all. Then you can write your code to use that absolute address and size (woe be unto you if you step outside that space).

Resources