Linux Kernel: copy_from_user - struct with pointers

Linux Kernel: copy_from_user - struct with pointers - linux

I've implemented some kind of character device and I need help with copy_ from_user function.
I've a structure:
struct my_struct{
int a;
int *b;
};
I initialize it in user space and pass pointer to my_struct to my char device using 'write' function. In Kernel's Space character device 'write' function I cast it from a *char to this kind of structure. I alloc some memory for a struct using kmalloc and do copy_from_user into it.
It's fine for simple 'int a', but it copies only pointer (address) of b value, not value pointed by b, so I'm now in Kernel Space and I'm working with a pointer that points to a user space memory. Is this incorrect and I shouldn't access to user space pointer directly and I have to copy_from_user every single pointer in my struct and then copy back every pointer in "read" function using copy_to_user function?

You must always use copy_from_user and similar to access user space memory from kernel space, regardless of how you got the pointer. Since b is a pointer to user space memory, you must use copy_from_user to access it.
These functions do two important additional tasks:
They make sure the pointer points into user space and not kernel space. Without this check, user space programs might be able to read or write to kernel memory, bypassing normal security.
They handle page faults correctly. Normally a page fault in kernel mode will result in an OOPS or panic - the copy_*_user family of functions have a special override that tells the PF handler that all is well, and the fault should be handled normally; and in the event that the fault cannot be satisfied by IO (ie, what would normally cause a SIGSEGV or SIGBUS), return an error code instead so their caller can do any necessary cleanup before returning to userspace with -EFAULT.

You are correct in your surmising. If you need to access the value *b, you will need to use copy_from_user (and copy_to_user to update it back in the user process).

Related

What is the benefit of calling ioread functions when using memory mapped IO

To use memory mapped I/O, we need to first call request_mem_region.
struct resource *request_mem_region(
unsigned long start,
unsigned long len,
char *name);
Then, as kernel is running in virtual address space, we need to map physical addresses to virtual address space by running ioremap function.
void *ioremap(unsigned long phys_addr, unsigned long size);
Then why can't we access the return value directly.
From Linux Device Drivers Book
Once equipped with ioremap (and iounmap), a device driver can access any I/O memory address, whether or not it is directly mapped to virtual address space. Remember, though, that the addresses returned from ioremap should not be dereferenced directly; instead, accessor functions provided by the kernel should be used.
Can anyone explain the reason behind this or the advantage with accessor functions like ioread32 or iowrite8()?

You need ioread8 / iowrite8 or whatever to at least cast to volatile* to make sure optimization still results in exactly 1 access (not 0 or more than 1). In fact they do more than that, handling endianness (They also handle endianness, accessing device memory as little-endian. Or ioread32be for big-endian) and some compile-time reordering memory-barrier semantics that Linux chooses to include in these functions. And even a runtime barrier after reads, because of DMA. Use the _rep version to copy a chunk from device memory with only one barrier.
In C, data races are UB (Undefined Behaviour). This means the compiler is allowed to assume that memory accessed through a non-volatile pointer doesn't change between accesses. And that if (x) y = *ptr; can be transformed into tmp = *ptr; if (x) y = tmp; i.e. compile-time speculative loads, if *ptr is known to not fault. (Related: Who's afraid of a big bad optimizing compiler? re: why the Linux kernel need volatile for rolling its own atomics.)
MMIO registers may have side effects even for reading so you must stop the compiler from doing loads that aren't in the source, and must force it to do all the loads that are in the source exactly once.
Same deal for stores. (Compilers aren't allowed to invent writes even to non-volatile objects, but they can remove dead stores. e.g. *ioreg = 1; *ioreg = 2; would typically compile the same as *ioreg = 2; The first store gets removed as "dead" because it's not considered to have a visible side effect.
C volatile semantics are ideal for MMIO, but Linux wraps more stuff around them than just volatile.
From a quick look after googling ioread8 and poking around in https://elixir.bootlin.com/linux/latest/source/lib/iomap.c#L11 we see that Linux I/O addresses can encode IO address space (port I/O, aka PIO; in / out instructions on x86) vs. memory address space (normal load/store to special addresses). And ioread* functions actually check that and dispatch accordingly.
/*
* Read/write from/to an (offsettable) iomem cookie. It might be a PIO
* access or a MMIO access, these functions don't care. The info is
* encoded in the hardware mapping set up by the mapping functions
* (or the cookie itself, depending on implementation and hw).
*
* The generic routines don't assume any hardware mappings, and just
* encode the PIO/MMIO as part of the cookie. They coldly assume that
* the MMIO IO mappings are not in the low address range.
*
* Architectures for which this is not true can't use this generic
* implementation and should do their own copy.
*/
For example implementation, here's ioread16. (IO_COND is a macro that checks the address against a predefined constant: low addresses are PIO addresses).
unsigned int ioread16(void __iomem *addr)
{
IO_COND(addr, return inw(port), return readw(addr));
return 0xffff;
}
What would break if you just cast the ioremap result to volatile uint32_t*?
e.g. if you used READ_ONCE / WRITE_ONCE which just cast to volatile unsigned char* or whatever, and are used for atomic access to shared variables. (In Linux's hand-rolled volatile + inline asm implementation of atomics which it uses instead of C11 _Atomic).
That might actually work on some little-endian ISAs like x86 if compile-time reordering wasn't a problem, but others need more barriers. If you look at the definition of readl (which ioread32 uses for MMIO, as opposed to inl for PIO), it uses barriers around a dereference of a volatile pointer.
(This and the macros this uses are defined in the same io.h as this, or you can navigate using the LXR links: every identifier is a hyperlink.)
static inline u32 readl(const volatile void __iomem *addr) {
u32 val;
__io_br();
val = __le32_to_cpu(__raw_readl(addr));
__io_ar(val);
return val;
}
The generic __raw_readl is just the volatile dereference; some ISAs may provide their own.
__io_ar() uses rmb() or barrier() After Read. /* prevent prefetching of coherent DMA data ahead of a dma-complete */. The Before Read barrier is just barrier() - blocking compile-time reordering without asm instructions.
Old answer to the wrong question: the text below answers why you need to call ioremap.
Because it's a physical address and kernel memory isn't identity-mapped (virt = phys) to physical addresses.
And returning a virtual address isn't an option: not all systems have enough virtual address space to even direct-map all of physical address space as a contiguous range of virtual addresses. (But when there is enough space, Linux does do this, e.g. x86-64 Linux's virtual address-space layout is documented in x86_64/mm.txt
Notably 32-bit x86 kernels on systems with more than 1 or 2GB of RAM (depending on how the kernel is configured: 2:2 or 1:3 kernel:user split of virtual address space). With PAE for 36-bit physical address space, a 32-bit x86 kernel can use much more physical memory than it can map at once. (This is pretty horrible and makes life difficult for a kernel: some random blog reposed Linus Torvald's comments about how PAE really really sucks.)
Other ISAs may have this too, and IDK what Alpha does about IO memory when byte accesses are needed; maybe the region of physical address space that maps word loads/stores to byte loads/stores is handled earlier so you request the right physical address. (http://www.tldp.org/HOWTO/Alpha-HOWTO-8.html)
But 32-bit x86 PAE is obviously an ISA that Linux cares a lot about, even quite early in the history of Linux.

Null pointer dereference in User space and kernel space

what will happen if we dereference the null pointer in user space and kernel space?
From my understanding the behaviour is based on compiler,architecture,etc.
but in general for every user space program allocated with virtual memory and the paging is used to translate the virtual address to physical address using page tables.
so if we are dereferencing null pointer in user space,that address is invalid so the context switch will happen and in kernel based on the interrupt for this null pointer dereference 'Segmentation fault will come or page fault error will come'.
In Kernel space:
If we dereference the NULL pointer there is a possibility of crashing the system or kernel may not able to return from that call.
Is my understanding correct?or any other informations missing means please explain.
Ref:I have understood from this "What happens in OS when we dereference a NULL pointer in C?"

The kernel maps the page at virtual address 0 into all processes with no permission bits set. When you try to access that page, you get a page fault. The kernel routine that handles it issues a SIGSEGV signal to your process. If you have no handler for SIGSEGV registered, core is dumped and you see your "Segmentation fault" message.
Kernel side, things are a bit different. After all, the kernel is supposed to be robust:
If the dereference happens and recovery is possible (e.g. your trackpad driver did the offence), a kernel oops is generated. The kernel continues running (for now).
If the dereference occurs so that no recovery is possible, the Oops leads to a kernel panic. Reboot necessary.
If for some reason, there is data mapped at page zero, you will corrupt memory. Which could lead to a panic down the way, go unnoticed or even be abused for a privilege escalation attack.

What kind of PC address could be in memory other than function entry point and return address?

The program counter (PC) value/address (like the value in %RIP on x86_64) could be in memory for some reasons (direct call/jump instructions in .text segment are not considered):
Function pointers.
Function return address in stack.
Jump table (switch/case in C, method dispatch in C++).
Context restoration: setjmp/longjmp, C++ exception handling, ptrace, getcontext, siglongjmp, signal delivery/sigreturn.
In kernel,
Interrupt return address.
Page fault handler, restartable atomic sequences (RAS) registry.
Thread/process context structure, like Task state segment (TSS), task_struct.
What other possible usage for PC address stored in memory, other than function entry pointer (e.g., function pointer), for different operating systems, on AMD64/i386/ARM?
Also, is there any form of encodings for these addresses?
For example, in jump tables, on x86_64, compiler could use 32-bit offset instead of 64-bit full address. In glibc's setjmp/longjmp jmp_buf, they use a special encoding (PTR_MANGLE) for the saved register values.

How does copy_from_user from the Linux kernel work internally?

How exactly does the copy_from_user() function work internally? Does it use any buffers or is there any memory mapping done, considering the fact that kernel does have the privilege to access the user memory space?

The implementation of copy_from_user() is highly dependent on the architecture.
On x86 and x86-64, it simply does a direct read from the userspace address and write to the kernelspace address, while temporarily disabling SMAP (Supervisor Mode Access Prevention) if it is configured. The tricky part of it is that the copy_from_user() code is placed into a special region so that the page fault handler can recognise when a fault occurs within it. A memory protection fault that occurs in copy_from_user() doesn't kill the process like it would if it is triggered by any other process-context code, or panic the kernel like it would if it occured in interrupt context - it simply resumes execution in a code path which returns -EFAULT to the caller.

regarding "how bout copy_to_user since the kernel is passing on the kernel space address,how can a user space process access it"
A user space process can attempt to access any address. However, if the address is not mapped in that process user space (i.e. in the page tables of that process) or if there is a problem with the access like a write attempt to a read-only location, then a page fault is generated. Note that at least on the x86, every process has all the kernel space mapped in the lowest 1 gigabyte of that process's virtual address space, while the 3 upper gigabytes of the 4GB total address space (I'm using here the 32-bit classic case) are used for the process text (i.e. code) and data.
A copy to or from user space is executed by the kernel code that is executing on behalf of the process and actually it's the memory mapping (i.e. page tables) of that process that are in-use during the copy. This takes place while execution is in kernel mode - i.e. privileged/supervisor mode in x86 language.
Assuming the user-space code has passed a legitimate target location (i.e. an address properly mapped in that process address space) to have data copied to, copy_to_user, run from kernel context would be able to normally write to that address/region w/out problems and after the control returns to the user, user space also can read from this location setup by the process itself to start with.
More interesting details can be found in chapters 9 and 10 of Understanding the Linux Kernel, 3rd Edition, By Daniel P. Bovet, Marco Cesati. In particular, access_ok() is a necessary but not sufficient validity check. The user can still pass addresses not belong to the process address space. In this case, a Page Fault exception will occur while the kernel code is executing the copy. The most interesting part is how the kernel page fault handler determines that the page fault in such case is not due to a bug in the kernel code but rather a bad address from the user (especially if the kernel code in question is from a kernel module loaded).

The best answer has something wrong, copy_(from|to)_user can't be used in interrupt context, they may sleep, copy_(from|to)_user function can only be used in process context,
the process's page table include all the information that kernel need to access it, so kernel can direct access the user space address if we can make sure the page addressed is in memory, use copy_(from|to)_user function, because they can check it for us and if the user space addressed page is not resident, it will fix it for us directly.

The implementation of copy_from_user() system call is done using two buffers from different address spaces:
The user-space buffer in user virtual address space.
The kernel-space buffer in kernel virtual address space.
When the copy_from_user() system call is invoked, data is copied from user buffer to kernel buffer.
A part (write operation) of character device driver code where copy_from_user() is used is given below:
ssize_t cdev_fops_write(struct file *flip, const char __user *ubuf,
size_t count, loff_t *f_pos)
{
unsigned int *kbuf;
copy_from_user(kbuf, ubuf, count);
printk(KERN_INFO "Data: %d",*kbuf);
}

Allocating memory for user space from kernel thread

My question is about passing data from kernel to a user space program. I want to implement a system call "get_data(size, char *buff, char **meta_buf)". In this call, buff is allocated by user space program and its length is passed in the size argument. However, meta_buf is a variable length buffer that is allocated (in the user space program's vm pages) and filled by kernel. User space program will free this region.
(I cant allocate data in user space as user space program does not know size of the meta_buff. Also, user space program cannot allocate a fixed amount of memory and call the system call again and again to read the entire meta data. meta_data has to be returned in one single system call)
How do I allocate memory for a user space program from kernel thread?
(I would even appreciate if you can point me to any other system call that does a similar operation - allocating in kernel and freeing in user space)
Is this interface right or is there a better way to do it?

Don't attempt to allocate memory for userspace from the kernel - this is a huge violation of the kernel's abstraction layering. Instead, consider a few other options:
Have userspace ask how much space it needs. Userspace allocates, then grabs the memory from the kernel.
Have userspace mmap pages owned by your driver directly into its address space.
Set an upper bound on the amount of data needed. Just allocate that much.
It's hard to say more without knowing why this has to be atomic. Actually allocating memory's going to need to be interruptible anyway (or you're unlikely to succeed), so it's unlikely that going out of and back into the kernel is going to hurt much. In fact, any write to userspace memory must be interruptible, as there's the potential for page faults requiring IO.

Call vm_mmap() to get a address in user space.
Call find_vma() to find the vma corresponding to this address.
Call virt_to_phys() to get the physical address of kernel memory.
Call remap_pfn_range() map the physical address to the vma.
There is a point to take care, that is, the mapped address should be aligned to one page size. If not, you should ajust it, and add the offset after you get the user space address.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string