how to avoid caching when using mmap() - linux

I'm writing a driver in petalinux for a device in my FPGA and I have implemented the mmap function in order to control the device in the user space. My problem is that, also if I'm using
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
in the mmap function and MAP_SHARED flag in the user application, it seems that the cache is enabled.
The test I did is to write a value (say 5) to a specific register of my mmaped device that actually stores only the least significant bit of the data coming from the AXI bus. If I read immediately after the write operation, I expect to read 1 (this happened when using a bare metal application on Microblaze), instead I read 5. However, the value is correctly wrote in the register, because what has to happen....happens.
Thanks in advance.

Based on what was discussed in the question comments, the address pointer being assigned here:
address = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
wasn't declared with the type qualifier volatile, allowing the compiler to preform assumptions over it, leading to potential compile time optimizations over the read/write operations.

Related

Where do FS and GS registers get mapped to in the linear address space?

I understand that in 32 bit you have segments where each segment would map to a base and limit. Therefore, a segment wouldn't be able to access another segments data.
With 64 bit, we throw away most of the segments and have a base of 0 with no limit, thus accessing the entire 64 bit address space. But I get confused when they state we have FS and GS registers for thread local storage and additional data.
If the default segment can access anything in the linear address space, then what is stopping the program from corrupting or accessing the FS/GS segments? The OS would have to keep track of FS/GS and make sure nothing else gets allocated there right? How does this work?
Also, if the default area can access anything, then why do we even have FS/GS. I guess FS makes sense because we can just switch the register during a thread switch. But why even use GS? Why not malloc memory instead? Sorry I am new to OS.
In 64-bit mode, the FS and GS "segment registers" aren't really used, per se. But using an FS or GS segment override for a memory access causes the address to be offset by the value contained in a hidden FSBASE/GSBASE register which can be set by a special instruction (possibly privileged, in which case a system call can ask the kernel to do it). So for instance, if FSBASE = 0x12340000 and RDI = 0x56789, then MOV AL, FS:[RDI] will load from linear address 0x12396789.
This is still just a linear address - it's not separate from the rest of the process's address space in any way, and it's subject to all the same paging and memory protection as any other memory access. The program could get the exact same effect with MOV AL, [0x12396789] (since DS has a base of 0). It is up to the program's usual memory allocation mechanisms to allocate an appropriate block of memory and set FSBASE to point to that block, if it intends to use FS in this way. There are no special protections needed to avoid "corrupting" this memory, any more than they are needed to prevent a program from corrupting any other part of its own memory.
So it doesn't really provide new functionality - it's more a convenience for programmers and compilers. As you say, it's nice for something like a pointer to thread-local storage, but if we didn't have FS/GS, there are plenty of other ways we could keep such a pointer in the thread's state (say, reserving R15). It's true that there's not an obvious general need for two such registers; for the most part, the other one is just there in case someone comes up with a good way to use it in a particular program.
See also:
How are the fs/gs registers used in Linux AMD64?
What is the "FS"/"GS" register intended for?

Can copy_to_user be used for IO memory?

I have buffer coming in from the user space which needs to be filled with device registers as a debugging mechanism. Is it safe to use copy_to_user() / copy_from_user() for device memory? If not, what's the best alternative given that the device driver lies in kernel space?
All the comments are wrong.
For any data moves between user and kernel spaces, you have to use copy_from/to_user
memcpy_from/toio() are reserved for addresses IN the kernel space and MMIO. It's unsafe to use those functions with user-space addresses.
Answer:
You can simply use copy_from/to_user() directly with the mapped MMIO address in void * to or void * from. So that you don't need a useless intermediate buffer.
To be used only with prefetchable memory since it might read/write several times the same memory and/or in an unordered way.

What is the benefit of calling ioread functions when using memory mapped IO

To use memory mapped I/O, we need to first call request_mem_region.
struct resource *request_mem_region(
unsigned long start,
unsigned long len,
char *name);
Then, as kernel is running in virtual address space, we need to map physical addresses to virtual address space by running ioremap function.
void *ioremap(unsigned long phys_addr, unsigned long size);
Then why can't we access the return value directly.
From Linux Device Drivers Book
Once equipped with ioremap (and iounmap), a device driver can access any I/O memory address, whether or not it is directly mapped to virtual address space. Remember, though, that the addresses returned from ioremap should not be dereferenced directly; instead, accessor functions provided by the kernel should be used.
Can anyone explain the reason behind this or the advantage with accessor functions like ioread32 or iowrite8()?
You need ioread8 / iowrite8 or whatever to at least cast to volatile* to make sure optimization still results in exactly 1 access (not 0 or more than 1). In fact they do more than that, handling endianness (They also handle endianness, accessing device memory as little-endian. Or ioread32be for big-endian) and some compile-time reordering memory-barrier semantics that Linux chooses to include in these functions. And even a runtime barrier after reads, because of DMA. Use the _rep version to copy a chunk from device memory with only one barrier.
In C, data races are UB (Undefined Behaviour). This means the compiler is allowed to assume that memory accessed through a non-volatile pointer doesn't change between accesses. And that if (x) y = *ptr; can be transformed into tmp = *ptr; if (x) y = tmp; i.e. compile-time speculative loads, if *ptr is known to not fault. (Related: Who's afraid of a big bad optimizing compiler? re: why the Linux kernel need volatile for rolling its own atomics.)
MMIO registers may have side effects even for reading so you must stop the compiler from doing loads that aren't in the source, and must force it to do all the loads that are in the source exactly once.
Same deal for stores. (Compilers aren't allowed to invent writes even to non-volatile objects, but they can remove dead stores. e.g. *ioreg = 1; *ioreg = 2; would typically compile the same as *ioreg = 2; The first store gets removed as "dead" because it's not considered to have a visible side effect.
C volatile semantics are ideal for MMIO, but Linux wraps more stuff around them than just volatile.
From a quick look after googling ioread8 and poking around in https://elixir.bootlin.com/linux/latest/source/lib/iomap.c#L11 we see that Linux I/O addresses can encode IO address space (port I/O, aka PIO; in / out instructions on x86) vs. memory address space (normal load/store to special addresses). And ioread* functions actually check that and dispatch accordingly.
/*
* Read/write from/to an (offsettable) iomem cookie. It might be a PIO
* access or a MMIO access, these functions don't care. The info is
* encoded in the hardware mapping set up by the mapping functions
* (or the cookie itself, depending on implementation and hw).
*
* The generic routines don't assume any hardware mappings, and just
* encode the PIO/MMIO as part of the cookie. They coldly assume that
* the MMIO IO mappings are not in the low address range.
*
* Architectures for which this is not true can't use this generic
* implementation and should do their own copy.
*/
For example implementation, here's ioread16. (IO_COND is a macro that checks the address against a predefined constant: low addresses are PIO addresses).
unsigned int ioread16(void __iomem *addr)
{
IO_COND(addr, return inw(port), return readw(addr));
return 0xffff;
}
What would break if you just cast the ioremap result to volatile uint32_t*?
e.g. if you used READ_ONCE / WRITE_ONCE which just cast to volatile unsigned char* or whatever, and are used for atomic access to shared variables. (In Linux's hand-rolled volatile + inline asm implementation of atomics which it uses instead of C11 _Atomic).
That might actually work on some little-endian ISAs like x86 if compile-time reordering wasn't a problem, but others need more barriers. If you look at the definition of readl (which ioread32 uses for MMIO, as opposed to inl for PIO), it uses barriers around a dereference of a volatile pointer.
(This and the macros this uses are defined in the same io.h as this, or you can navigate using the LXR links: every identifier is a hyperlink.)
static inline u32 readl(const volatile void __iomem *addr) {
u32 val;
__io_br();
val = __le32_to_cpu(__raw_readl(addr));
__io_ar(val);
return val;
}
The generic __raw_readl is just the volatile dereference; some ISAs may provide their own.
__io_ar() uses rmb() or barrier() After Read. /* prevent prefetching of coherent DMA data ahead of a dma-complete */. The Before Read barrier is just barrier() - blocking compile-time reordering without asm instructions.
Old answer to the wrong question: the text below answers why you need to call ioremap.
Because it's a physical address and kernel memory isn't identity-mapped (virt = phys) to physical addresses.
And returning a virtual address isn't an option: not all systems have enough virtual address space to even direct-map all of physical address space as a contiguous range of virtual addresses. (But when there is enough space, Linux does do this, e.g. x86-64 Linux's virtual address-space layout is documented in x86_64/mm.txt
Notably 32-bit x86 kernels on systems with more than 1 or 2GB of RAM (depending on how the kernel is configured: 2:2 or 1:3 kernel:user split of virtual address space). With PAE for 36-bit physical address space, a 32-bit x86 kernel can use much more physical memory than it can map at once. (This is pretty horrible and makes life difficult for a kernel: some random blog reposed Linus Torvald's comments about how PAE really really sucks.)
Other ISAs may have this too, and IDK what Alpha does about IO memory when byte accesses are needed; maybe the region of physical address space that maps word loads/stores to byte loads/stores is handled earlier so you request the right physical address. (http://www.tldp.org/HOWTO/Alpha-HOWTO-8.html)
But 32-bit x86 PAE is obviously an ISA that Linux cares a lot about, even quite early in the history of Linux.

Linux: Managing virtual memory mapping within my process for fast emulation

Recently it occurred to me that a lot of emulators are slow because they have to simulate not just the CPU but also the memory of the emulated device. When the device has memory-mapped I/O, virtual memory, or just unused address space, then every memory access has to be simulated in software.
I feel like it might be a lot faster if the OS did this for us, by means of virtual memory. I'll use Game Boy emulation as an example for simplicity's sake but obviously this method would be better for newer, more powerful machines.
The Game Boy memory map is roughly:
0x0000 - 0x7FFF: Mapped to cartridge ROM
Most cartridges have 0x0000 - 0x3FFF fixed and 0x4000 - 0x7FFF bank-switchable by writing to 0x2000
0x8000 - 0x9FFF: Video RAM (only accessible when not currently rendering)
0xA000 - 0xBFFF: Mapped to cartridge (usually battery-backed RAM)
0xC000 - 0xDFFF: Internal RAM (0xD000 - 0xDFFF is bankswitched on GB Color)
0xE000 - 0xFDFF: Mirror of internal RAM
0xFE00 - 0xFE9F: Object Attribute Memory (sprite RAM)
0xFEA0 - 0xFEFF: Unmapped (open bus or something, unsure)
0xFF00 - 0xFF7F: Memory-mapped I/O (sound system, video control, etc)
0xFE80 - 0xFFFF: Internal RAM
So a traditional emulator has to translate every memory access something like:
if(addr < 0x4000) return rom[addr];
else if(addr < 0x8000) return rom[(addr - 0x4000) + (0x4000 * cur_rom_bank)];
else if(addr < 0xA000) {
if(vram_accessible) return vram[addr - 0x8000];
else return 0xFF;
}
else if(addr < 0xC000) return saveram[addr - 0xA000];
else if(addr < 0xE000) return ram[addr - 0xC000];
else if(addr < 0xFE00) return ram[addr - 0xE000];
else if(addr < 0xFE9F) return oam[addr - 0xFE00];
else if(addr < 0xFF00) return 0xFF; //or whatever should be here
else if(addr < 0xFF80) return handle_io_read(addr);
else return hram[addr - 0xFF80];
Obviously that can be optimized by using a switch or table, but still it's a lot of code to run for every memory access. We could potentially improve the emulation speed quite a bit by mapping some pages to those addresses in our process's memory map:
0x0000 - 0x3FFF: R-- (no Exec flag because native CPU doesn't execute it)
0x4000 - 0x7FFF: R--
0x8000 - 0x9FFF: ---
0xA000 - 0xBFFF: ---
0xC000 - 0xDFFF: RW-
0xE000 - 0xFDFF: RW- (and mapped to same physical page as 0xC000 - 0xDFFF)
0xFE00 - 0xFE9F: ---
0xFEA0 - 0xFEFF: ---
0xFF00 - 0xFF7F: ---
0xFF80 - 0xFFFF: RW-
Then handle the SIGSEGV (or whatever signal would be generated) we get when accessing those pages. So a read from ROM or a write to RAM can just be performed directly, and a write to ROM will raise an exception which we can handle. We can change the permissions of VRAM (0x8000 - 0x9FFF) to be RW- when it should be accessible and --- when it shouldn't. In theory it could be much faster since it doesn't require the emulator to manually map every memory access in software.
I know that I can use mmap() to map pages at fixed addresses with various permissions. What I don't know is:
Can the mappings overlap, with different permissions?
Can I map pages to arbitrary addresses like this, regardless of the system's page size? Can I map to address 0?
How to change which memory a mapping points to? (eg when ROM bank is changed, we can just switch what memory is mapped at 0x4000 - 0x7FFF, but how do I do that?)
In a real-world case where the emulated system has a 32- or 64-bit CPU, can I map the entire first 4GB, or potentially the entire memory space? How would I avoid conflicting with whatever is already mapped (eg libraries, my stack, the kernel)?
Would this really be any faster? Or does throwing and catching a SIGSEGV generate more overhead than doing it the traditional way?
If it's not possible to do this in userspace, does Linux maybe provide a way to "take over" the kernel and do it there? So I could at least create an "emulator OS" which runs bare-metal while still having some Linux kernel facilities (such as video and filesystem drivers) available?
I'd expect generating a SIGSEGV, catching it, handling it, and resuming, would have more perf overhead than on the original hardware, so arrange for it to only happen when there's actually an error that can be slow.
This is a nice technique for memory protection / array bounds checking when violations are rare, and it's ok if they're slow. Speeding up the common case a bit is a win, even if it makes the exceptional case much slower, is a win when the exceptional case doesn't happen in normal emulated code.
I've heard of Javascript emulators doing this to get cheaper array bounds checking: allocate an array so it ends at the top of a page, where the next page is unmapped.
Take this with a grain of salt: I haven't used any of this in code I've written. I just just heard about it and think I understand how it works and some of the implications.
Hopefully this will get you started looking at docs that will tell you what actually can be done.
Updating page tables is fairly slow. Try to find a balance where you can take advantage of user-space memory protection for some of the checks, but you aren't constantly mapping/unmapping pages from your memory space during the "common case" of what your emulated code does. Predicted branches run really fast, esp. if they're predicted not taken.
I've seen Linux kernel discussion / notes indicating that playing tricks with mmap isn't worth it over just memcpy of a single page. For larger block of memory, or less checking on repeated accesses, the benefit will outweigh the setup overhead.
You'll want to use mprotect(2) to change the permissions on (ranges of) pages. No, mappings can't overlap. See the MAP_FIXED option in mmap(2):
If the memory region specified by addr and len overlaps pages of any
existing mapping(s), then the overlapped part of the existing
mapping(s) will be discarded.
IDK if you can do anything useful with x86 segment registers when accessing emulated memory, to map guest address 0 to some other address in your process's virtual address space. You can map virtual address 0, but by default Linux disables it so that NULL-pointer dereferences don't silently work!
Users of your software will have to futz with sysctl (same as for WINE) to enable it:
# Ubuntu's /etc/sysctl.d/10-zeropage.conf
# Protect the zero page of memory from userspace mmap to prevent kernel
# NULL-dereference attacks against potential future kernel security
# vulnerabilities. (Added in kernel 2.6.23.)
#
# While this default is built into the Ubuntu kernel, there is no way to
# restore the kernel default if the value is changed during runtime; for
# example via package removal (e.g. wine, dosemu). Therefore, this value
# is reset to the secure default each time the sysctl values are loaded.
vm.mmap_min_addr = 65536
Like I said, you can maybe use a segment register override on all loads/stores into guest (emulated-machine) memory, to remap it to a more reasonable page. Or maybe just use a constant offset of 64kiB (or more, to maybe put it above the text/data/bss (heap) of the emulation software. Or a non-constant offset using a pointer to the base of your mmapped guest-memory region, so everything is relative to a global variable. With gcc, this might be a good candidate for requesting that gcc keep that global in a register across all your functions. IDK, you'd have to see if that helped perf or not. A constant offset would end up making every instruction accessing guest memory need a 32b displacement field in the addressing mode, rather than 0 or 8b.
A segment register, if it works the way I think it does (as a constant offset you can apply with a segment-override prefix, instead of a 32b displacement modifier) would be much harder to get the compiler to generate, AFAIK. If it was just loads/stores, that would be one thing: you could use an inline asm wrapper for a load and store insn. But for efficient x86 code, all kinds of ALU instructions should use memory operands to reduce frontend bottlenecks via micro-fusion.
You could maybe just define a global char *const guest_mem = (void*)0x2000000; or something, and then use mmap with MAP_FIXED to force mapping memory there? Then guest memory accesses can compile to more efficient one-register addresisng modes.
General stuff
The Dolphin emulator has a feature called fastmem. AFAIU, the code blocks are JITed by assuming that the memory accesses are using standard memory. If at some point the instruction is accessing hardware memory, the instruction is patched in order to use a slow (memory) path instead. This is triggered by a segfault which is handled by the emulator:
a trampoline calling the suitable (slow memory path) code is generated;
the existing instruction is patched and replaced to a jump to this trampoline.
Some references:
HandleFault(), handles the segfaults;
BackPatch, patche the existing code;
GenerateWriteTrampoline and GenerateReadTrampoline generate the trampoline to Read_U64(), Write_U64(), etc.
This is somehow similar to what you are describing by the JIT/patching can amortise the cost of the page faults (because generateing a page fault each time an instruction is accessing the hardware address would be inefficient).
By the way you might want to be interested how the emulated memory is managed. See MemoryMap_Setup().
Answers to our questions
Can the mappings overlap, with different permissions?
If you mmap something which overlaps a previous VMA, this replaces the part of the old VMA with the new one.
Can I map pages to arbitrary addresses like this,
regardless of the system's page size?
No, the VMAs are always aligned to page boundaries (4KiB on x86 and x86_64). If you are mapping a file/shared memory, you have alignment constraint as well on the offset.
Can I map to address 0?
At least, Linux does not let you do this.
In a real-world case where the emulated system has a 32- or 64-bit CPU, > can I map the entire first 4GB, or potentially the entire memory space?
You cannot map the entire address space. AFAIU, what Dolphin does is map the emulated 32 bit address space at a fixed offset of the native 64 bit address space.
How would I avoid conflicting with whatever is already mapped (eg
libraries, my stack, the kernel)?
Having a an address space larger than the emulated one helps for that.
If it's not possible to do this in userspace, does Linux
maybe provide a way to "take over" the kernel and do it there?
So I could at least create an "emulator OS" which runs bare-metal
while still having some Linux kernel facilities
(such as video and filesystem drivers) available?
If you're trying to emulate the native CPU, you could use a virtualisation technology (such as KVM).

How does copy_from_user from the Linux kernel work internally?

How exactly does the copy_from_user() function work internally? Does it use any buffers or is there any memory mapping done, considering the fact that kernel does have the privilege to access the user memory space?
The implementation of copy_from_user() is highly dependent on the architecture.
On x86 and x86-64, it simply does a direct read from the userspace address and write to the kernelspace address, while temporarily disabling SMAP (Supervisor Mode Access Prevention) if it is configured. The tricky part of it is that the copy_from_user() code is placed into a special region so that the page fault handler can recognise when a fault occurs within it. A memory protection fault that occurs in copy_from_user() doesn't kill the process like it would if it is triggered by any other process-context code, or panic the kernel like it would if it occured in interrupt context - it simply resumes execution in a code path which returns -EFAULT to the caller.
regarding "how bout copy_to_user since the kernel is passing on the kernel space address,how can a user space process access it"
A user space process can attempt to access any address. However, if the address is not mapped in that process user space (i.e. in the page tables of that process) or if there is a problem with the access like a write attempt to a read-only location, then a page fault is generated. Note that at least on the x86, every process has all the kernel space mapped in the lowest 1 gigabyte of that process's virtual address space, while the 3 upper gigabytes of the 4GB total address space (I'm using here the 32-bit classic case) are used for the process text (i.e. code) and data.
A copy to or from user space is executed by the kernel code that is executing on behalf of the process and actually it's the memory mapping (i.e. page tables) of that process that are in-use during the copy. This takes place while execution is in kernel mode - i.e. privileged/supervisor mode in x86 language.
Assuming the user-space code has passed a legitimate target location (i.e. an address properly mapped in that process address space) to have data copied to, copy_to_user, run from kernel context would be able to normally write to that address/region w/out problems and after the control returns to the user, user space also can read from this location setup by the process itself to start with.
More interesting details can be found in chapters 9 and 10 of Understanding the Linux Kernel, 3rd Edition, By Daniel P. Bovet, Marco Cesati. In particular, access_ok() is a necessary but not sufficient validity check. The user can still pass addresses not belong to the process address space. In this case, a Page Fault exception will occur while the kernel code is executing the copy. The most interesting part is how the kernel page fault handler determines that the page fault in such case is not due to a bug in the kernel code but rather a bad address from the user (especially if the kernel code in question is from a kernel module loaded).
The best answer has something wrong, copy_(from|to)_user can't be used in interrupt context, they may sleep, copy_(from|to)_user function can only be used in process context,
the process's page table include all the information that kernel need to access it, so kernel can direct access the user space address if we can make sure the page addressed is in memory, use copy_(from|to)_user function, because they can check it for us and if the user space addressed page is not resident, it will fix it for us directly.
The implementation of copy_from_user() system call is done using two buffers from different address spaces:
The user-space buffer in user virtual address space.
The kernel-space buffer in kernel virtual address space.
When the copy_from_user() system call is invoked, data is copied from user buffer to kernel buffer.
A part (write operation) of character device driver code where copy_from_user() is used is given below:
ssize_t cdev_fops_write(struct file *flip, const char __user *ubuf,
size_t count, loff_t *f_pos)
{
unsigned int *kbuf;
copy_from_user(kbuf, ubuf, count);
printk(KERN_INFO "Data: %d",*kbuf);
}

Resources