What is the time complexity of mmap in Linux? - linux

In big O notation I guess and with respect to the size of memory requested. Also, can we assume that the memory is not committed lazily because that makes things complicated.
To be precise for the call mmap(0, n, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0) where n is a variable.

In this reference state that,
MAP_ANONYMOUS initialize the region to zeros.
I believe this process is O(n) complexity, but possibly more efficient :
On some systems using private anonymous mmaps is more efficient than using malloc for large blocks. This is not an issue with the GNU C Library, as the included malloc automatically uses mmap where appropriate.

Related

What is the benefit of calling ioread functions when using memory mapped IO

To use memory mapped I/O, we need to first call request_mem_region.
struct resource *request_mem_region(
unsigned long start,
unsigned long len,
char *name);
Then, as kernel is running in virtual address space, we need to map physical addresses to virtual address space by running ioremap function.
void *ioremap(unsigned long phys_addr, unsigned long size);
Then why can't we access the return value directly.
From Linux Device Drivers Book
Once equipped with ioremap (and iounmap), a device driver can access any I/O memory address, whether or not it is directly mapped to virtual address space. Remember, though, that the addresses returned from ioremap should not be dereferenced directly; instead, accessor functions provided by the kernel should be used.
Can anyone explain the reason behind this or the advantage with accessor functions like ioread32 or iowrite8()?
You need ioread8 / iowrite8 or whatever to at least cast to volatile* to make sure optimization still results in exactly 1 access (not 0 or more than 1). In fact they do more than that, handling endianness (They also handle endianness, accessing device memory as little-endian. Or ioread32be for big-endian) and some compile-time reordering memory-barrier semantics that Linux chooses to include in these functions. And even a runtime barrier after reads, because of DMA. Use the _rep version to copy a chunk from device memory with only one barrier.
In C, data races are UB (Undefined Behaviour). This means the compiler is allowed to assume that memory accessed through a non-volatile pointer doesn't change between accesses. And that if (x) y = *ptr; can be transformed into tmp = *ptr; if (x) y = tmp; i.e. compile-time speculative loads, if *ptr is known to not fault. (Related: Who's afraid of a big bad optimizing compiler? re: why the Linux kernel need volatile for rolling its own atomics.)
MMIO registers may have side effects even for reading so you must stop the compiler from doing loads that aren't in the source, and must force it to do all the loads that are in the source exactly once.
Same deal for stores. (Compilers aren't allowed to invent writes even to non-volatile objects, but they can remove dead stores. e.g. *ioreg = 1; *ioreg = 2; would typically compile the same as *ioreg = 2; The first store gets removed as "dead" because it's not considered to have a visible side effect.
C volatile semantics are ideal for MMIO, but Linux wraps more stuff around them than just volatile.
From a quick look after googling ioread8 and poking around in https://elixir.bootlin.com/linux/latest/source/lib/iomap.c#L11 we see that Linux I/O addresses can encode IO address space (port I/O, aka PIO; in / out instructions on x86) vs. memory address space (normal load/store to special addresses). And ioread* functions actually check that and dispatch accordingly.
/*
* Read/write from/to an (offsettable) iomem cookie. It might be a PIO
* access or a MMIO access, these functions don't care. The info is
* encoded in the hardware mapping set up by the mapping functions
* (or the cookie itself, depending on implementation and hw).
*
* The generic routines don't assume any hardware mappings, and just
* encode the PIO/MMIO as part of the cookie. They coldly assume that
* the MMIO IO mappings are not in the low address range.
*
* Architectures for which this is not true can't use this generic
* implementation and should do their own copy.
*/
For example implementation, here's ioread16. (IO_COND is a macro that checks the address against a predefined constant: low addresses are PIO addresses).
unsigned int ioread16(void __iomem *addr)
{
IO_COND(addr, return inw(port), return readw(addr));
return 0xffff;
}
What would break if you just cast the ioremap result to volatile uint32_t*?
e.g. if you used READ_ONCE / WRITE_ONCE which just cast to volatile unsigned char* or whatever, and are used for atomic access to shared variables. (In Linux's hand-rolled volatile + inline asm implementation of atomics which it uses instead of C11 _Atomic).
That might actually work on some little-endian ISAs like x86 if compile-time reordering wasn't a problem, but others need more barriers. If you look at the definition of readl (which ioread32 uses for MMIO, as opposed to inl for PIO), it uses barriers around a dereference of a volatile pointer.
(This and the macros this uses are defined in the same io.h as this, or you can navigate using the LXR links: every identifier is a hyperlink.)
static inline u32 readl(const volatile void __iomem *addr) {
u32 val;
__io_br();
val = __le32_to_cpu(__raw_readl(addr));
__io_ar(val);
return val;
}
The generic __raw_readl is just the volatile dereference; some ISAs may provide their own.
__io_ar() uses rmb() or barrier() After Read. /* prevent prefetching of coherent DMA data ahead of a dma-complete */. The Before Read barrier is just barrier() - blocking compile-time reordering without asm instructions.
Old answer to the wrong question: the text below answers why you need to call ioremap.
Because it's a physical address and kernel memory isn't identity-mapped (virt = phys) to physical addresses.
And returning a virtual address isn't an option: not all systems have enough virtual address space to even direct-map all of physical address space as a contiguous range of virtual addresses. (But when there is enough space, Linux does do this, e.g. x86-64 Linux's virtual address-space layout is documented in x86_64/mm.txt
Notably 32-bit x86 kernels on systems with more than 1 or 2GB of RAM (depending on how the kernel is configured: 2:2 or 1:3 kernel:user split of virtual address space). With PAE for 36-bit physical address space, a 32-bit x86 kernel can use much more physical memory than it can map at once. (This is pretty horrible and makes life difficult for a kernel: some random blog reposed Linus Torvald's comments about how PAE really really sucks.)
Other ISAs may have this too, and IDK what Alpha does about IO memory when byte accesses are needed; maybe the region of physical address space that maps word loads/stores to byte loads/stores is handled earlier so you request the right physical address. (http://www.tldp.org/HOWTO/Alpha-HOWTO-8.html)
But 32-bit x86 PAE is obviously an ISA that Linux cares a lot about, even quite early in the history of Linux.

how to avoid caching when using mmap()

I'm writing a driver in petalinux for a device in my FPGA and I have implemented the mmap function in order to control the device in the user space. My problem is that, also if I'm using
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
in the mmap function and MAP_SHARED flag in the user application, it seems that the cache is enabled.
The test I did is to write a value (say 5) to a specific register of my mmaped device that actually stores only the least significant bit of the data coming from the AXI bus. If I read immediately after the write operation, I expect to read 1 (this happened when using a bare metal application on Microblaze), instead I read 5. However, the value is correctly wrote in the register, because what has to happen....happens.
Thanks in advance.
Based on what was discussed in the question comments, the address pointer being assigned here:
address = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
wasn't declared with the type qualifier volatile, allowing the compiler to preform assumptions over it, leading to potential compile time optimizations over the read/write operations.

mmap options in malloc

What is the effect of the MAP_ANONYMOUS|MAP_SHARED options in the mmap? I see that the malloc uses the MAP_ANONYMOUS|MAP_PRIVATE options for doing mmap for larger memory allocations.
I'm observing that with the MAP_ANONYMOUS|MAP_PRIVATE, the unmapped memory region is still with the process ( observed through the pmap) whereas with the MAP_ANONYMOUS|MAP_SHARED, the unmapped is released back immediately.
When using MAP_ANONYMOUS, MAP_PRIVATE versus MAP_SHARED only makes a difference if the process forks a child that also uses the mapped memory block.
If you use MAP_PRIVATE, the mapped memory is marked copy-on-write, so changes made by one of the processs will not be seen by the other process.
If you use MAP_SHARED, the mapped memory is shared by both processs, so they can see each other's changes.
malloc() uses MAP_PRIVATE so that the parent and child can continue to use the mapped memory for their heaps without needing to synchronize updates. It behaves just like the data segment that's used for the normal heap.

Index into a mmap?

I'm trying to create an array of structs as a sort of rudimentary cache.
Given a void* pointer to a mmap, does mmap provide any affordances for indexing into it? I think conceptually a mmap is simply providing a block of memory, but then I'm a bit confused as to what I can do with it. Can I just think of it as a malloc?
void * mptr = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
Thanks for any clarification here.
Yes, you can think of it as a malloc, but you must deallocate it with munmap(mptr,1024*1024) rather than free(mptr).
If you want to index into it, cast it to another type, for example char:
char *cptr = (char*) mptr;
Then you can index into it using cptr[10], for example.
Regardless of what allocator you're using (mmap, malloc, sbrk, ...) you're still left with a pointer to memory. Before you can use the memory, you must tell if compiler what types live in that memory. Use C-style or C++ casting to tell the compiler how to treat memory.

What is aligned memory allocation?

I also want to know whether glibc malloc() does this.
Suppose that you have the structure.
struct S {
short a;
int b;
char c, d;
};
Without alignment, it would be laid out in memory like this (assuming a 32-bit architecture):
0 1 2 3 4 5 6 7
|a|a|b|b|b|b|c|d| bytes
| | | words
The problem is that on some CPU architectures, the instruction to load a 4-byte integer from memory only works on word boundaries. So your program would have to fetch each half of b with separate instructions.
But if the memory was laid out as:
0 1 2 3 4 5 6 7 8 9 A B
|a|a| | |b|b|b|b|c|d| | |
| | | |
Then access to b becomes straightforward. (The disadvantage is that more memory is required, because of the padding bytes.)
Different data types have different alignment requirements. It's common for char to be 1-byte aligned, short to be 2-byte aligned, and 4-byte types (int, float, and pointers on 32-bit systems) to be 4-byte aligned.
malloc is required by the C standard to return a pointer that's properly aligned for any data type.
glibc malloc on x86-64 returns 16-byte-aligned pointers.
Alignment requirements specify what address offsets can be assigned to what types. This is completely implementation-dependent, but is generally based on word size. For instance, some 32-bit architectures require all int variables start on a multiple of four. On some architectures, alignment requirements are absolute. On others (e.g. x86) flouting them only comes with a performance penalty.
malloc is required to return an address suitable for any alignment requirement. In other words, the returned address can be assigned to a pointer of any type. From C99 ยง7.20.3 (Memory management functions):
The pointer returned if the allocation
succeeds is suitably aligned so that
it may be assigned to a pointer to any
type of object and then used to access
such an object or an array of such
objects in the space allocated (until
the space is explicitly deallocated).
The malloc() documentation says:
[...] the allocated memory that is suitably aligned for any kind of variable.
Which is true for most everything you do in C/C++. However, as pointed out by others, many special cases exist and require a specific alignment. For example, Intel processors support a 256 bit type: __m256, which is most certainly not taken in account by malloc().
Similarly, if you want to allocate a memory buffer for data that is to be paged (similar to addresses returned by mmap(), etc.) then you need a possibly very large alignment which would waste a lot of memory if malloc() was to return buffers always aligned to such boundaries.
Under Linux or other Unix systems, I suggest you use the posix_memalign() function:
int posix_memalign(void **memptr, size_t alignment, size_t size);
This is the most current function that one wants to use for such needs.
As a side note, you could still use malloc(), only in that case you need to allocate size + alignment - 1 bytes and do your own alignment on the returned pointer: (ptr + alignment - 1) & -alignment (not tested, all casts missing). Also the aligned pointer is not the one you'll use to call free(). In other words, you have to store the pointer that malloc() returned to be able to call free() properly. As mentioned above, this means you lose up to alignment - 1 byte per such malloc(). In contrast, the posix_memalign() function should not lose more than sizeof(void*) * 4 - 1 bytes, although since your size is likely a multiple of alignment, you would only lose sizeof(void*) * 2... unless you only allocate such buffers, then you lose a full alignment bytes each time.
If you have particular memory alignemnt needs (for particular hardware or libraries), you can check out non-portable memory allocators such as _aligned_malloc() and memalign(). These can easily be abstracted behind a "portable" interface, but are unfortunately non-standard.

Resources