What is aligned memory allocation? - malloc

I also want to know whether glibc malloc() does this.

Suppose that you have the structure.
struct S {
short a;
int b;
char c, d;
};
Without alignment, it would be laid out in memory like this (assuming a 32-bit architecture):
0 1 2 3 4 5 6 7
|a|a|b|b|b|b|c|d| bytes
| | | words
The problem is that on some CPU architectures, the instruction to load a 4-byte integer from memory only works on word boundaries. So your program would have to fetch each half of b with separate instructions.
But if the memory was laid out as:
0 1 2 3 4 5 6 7 8 9 A B
|a|a| | |b|b|b|b|c|d| | |
| | | |
Then access to b becomes straightforward. (The disadvantage is that more memory is required, because of the padding bytes.)
Different data types have different alignment requirements. It's common for char to be 1-byte aligned, short to be 2-byte aligned, and 4-byte types (int, float, and pointers on 32-bit systems) to be 4-byte aligned.
malloc is required by the C standard to return a pointer that's properly aligned for any data type.
glibc malloc on x86-64 returns 16-byte-aligned pointers.

Alignment requirements specify what address offsets can be assigned to what types. This is completely implementation-dependent, but is generally based on word size. For instance, some 32-bit architectures require all int variables start on a multiple of four. On some architectures, alignment requirements are absolute. On others (e.g. x86) flouting them only comes with a performance penalty.
malloc is required to return an address suitable for any alignment requirement. In other words, the returned address can be assigned to a pointer of any type. From C99 §7.20.3 (Memory management functions):
The pointer returned if the allocation
succeeds is suitably aligned so that
it may be assigned to a pointer to any
type of object and then used to access
such an object or an array of such
objects in the space allocated (until
the space is explicitly deallocated).

The malloc() documentation says:
[...] the allocated memory that is suitably aligned for any kind of variable.
Which is true for most everything you do in C/C++. However, as pointed out by others, many special cases exist and require a specific alignment. For example, Intel processors support a 256 bit type: __m256, which is most certainly not taken in account by malloc().
Similarly, if you want to allocate a memory buffer for data that is to be paged (similar to addresses returned by mmap(), etc.) then you need a possibly very large alignment which would waste a lot of memory if malloc() was to return buffers always aligned to such boundaries.
Under Linux or other Unix systems, I suggest you use the posix_memalign() function:
int posix_memalign(void **memptr, size_t alignment, size_t size);
This is the most current function that one wants to use for such needs.
As a side note, you could still use malloc(), only in that case you need to allocate size + alignment - 1 bytes and do your own alignment on the returned pointer: (ptr + alignment - 1) & -alignment (not tested, all casts missing). Also the aligned pointer is not the one you'll use to call free(). In other words, you have to store the pointer that malloc() returned to be able to call free() properly. As mentioned above, this means you lose up to alignment - 1 byte per such malloc(). In contrast, the posix_memalign() function should not lose more than sizeof(void*) * 4 - 1 bytes, although since your size is likely a multiple of alignment, you would only lose sizeof(void*) * 2... unless you only allocate such buffers, then you lose a full alignment bytes each time.

If you have particular memory alignemnt needs (for particular hardware or libraries), you can check out non-portable memory allocators such as _aligned_malloc() and memalign(). These can easily be abstracted behind a "portable" interface, but are unfortunately non-standard.

Related

What is the benefit of calling ioread functions when using memory mapped IO

To use memory mapped I/O, we need to first call request_mem_region.
struct resource *request_mem_region(
unsigned long start,
unsigned long len,
char *name);
Then, as kernel is running in virtual address space, we need to map physical addresses to virtual address space by running ioremap function.
void *ioremap(unsigned long phys_addr, unsigned long size);
Then why can't we access the return value directly.
From Linux Device Drivers Book
Once equipped with ioremap (and iounmap), a device driver can access any I/O memory address, whether or not it is directly mapped to virtual address space. Remember, though, that the addresses returned from ioremap should not be dereferenced directly; instead, accessor functions provided by the kernel should be used.
Can anyone explain the reason behind this or the advantage with accessor functions like ioread32 or iowrite8()?
You need ioread8 / iowrite8 or whatever to at least cast to volatile* to make sure optimization still results in exactly 1 access (not 0 or more than 1). In fact they do more than that, handling endianness (They also handle endianness, accessing device memory as little-endian. Or ioread32be for big-endian) and some compile-time reordering memory-barrier semantics that Linux chooses to include in these functions. And even a runtime barrier after reads, because of DMA. Use the _rep version to copy a chunk from device memory with only one barrier.
In C, data races are UB (Undefined Behaviour). This means the compiler is allowed to assume that memory accessed through a non-volatile pointer doesn't change between accesses. And that if (x) y = *ptr; can be transformed into tmp = *ptr; if (x) y = tmp; i.e. compile-time speculative loads, if *ptr is known to not fault. (Related: Who's afraid of a big bad optimizing compiler? re: why the Linux kernel need volatile for rolling its own atomics.)
MMIO registers may have side effects even for reading so you must stop the compiler from doing loads that aren't in the source, and must force it to do all the loads that are in the source exactly once.
Same deal for stores. (Compilers aren't allowed to invent writes even to non-volatile objects, but they can remove dead stores. e.g. *ioreg = 1; *ioreg = 2; would typically compile the same as *ioreg = 2; The first store gets removed as "dead" because it's not considered to have a visible side effect.
C volatile semantics are ideal for MMIO, but Linux wraps more stuff around them than just volatile.
From a quick look after googling ioread8 and poking around in https://elixir.bootlin.com/linux/latest/source/lib/iomap.c#L11 we see that Linux I/O addresses can encode IO address space (port I/O, aka PIO; in / out instructions on x86) vs. memory address space (normal load/store to special addresses). And ioread* functions actually check that and dispatch accordingly.
/*
* Read/write from/to an (offsettable) iomem cookie. It might be a PIO
* access or a MMIO access, these functions don't care. The info is
* encoded in the hardware mapping set up by the mapping functions
* (or the cookie itself, depending on implementation and hw).
*
* The generic routines don't assume any hardware mappings, and just
* encode the PIO/MMIO as part of the cookie. They coldly assume that
* the MMIO IO mappings are not in the low address range.
*
* Architectures for which this is not true can't use this generic
* implementation and should do their own copy.
*/
For example implementation, here's ioread16. (IO_COND is a macro that checks the address against a predefined constant: low addresses are PIO addresses).
unsigned int ioread16(void __iomem *addr)
{
IO_COND(addr, return inw(port), return readw(addr));
return 0xffff;
}
What would break if you just cast the ioremap result to volatile uint32_t*?
e.g. if you used READ_ONCE / WRITE_ONCE which just cast to volatile unsigned char* or whatever, and are used for atomic access to shared variables. (In Linux's hand-rolled volatile + inline asm implementation of atomics which it uses instead of C11 _Atomic).
That might actually work on some little-endian ISAs like x86 if compile-time reordering wasn't a problem, but others need more barriers. If you look at the definition of readl (which ioread32 uses for MMIO, as opposed to inl for PIO), it uses barriers around a dereference of a volatile pointer.
(This and the macros this uses are defined in the same io.h as this, or you can navigate using the LXR links: every identifier is a hyperlink.)
static inline u32 readl(const volatile void __iomem *addr) {
u32 val;
__io_br();
val = __le32_to_cpu(__raw_readl(addr));
__io_ar(val);
return val;
}
The generic __raw_readl is just the volatile dereference; some ISAs may provide their own.
__io_ar() uses rmb() or barrier() After Read. /* prevent prefetching of coherent DMA data ahead of a dma-complete */. The Before Read barrier is just barrier() - blocking compile-time reordering without asm instructions.
Old answer to the wrong question: the text below answers why you need to call ioremap.
Because it's a physical address and kernel memory isn't identity-mapped (virt = phys) to physical addresses.
And returning a virtual address isn't an option: not all systems have enough virtual address space to even direct-map all of physical address space as a contiguous range of virtual addresses. (But when there is enough space, Linux does do this, e.g. x86-64 Linux's virtual address-space layout is documented in x86_64/mm.txt
Notably 32-bit x86 kernels on systems with more than 1 or 2GB of RAM (depending on how the kernel is configured: 2:2 or 1:3 kernel:user split of virtual address space). With PAE for 36-bit physical address space, a 32-bit x86 kernel can use much more physical memory than it can map at once. (This is pretty horrible and makes life difficult for a kernel: some random blog reposed Linus Torvald's comments about how PAE really really sucks.)
Other ISAs may have this too, and IDK what Alpha does about IO memory when byte accesses are needed; maybe the region of physical address space that maps word loads/stores to byte loads/stores is handled earlier so you request the right physical address. (http://www.tldp.org/HOWTO/Alpha-HOWTO-8.html)
But 32-bit x86 PAE is obviously an ISA that Linux cares a lot about, even quite early in the history of Linux.

Maximum size of an array in 32 bits?

According to the Rust Reference:
The isize type is a signed integer type with the same number of bits as the platform's pointer type. The theoretical upper bound on object and array size is the maximum isize value. This ensures that isize can be used to calculate differences between pointers into an object or array and can address every byte within an object along with one byte past the end.
This obviously constrain an array to at most 2G elements on 32 bits system, however what is not clear is whether an array is also constrained to at most 2GB of memory.
In C or C++, you would be able to cast the pointers to the first and one past last elements to char* and obtain the difference of pointers from those two; effectively limiting the array to 2GB (lest it overflow intptr_t).
Is an array in 32 bits also limited to 2GB in Rust? Or not?
The internals of Vec do cap the value to 4GB, both in with_capacity and grow_capacity, using
let size = capacity.checked_mul(mem::size_of::<T>())
.expect("capacity overflow");
which will panic if the pointer overflows.
As such, Vec-allocated slices are also capped in this way in Rust. Given that this is because of an underlying restriction in the allocation API, I would be surprised if any typical type could circumvent this. And if they did, Index on slices would be unsafe due to pointer overflow. So I hope not.
It might still not be possible to allocate all 4GB for other reasons, though. In particular, allocate won't let you allocate more than 2GB (isize::MAX bytes), so Vec is restricted to that.
Rust uses LLVM as compiler backend. The LLVM instruction for pointer arithmetic (GetElementPtr) takes signed integer offsets and has undefined behavior on overflow, so it is impossible to index into arrays larger than 2GB when targeting a 32-bit platform.
To avoid undefined behavior, Rust will refuse to allocate more than 2 GB in a single allocation. See Rust issue #18726 for details.

is mmap use already allocated block if next request fit into it?

suppose i want to allocate 3000 bytes like this
malloc(1000);
malloc(1000);
malloc(1000);
and my malloc implementation use mmap() .
So i want to know that:-
is malloc called 3 times mmap().
is mmap allocate 3 separate pages(total allocated memory is 3*4096) or it gives memory to all three requeste from one page(total allocated memory is 4096).
if it allocated three different pages then how i can make my alloc to do it with only one page.
The mmapping behavior of Linux's (that is, GNU libc's) malloc is described in the manpage mallopt(3). malloc uses a "dynamic mmap threshold" that starts off at 128kB, but this may automatically be adjusted upward based on a process's allocation patterns. Smaller allocations are served using an old-school free list, and the initial threshold can be set using environment variables or the mallopt function.
So malloc will almost certainly not mmap three 4kB pages, but whether or not it keeps the allocations in a single page is not guaranteed. You can either do a manual mmap or, if two pages is ok, do a single malloc:
char *a = malloc(3000);
// check for errors
char *b = a + 1000;
char *c = b + 1000;
// don't forget that you must free a, and only a, to free b and c

Why has a (C-)stack a maximum of 2mb?

This question is about stack overflows, so where better to ask it than here.
If we consider how memory is used for a program (a.out) in unix, it is something like this:
| etext | stack, 2mb | heap ->>>
And I have wondered for a few years now why there is a restriction of 2MB for the stack. Consider that we have 64 bits for a memory address, then why not allocate like this:
| MIN_ADDR MAX_ADDR|
| heap ->>>> <<<- stack | etext |
MAX_ADDR will be somewhere near 2^64 and MIN_ADDR somewhere near 2^0, so there are many bytes in between which the program can use, but are not necessarily accounted for by the kernel (by actually assigning pages for them). The heap and stack will probably never reach each other, and hence the 2MB limit is not needed ( and would instead have a ~1.8446744e+19 bytes limit). If we are scared that they will reach each other, then set the limit to 2^63 or some bizarre and enormous number.
Furthermore, the heap grows from low to high, so our kernel can still resize blocks of memory (allocated with for example malloc) without necessarily needing to shift the content.
Moreover, a stack frame is always static in size in some way. So we never need to resize there, if we do, that would be awkward anyway, since we also need to change the whole pointer structure used by return and created by call.
I read this as an answer on another stackoverflow question:
"My intuition is the following. The stack is not as easy to manage as the heap. The stack need to be stored in continuous memory locations. This means that you cannot randomly allocate the stack as needed, but you need to at least reserve virtual addresses for that purpose. The larger the size of the reserved virtual address space, the fewer threads you can create."
Source: Why is the page size of Linux (x86) 4 KB, how is that calcualted
But we have loads of memory addresses! So this makes no sense. So why 2MB?
The reason I ask is that allocating memory on the stack is quite safe with respect to dangling pointers and memory leaks:
e.g. I prefer
int foo[5];
instead of
int *foo = malloc(5*sizeof(int));
Since it will deallocate by itself. Also, allocation on the stack is faster than allocation executed by malloc. However, If I allocate an image (i.e. a jpeg or png) on the stack, I am in a dangerous zone of overflowing the stack.
Another point on this matter, why not also allow this:
int *huge_list_of_data = malloc(1000*sizeof(char), 10 000 000 000*sizeof(char))
where we allocate a list object, which has initially the size of 1KB, but we ask the kernel to allocate it such that the page it is put on is not used for anything else, and that we want to have 10GB of pages behind it, which can be (partially) swapped in when necessary.
This way we don't need 10GB of memory, we only need 10GB of memory addresses.
So why no:
void *malloc( unsigned long, unsigned long );
?
In essence: WHY NOT USE THE PAGING SYSTEM OF UNIX TO SOLVE OUR MEMORY ALLOCATION PROBLEMS?
Thank you for reading.

Will malloc implementations return free-ed memory back to the system?

I have a long-living application with frequent memory allocation-deallocation. Will any malloc implementation return freed memory back to the system?
What is, in this respect, the behavior of:
ptmalloc 1, 2 (glibc default) or 3
dlmalloc
tcmalloc (google threaded malloc)
solaris 10-11 default malloc and mtmalloc
FreeBSD 8 default malloc (jemalloc)
Hoard malloc?
Update
If I have an application whose memory consumption can be very different in daytime and nighttime (e.g.), can I force any of malloc's to return freed memory to the system?
Without such return freed memory will be swapped out and in many times, but such memory contains only garbage.
The following analysis applies only to glibc (based on the ptmalloc2 algorithm).
There are certain options that seem helpful to return the freed memory back to the system:
mallopt() (defined in malloc.h) does provide an option to set the trim threshold value using one of the parameter option M_TRIM_THRESHOLD, this indicates the minimum amount of free memory (in bytes) allowed at the top of the data segment. If the amount falls below this threshold, glibc invokes brk() to give back memory to the kernel.
The default value of M_TRIM_THRESHOLD in Linux is set to 128K, setting a smaller value might save space.
The same behavior could be achieved by setting trim threshold value in the environment variable MALLOC_TRIM_THRESHOLD_, with no source changes absolutely.
However, preliminary test programs run using M_TRIM_THRESHOLD has shown that even though the memory allocated by malloc does return to the system, the remaining portion of the actual chunk of memory (the arena) initially requested via brk() tends to be retained.
It is possible to trim the memory arena and give any unused memory back to the system by calling malloc_trim(pad) (defined in malloc.h). This function resizes the data segment, leaving at least pad bytes at the end of it and failing if less than one page worth of bytes can be freed. Segment size is always a multiple of one page, which is 4,096 bytes on i386.
The implementation of this modified behavior of free() using malloc_trim could be done using the malloc hook functionality. This would not require any source code changes to the core glibc library.
Using madvise() system call inside the free implementation of glibc.
Most implementations don't bother identifying those (relatively rare) cases where entire "blocks" (of whatever size suits the OS) have been freed and could be returned, but there are of course exceptions. For example, and I quote from the wikipedia page, in OpenBSD:
On a call to free, memory is released
and unmapped from the process address
space using munmap. This system is
designed to improve security by taking
advantage of the address space layout
randomization and gap page features
implemented as part of OpenBSD's mmap
system call, and to detect
use-after-free bugs—as a large memory
allocation is completely unmapped
after it is freed, further use causes
a segmentation fault and termination
of the program.
Most systems are not as security-focused as OpenBSD, though.
Knowing this, when I'm coding a long-running system that has a known-to-be-transitory requirement for a large amount of memory, I always try to fork the process: the parent then just waits for results from the child [[typically on a pipe]], the child does the computation (including memory allocation), returns the results [[on said pipe]], then terminates. This way, my long-running process won't be uselessly hogging memory during the long times between occasional "spikes" in its demand for memory. Other alternative strategies include switching to a custom memory allocator for such special requirements (C++ makes it reasonably easy, though languages with virtual machines underneath such as Java and Python typically don't).
I had a similar problem in my app, after some investigation I noticed that for some reason glibc does not return memory to the system when allocated objects are small (in my case less than 120 bytes).
Look at this code:
#include <list>
#include <malloc.h>
template<size_t s> class x{char x[s];};
int main(int argc,char** argv){
typedef x<100> X;
std::list<X> lx;
for(size_t i = 0; i < 500000;++i){
lx.push_back(X());
}
lx.clear();
malloc_stats();
return 0;
}
Program output:
Arena 0:
system bytes = 64069632
in use bytes = 0
Total (incl. mmap):
system bytes = 64069632
in use bytes = 0
max mmap regions = 0
max mmap bytes = 0
about 64 MB are not return to system. When I changed typedef to:
typedef x<110> X; program output looks like this:
Arena 0:
system bytes = 135168
in use bytes = 0
Total (incl. mmap):
system bytes = 135168
in use bytes = 0
max mmap regions = 0
max mmap bytes = 0
almost all memory was freed. I also noticed that using malloc_trim(0) in either case released memory to system.
Here is output after adding malloc_trim to the code above:
Arena 0:
system bytes = 4096
in use bytes = 0
Total (incl. mmap):
system bytes = 4096
in use bytes = 0
max mmap regions = 0
max mmap bytes = 0
I am dealing with the same problem as the OP. So far, it seems possible with tcmalloc. I found two solutions:
compile your program with tcmalloc linked, then launch it as :
env TCMALLOC_RELEASE=100 ./my_pthread_soft
the documentation mentions that
Reasonable rates are in the range [0,10].
but 10 doesn't seem enough for me (i.e I see no change).
find somewhere in your code where it would be interesting to release all the freed memory, and then add this code:
#include "google/malloc_extension_c.h" // C include
#include "google/malloc_extension.h" // C++ include
/* ... */
MallocExtension_ReleaseFreeMemory();
The second solution has been very effective in my case; the first would be great but it isn't very successful, it is complicated to find the right number for example.
Of the ones you list, only Hoard will return memory to the system... but if it can actually do that will depend a lot on your program's allocation behaviour.
The short answer: To force malloc subsystem to return memory to OS, use malloc_trim(). Otherwise, behavior of returning memory is implementation dependent.
For all 'normal' mallocs, including the ones you've mentioned, memory is released to be reused by your process, but not back to the whole system. Releasing back to the whole system happens only when you process is finally terminated.
FreeBSD 12's malloc(3) uses jemalloc 5.1, which returns freed memory ("dirty pages") to the OS using madvise(...MADV_FREE).
Freed memory is only returned after a time delay controlled by opt.dirty_decay_ms and opt.muzzy_decay_ms; see the manual page and this issue on implementing decay-based unused dirty page purging for more details.
Earlier versions of FreeBSD shipped with older versions of jemalloc, which also returns freed memory, but uses a different algorithm to decide what to purge and when.

Resources