GNU malloc size classes - malloc

I have been trying to find information on whether GNU libc malloc uses size classes like jemalloc. I haven't found any information. I'd like to be able make a better estimate of the savings obtained by using std::vector::reserve() against letting a vector grow incrementally with push_back. I know that GNU stlibc++ will double the vector size whenever it needs to grow, but I can't calculate the real savings without knowing how much malloc is really allocating.
The allocations that I'm looking at are all fairly small (from about 140 to 1500 bytes, but there are several million of them). That means that they are well below the MMAP_THRESHOLD limit.

Related

mmap with large alignment

I'm working on a memory management library that targets both Windows and Linux systems. On Windows, I'm currently using VirtualAlloc2 with MEM_ADDRESS_REQUIREMENTS to allocate blocks of memory that are aligned to relatively large powers of two. Is there any way to achieve the same result with mmap?
I'm aware that a possible solution is to overallocate and then trim the excessive memory using munmap, but I'd like to avoid this since it potentially triples the number of system calls per allocation.

Why does the Linux kernel require small short-term memory chunks in odd sizes?

I'm reading Operating System: Internals and Design Principles by William Stallings, 7th edition. In section 8.4 Linux Memory Management, when talking about kernel memory management, it goes like:
The foundation of kernel memory allocation for Linux is the page allocation
mechanism used for user virtual memory management. As in the virtual memory
scheme, a buddy algorithm is used so that memory for the kernel can be allocated
and deallocated in units of one or more pages. Because the minimum amount of
memory that can be allocated in this fashion is one page, the page allocator alone
would be inefficient because the kernel requires small short-term memory chunks
in odd sizes.
I could understand the discuss on paging, but why does the author says that the kernel requires small short-term memory chunks
in odd sizes., especially, why in odd sizes?
Because most programs require small allocations, for relatively short periods, in a variety of sizes? That's why malloc and friends exist: To subdivide the larger allocations from the OS into smaller pieces with sub-page-size granularity. Want a linked list (commonly needed in OS kernels)? You need to be able to allocate small nodes that contain the value and a pointer to the next node (and possibly a reverse pointer too).
I suspect by "odd sizes" they just mean "arbitrary sizes"; I don't expect the kernel to be unusually heavy on 1, 3, 5, 7, etc. byte allocations, but the allocation sizes are, in many cases, not likely to be consistent enough that a fixed block allocator is broadly applicable. Writing a special block allocator for each possible linked list node size (let alone every other possible size needed for dynamically allocated memory) isn't worth it unless that linked list is absolutely performance critical after all.

How much memory did Linux give to malloc()?

This is a Linux system question, not a coding question. When I use "top" to check the memory usage of my program, it reports a value 3-4 times as large as the actual heap allocation as given by Valgrind's Massif, a memory profiler. It's a large program, and the difference is hundreds of megabytes. The Valgrind manual gives only a partial explanation:
(Massif) does not directly measure memory allocated with
lower-level system calls such as mmap, mremap, and brk.
Heap allocation functions such as malloc are built on top of these
system calls. For example, when needed, an allocator will typically
call mmap to allocate a large chunk of memory, and then hand over
pieces of that memory chunk to the client program in response to calls
to malloc et al. Massif directly measures only these higher-level
malloc et al calls, not the lower-level system calls.
Fine, but how much memory am I really taking away from the system? I need to be able to run as many instances of this program as possible on one machine, so I need to know how much of that memory is still available. Page alignment etc. cannot explain a difference of hundreds of megabytes in reported memory usage.
Also, what determines the block size of the underlying mmap() call? I'm seeing blocks of 64MB at a time being taken according to top, which seems bizarrely large.
Any malloc implementation will be optimised for applications with huge memory requirements, because apps with low requirements run just fine anyway, and virtual memory is cheap.
For example, you will find malloc implementations that use a block of memory for up to 1024 mallocs of up to 16 bytes, another block for up to 1024 mallocs of up to 32 bytes, and so on. With a few mallocs this is inefficient but still cheap. With gazillions of mallocs, it makes malloc very efficient.
So saying "4 times as much" can be completely pointless. Tell us how many megabytes more than you thought.

1GB Vector, will Vector.Unboxed give trouble, will Vector.Storable give trouble?

We need to store a large 1GB of contiguous bytes in memory for long periods of time (weeks to months), and are trying to choose a Vector/Array library. I had two concerns that I can't find the answer to.
Vector.Unboxed seems to store the underlying bytes on the heap, which can be moved around at will by the GC.... Periodically moving 1GB of data would be something I would like to avoid.
Vector.Storable solves this problem by storing the underlying bytes in the c heap. But everything I've read seems to indicate that this is really only to be used for communicating with other languages (primarily c). Is there some reason that I should avoid using Vector.Storable for internal Haskell usage.
I'm open to a third option if it makes sense!
My first thought was the mmap package, which allows you to "memory-map" a file into memory, using the virtual memory system to manage paging. I don't know if this is appropriate for your use case (in particular, I don't know if you're loading or computing this 1GB of data), but it may be worth looking at.
In particular, I think this prevents the GC moving the data around (since it's not on the Haskell heap, it's managed by the OS virtual memory subsystem). On the other hand, this interface handles only raw bytes; you couldn't have, say, an array of Customer objects or something.

How can I benchmark malloc implementations?

I am comparing between different malloc implementations and I would like to compare their run time and memory usage.
In particular, I am interested in the runtime and in the maximum resident memory. It is important that the maximum resident memory will be the real one (without the code segment etc.).
I cannot use tools like valgrind, since it replaces the malloc implementation. Also, I run the tests on programs that I have not written, and I prefer not to change their source code.
You can use rdtscbench for the runtime measurement. See:
https://github.com/petersenna/rdtscbench

Resources