What are the advantages and disadvantage of using jemalloc vs malloc vs calloc and other common alternatives? - malloc

Reading the Rust subreddit today I came across comments that:
jemalloc is optimized for (multithreaded) speed, not memory usage
After doing more research I found that there are even more alternatives (such as calloc).
I would like to understand what the advantages and disadvantages of the different memory allocators?
If this question seems silly, my background is mainly interpreted languages (which don't expose such fine grain memory control).

malloc, calloc, and realloc
These functions are not different allocators. They are different ways of asking for memory from the same allocator.
malloc provides memory without initializing it (filled with whatever the previous user stored in it).
calloc is same as malloc but it will also initialize the memory (fill it with the zero byte 0x00).
realloc takes an already allocated memory and allows the users to resize it.
So, in the context of allocators and their different implementations, malloc, calloc and realloc are not listed independently, because each allocator implementation needs its own version of these functions.
jemalloc, ptmalloc, ...
When someone wants to implement a different allocator, he can't (can but shouldn't by default) name it malloc because it will collide with the C standard library one. Instead, they usually give it a different prefix like jemalloc, ptmalloc, nedmalloc, tcmalloc and others.
It is worth mentioning that there are, also, multiple implementations of the C standard library itself and each will implement its allocator differently. So malloc will have a different implementation based on what standard library being used when compiling the code. Examples are: the GNU C Standard library, MSVC standard library, etc.
What is the difference between different allocators?
To know the exact advantages and disadvantages of each implementation, one must read the documentation written by the author/authors of each one if it exists, read the code to understand the algorithm or read articles/research papers written by experts that talks about that particular implementation.
However, if I were to categorize the differences between these implementations, I would list the following:
Some implementations focus on certain usage patterns and try to optimize for them even at expense of decreasing efficiency of other cases. An example for this would be jemalloc where they focused on optimizing allocation from multiple threads to make it faster but at the expense of using more memory. These types of allocators typically deployed upon careful investigation of a specific case that showed it will benefits from this trade-off.
Some implementations put a certain limitation on the usage of the allocator in order to make it faster. An example is single-threaded allocators which will eliminate the need for synchronization objects to make it faster.
Other implementations try to be as general-purpose as possible and doesn't favor any case over the others. This category includes the default allocators that is included in the standard libraries.

Related

How to reliably influence generated code at near machine level using GHC?

While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.
Some assumptions:
It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed
What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.
Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?
It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).
When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.
Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).
There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.
You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.
GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.
If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).
One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.

Does Racket offer anything for implementing a language with its own GC that could manage GPU memory?

I am working on this and I have some regrets with that I am going to have to do some kind of region based memory allocation scheme for GPU memory because .NET does not allow the adequate level of control over its GC.
I was too naive. I admit that it did cross my mind that just because I was on a platform with GC that I would (and should) not have to do manual memory management, nor would I need to know how the C malloc works nor how it is implemented. I want to do better than this.
What are Racket's facilities in this area?
No. GPU processors are not like CPUs, and practically speaking don't run any GC-ed language implementation, but some very low-level code (e.g. using OpenCL or CUDA or OpenACC or SPIR). They don't really have some general purpose dynamic memory allocation, and they might not even have any virtual memory or MMU. Their memory is generally separate.
What you could do is use some existing library having some GPU compute kernels (like TensorFlow, OpenCV, etc...) and call that library from your Racket based thing using some foreign function interface.
What you might do with a lot of work (probably several years) is to generate some kernel code in OpenCL or CUDA (or SPIR) -mixed with some other generated code managing that kernel code-, that is to implement a compiler from a small subset (to be painfully defined) of your Spiral language into OpenCL or CUDA kernels. In that case, the evil is in the details (and the kernel code you'll generate would depend upon the particular GPU model). You could look into SPOC for inspiration.
nor would I need to know how the C malloc works nor how it is implemented.
It is much worse than that. You'll need to care of a lot of low level details, you'll need to code stuff specific to your OS and hardware, and understanding C malloc is easier than taking care of all the GPU details (that is, generating the "right" GPU and glue code: dive into the specifications of OpenCL for more).
(I believe that it is not worth the effort -several years- to compile your Spiral into GPU kernel code and the necessary glue code running in the CPU)
You should also read more about garbage collection, e.g. the GC handbook.
I was too naive.
You probably still are. Your subject is harder than what you think, if you want an efficient and competitive implementation. Coding a naïve GC (or VM) is easy, but coding an efficient one is hard (requiring several years of work).
I want to do better than this.
You'll need several years of full time work.

Mutexes, atomic and fences : what offers the best tradeoff and portability ? C++11

I'm trying to get into something deeper to better understand how many options do I have when writing multi-threaded applications in C++ 11.
In short I see this 3 options so far:
mutexes with explicit locking and freeing mechanism, they keep the threading in sync by locking and freeing, this is costly and doesn't guarantee the ordering of the execution of my code, but often times this solution is quite portable among different memory models.
atomic operations, since atomic = 1single operation without a race and it is always consistent, the sync is accomplished without locking and freeing, there is no need for locking without a race, with highly optimized atomic operations, but atomics still can't guarantee the order in which my code will be executed.
fences, they create a block in my code where nothing can't be re-ordered by the compiler, are less flexible and they tend to be costly in terms of code maintenance because I always have to keep an eye on what is really being executed and in what order, but they also improve caching techniques and among this 3 solutions they are probably the one with the most predictable behaviour.
This is more or less the core of what I got from the first lessons about threading and memory models, my problems is:
I was going for lockfree data structures and atomics to achieve flexibility and good performances, the problem here is the fact that apparently an X86 machine performs memory re-ordering differently from an ARM one and I would like to keep my code portable as much as possible at least across this 2 platforms, so what kind of approach you can suggest to write a portable multi-threaded software when 2 platforms are not guarantee to have the same re-ordering mechanisms ? Or atomic operations are the best choice as it is by now and I got all this wrong ?
For example I noticed that the Intel TBB library ( which is not C++11 code ) is being ported to ARM/Android with heavy modifications on the part dedicated to the atomic, so maybe I can write portable multi-threaded code in C++11, with lockfree data structures, and optimize the part about atomic later on when porting my library to another platform ?
The issues surrounding multi-threaded programming are not language-specific or architecture-specific. You are better off studying them first with a generalized view - and only after, as a second step, specializing your general understanding to specific languages, libraries, platforms, etc, etc.
The textbook required when I went to school was:
Principles of Concurrent and Distributed Programming - Ben-Ari
The second edition is 2006 I believe. There may be better ones, but this should suffice for starters.
Yep, X86 and ARM have different memory models.
The C++11 memory model is however not platform-specific, it has the same behavior everywhere.
That means implementation of the C++11 atomics is different on each platform - on x86, which has a fairly strong memory model, the implementation of std::atomic might get away without special assembler instructions when storing a value, while on ARM, the implementation needs special locking or fence instructions internally.
So you can simply use the atomic classes in C++11, they will work the same on all platforms. If you want to, you can even tweak the memory order if you are absolutely sure what you are doing. A weaker memory order might be faster since the implementation of the atomics might need less assembler instructions for locks and fences internally.
I can highly recommend watching Herb Sutter's talk Atomic Weapons for some detailed explanations about this.

How long should I expect a garbage collection to take before removing an opaque FFI object? Is it possible to speed it up some way?

I consider writing Haskell bindings to a quantum mechanics library written in C++ (I'd write a plain C wrapper) and CUDA. A major bottleneck is always the GPU memory used by the CUDA parts. In C++, this is handled quite efficiently because all objects have automatic memory management, i.e. are erased as soon as they leave scope. Also I use C++11 move semantics to avoid copies, those obviously wouldn't be necessary in Haskell anyway.
Yet I'm concerned it might not work as smoothly anymore when the objects are managed from garbage-collected Haskell, and I might need to come up with heuristics to migrate seldom-used objects back to host memory (which tends to be quite slow). Is this fear reasonable or is the GHC garbage collection so effective that most objects will vanish almost as quickly as in C++, even when the Haskell runtime doesn't see it needs to be economic on memory? Are there any tricks to help, or ways to signal that some objects take up too much GPU memory and should be removed as quickly as possible?
even when the Haskell runtime doesn't see it needs to be economic on memory?
This is the issue: the GHC GC doesn't know how big your foreign objects are, so they don't exert any heap pressure, and thus aren't collected as soon as they could be.
You can mitigate this by calling performGC manually to force a major GC.

Boehm and tagged pointers

Tagged pointers are a common optimization when implementing dynamic languages: take advantage of alignment requirements that mean the low two or three bits of a pointer will always be zero, and use them to store type information.
Suppose you're using the Boehm garbage collector, which basically works by looking at active data for things that look like pointers. Tagged pointers don't look like pointers, in the sense that their low bits are nonzero.
Is this a showstopper, i.e. do you have to ditch tagged pointers if you're using Boehm? Or does it have a way around this problem?
AFAIK Boehm can handle this with the right options. It is capable, at a small price, of detecting interior pointers. It is also possible to write your own scanning code. Basically there are probably enough hooks to handle just about anything.
I have written my own collector, it is precise on the heap and conservative on the stack. It does not touch C made pointers. For some applications it will be faster because it knows a lot about my language allocated objects and doesn't care about other stuff which is managed, say, using traditional C++ destructors.
However it isn't incremental or generational, and it doesn't handle threads as well (it's not smart enough to stop threads with signals). On the plus side, however, it doesn't require magic linkage techniques which Boehm does (to capture mallocs, etc). On the seriously minus side you can't put managed objects into unmanaged ones.

Resources