How do Haskell compilers decide whether to allocate on the heap or the stack? - haskell

Haskell doesn't feature explicit memory management, and all objects are passed by value, so there's no obvious reference counting or garbage collection either. How does a Haskell compiler typically decide whether to generate code that allocates on the stack versus code that allocates on the heap for a given variable? Will it consistently heap or stack allocate the same variables across different call sites for the same function? And when it allocates, how does it decide when to free memory? Are stack allocations and deallocations still performed in the same function entrance/exit pattern as in C?

When you call a function like this
f 42 (g x y)
then the runtime behaviour is something like the following:
p1 = malloc(2 * sizeof(Word))
p1[0] = &Tag_for_Int
p1[1] = 42
p2 = malloc(3 * sizeof(Word))
p2[0] = &Code_for_g_x_y
p2[1] = x
p2[2] = y
f(p1, p2)
That is, arguments are usually passed as pointers to objects on the heap like in Java, but unlike Java these objects may represent suspended computations, a.k.a. thunks, such as (g x y/p2) in our example. Without optimisations, this execution model is quite inefficient, but there are ways to avoid many of these overheads.
GHC does a lot of inlining and unboxing. Inlining removes the function call overhead and often enables further optimisations. Unboxing means changing the calling convention, in the example above we could pass 42 directly instead of creating the heap object p1.
Strictness analysis finds out whether an argument is guaranteed to be evaluated. In that case, we don't need to create a thunk, but evaluate the expression fully and then pass the final result as an argument.
Small objects (currently only 8bit Chars and Ints) are cached. That is, instead of allocating a new pointer for each object, a pointer to the cached object is returned. Even though the object is initially allocated on the heap, the garbage collector will de-duplicate them later (only small Ints and Chars). Since objects are immutable this is safe.
Limited escape analysis. For local functions some arguments may be passed on the stack, because they are known to be dead code by the time the outer function returns.
Edit: For (much) more information see "Implementing Lazy Functional Languages on Stock Hardware: The Spineless Tagless G-machine". This paper uses "push/enter" as the calling convention. Newer versions of GHC use the "eval/apply" calling convention. For a discussion of the trade-offs and reasons for that switch see "How to make a fast curry: push/enter vs eval/apply"

The only things GHC puts on the stack are evaluation contexts. Anything allocated with a let/where binding, and all data constructors and functions, are stored in the heap. Lazy evaluation makes everything you know about execution strategies in strict languages irrelevant.

Related

How do laziness and parallelism coexist in Haskell?

People argue that Haskell has an advantage in parallelism since it has immutable data structures. But Haskell is also lazy. It means data actually can be mutated from thunk to evaluated result.
So it seems laziness can harm the advantage of immutability. Am I wrong or does Haskell have countermeasures for this problem? Or is this Haskell's own feature?
Yes, GHC’s RTS uses thunks to implement non-strict evaluation, and they use mutation under the hood, so they require some synchronisation. However, this is simplified due to the fact that most heap objects are immutable and functions are referentially transparent.
In a multithreaded program, evaluation of a thunk proceeds as follows:
The thunk is atomically† replaced with a BLACKHOLE object
If the same thread attempts to force the thunk after it’s been updated to a BLACKHOLE, this represents an infinite loop, and the RTS throws an exception (<<loop>>)
If a different thread attempts to force the thunk while it’s a BLACKHOLE, it blocks until the original thread has finished evaluating the thunk and updated it with a value
When evaluation is complete, the original thread atomically† replaces the thunk with its result
† e.g., using a compare-and-swap (CAS) instruction
So there is a potential race here: if two threads attempt to force the same thunk at the same time, they may both begin evaluating it. In that case, they will do some redundant work—however, one thread will succeed in overwriting the BLACKHOLE with the result, and the other thread will simply discard the result that it calculated, because its CAS will fail.
Safe code cannot detect this, because it can’t obtain the address of an object or determine the state of a thunk. And in practice, this type of collision is rare for a couple of reasons:
Concurrent code generally partitions workloads across threads in a manner suited to the particular problem, so there is low risk of overlap
Evaluation of thunks is generally fairly “shallow” before you reach weak head normal form, so the probability of a “collision” is low
So thunks ultimately provide a good performance tradeoff when implementing non-strict evaluation, even in a concurrent context.

Stack allocation of `isbits` types in Julia

Summary of question and answers
Objects of a particular type, say
type Foo
a::A
b::B
end
can be stored in either of two ways:
Inlined (aka by value): in this case, the statement "variable foo::Foo is stored at location x" effectively means we have a variable foo.a::A at location x and a variable foo.b::B at location x + sizeof(A) (technically the addresses could be a bit more complicated, but that's irrelevant for our purposes).
Referenced (aka by reference): "foo::Foo is stored at location x" means the location x contains a pointer fooptr::Ptr{Foo} such that there is a variable foo.a::A at location fooptr and foo.b::B at location fooptr + sizeof(A).
Unlike other languages (I'm looking at you, C/C++), Julia decides by itself whether to store variables inlined or referenced, and it does so based on the properties of the type:
mutable types -> referenced,
immutable types -> referenced if at least one of its fields is referenced, inlined otherwise.
There are at least two reasons for this rule:
StefanKarpinski's answer: The garbage collector needs be able to find all pointers to heap-allocated objects on the stack. Currently, Julia ensures this by storing all such pointers on a separate "shadow stack", but if we allowed composite types containing pointers to be placed on the stack then such a neat separation would no longer be possible. Instead, the compiler would need to look for pointers among other variables which poses technical difficulties.
yuyichao's answer: Julia requires the inline/reference decision to be made on a per-type rather than per-object basis, which means a hypothetical type
immutable A
a::A
end
would have to be infinitely big if we insisted on inlining it. So we would either have to forbid such recursive immutable types, or we could at most allow non-recursive immutable types to be inlined.
Original question
My understanding of memory management in Julia is:
mutable types -> heap-allocated,
immutable types and tuples -> stack-allocated unless one of their fields is heap-allocated (i.e. mutable).
I don't quite understand the rationale for this behaviour, however. I've read somewhere that the problem with stack-allocating immutables with pointers to mutables is that then the garbage collector might consider the mutables unreachable and destroy them prematurely. On the other hand, if we place the immutable on the heap then there will still be a pointer to the mutables, so it might seem like we avoided the problem, but actually we just shifted it to making sure that now the immutable itself will not be destroyed.
Can anyone explain this to me who has only very superficial knowledge of how garbage collection works?
The problem with stack-allocation of objects which reference other objects is knowing that they need to be traced during garbage collection. The simplest way to do this is what Julia does: heap allocate the objects and "root" them using "shadow stack" which is pushed and popped in sync with the actual stack. This introduces a fair bit of overhead and forces these objects to be heap allocated.
A more sophisticated approach that avoids the overhead of a shadow stack and heap allocation is to stack allocated these objects and then scan the stack which doing garbage collection and follow references from objects in the stack to objects on the heap. However, this requires knowing which objects in the stack are pointers to objects on the heap – in general, non heap-allocated objects are not guaranteed to be kept intact or contiguous in registers or the stack. One approach to doing this is called "conservative stack scanning" which entails assuming during gc that any value on the stack which looks like it could be a pointer to an object on the heap actually is. That approach has been successfully used in applications like Safari's JavaScript engine, but it's not without it's challenges. We've contemplated using conservative stack scanning in Julia, and an initial effort to do so was started but the effort was never completed.
References:
https://github.com/JuliaLang/julia/issues/11714
https://github.com/JuliaLang/julia/pull/8134
There are multiple issues/concepts that are frequently mixed together whenever this is brought up.
mutable or non-pointerfree immutable doesn't necessarily mean heap allocation, we already have optimization passes to elide some of the optimizations and are working on improving them further.
The object layout ABI is an user visible behavior and not something an optimization pass can easily change (unless it can prove that the local optimization it wants to do does not escape). The current ABI is that only isbits immutable will be stored inline (and "stack allocated" when used as local variable). There's a fundamental limitation of lifting the requirement of pointerfree-ness for inlined object, i.e. the necessity to handle recursive types. It is impossible to make all types in a reference circle stored inline and the loop has to be broken somewhere if we want to make some of them inlined. I believe we do have a consistent and predictable model to do this though whether this is desireable is another issue.
This is somewhat related to performance but not always. Stored inline means more copy so it's hard to make sure there's no regression if we do the switch.
Edit: And I should also mention that pointer-free is a sufficient condition for cycle free and is easier to compute, which is partly why we are currently using it to break inlining cycles.
GC support. This is basically the easiest part. It's very easy to make GC recognize pointers on the stack. It just needs to be done if we decide to change the object layout ABI.
Edit: And I should add that "GC support" is needed because we currently only support a limited / simple stack layout for object reference (i.e. an array of pointers). It's this that needs to be improved.

Box<X> vs move semantics on X

I have an easy question regarding Box<X>.
I understand what it does, it allocates X on the heap.
In C++ you use the new operator to allocate something on the heap so it can outlive the current scope (because if you create something on the stack it goes away at the end of the current block).
But reading Rust's documentation, it looks like you can create something on the stack and still return it taking advantage of the language's move semantics without having to resort to the heap.
Then it's not clear to me when to use Box<X> as opposed to simply X.
I just started reading about Rust so I apologize if I'm missing something obvious.
First of all: C++11 (and newer) has move semantics with rvalue references, too. So your question would also apply to C++. Keep in mind though, that C++'s move semantics are -- unlike Rust's ones -- highly unsafe.
Second: the word "move semantic" somehow hints the absence of a "copy", which is not true. Suppose you have a struct with 100 64-bit integers. If you would transfer an object of this struct via move semantics, those 100 integers will be copied (of course, the compiler's optimizer can often remove those copies, but anyway...). The advantage of move semantics comes to play when dealing with objects that deal with some kind of data on the heap (or pointers in general).
For example, take a look at Vec (similar to C++'s vector): the type itself only contains a pointer and two pointer-sized integer (ptr, len and cap). Those three times 64bit are still copied when the vector is moved, but the main data of the vector (which lives on the heap) is not touched.
That being said, let's discuss the main question: "Why to use Box at all?". There are actually many use cases:
Unsized types: some types (e.g. Trait-objects which also includes closures) are unsized, meaning their size is not known to the compiler. But the compiler has to know the size of each stack frame -- hence those unsized types cannot live on the stack.
Recursive data structures: think of a BinaryTreeNode struct. It saves two members named "left" and "right" of type... BinaryTreeNode? That won't work. So you can box both children so that the compiler knows the size of your struct.
Huge structs: think of the 100 integer struct mentioned above. If you don't want to copy it every time, you can allocate it on the heap (this happens pretty seldom).
There are cases where you can’t return X eg. if X is ?Sized (traits, non-compile-time-sized arrays, etc.). In those cases Box<X> will still work.

Understanding Haskell's `map` - Stack or Heap?

Given the following function:
f :: [String]
f = map integerToWord [1..999999999]
integerToWord :: Integer -> String
Let's ignore the implementation. Here's a sample output:
ghci> integerToWord 123999
"onehundredtwentythreethousandandninehundredninetynine"
When I execute f, do all results, i.e. f(0) through f(999999999) get stored on the stack or heap?
Note - I'm assuming that Haskell has a stack and heap.
After running this function for ~1 minute, I don't see the RAM increasing from its original usage.
To be precise - when you "just execute" f it's not evaluated unless you use its result somehow. And when you do - it's stored according to how it's required to fulfill the caller requirements.
As of this example - it's not stored anywhere: the function is applied to every number, the result is output to your terminal and is discarded. So at a given moment in time you only allocate enough memory to store the current value and the result (which is an approximation, but for the case it's precise enough).
References:
https://wiki.haskell.org/Non-strict_semantics
https://wiki.haskell.org/Lazy_vs._non-strict
First: To split hairs, the following answer applies to GHC. A different Haskell compiler could plausibly implement things differently.
There is indeed a heap and a stack. Almost everything goes on the heap, and hardly anything goes on the stack.
Consider, for example, the expression
let x = foo 17 in ...
Let's assume that the optimiser doesn't transform this into something completely different. The call to foo doesn't appear on the stack at all; instead, we create a note on the heap saying that we need to do foo 17 at some point, and x becomes a pointer to this note.
So, to answer your question: when you call f, a note that says "we need to execute map integerToWord [1..999999999] someday" gets stored on the heap, and you get a pointer to that. What happens next depends on what you do with that result.
If, for example, you try to print the entire thing, then yes, the result of every call to f ends up on the heap. At any given moment, only a single call to f is on the stack.
Alternatively, if you just try to access the 8th element of the result, then a bunch of "call f 5 someday" notes end up on the heap, plus the result of f 8, plus a note for the rest of the list.
Incidentally, there's a package out there ("vacuum"?) which lets you print out the actual object graphs for what you're executing. You might find it interesting.
GHC programs use a stack and a heap... but it doesn't work at all like the eager language stack machines you're familiar with. Somebody else is gonna have to explain this, because I can't.
The other challenge in answering your question is that GHC uses the following two techniques:
Lazy evaluation
List fusion
Lazy evaluation in Haskell means that (as the default rule) expressions are only evaluated when their value is demanded, and even then they may only be partially evaluated—only far enough as needed to resolve a pattern match that requires the value. So we can't say what your map example does without knowing what is demanding its value.
List fusion is a set of rewrite rules built into GHC, that recognize a number of situations where the output of a "good" list producer is only ever consumed as the input of a "good" list consumer. In these cases, Haskell can fuse the producer and the consumer into an object-code loop without ever allocating list cells.
In your case:
[1..999999999] is a good producer
map is both a good consumer and a good producer
But you seem to be using ghci, which doesn't do fusion. You need to compile your program with -O for fusion to happen.
You haven't told us what would be consuming the output of the map. If it's a good consumer it will fuse with the map.
But there's a good chance that GHC would eliminate most or all of the list cell allocations if you compiled (with -O) a program that just prints the result of that code. In that case, the list would not exist as a data structure in memory at all—the compiler would generate object code that does something roughly equivalent to this:
for (int i = 1; i <= 999999999; i++) {
print(integerToWord(i));
}

Dynamic Languages and Variable Allocation

How does a dynamic language decide how much memory to allocate for a variable?
eg. How does the compiler change variable= 5 to variable ="xxx" without too much memory overhead? When does it use the hardware stack and when does it use the memory heap?
The compiler allocates enough memory for each variable to hold a pointer plus whatever metadata the language runtime requires. But I think you mean to be asking how much memory is allocated for each object. In that case the answer is that it depends on the type of object. When a variable gets assigned to a different object, the pointer associated with that variable changes what it points to.
The answer, of course, varies by language - both the hosted dynamic language and the lower-level implementation language. That which applies to Perl does not necessarily apply to Python, nor does what applies in Tcl apply in Java or LISP or ... well, do they count as dynamic languages.
In Perl, there's a C-level structure that goes by the name SV (scalar variable) that contains different storage for different versions of the variable's value. These often heap-based; the storage for strings always ends up being heap based, though a pure numeric value that has never been converted to string might be in an SV that is strictly on the stack. In Perl, these things are reference counted (and mortalized, or immortalized, and all sorts of other interesting terms). More complicated types (AV, HV, RV, etc) are based on SV.

Resources