Efficient GC write barrier in generated C

Efficient GC write barrier in generated C - garbage-collection

I'm designing a system that up-front compiles CIL bytecode. In order to keep it relatively simple and also make it very portable, the system will emit C source code (however with all higher-level constructs like OOP factored out) instead of machine code. The intention is that a standard C compiler for the target platform will then be used on that code to get the end product.
Initially I intend to use a very simple GC approach such as stop-the-world. However, while the application doesn't require stellar performance, it does need decent performance, so eventually the GC may need to be changed.
I'm thinking ahead to eventually a more sophisticated GC requiring some sort of write barrier. I've looked at SATB and card marking approaches but I'm not ready to actually plan out a good GC yet. I just don't want to shoot myself in the foot by having the thing emit C source code only to later find that an efficient GC write barrier will demand inline assembly, largely defeating the purpose of emitting C.
So, my question is, can typical write barriers be efficiently implemented in C code? We can assume that the C compiler has a decent optimizer. It's also already a given that the resultant "source code" will be utterly illegible, so clarity doesn't matter.
I'm guessing that - at the expense of bloating the source files even more, - it probably can be done reasonably, but I'd appreciate word from people more experienced in GC design and/or compiler internals.

I am assuming you want a precise generational moving or copying GC.
You could have a write barrier in C; as an example, both Ocaml and MELT runtimes have generational GC with a write barrier. And qish is a generational copying GC with a write barrier, working with C.
(BTW, MELT is a domain specific language to extend GCC, and it is compiled to C, exactly like you want to do)
A more significant issue is how do you keep local pointers (and how the GC knows about them), that is the precise aspect of your GC. You may want to pack them in some local structure.... But then it could happen that the C compiler (e.g. GCC) is optimizing a bit less.
You might look into the source code of recent versions of MONO, they have a generational copying GC. Look also inside Chicken Scheme (also generating C code).
Notice that your C code generator will have to be changed to fit inside some (or yours) particular GC implementation (because each GC has slightly different invariants and expectations). Take also care of tail recursion (some C compilers, notably recent GCC, may optimize them in limited cases).
In Qish, MELT or Ocaml the write barrier (on the C side) is implemented by some macro (or inline functions) called for each touched pointer. Details are implementation specific. Your C code generator will have to take care of them.
Be careful that multi-threaded GC are difficult to design, and that debugging GC, even simple ones, takes a lot of time and is difficult.

Related

How to reliably influence generated code at near machine level using GHC?

While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.
Some assumptions:
It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed
What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.
Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?

It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).
When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.
Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).
There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.
You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.
GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.
If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).
One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.

Can I use the OCaml GC with the LLVM GC interface?

For my PHP LLVM backend I'd like to try out the OCaml GC. Is it possible to use it with LLVM? Especially:
Is the OCaml GC decoupled enough to be used outside of the compiler?
Is the LLVM GC interface mature enough to be used with the OCaml GC?

This shouldn't represent too much work as the OCaml GC is already handled in some way in LLVM: http://llvm.org/docs/GarbageCollection.html#the-erlang-and-ocaml-gcs . This means that stack frames descriptors are correctly emitted for function calls (not the smallest ones, but this should improve with current LLVM GC handling developments). An old version of LLVM's documentation tells that OCaml gc doesn't use write barriers, which is erroneous. So you should be careful to ensure that the generated code is correct for assignments.
For the LLVM GC interface, the current one is quite restricted and does not allows to generate very efficient code, but this should be sufficient to prototype while waiting for the next version that should contain some important changes on that side.

While it appears as though it would be relatively easy to tear out the OCaml GC and Frankenstein it into a different project, I'm not sure this is really something you would want to do in practice.
The OCaml garbage collector was designed with a functional programming style in mind, and this GC architecture might be a liability for a language such as PHP, which is usually not used in a functional style.
If you are set on doing this then I would suggest either waiting a few months for multicore support to be accepted into the OCaml compiler/runtime or using one of the various projects trying to bring multicore support to OCaml at the moment (the most serious of these would probably be this project by the people over at OCamllabs). Right now the OCaml GC lacks true multicore support, and while this isn't really much of a problem in practice, some people can't seem to live without it.

How long should I expect a garbage collection to take before removing an opaque FFI object? Is it possible to speed it up some way?

I consider writing Haskell bindings to a quantum mechanics library written in C++ (I'd write a plain C wrapper) and CUDA. A major bottleneck is always the GPU memory used by the CUDA parts. In C++, this is handled quite efficiently because all objects have automatic memory management, i.e. are erased as soon as they leave scope. Also I use C++11 move semantics to avoid copies, those obviously wouldn't be necessary in Haskell anyway.
Yet I'm concerned it might not work as smoothly anymore when the objects are managed from garbage-collected Haskell, and I might need to come up with heuristics to migrate seldom-used objects back to host memory (which tends to be quite slow). Is this fear reasonable or is the GHC garbage collection so effective that most objects will vanish almost as quickly as in C++, even when the Haskell runtime doesn't see it needs to be economic on memory? Are there any tricks to help, or ways to signal that some objects take up too much GPU memory and should be removed as quickly as possible?

even when the Haskell runtime doesn't see it needs to be economic on memory?
This is the issue: the GHC GC doesn't know how big your foreign objects are, so they don't exert any heap pressure, and thus aren't collected as soon as they could be.
You can mitigate this by calling performGC manually to force a major GC.

Boehm and tagged pointers

Tagged pointers are a common optimization when implementing dynamic languages: take advantage of alignment requirements that mean the low two or three bits of a pointer will always be zero, and use them to store type information.
Suppose you're using the Boehm garbage collector, which basically works by looking at active data for things that look like pointers. Tagged pointers don't look like pointers, in the sense that their low bits are nonzero.
Is this a showstopper, i.e. do you have to ditch tagged pointers if you're using Boehm? Or does it have a way around this problem?

AFAIK Boehm can handle this with the right options. It is capable, at a small price, of detecting interior pointers. It is also possible to write your own scanning code. Basically there are probably enough hooks to handle just about anything.
I have written my own collector, it is precise on the heap and conservative on the stack. It does not touch C made pointers. For some applications it will be faster because it knows a lot about my language allocated objects and doesn't care about other stuff which is managed, say, using traditional C++ destructors.
However it isn't incremental or generational, and it doesn't handle threads as well (it's not smart enough to stop threads with signals). On the plus side, however, it doesn't require magic linkage techniques which Boehm does (to capture mallocs, etc). On the seriously minus side you can't put managed objects into unmanaged ones.

How to implement closures without gc?

I'm designing a language. First, I want to decide what code to generate. The language will have lexical closures and prototype based inheritance similar to javascript. But I'm not a fan of gc and try to avoid as much as possible. So the question: Is there an elegant way to implement closures without resorting to allocate the stack frame on the heap and leave it to garbage collector?
My first thoughts:
Use reference counting and garbage collect the cycles (I don't really like this)
Use spaghetti stack (looks very inefficient)
Limit forming of closures to some contexts such a way that, I can get away with a return address stack and a locals' stack.
I won't use a high level language or follow any call conventions, so I can smash the stack as much as I like.
(Edit: I know reference counting is a form of garbage collection but I am using gc in its more common meaning)

This would be a better question if you can explain what you're trying to avoid by not using GC. As I'm sure you're aware, most languages that provide lexical closures allocate them on the heap and allow them to retain references to variable bindings in the activation record that created them.
The only alternative to that approach that I'm aware of is what gcc uses for nested functions: create a trampoline for the function and allocate it on the stack. But as the gcc manual says:
If you try to call the nested function through its address after the containing function has exited, all hell will break loose. If you try to call it after a containing scope level has exited, and if it refers to some of the variables that are no longer in scope, you may be lucky, but it's not wise to take the risk. If, however, the nested function does not refer to anything that has gone out of scope, you should be safe.
Short version is, you have three main choices:
allocate closures on the stack, and don't allow their use after their containing function exits.
allocate closures on the heap, and use garbage collection of some kind.
do original research, maybe starting from the region stuff that ML, Cyclone, etc. have.

This thread might help, although some of the answers here reflect answers there already.
One poster makes a good point:
It seems that you want garbage collection for closures
"in the absence of true garbage collection". Note that
closures can be used to implement cons cells. So your question
seem to be about garbage collection "in the absence of true
garbage collection" -- there is rich related literature.
Restricting problem to closures does not really change it.
So the answer is: no, there is no elegant way to have closures and no real GC.
The best you can do is some hacking to restrict your closures to a particular type of closure. All this is needless if you have a proper GC.
So, my question reflects some of the other ones here - why do you not want to implement GC? A simple mark+sweep or stop+copy takes about 2-300 lines of (Scheme) code, and isn't really that bad in terms of programming effort. In terms of making your programs slower:
You can implement a more complex GC which has better performance.
Just think of all the memory leaks programs in your language won't suffer from.
Coding with a GC available is a blessing. (Think C#, Java, Python, Perl, etc... vs. C++ or C).

I understand that I'm very late, but I stumbled upon this question by accident.
I believe that full support of closures indeed requires GC, but in some special cases stack allocation is safe. Determining these special cases requires some escape analysis. I suggest that you take a look at the BitC language papers, such as Closure Implementation in BitC. (Although I doubt whether the papers reflect the current plans.) The designers of BitC had the same problem you do. They decided to implement a special non-collecting mode for the compiler, which denies all closures that might escape. If turned on, it will restrict the language significantly. However, the feature is not implemented yet.
I'd advise you to use a collector - it's the most elegant way. You should also consider that a well-built garbage collector allocates memory faster than malloc does. The BitC folks really do value performance and they still think that GC is fine even for the most parts of their operating system, Coyotos. You can migitate the downsides by simple means:
create only a minimal amount of garbage
let the programmer control the collector
optimize stack/heap use by escape analysis
use an incremental or concurrent collector
if somehow possible, divide the heap like Erlang does
Many fear garbage collectors because of their experiences with Java. Java has a fantastic collector, but applications written in Java have performance problems because of the sheer amount of garbage generated. In addition, a bloated runtime and fancy JIT compilation is not really a good idea for desktop applications because of the longer startup and response times.

The C++ 0x spec defines lambdas without garbage collection. In short, the spec allows non-deterministic behavior in cases where the lambda closure contains references which are no longer valid. For example (pseudo-syntax):
(int)=>int create_lambda(int a)
{
return { (int x) => x + a }
}
create_lambda(5)(4) // undefined result
The lambda in this example refers to a variable (a) which is allocated on the stack. However, that stack frame has been popped and is not necessarily available once the function returns. In this case, it would probably work and return 9 as a result (assuming sane compiler semantics), but there is no way to guarantee it.
If you are avoiding garbage collection, then I'm assuming that you also allow explicit heap vs. stack allocation and (probably) pointers. If that is the case, then you can do like C++ and just assume that developers using your language will be smart enough to spot the problem cases with lambdas and copy to the heap explicitly (just like you would if you were returning a value synthesized within a function).

Use reference counting and garbage collect the cycles (I don't really like this)
It's possible to design your language so there are no cycles: if you can only make new objects and not mutate old ones, and if making an object can't make a cycle, then cycles never appear. Erlang works essentially this way, though in practice it does use GC.

If you have the machinery for a precise copying GC, you could allocate on the stack initially and copy to the heap and update pointers if you discover at exit that a pointer to this stack frame has escaped. That way you only pay if you actually do capture a closure that includes this stack frame. Whether this helps or hurts depends on how often you use closures and how much they capture.
You might also look into C++0x's approach (N1968), though as one might expect from C++ it consists of counting on the programmer to specify what gets copied and what gets referenced, and if you get it wrong you just get invalid accesses.

Or just don't do GC at all. There can be situations where it's better to just forget the memory leak and let the process clean up after it when it's done.
Depending on your qualms about GC, you might be afraid of the periodic GC sweeps. In this case you could do a selective GC when an item falls out of scope or the pointer changes. I'm not sure how expensive this would be though.
#Allen
What good is a closure if you can't use them when the containing function exits? From what I understand that's the whole point of closures.

You could work with the assumption that all closures will be called eventually and exactly one time. Now, when the closure is called you can do the cleanup at the closure return.
How do you plan on dealing with returning objects? They have to be cleaned up at some point, which is the exact same problem with closures.

So the question: Is there an elegant way to implement closures without resorting to allocate the stack frame on the heap and leave it to garbage collector?
GC is the only solution for the general case.

Better late than never?
You might find this interesting: Differential Execution.
It's a little-known control stucture, and its primary use is in programming user interfaces, including ones that can change dynamically while in use. It is a significant alternative to the Model-View-Controller paradigm.
I mention it because one might think that such code would rely heavily on closures and garbage-collection, but a side effect of the control structure is that it eliminates both of those, at least in the UI code.

Create multiple stacks?

I've read that the last versions of ML use GC only sparingly

I guess if the process is very short, which means it cannot use much memory, then GC is unnecessary. The situation is analogous to worrying about stack overflow. Don't nest too deeply, and you cannot overflow; don't run too long, and you cannot need the GC. Cleaning up becomes a matter of simply reclaiming the large region that you pre-allocated. Even a longer process can be divided into smaller processes that have their own heaps pre-allocated. This would work well with event handlers, for example. It does not work well, if you are writing compiler; in that case, a GC is surely not much of a handicap.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string