Heap Object representation for OO language - garbage-collection

As part of my masters thesis I am writing a compiler for an object oriented language that was developed at my home university. Currently the compiler outputs assembler that runs on a virtual machine. The virtual machine handles all things like stack operations, object generation, heap management and garbage collection.
Target architecture for my compiler is a MIPS-alike CPU.
I am searching for strategies to develop an object layout and ideas to implement and trigger garbage collection during runtime. I could of course analyze how GCC implements this with C++, but I'd prefer to be pointed to some good publications/ressources.

Read up on Python's internal object management. They use reference counting and dispose of objects when the reference count goes to zero.
Here's an older (but still helpful) document: http://docs.python.org/release/2.5.2/ext/refcounts.html
Here's general stuff: http://en.wikipedia.org/wiki/Reference_counting
And some more: http://code.google.com/p/augustus/wiki/OptionalGarbageCollection

Related

Kotlin for game dev

Background:
I'm always searching for a language to replace Java for game development. Kotlin looks promising with a good IDE support and Java interop. But one of the FPS killers for a game (on Android especially) is GC usage. So, some libraries (like libgdx) are using pools of objects, custom collections and other tricks to avoid frequent GC run. For Java that can be done in a clear way. Some other JVM languages espesially with functional support using a lot of GC by it's nature, so it is hard to avoid.
Questions:
Does Kotlin creates any invisible GC overhead in comparison to Java?
Which features of Kotlin is better to avoid to have less GC work?
You can write Kotlin Code for the JVM which causes the same allocations than the Java corresponding logic. In both cases you have to carefully check if a library call allocates new memory on the heap, or not. Using Kotlin in combination with LibGDX doesn't introduce any invisible GC overhead. It's an effective way and works well (especially with the ktx extension.
But there are Kotlin language features which may help you to write your code with fewer allocations.
Singletons are a language feature. (Object declarations, companion object )
You can create wrapper classes for primitive types which compile to primitives. But you get the power of type safety and rich domain models (Inline classes).
With the combination of Operator overloading and Inline Functions you can build nice APIs which modify objects without allocating new ones. (Example: Allocation-free Vectorial operations using custom operators)
If you use any kind of dependency injection mechanism or object pooling to connect your game logic and reuse objects, then Reified type parameters may help to use it in a very elegant an short way. You can skip a class as type parameter, if the compiler knows the actual type.
But there is also another option which indeed gives you a different behavior in memory management. Thanks to Kotlin Multiplatform, you can write your game logic as Kotlin common module and cross compile it to native code or to Javascript.
I did this in a sample Game project Candy Crush Clone. It works with Korge a Modern Multiplatform Game Engine for Kotlin. The game runs on the JVM, as HTML web app and as Native binary in Win, Linux, Mac, Android or IOS.
The native compiled code has its own simpler garbage collection and can run faster. So the speed-increase and the different memory management may give you the power reserve to bother even less with the GC.
In conclusion I can recommend Kotlin for Game dev, also for GC critical scenarios. In my projects I tend to create more classes and allocate more memory when I write Kotlin code. But this is a question of programming style, not a technical one.
As a rule of thumb, Kotlin generates bytecode as close as possible to the one generated by Java. So, for example, if you use a function as a value, an inner class will be created, like in Java, but no more. There are also some optimization tricks like IntArray and inline to perform even better.
And as #Peter-Lawrey said, it's always a better idea to measure the values for your specific case.
Technically, your questions comparing Kotlin to Java are moot, they will perform the same. But Kotlin will be a better development experience.
If Java is good for writing Games, then Kotlin would only better due to developer productivity.
Note: the gaming library LWJGL 3 uses Kotlin in part, with GitHub stats showing 67.3% of the code being Kotlin (template module looks to be mostly Kotlin). So asking people who work with LWJGL will give you the best answer to this question since they have a lot of experience in this area.

How to determine whether I need to use a garbage collector?

I am working on a program which will be used for drawing vector pictures. As such, it will have to store points, paths defined by these points, pictures defined by these paths etc. Inkscape (http://inkscape.org/) which does something similar seems to use the Boehm Garbage Collector (http://www.hpl.hp.com/personal/Hans_Boehm/gc/). Does that mean it would be advisable for me to do the same also? I mean, what criteria should I use to determine whether I need to use a GC in my program?
Thanks.
Whether garbage collection will be used or not depends on the programming language used to develop your application.
Garbage-collection is a memory-management technique and as such its use depends on the choice of programming language. Some programming languages are using garbage collection (such as Java, C#) and some do not use garbage collection (C/C++).
Btw. Inkscape uses the C, C++, Python, Perl, XSLT programming languages out of which Python and Perl use garbage collection for its memory management.
UPDATE: To learn more about C/C++ and Garbage Collection, I would recommend:
Why doesn't C++ have a garbage collector?
Garbage Collection in C++ -- why?

is Haskell a managed language?

I'm a complete newbie in Haskell. One thing that always bugs me is the ambiguity in whether Haskell is a managed(term borrowed from MS) language like Java or a compile-to-native code like C?
The GHC page says this "GHC compiles Haskell code either directly to native code or using LLVM as a back-end".
In the case of "compiled to native code", how can features like garbage collection be possible without something like a JVM?
/Update/
Thanks so much for your answer. Conceptually, can you please help point out which one of my following understandings of garbage collection in Haskell is correct:
GHC compiles Haskell code to native code. In the processing of compiling, garbage collection routines will be added to the original program code?
OR
There is a program that runs along side a Haskell program to perform garbage collection?
As far as I am aware the term "managed language" specifically means a language that targets .NET/the Common Language Runtime. So no, Haskell is not a managed language and neither is Java.
Regarding what Haskell is compiled to: As the documentation you quoted says, GHC compiles Haskell to native code. It can do so by either directly emitting native code or by first emitting LLVM code and then letting LLVM compile that to native code. Either way the end result of running GHC is a native executable.
Besides GHC there are also other implementations of Haskell - most notably Hugs, which is a pure interpreter that never produces an executable (native or otherwise).
how can features like garbage collection be possible without something like a JVM?
The same way that they're possible with the JVM: Every time memory is allocated, it is registered with the garbage collector. Then from time to time the garbage collector runs, following the steps of the given garbage collection algorithm. GHC-compiled code uses generational garbage collection.
In response to your edit:
GHC compiles Haskell code to native code. In the processing of compiling, garbage collection routines will be added to the original program code?
Basically. Except that saying "garbage collection routines will be added to the original program code" might paint the wrong picture. The GC routines are just part of the library that every Haskell program is linked against. The compiled code simply contains calls to those routines at the appropriate places.
Basically all there is to it is to call the GC's alloc function every time you would otherwise call malloc.
Just look at any GC library for C and how it's used: All you need to do is to #include the library's header and link against the library, and replace each occurence of malloc with the GC library's alloc function (and remove all calls to free) and bam, your code is garbage collected.
There is a program that runs along side a Haskell program to perform garbage collection?
No.
whether Haskell is a managed(term borrowed from MS) language like Java
GHC-compiled programs include a garbage collector. (As far as I know, all implementations of Haskell include garbage collection, but this is not part of the specification.)
or a compile-to-native code like C?
GHC-compiled programs are compiled to native code. Hugs interprets programs, and does not compile to native code. There are several other implementations which all, as far as I know, compile to native code, but I list these separately because I'm not as confident of this fact.
In the case of "compiled to native code", how can features like garbage collection be possible without something like a JVM?
GHC-compiled programs include a runtime system that provides some basic capabilities like M-to-N green threading, garbage collection, and an IO manager. In a sense, this is a bit like having "something like a JVM" in that it provides many of the same features, but it's very different in implementation: there is no common bytecode across all architectures (and hence no "virtual machine").
which one of my following understandings of garbage collection in Haskell is correct:
GHC compiles Haskell code to native code. In the processing of compiling, garbage collection routines will be added to the original program code?
There is a program that runs along side a Haskell program to perform garbage collection?
Case 1 is correct: the runtime system code is added to the program code during compilation.
"Managed language" is an overloaded term so here are one-word answers and then some details for the usual different meanings that come to (my) mind:
Managed as in a CLR target
No, Haskell does not compile to Microsoft CLI's IL.
Well, I read there are some solutions that can do that, but imo, don't.. the CLR isn't built for FP and will seriously lack optimizations, probably yielding a research language performance. If I personally would really really want to target the CLR, I'd use F# -- it's not a functional language but it's close.
N.B. This is the most accurate and actual meaning for the term "managed language". The next meanings are, well, wrong, but nevertheless & unfortunately common.
Managed as in automatically garbage-collected
Yes, and this is pretty much a must have. I mean, beyond the specification: If we would have to garbage collect it would destroy the functional theme that makes us work in the high altitudes that are our beloved home.
It would also enforce impurity and a memory model.
Managed as in compiled to bytecode which is ran by a VM
No (usually).
It depends on your backend:
Not only we have different Haskell compilers today, some compilers have different backends -- there are even backends for JavaScript!
So if you do want to target a VM, you can use an existing / make a backend for it. But Haskell doesn't require it. So just as you can compile to native raw-metal binary, you can compile to anything else.
In contrast to CLR languages like C#1, VB.NET, and in contrast to Java, etc. you don't have to target a JVM, the CLR, Mono, etc. as Haskell doesn't require a VM at all.
GHC is a good example. When you compile in GHC, it doesn't compile you straight to binary, it compiles to an intermediate language called Core, and then optimizes from Core to Core for some times before it proceeds to another language called STG, and only then proceeds to code generation (it can stop there if you tell it to).2 And these days you can also use it to compile to LLVM bytecode (which is subject to some awesome optimizations). With the LLVM backend, GHC can produce wildly faster programs. For more information about it and about GHC backends, go here.
The diagram below illustrates the GHC compilation pipeline, and here you can find more information about the various stages.
See the fork at the bottom for three different targets? those are the backends I was referring to.
1 A future exception and a fun fact: Microsoft are currently working on native .NET! the cunningly named: Microsoft .NET Native.
What, for you, is the defining feature of a "managed language"? The phrase "GHC compiles Haskell code either directly to native code or using LLVM as a back-end" that you quote is quite clear about what GHC does, so I suspect the "ambiguity" that bugs you is rather in the term "managed language" than in GHC's docs.
In the case of "compiled to native code", how can features like garbage collection be possible without something like a JVM?
How exactly do you think "something like a JVM" implements features like garbage collection? The JVM isn't magic, it's just a program like everything else. At some level you need to have native code in order for the CPU to execute it, so clearly features like garbage collection are possible in native code.
For where you currently are, it's probably best to think of (GHC) Haskell as "managed," but that the platform GHC compiles to is not targeted by anything else. There is, of course, more to it than that, but that's a sufficient explanation in lieu of more Haskell experience.

Garbage collection with glib?

I would like to interface an garbage collected language (specifically, it's using the venerable Boehm libgc) to the glib family of APIs.
glib and gobject use reference counting internally to manage object lifetime. The normal way to wrap these is to use a garbage collected peer object which holds a reference to the glib object, and which drops the reference when the peer gets finalised; this means that the glib object is kept alive while the application is using the peer. I've done this before, and it works, but it's pretty painful and has its own problems (such as producing two peers of the same underlying object).
Given that I've got all the overhead of a garbage collector anyway, ideally what I'd like to do is to simply turn off glib's reference counting and use the garbage collector for everything. This would simplify the interface no end and hopefully improve performance.
On the face of things this would seem fairly simple --- hook up a garbage collector finaliser to the glib object finaliser, and override the ref and unref functions to be noops --- but further investigation shows there's more to it than that: glib is very fond of keeping its own allocator pools, for example, and of course I let it do that the garbage collector assume that everything in the pool is live and it'll leak.
Is persuading glib to use libgc actually feasible? If so, what other gotchas am I likely to face? What sort of glib performance impact would forcing all allocations to go through libgc produce (as opposed to using the optimised allocators currently in glib)?
(The glib docs do say that it's supposed to interface cleanly to a garbage collector...)
http://mail.gnome.org/archives/gtk-devel-list/2001-February/msg00133.html is old
but still relevant.
Learning how language bindings work (proxy objects, toggle references) would probably be helpful in thinking this through.
Update: oh, from hearing Boehm GC I was thinking you were trying to replace g_malloc etc. with GC, as in that old post.
If you're doing a language binding (not GC'ing C/C++) then yes that's very achievable. A good pretty manageable example to read over would be the gjs (SpiderMonkey JavaScript) codebase.
The basic idea is that you're going to have a proxy object that "holds" a GObject and often has the only reference to the GObject. But, the one complexity is toggle references: http://mail.gnome.org/archives/gtk-devel-list/2005-April/msg00095.html
You have to store the proxy object on the GObject so you can get it back (say someone does widget.get_parent(), then you need to return the same object that was previously set as the parent, by retrieving it from the C GObject). You also have to be able to go from the proxy object to the C object obviously.
No.
Since asking this I have discovered that libgc does not search memory owned by third-party libraries for references. Which means that if glib has, in its own workspace, the only reference to an object allocated via libgc, libgc will collect it and then your program will crash.
libgc is only safe to use on objects owned by the main program.
For future visitors, you can refer to this article (not mine): http://d.hatena.ne.jp/bellbind/20090630/1246362401.
It's written in Japanese but the code is readable.
The compilation options mentioned in https://mail.gnome.org/archives/gtk-devel-list/2001-February/msg00133.html may also work, I haven't tested it myself.
And another relavant issue on G_SLICE if you encountered it: http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2011-January/004289.html.

difference between kernel objects and C# objects?

In C# or C++ we have objects, instances of classes that are live in memory. The kernel also has objects, like interrupt objects. I wondered if these kernel objects can be thought of as we C# or C++ programmers objects?
The short answer is 'yes'. Objects are a state of mind. You can organize your work in objects in assembly language with a few macros, or in PL/I, or C, or C++. Some people might insist that it isn't an object without some sort of binding of dispatch to objects. Well, kernel/C object models use functions pointers to accomplish, somewhat more manually, what languages like C++, C#, or Java do.
What, after all, is an 'object'?
Answer 1: any lump of data that groups some related items. Any c struct. Some people would bridle and insist on ...
Answer 2: the combination of data and functions, such that code 'calls' the object, and the results depend on conditions set up by the creator of the object. So, in C++ or C# or Java, there is inheritance. Code calls SomeObject.someFunction(), and what happens depends on the inheritance graph, which is controlled by the object author, not by the caller.
In kernels, and in the pleistocene era when some of us learned to program, we accomplish(ed) the same thing, more or less, with simpler languages, using function pointers. That is to say, a slot in a structure that stores a reference to a function. The caller calls someobject.throwAnEgg, and what actually happens depends on what function pointer is sitting in `throwAnEgg'.
I think this should be tagged as subjective as the answers are going to vary and reflect the individual's personal view of things.
My take is this...
When you are talking about low-level stuff, sometimes, it is easier to bring in the perspective of OOP into it, to make it easier to communicate the concepts of what happens in the kernel level
...but...
I'd rather prefer to talk in terms of low-level nuts and bolts rather, because the complexity of the nuts and bolts can be easily solved by hammering it out, rather than talking and thinking in terms of objects because it is contriving and making a complex thing sound simple and setting yourself up for false thinking in terms of code economy.
For an example, from a kernel viewpoint, a TSS (Task State Segment) is a structure for holding the registers at the point before a task switch takes place (this is based on the processor being switched to 32bit and has paging enabled and so on). If you talk in terms of OOP aspect, i.e. a task selector object, that would not sound right because you're talking about a high level aspect when really, it is an actual nuts and bolts, take a look at the Intel 80386 programmer's manual, and there are references to the TSS, Chapter 13 - Protected-Mode Multitasking, Section 13.1 in the document 24143004.pdf available for download here
If you are talking high-level, from a high-level programming aspect, then it would be easier to define the OOP paradigm.
So, going back to your question, from a kernel aspect, you can if you wish, talk from a simplistic and concrete OOP terms, nonetheless, it would make you think in terms of having to put extra effort into coding in order to follow the OOP paradigm which may or could end up with convoluted code.
If the book you are reading is about Linux you might be looking for kobject, which is simply an abstraction supporting Linux driver model. But in general any piece of data that is somehow associated with some behaviour like set of functions, macros, etc. might be called an object. This is much relaxed from more or less formal definition of object in OO languages like C# or C++.

Resources