Rakudo Memory/Garbage collecting techniques

Rakudo Memory/Garbage collecting techniques - garbage-collection

I understand that this question verges into implementation specific domains, but at this point, Rakudo/MoarVM specific answers would help me too.
I am working on some NativeCall modules, and wondering how to debug memory leaks. Some memory is handled in the C library, and I have a good handle over there. I know that domain is my responsibility and there is nothing that MoarVM can do over there. What can I do in the MoarVM domain? what is the best way to check for dangling objects, circular references, etc.?
Is there a way at the end of a series of operations, where I think all of my Perl objects are out of scope to say "Run Garbage Collection and tell me about anything left"?
Is there some Rakudo/NQP/MoarVM specific code I can run to help me? This isn't to release in production, just for testing/diagnostics while I am developing.
Garbage Collection in MoarVM gives a tantalizing overview, but not enough information for me to do anything with it.

Firstly, while leaked memory on the C-side isn't your problem in this case, it's worth knowing that Rakudo installs a perl6-valgrind-m that runs the program under valgrind. I've used this a number of times to figure out segfaults and leaks when writing native library bindings.
For looking into objects managed by MoarVM, it's possible to get the VM to dump heap snapshots. They are taken after each GC run, and an extra GC run is forced and a final snapshot taken at the end of the program. To record snapshots, run with --profile=heap. The output file can then be fed to moar-ha, which can be installed using zef install App::MoarVM::HeapAnalyzer (it's implemented in Perl 6, which may be worth knowing should you wish to extend it in some way to help you solve you problems).
If you have any idea of what kind of objects might be leaking, then it can be useful to search for objects of that type with the find command. There is then a path command that shows how that object is being kept alive. It can also be useful to look at counts of objects between different heap snapshots, to see what is growing in use. Unfortunately there's not yet a snapshot diff feature.
One thing to note is that the snapshots include everything that runs atop of the VM. That means the Perl 6 compiler will be in memory, as well as a bunch of objects for things from the language built-ins. (The tool was developed to help track down managed leaks in the compiler and built-ins, so this is considered a feature. :-) Some kind of filtering may be feasible in the future, however.)
Finally, you mentioned circular references. These are not a problem in Perl 6, since GC is done through tracing, not reference counting.

Related

Copying garbage collection of generated code

Copying (generational) garbage collection offers the best performance of any form of automatic memory management, but requires pointers to relocated chunks of data be fixed up. This is enabled, in languages which support this memory management technique, by disallowing pointer arithmetic and making sure all pointers are to the beginning of identifiable objects.
If you're generating code at run time with a JIT compiler, things look a bit trickier because return addresses on the call stack will point to, not the beginning of code blocks, but random locations within them, so fixup is a problem.
How is this typically solved?

Quite often, you don't relocate code. This is both because it is indeed complicated to fix the stack and other addresses (think jumps across code fragments), and because you don't actually need garbage collection for such code (as it is only manipulated by code you write anyway, so you can do manual memory management). You also don't expect to create a whole lot of machine code (compared to application objects), so fragmentation etc. is not a concern.
If you insist on moving machine code and fixing up the stack, there is a way, I think: Similar to Mark-Compact, build a "break table" (I have no idea where this name comes from; "relocation table" might be clearer) that tells you the amount by which you should adjust pointers to moved objects. Now, walk the stack for return addresses (highly platform-specific, of course) and fix them if they refer to relocated code. Instead of looking for exact matches, search for the highest address lower than the return address you're currently replacing. You can check that this address indeed refers to some machine code that moved by looking at the object size (you have a pointer to the start of the object, after all). This approach isn't feasible for all objects, for various reasons.
There are other reasons to do something similar though. Some JIT compilers feature on-stack replacement, which means creating a new version (e.g. more optimized, or less optimized) of some machine code and replacing all occurrences of the old version with it. This is far more complicated than just fixing the return addresses though. You have to ensure the new version logically continue where the old one was left hanging. I am not familiar with how this is implemented, so I will not go into detail.

Garbage collection with glib?

I would like to interface an garbage collected language (specifically, it's using the venerable Boehm libgc) to the glib family of APIs.
glib and gobject use reference counting internally to manage object lifetime. The normal way to wrap these is to use a garbage collected peer object which holds a reference to the glib object, and which drops the reference when the peer gets finalised; this means that the glib object is kept alive while the application is using the peer. I've done this before, and it works, but it's pretty painful and has its own problems (such as producing two peers of the same underlying object).
Given that I've got all the overhead of a garbage collector anyway, ideally what I'd like to do is to simply turn off glib's reference counting and use the garbage collector for everything. This would simplify the interface no end and hopefully improve performance.
On the face of things this would seem fairly simple --- hook up a garbage collector finaliser to the glib object finaliser, and override the ref and unref functions to be noops --- but further investigation shows there's more to it than that: glib is very fond of keeping its own allocator pools, for example, and of course I let it do that the garbage collector assume that everything in the pool is live and it'll leak.
Is persuading glib to use libgc actually feasible? If so, what other gotchas am I likely to face? What sort of glib performance impact would forcing all allocations to go through libgc produce (as opposed to using the optimised allocators currently in glib)?
(The glib docs do say that it's supposed to interface cleanly to a garbage collector...)

http://mail.gnome.org/archives/gtk-devel-list/2001-February/msg00133.html is old
but still relevant.
Learning how language bindings work (proxy objects, toggle references) would probably be helpful in thinking this through.
Update: oh, from hearing Boehm GC I was thinking you were trying to replace g_malloc etc. with GC, as in that old post.
If you're doing a language binding (not GC'ing C/C++) then yes that's very achievable. A good pretty manageable example to read over would be the gjs (SpiderMonkey JavaScript) codebase.
The basic idea is that you're going to have a proxy object that "holds" a GObject and often has the only reference to the GObject. But, the one complexity is toggle references: http://mail.gnome.org/archives/gtk-devel-list/2005-April/msg00095.html
You have to store the proxy object on the GObject so you can get it back (say someone does widget.get_parent(), then you need to return the same object that was previously set as the parent, by retrieving it from the C GObject). You also have to be able to go from the proxy object to the C object obviously.

No.
Since asking this I have discovered that libgc does not search memory owned by third-party libraries for references. Which means that if glib has, in its own workspace, the only reference to an object allocated via libgc, libgc will collect it and then your program will crash.
libgc is only safe to use on objects owned by the main program.

For future visitors, you can refer to this article (not mine): http://d.hatena.ne.jp/bellbind/20090630/1246362401.
It's written in Japanese but the code is readable.
The compilation options mentioned in https://mail.gnome.org/archives/gtk-devel-list/2001-February/msg00133.html may also work, I haven't tested it myself.
And another relavant issue on G_SLICE if you encountered it: http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2011-January/004289.html.

Parsing .ply files without memory leaks

I downloaded the .ply files in the Stanford 3D scanning repository, and am using Stanford's code from that page (ply.h, plyfile.c) to parse them. However, looking at this code, I see that it's rife with mallocs that are never freed. I could close my eyes and look the other way, but it makes my teeth itch.
I can think of two workarounds:
One is to use Hans Boehm's garbage collector, or something similar, which redefines "malloc" so that it does so within a garbage collector. I've never used this library, but perhaps there's a way to have it operate just on the mallocs in the Stanford code and not anywhere else.
The other workaround is to use a different parser, preferably a C++ one with nicely RAII-ified memory management. I see a few alternative parsers and converters listed at the above link, but rather than kill a day or two trying them all, I was hoping to get a recommendation here.
Can anybody recommend a way to parse .ply files without memory leaks, either by containing the memory leaks in the Stanford parser, or using a different parser, or by some third method I haven't thought of?

Try also RPly.

This library looks promising; until someone else answers this question, I'll mark this as the answer: http://assimp.sourceforge.net/

Another library is the one used by MeshLab
http://vcg.sourceforge.net/index.php/Tutorial

How to detect and debug multi-threading problems?

This is a follow up to this question, where I didn't get any input on this point. Here is the brief question:
Is it possible to detect and debug problems coming from multi-threaded code?
Often we have to tell our customers: "We can't reproduce the problem here, so we can't fix it. Please tell us the steps to reproduce the problem, then we'll fix it." It's a somehow nasty answer if I know that it is a multi-threading problem, but mostly I don't. How do I get to know that a problem is a multi-threading issue and how to debug it?
I'd like to know if there are any special logging frameworks, or debugging techniques, or code inspectors, or anything else to help solving such issues. General approaches are welcome. If any answer should be language related then keep it to .NET and Java.

Threading/concurrency problems are notoriously difficult to replicate - which is one of the reasons why you should design to avoid or at least minimize the probabilities. This is the reason immutable objects are so valuable. Try to isolate mutable objects to a single thread, and then carefully control the exchange of mutable objects between threads. Attempt to program with a design of object hand-over, rather than "shared" objects. For the latter, use fully synchronized control objects (which are easier to reason about), and avoid having a synchronized object utilize other objects which must also be synchronized - that is, try to keep them self contained. Your best defense is a good design.
Deadlocks are the easiest to debug, if you can get a stack trace when deadlocked. Given the trace, most of which do deadlock detection, it's easy to pinpoint the reason and then reason about the code as to why and how to fix it. With deadlocks, it always going to be a problem acquiring the same locks in different orders.
Live locks are harder - being able to observe the system while in the error state is your best bet there.
Race conditions tend to be extremely difficult to replicate, and are even harder to identify from manual code review. With these, the path I usually take, besides extensive testing to replicate, is to reason about the possibilities, and try to log information to prove or disprove theories. If you have direct evidence of state corruption you may be able to reason about the possible causes based on the corruption.
The more complex the system, the harder it is to find concurrency errors, and to reason about it's behavior. Make use of tools like JVisualVM and remote connect profilers - they can be a life saver if you can connect to a system in an error state and inspect the threads and objects.
Also, beware the differences in possible behavior which are dependent on the number of CPU cores, pipelines, bus bandwidth, etc. Changes in hardware can affect your ability to replicate the problem. Some problems will only show on single-core CPU's others only on multi-cores.
One last thing, try to use concurrency objects distributed with the system libraries - e.g in Java java.util.concurrent is your friend. Writing your own concurrency control objects is hard and fraught with danger; leave it to the experts, if you have a choice.

I thought that the answer you got to your other question was pretty good. But I'll emphasis these points.
Only modify shared state in a critical section (Mutual Exclusion)
Acquire locks in a set order and release them in the opposite order.
Use pre-built abstractions whenever possible (Like the stuff in java.util.concurrent)
Also, some analysis tools can detect some potential issues. For example, FindBugs can find some threading issues in Java programs. Such tools can't find all problems (they aren't silver bullets) but they can help.
As vanslly points out in a comment to this answer, studying well placed logging output can also very helpful, but beware of Heisenbugs.

For Java there is a verification tool called javapathfinder which I find it useful to debug and verify multi-threading application against potential race condition and death-lock bugs from the code.
It works finely with both Eclipse and Netbean IDE.
[2019] the github repository
https://github.com/javapathfinder

Assuming I have reports of troubles that are hard to reproduce I always find these by reading code, preferably pair-code-reading, so you can discuss threading semantics/locking needs. When we do this based on a reported problem, I find we always nail one or more problems fairly quickly. I think it's also a fairly cheap technique to solve hard problems.
Sorry for not being able to tell you to press ctrl+shift+f13, but I don't think there's anything like that available. But just thinking about what the reported issue actually is usually gives a fairly strong sense of direction in the code, so you don't have to start at main().

In addition to the other good answers you already got: Always test on a machine with at least as many processors / processor cores as the customer uses, or as there are active threads in your program. Otherwise some multithreading bugs may be hard to impossible to reproduce.

Apart from crash dumps, a technique is extensive run-time logging: where each thread logs what it's doing.
The first question when an error is reported, then, might be, "Where's the log file?"
Sometimes you can see the problem in the log file: "This thread is detecting an illegal/unexpected state here ... and look, this other thread was doing that, just before and/or just afterwards this."
If the log file doesn't say what's happening, then apologise to the customer, add sufficiently-many extra logging statements to the code, give the new code to the customer, and say that you'll fix it after it happens one more time.

Sometimes, multithreaded solutions cannot be avoided. If there is a bug,it needs to be investigated in real time, which is nearly impossible with most tools like Visual Studio. The only practical solution is to write traces, although the tracing itself should:
not add any delay
not use any locking
be multithreading safe
trace what happened in the correct sequence.
This sounds like an impossible task, but it can be easily achieved by writing the trace into memory. In C#, it would look something like this:
public const int MaxMessages = 0x100;
string[] messages = new string[MaxMessages];
int messagesIndex = -1;
public void Trace(string message) {
int thisIndex = Interlocked.Increment(ref messagesIndex);
messages[thisIndex] = message;
}
The method Trace() is multithreading safe, non blocking and can be called from any thread. On my PC, it takes about 2 microseconds to execute, which should be fast enough.
Add Trace() instructions wherever you think something might go wrong, let the program run, wait until the error happens, stop the trace and then investigate the trace for any errors.
A more detailed description for this approach which also collects thread and timing information, recycles the buffer and outputs the trace nicely you can find at:
CodeProject: Debugging multithreaded code in real time 1

A little chart with some debugging techniques to take in mind in debugging multithreaded code.
The chart is growing, please leave comments and tips to be added.
(update file at this link)

Visual Studio allows you to inspect the call stack of each thread, and you can switch between them. It is by no means enough to track all kinds of threading issues, but it is a start. A lot of improvements for multi-threaded debugging is planned for the upcoming VS2010.
I have used WinDbg + SoS for threading issues in .NET code. You can inspect locks (sync blokcs), thread call stacks etc.

Tess Ferrandez's blog has good examples of using WinDbg to debug deadlocks in .NET.

assert() is your friend for detecting race-conditions. Whenever you enter a critical section, assert that the invariant associated with it is true (that's what CS's are for). Though, unfortunately, the check might be expensive and thus not suitable for use in production environment.

I implemented the tool vmlens to detect race conditions in java programs during runtime. It implements an algorithm called eraser.

Develop code the way that Princess recommended for your other question (Immutable objects, and Erlang-style message passing). It will be easier to detect multi-threading problems, because the interactions between threads will be well defined.

I faced a thread issue which was giving SAME wrong result and was not behaving un-predictably since each time other conditions(memory, scheduler, processing load) were more or less same.
From my experience, I can say that HARDEST PART is to recognize that it is a thread issue, and BEST SOLUTION is to review the multi-threaded code carefully. Just by looking carefully at the thread code you should try to figure out what can go wrong. Other ways (thread dump, profiler etc) will come second to it.

Narrow down on the functions that are being called, and rule out what could and could not be to blame. When you find sections of code that you suspect may be causing the issue, add lots of detailed logging / tracing to it. Once the issue occurs again, inspect the logs to see how the code executed differently than it does in "baseline" situations.
If you are using Visual Studio, you can also set breakpoints and use the Parallel Stacks window. Parallel Stacks is a huge help when debugging concurrent code, and will give you the ability to switch between threads to debug them independently. More info-
https://learn.microsoft.com/en-us/visualstudio/debugger/using-the-parallel-stacks-window?view=vs-2019
https://learn.microsoft.com/en-us/visualstudio/debugger/walkthrough-debugging-a-parallel-application?view=vs-2019

I'm using GNU and use simple script
$ more gdb_tracer
b func.cpp:2871
r
#c
while (1)
next
#step
end

The best thing I can think of is to stay away from multi-threaded code whenever possible. It seems there are very few programmers who can write bug free multi threaded applications and I would argue that there are no coders beeing able to write bug free large multi threaded applications.

How to implement closures without gc?

I'm designing a language. First, I want to decide what code to generate. The language will have lexical closures and prototype based inheritance similar to javascript. But I'm not a fan of gc and try to avoid as much as possible. So the question: Is there an elegant way to implement closures without resorting to allocate the stack frame on the heap and leave it to garbage collector?
My first thoughts:
Use reference counting and garbage collect the cycles (I don't really like this)
Use spaghetti stack (looks very inefficient)
Limit forming of closures to some contexts such a way that, I can get away with a return address stack and a locals' stack.
I won't use a high level language or follow any call conventions, so I can smash the stack as much as I like.
(Edit: I know reference counting is a form of garbage collection but I am using gc in its more common meaning)

This would be a better question if you can explain what you're trying to avoid by not using GC. As I'm sure you're aware, most languages that provide lexical closures allocate them on the heap and allow them to retain references to variable bindings in the activation record that created them.
The only alternative to that approach that I'm aware of is what gcc uses for nested functions: create a trampoline for the function and allocate it on the stack. But as the gcc manual says:
If you try to call the nested function through its address after the containing function has exited, all hell will break loose. If you try to call it after a containing scope level has exited, and if it refers to some of the variables that are no longer in scope, you may be lucky, but it's not wise to take the risk. If, however, the nested function does not refer to anything that has gone out of scope, you should be safe.
Short version is, you have three main choices:
allocate closures on the stack, and don't allow their use after their containing function exits.
allocate closures on the heap, and use garbage collection of some kind.
do original research, maybe starting from the region stuff that ML, Cyclone, etc. have.

This thread might help, although some of the answers here reflect answers there already.
One poster makes a good point:
It seems that you want garbage collection for closures
"in the absence of true garbage collection". Note that
closures can be used to implement cons cells. So your question
seem to be about garbage collection "in the absence of true
garbage collection" -- there is rich related literature.
Restricting problem to closures does not really change it.
So the answer is: no, there is no elegant way to have closures and no real GC.
The best you can do is some hacking to restrict your closures to a particular type of closure. All this is needless if you have a proper GC.
So, my question reflects some of the other ones here - why do you not want to implement GC? A simple mark+sweep or stop+copy takes about 2-300 lines of (Scheme) code, and isn't really that bad in terms of programming effort. In terms of making your programs slower:
You can implement a more complex GC which has better performance.
Just think of all the memory leaks programs in your language won't suffer from.
Coding with a GC available is a blessing. (Think C#, Java, Python, Perl, etc... vs. C++ or C).

I understand that I'm very late, but I stumbled upon this question by accident.
I believe that full support of closures indeed requires GC, but in some special cases stack allocation is safe. Determining these special cases requires some escape analysis. I suggest that you take a look at the BitC language papers, such as Closure Implementation in BitC. (Although I doubt whether the papers reflect the current plans.) The designers of BitC had the same problem you do. They decided to implement a special non-collecting mode for the compiler, which denies all closures that might escape. If turned on, it will restrict the language significantly. However, the feature is not implemented yet.
I'd advise you to use a collector - it's the most elegant way. You should also consider that a well-built garbage collector allocates memory faster than malloc does. The BitC folks really do value performance and they still think that GC is fine even for the most parts of their operating system, Coyotos. You can migitate the downsides by simple means:
create only a minimal amount of garbage
let the programmer control the collector
optimize stack/heap use by escape analysis
use an incremental or concurrent collector
if somehow possible, divide the heap like Erlang does
Many fear garbage collectors because of their experiences with Java. Java has a fantastic collector, but applications written in Java have performance problems because of the sheer amount of garbage generated. In addition, a bloated runtime and fancy JIT compilation is not really a good idea for desktop applications because of the longer startup and response times.

The C++ 0x spec defines lambdas without garbage collection. In short, the spec allows non-deterministic behavior in cases where the lambda closure contains references which are no longer valid. For example (pseudo-syntax):
(int)=>int create_lambda(int a)
{
return { (int x) => x + a }
}
create_lambda(5)(4) // undefined result
The lambda in this example refers to a variable (a) which is allocated on the stack. However, that stack frame has been popped and is not necessarily available once the function returns. In this case, it would probably work and return 9 as a result (assuming sane compiler semantics), but there is no way to guarantee it.
If you are avoiding garbage collection, then I'm assuming that you also allow explicit heap vs. stack allocation and (probably) pointers. If that is the case, then you can do like C++ and just assume that developers using your language will be smart enough to spot the problem cases with lambdas and copy to the heap explicitly (just like you would if you were returning a value synthesized within a function).

Use reference counting and garbage collect the cycles (I don't really like this)
It's possible to design your language so there are no cycles: if you can only make new objects and not mutate old ones, and if making an object can't make a cycle, then cycles never appear. Erlang works essentially this way, though in practice it does use GC.

If you have the machinery for a precise copying GC, you could allocate on the stack initially and copy to the heap and update pointers if you discover at exit that a pointer to this stack frame has escaped. That way you only pay if you actually do capture a closure that includes this stack frame. Whether this helps or hurts depends on how often you use closures and how much they capture.
You might also look into C++0x's approach (N1968), though as one might expect from C++ it consists of counting on the programmer to specify what gets copied and what gets referenced, and if you get it wrong you just get invalid accesses.

Or just don't do GC at all. There can be situations where it's better to just forget the memory leak and let the process clean up after it when it's done.
Depending on your qualms about GC, you might be afraid of the periodic GC sweeps. In this case you could do a selective GC when an item falls out of scope or the pointer changes. I'm not sure how expensive this would be though.
#Allen
What good is a closure if you can't use them when the containing function exits? From what I understand that's the whole point of closures.

You could work with the assumption that all closures will be called eventually and exactly one time. Now, when the closure is called you can do the cleanup at the closure return.
How do you plan on dealing with returning objects? They have to be cleaned up at some point, which is the exact same problem with closures.

So the question: Is there an elegant way to implement closures without resorting to allocate the stack frame on the heap and leave it to garbage collector?
GC is the only solution for the general case.

Better late than never?
You might find this interesting: Differential Execution.
It's a little-known control stucture, and its primary use is in programming user interfaces, including ones that can change dynamically while in use. It is a significant alternative to the Model-View-Controller paradigm.
I mention it because one might think that such code would rely heavily on closures and garbage-collection, but a side effect of the control structure is that it eliminates both of those, at least in the UI code.

Create multiple stacks?

I've read that the last versions of ML use GC only sparingly

I guess if the process is very short, which means it cannot use much memory, then GC is unnecessary. The situation is analogous to worrying about stack overflow. Don't nest too deeply, and you cannot overflow; don't run too long, and you cannot need the GC. Cleaning up becomes a matter of simply reclaiming the large region that you pre-allocated. Even a longer process can be divided into smaller processes that have their own heaps pre-allocated. This would work well with event handlers, for example. It does not work well, if you are writing compiler; in that case, a GC is surely not much of a handicap.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string