The reason of Compact algorithm called 'Compact'? - garbage-collection

As we know, the Compact algorithm also use copying to move the live objects
in a continuous memory space in order to eliminate the memory fragmentation. Why not consider the Compact algorithm as a type of Copying algorithm. Could any one tell me the reason of Compact algorithm called 'Compact'?

Related

How to reliably influence generated code at near machine level using GHC?

While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.
Some assumptions:
It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed
What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.
Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?
It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).
When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.
Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).
There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.
You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.
GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.
If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).
One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.

When is it advantageous to use theano's scan function

I've been reading the Theano documentation on scan and find myself confused by two, seemingly, contradictory statements.
On http://deeplearning.net/software/theano/tutorial/loop.html#scan, one of the advantages of scan is listed as:
Slightly faster than using a for loop in Python with a compiled Theano function.
But, on http://deeplearning.net/software/theano/library/scan.html#lib-scan, in a section on optimizing the use of scan, it says:
Scan makes it possible to define simple and compact graphs that can do
the same work as much larger and more complicated graphs. However, it
comes with a significant overhead. As such, **when performance is the
objective, a good rule of thumb is to perform as much of the computation
as possible outside of Scan**. This may have the effect of increasing
memory usage but can also reduce the overhead introduces by using Scan.
My reading of 'performance', here, is as a synonym for speed. So, I'm left confused as to when/if scan will lead to shorter runtimes, once compilation has been completed.
If your expression intrinsically needs a for-loop, then you sometimes have two options:
Build the expression using a python for loop
Build the expression using scan
Option 1 only works if you know in advance the length of your for-loop. It can happen that the length of your for-loop depends on a symbolic variable that is not available at script-writing time. In that case you need to use scan. Although oftentimes you can formulate the problem either way (see the absence of scan in tensorflow).
As for time performance, there have been a number of results showing that it really depends on the problem which one is faster.

Does incremental GC preclude using the Cheney algorithm?

I've been using the Cheney two-space GC algorithm for many years, but I'm ready to graduate to a generational-style collector. I've read the 'beltway' paper, and I have the Jones book. I'm trying to understand the implications of doing a 'partial' GC - i.e., collecting only a portion of a a generation/subspace. My plan is to implement the beltway collector.
If I understand it correctly, doing a partial collection precludes using the Cheney algorithm, because that algorithm assumes that you are copying everything that you visit. If I collect only a window within a subspace (an 'increment' in the terms used by the beltway paper), then I must visit some records (in other increments in the same belt) that I will not be copying.
More context: this is for a functional language, a Scheme dialect using an ML-style static type system. I'm currently using runtime tags (so I can easily tell pointer from non-pointer), but I intend to move toward a more tagless scheme using compile-time information. This is another motivator away from Cheney, since a recursive descent of some kind will be necessary while traversing the type graph of each pointer.

nodejs buffers vs typed arrays

What is more efficient - nodejs buffers or typed arrays? What should I use for better performance?
I think that only those who know interiors of V8 and NodeJs could answer this question.
A Node.js buffer should be more efficient than a typed array. The reason is simply because when a new Node.js Buffer is created it does not need to be initialized to all 0's. Whereas, the HTML5 spec states that initialization of typed arrays must have their values set to 0. Allocating the memory and then setting all of the memory to 0's takes more time.
In most applications picking either one won't matter. As always, the devil lies in the benchmarks :) However, I recommend that you pick one and stick with it. If you're often converting back and forth between the two, you'll take a performance hit.
Nice discussion here: https://github.com/joyent/node/issues/4884
There are a few things that I think are worth mentioning:
Buffer instances are Uint8Array instances but there are subtle incompatibilities with the TypedArray specification in ECMAScript 2015. For example, while ArrayBuffer#slice() creates a copy of the slice, the implementation of Buffer#slice() creates a view over the existing Buffer without copying, making Buffer#slice() far more efficient.
When using Buffer.allocUnsafe() and Buffer.allocUnsafeSlow() the memory isn't zeroed-out (as many have pointed out already). So make sure you completely overwrite the allocated memory or you can allow the old data to be leaked when the Buffer memory is read.
TypedArrays are not readable right away, you'll need a DataView for that. Which means you might need to rewrite your code if you were to migrate back to Buffer. Adapter pattern could help here.
You can use for-of on Buffer. You cannot on TypedArrays. Also you won't have the classic entries(), values(), keys() and length support.
Buffer is not supported in the frontend while TypedArray may well be. So if your code is shared between frontend or backend you might consider sticking to one.
More info in the docs here.
This is a tough one, but I think that it will depend on what are you planning to do with them and how much data you are planning to work with?
typed arrays themselves need node buffers, but are easier to play with and you can overcome the 1GB limit (kMaxLength = 0x3fffffff).
If you are doing common stuff such as iterations, setting, getting, slicing, etc... then typed arrays should be your best shot for performance, not memory ( specially if you are dealing with float and 64bits integer types ).
In the end, probably only a good benchmark with what you want to do can shed real light on this doubt.

Boehm and tagged pointers

Tagged pointers are a common optimization when implementing dynamic languages: take advantage of alignment requirements that mean the low two or three bits of a pointer will always be zero, and use them to store type information.
Suppose you're using the Boehm garbage collector, which basically works by looking at active data for things that look like pointers. Tagged pointers don't look like pointers, in the sense that their low bits are nonzero.
Is this a showstopper, i.e. do you have to ditch tagged pointers if you're using Boehm? Or does it have a way around this problem?
AFAIK Boehm can handle this with the right options. It is capable, at a small price, of detecting interior pointers. It is also possible to write your own scanning code. Basically there are probably enough hooks to handle just about anything.
I have written my own collector, it is precise on the heap and conservative on the stack. It does not touch C made pointers. For some applications it will be faster because it knows a lot about my language allocated objects and doesn't care about other stuff which is managed, say, using traditional C++ destructors.
However it isn't incremental or generational, and it doesn't handle threads as well (it's not smart enough to stop threads with signals). On the plus side, however, it doesn't require magic linkage techniques which Boehm does (to capture mallocs, etc). On the seriously minus side you can't put managed objects into unmanaged ones.

Resources