How actually is used card table structure during garbage-collector between multiple threads? - garbage-collection

I actually have two questions 1) I have studied various articles and answers here about garbage collection and I can't understand the answer to the question: how is "card table" structure used during garbage collector between multiple threads? I think I'm missing something to understand it. 2) Is it right that this structure "card table" is used only in concurrent garbage collectors?

Card Table is a primitive implementation of a Remembered Set based on a bitmap. One bit in a Card Table corresponds to one or more words in a heap generation (or region).
The purpose of a remembered set is to track references from old generation to young generation - in order to update references in old gen when doing a young-only collection. So a remembered set, or a Card Table as its particular implementation, is inherent to generational/regional collectors, no matter concurrent or not.
Card Table is not specific to concurrent collectors and it has nothing to do with multithreading. Even the Serial GC uses the Card Table. I found the traces of gc/gen/cardtable.c in JDK 1.2 sources dated 1999, when there were no concurrent garbage collectors at all.

Related

Data security during dynamic memory allocation

Several minutes ago, I and my friends solved some algorithmic problems on the leetcode.com and share our solutions. We used high level languages and when new memory allocated by Array.new(128) in Ruby or int[] map = new int[128]; in Java it already filled by zero-like values nil or 0 respectively.
So it's guarantied that high level program have cleared place.
And here I have a question: In C or Assembler program could it happens that new chunk of memory stores data from other process unchanged?
And thus one process get data of another process. And even may be data from another user that worked in system some time ago. Could it be a way information leaked?
Do OS clear a memory before sharing it among processes? and If so is it very expensive to run so many iterations?
Thank you.
UPD: http://www.cplusplus.com/articles/ETqpX9L8/ looks like it need to clear valuable data in "lower-level" languages manually to prevent data leaks to other processes.
Yes, in lower-level languages where memory is not initialized, it could contain valuable stuff from other processes. There have been encryption key leakage attacks done this way by continually allocating memory and scanning it for what looks like useful information.
Security sensitive programs that store passwords or crypto keys, etc should always clear the memory ASAP after use. It's not only to prevent leaks through re-allocated memory, but there are also other attack vectors like RAM dumps that could be used to extract secrets. Always zero or randomize your memory when you are done with it.

what is sequential store buffer structure in gc specifically?

I have read garbage collection book, it mentioned a sort of data structure ,sequential store buffer , could anyone help me to explain how it works? or the principle? or where i can find the thesis about it ?
For generational collectors, different regions of the heap get collected at different times (minor for young gen., major for old gen.). To ensure consistency of collection a remembered set is typically used that records links from objects in the old generation to the young generation.
There are different ways of recording the remembered set, as described in the GC book you mention. A common way is the use of a card table, which is how the G1 collector does it.
An alternative is the sequential store buffer. This is an area of memory that is treated roughly like a stack, i.e. there is a pointer to where the next piece of data can be stored. Once the data is saved the pointer is bumped by the size of the data. This is very efficient (and is also the way space is allocated in the young generation). For a GC algorithm that uses a write-barrier (most) this is a good way of reducing the load created by the write-barrier. It is also very efficient on pipelined architectures with branch prediction.

card table and write barriers in .net GC

Can anybody explain the concept of card table and write barriers in Garbage Collection process in .Net?
I really can't get the explanation of these terms i.e what are they,how are they useful and how do they paticipate in GC.
Any help would be really appreciated.
The card table is an array of bits, one bit for each chunk of 256 bytes of memory in the old generation. The bits are normally zero but when a field of an object in the old generation is written to, the bit corresponding to the objects memory address is set to one. That is called executing the write barrier.
The garbage collector in .NET is generational and has a phase which only traces and collects objects in the young generation. So it goes through the object graph starting with the roots but does not recurse into objects in the old generation. In that way, it only traces a small fraction of the whole object graph.
To find the roots to start tracing from, it scans the programs local and global variables for young generation objects. But it would miss objects only referenced from old generation objects. Therefore it also scans fields of objects in the old generation whose card table bit is set.
Then after the young generation collection is complete it resets all card table bits to zero.

How to optimize an algorithm for a given multi-core architecture

I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?
I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?
Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.
What techniques I should look up-to for optimizing a given algorithm for a given architecture?
Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:
8.3 OPTIMIZATION GUIDELINES
This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):
Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization
Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.
8.6 MEMORY OPTIMIZATION
Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:
Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache
What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?
Exploit
When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:
Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As #PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.
Avoid
Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data
Measure
As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:
perf
Cachegrind
The key principle is locality: when you have the choice, process nearby data first (avoid sparse accesses), and perform data reuse as soon as possible (regroup successive passes over the same data).
For multithreaded programs, the principle is separate locality: ensure that the threads work on disjoint data sets (use distinct copies is necessary/possible).
Unless you have very good reasons to do so, stay away from the peculiarities of the hardware.
It should be mentioned that code is also cached in the same way as data. Small, dense code with a lot of inlining and few jumps/calls will put less strain on the L1C cache and, ultimately, L2, L3 and RAM where collisions with data fetches will occur.
If you are using hyperthreading there appears to be evidence to indicate that a lower optimization level (O1) on two hyperthreads in a core will overall get more work done than a single, highly optimized (O2 and higher) thread.

Stop and copy garbage collector in two phases

When implementing a stop and copy garbage collector as a pair, I need two memory banks (the old one and a free new one). One memory bank consists of the-cars and the-cdrs. So basicly when I make a new addres, it is a pointer to the-cars and the-cdrs.
When allocating new memory and I see that I don't have enough space, I start a GC. What this one does is:
switch the memory banks
move: read car and cdr from the old bank, copy to the new bank and put a pointer in the old bank to the new for later.
scan: loops over the memory and copies everything from old to new.
Now the question is: Why do I need to scan first and move after. Why can't I do both together?
It sounds like you are going through the really awesome garbage collection assignment where you implement your own collectors (mark and sweep, stop and copy, generational).
General answer to your question: two-pass algorithms typically use less memory than one-pass algorithms, by trading time for space.
More specific answer: in a stop-and-copy collector, you do it in two passes by (1) first copying the live data over to the new semispace, and (2) adjusting internal references in the live data to refer to elements in the new semispace, mapping old memory to new memory.
You must realize that the information necessary to do the mapping isn't magically available: you need memory to keep track how to redirect the moved memory. And remember: your collector itself is a program, and it must use a bounded, small amount of memory! Keeping a hash table in your collector to do the bookkeeping, for example, would be verboten: it's not playing by the rules. So one thing you need to keep track of is making sure the collector is playing with a limited amount of memory. So that explains why a stop-and-copy collector will reuse the old semispace and write those redirect records there.
With that constraint in mind: it's important to realize that we need to be careful of how we're traversing the live set. Which approach we choose may or may not require additional memory, in some very subtle and surprising ways. In particular, recursion in the general case is not free! Technically, in the first pass we should be using the new semispace not only as the target of our copying, but as a funky representation of the control stack that we use to implement the recursive process that walks through the live data.
Concretely, if we're doing a one-pass approach like this to copy the live set:
;; copy-live-set: number -> void
;; copies the live set starting from memory-location.
Pseudocode:
to copy-live-set starting at memory-location:
copy the block at memory-location over to the new semispace, and
record a redirection record in the old semispace
for each internal-reference in the block:
recursively call copy-live-set at the internal-reference if
it hasn't been copied already
remap the internal-reference to that new memory location
then you may be surprised to know that we've messed up with memory. The above will break the promise that the collector is making to the runtime environment because the recursion here is not iterative! It will consume control stack space. During the live set traversal, it will consume control stack space proportional to the depth of the structures we're walking across. Ooops.
If you try an alternative approach for walking through the live set, you should eventually see that there's a good way to traverse the whole live set while still guaranteeing bounded, small control stack usage. Hint: consider how graph traversal algorithms can be written as a simple while loop, with an explicit container that holds what to visit next till we exhaust the container. If you squint just right, the intermediate values in the new semispace look awfully like that container.
Once you discover how to traverse the live set in constant control stack space, you'll see that you'll need those two passes to do the complete copy-and-rewrite-internal-references thing. Worrying about these details is messy, but it's important in seeing how garbage collectors actually work. A real collector needs to do something like this, to be concerned about control stack usage, to ensure it uses bounded memory during the collection.
Summary: a two-pass algorithm is a solution that helps us with memory at the cost of some time. But we don't pay much in terms of performance: though we pass through the live set twice, the process is still linear in the size of the live set.
History: see Cheney's Algorithm, and note the title of the seminal paper's emphasis: "A Nonrecursive List Compacting Algorithm". That single highlighted word "Nonrecursive" is the key to what motivates the two-pass approach: it's trying to avoid consuming the control stack. Cheney's paper is an extension of the paper by Fenichel and Yochelson "A LISP Garbage-Collector for Virtual-Memory Computer Systems", in which the authors there proposed basically the recursive, stack-using approach first. To improve the situation, Fenichel and Yochelson then proposed using the non-recursive (but complicated!) Schorr-Waite DFS algorithm to do the traversal. Cheney's approach is an improvement because the traversal is simpler.

Resources