Garbage collection - root nodes - garbage-collection

I have recently read bits and pieces about garbage collection (mostly in Java) and one question still remains unanswered: how does a JVM (or runtime system in general) keeps track of CURRENTLY live objects?
I understand there objects are the ones which are currently on the stack, so all the local variables or function parameters, which ARE objects. The roblem with this approch is that whenever runtime system checks what currently is on the stack, how would it differentiate between a reference variable and simple int? it can't, can it?
Therefore, there must be some sort of mechanism to allow runtime to build initial list of live objects to pass for mark-sweep phase...

I found the answer provided by greyfairer is wrong. The JVM runtime does not gather the root set from stack by looking at what bytecodes are used to push data on the stack. The stack frame consists of 4 byte(32bit arch) slots. Each slot could be a reference to a heap object or a primitive value such as an int. When a GC is needed, the runtime scans the stack, from top to bottom. For each slot, it contains a reference if:
a. It's aligned at 4 byte boundary.
b. The value in the slot point to the region of the heap(between lower and upper bound).
c. The allocbit is set. The allocbit is a flag indicating whether the memory location corresponding to it is allocated or not.
Here is my reference: http://www.ibm.com/developerworks/ibm/library/i-garbage2/.
There are some other techniques to find the root set(not in Java). For example, because pointers are usually aligned at 4/8 bytes boundary, the first bit can be used to indicate whether a slot is a primitive value or pointer: for primitive values, the first bit is set to 1. The disadvantage of this is that you only have 31bits(32 bits arch) to represent the integer, and every operations on primitive values involves shifting, which is obvious an overhead.
Also, you can make all types including int allocated on the heap. That is, all things are objects. Then all slots in a stack frame are then references.

The runtime can perfectly differentiate between reference variables and primitives, because that's in the compiled bytecode.
For example if a function f1 calls a function f2(int i, Object o, long l), the calling function f1 will push 4 bytes on the stack (or in a register) representing i, 4 (or 8?) bytes for the reference to o, and 8 bytes for l. The called function f2 knows where to find these bytes on the stack, and could potentially copy the reference to some object on the heap, or not. When the function f2 returns, the calling function will drop the parameters from the stack.
The runtime interpretes the bytecode and keeps record of what it pushes or drops on the stack, so it knows what is a reference and what is a primitive value.
According to http://www.javacoffeebreak.com/articles/thinkinginjava/abitaboutgarbagecollection.html, java uses a tracing garbage collector and not a reference counting algorithm.

The HotSpot VM generates a GC map for each subroutine compiled which contain information about where the roots are. For example, suppose it has compiled a subroutine to machine code (the principle is the same for byte code) which is 120 bytes long, then the GC map for it could look something like this:
0 : [RAX, RBX]
4 : [RAX, [RSP+0]]
10 : [RBX, RSI, [RSP+0]]
...
120 : [[RSP+0],[RSP+8]]
Here [RSP+x] is supposed to indicate stack locations and R?? registers. So if the thread is stopped at the assembly instruction at offset 10 and a gc cycle runs then HotSpot knows that the three roots are in RBX, RSI and [RSP+0]. It traces those roots and updates the pointers if it has to move the objects.
The format I've described for the GC map is just for demonstrating the principle and obviously not the one HotSpot actually uses. It is not complete because it doesn't contain information about registers and stack slots which contain primitive live values and it is not space efficient to use a list for every instruction offset. There are many ways in which you can pack the information in a much more efficient way.

Related

Capture All Heap Allocations

I want to find all accesses to heap memory in an application. I need to store each allocation and, consequently, can not only check for addresses in the [heap] range (which also does not include heap memory areas allocated by mmap()). Therefore, I wrote a pintool and captured all calls to malloc(), calloc(), realloc() and free(). Because of optimizations such as tail-call elimiation Pin can not detect the last instruction of these calls. Therefore, I manually added callbacks after (precisely, I used IPOINT_TAKEN_BRANCH) the ret... instructions in each of the probable direct/indirect jump targets out of these functions (e.g., malloc(), indirectly, jumps to malloc_hook_ini(). So I added instrumentation code after all ret... instructions in malloc_hook_ini()). These targets, themselves, may have outgoing direct/indirect jumps and, again, I tried to capture them.
But, there are still some accesses in the [heap] range (and also in mmap() ranges) which do not pertain to any of the previously captured allocations. To clear up any doubts, I used Pwngdb to display all currently allocated heap chunks, right before the access point. The access address was clearly in an allocated heap chunk. Of course, knowing the allocation IP for these heap chunks will be a great help. But this is not supported in Pwngdb or any other similar tools.
In many cases analyzed by Pin, the access address does not belong to any address range allocated (even those removed in the meantime) during the whole program execution. How can I determine which allocation function was missed during Pin analysis?
It seems that there are two possible situations:
1) There exists some omitted function other than malloc(), calloc(), realloc() and free().
2) There are some missed return points for malloc(), calloc(), realloc() and free().
The second candidate is not possible. Because I put a counter before and after each of these allocation functions and at the end, they had equal values.
UPDATE:
Here is the backtrace for one such access point and also the value for the RSI register:

How does the referencing of objects and variables in programs work?

Disclaimer: I am not a very experienced guy, and many questions might seem stupid or badly phrased.
I have heard about stacks and heaps and read a bit about them, but still a few things I don't quite understand:
How does a program find empty memory to store new variables/objects in physical memory.
How does a program know where an object starts and where an object ends in memory. With number variables I can imagine there is a few extra information provided in memory that show the porgram how many bits the variable occupies, but correct me if I'm wrong.
This is similar to my first question, but: when a variable has a value representd only by zeros, how does the program not confuse that with free memory.
Does the object value null mean that the address of an object is a bunch of 0's or does the object point to litterally nothing? And if so, how is the "reference" stored to assign it an address later on?
How does a program find empty memory to store new variables/objects in physical memory.
Modern operating systems use logical address translation. A process sees a range of logical addresses—its address space. The system hardware breaks the address range into pages. The size of the page is system dependent and is often configurable. The operating system manages page tables that map logical pages to physical page frames of the same size.
The address space is divided into a range of pages that is the system space, shared by all processes, and a user space, that is generally unique to each process.
Within the user and system spaces, pages may be valid or invalid. An invalid page has not yet been mapped to the process address space. Most pages are likely to be invalid.
Memory is always allocated from the operating system image pages. The operating system will have system services that transform invalid pages into valid pages with mappings to physical memory. In order to map pages, the operating system needs to find (or the application needs to specify) a range of pages that are invalid and then has to allocate physical page frames to map to the those pages. Note that physical page frames do not have to be mapped contiguously to logical pages.
You mention stacks and heaps. Stacks and heap are just memory. The operating system cannot tell whether memory is a stack, heap or something else. User mode libraries for memory allocation (such as those that implement malloc/free) allocate memory in pages to create heaps. The only thing that makes this memory a heap is that there is a heap manager controlling it. The heap manager can then allocate smaller blocks of memory from the pages allocated to the heap.
A stack is simpler. It is just a contiguous range of pages. Typically an operating system service that creates a thread or process will allocate a range of pages for a stack and assign the hardware stack pointer register to the high end of the stack range.
How does a program know where an object starts and where an object ends in memory. With number variables I can imagine there is a few extra information provided in memory that show the porgram how many bits the variable occupies, but correct me if I'm wrong.
This depends upon how the program is created and how the object is created in memory. For typed languages, the linker binds variables to addresses. The linker also generates instruction for mapping those addresses to the address space. For stack/auto variables, the compiler generates offsets from a pointer to the stack. When a function/subroutine gets called, the compiler generates code to allocate the memory required by the procedure, which it does by simply subtracting from the stack pointer. The memory gets freed by simply adding that value back to the stack pointer.
In the case of typeless languages, such as assembly language or Bliss, the programmer has to keep track of the type for each location. When memory is dynamically, the programmer also has to keep track of the type. Most programming languages help this out by having pointers with types.
This is similar to my first question, but: when a variable has a value representd only by zeros, how does the program not confuse that with free memory.
Free memory is invalid. Accessing free memory causes a hardware exception.
Does the object value null mean that the address of an object is a bunch of 0's or does the object point to litterally nothing? And if so, how is the "reference" stored to assign it an address later on?
The linker defines the initial state of a program's user address space. Most linkers do not map the first page (or even more than one page). That page is then invalid. That means a null pointer, as you say, references absolutely nothing. If you try to dereference a null pointer you will usually get some kind of access violation exception
Most operating system will allow the user to map the first page. Some linkers will allow the user to override the default setting and map the first page. This is not commonly done as it makes detecting memory error difficult.
How does a program find empty memory to store new variables/objects in physical memory.
Physical memory is managed by the OS that knows which parts of the memory are used by processes and which parts are free. When it needs memory, a program asks the operating system to use parts of the memory. If this memory is for the heap, extra operations are needed. The operating systems delivers memory by fixed size blocks called pages. As a page is 4kbytes, if the user mallocs some bytes, there is a need, to optimize memory use, to know which parts of the page are used or available and to monitor page content after successive malloc and free. There are specific data structures to describe used space and algorithms to find space, whilst avoiding fragmentation.
How does a program know where an object starts and where an object ends in memory. With number variables I can imagine there is a few extra information provided in memory that show the porgram how many bits the variable occupies, but correct me if I'm wrong
The program knows the address (ie the start) of every variable. For global or static variables it is generated by the linker when it places vars in memory. For local variables, the processor has means to compute it given the stack position. For allocated variables, it is stored in another variable (a pointer) when memory is allocated. Concerning the end, it depends on the type of variables. For known types (like int) or composition of known types (like structs) it can be computed at compile time. In other situations, the program has no way to know the entity size. For instance a declaration like int * a may describe an array, but the program has no way to know the array size. The programmer must keep track of this information, for instance by writing the number of elements in the array in another variable.
This is similar to my first question, but: when a variable has a value representd only by zeros, how does the program not confuse that with free memory.
The program never looks at the memory to know if it is free or not. It managed by other means (see question 1).
Does the object value null mean that the address of an object is a bunch of 0's or does the object point to litterally nothing? And if so, how is the "reference" stored to assign it an address later on?
An address is never a bunch of zero, except for address '0' of memory. It is the content that is set to zero. Actually, it not possible to read or write address 0. It generates a "bus error" exception (and maybe you have already encountered it). Pointing to a zero address is exactly like "pointing to litterally nothing" and generate an error if encountered in a program. These variables hold addresses of other vars (pointer). So the address of the pointer is well defined. Was may not be defined is what it points to. It can be modified by assigning something to the pointer (for instance what malloc returned or the address of another var).

How many ABA tag bits are needed in lock-free data structures?

One popular solution to the ABA problem in lock-free data structures is to tag pointers with an additional monotonically incrementing tag.
struct aba {
void *ptr;
uint32_t tag;
};
However, this approach has a problem. It is really slow and has huge cache problems. I can obtain a speed-up of twice as much if I ditch the tag field. But this is unsafe?
So my next attempt stuff for 64 bit platforms stuffs bits in the ptr field.
struct aba {
uintptr __ptr;
};
uint32_t get_tag(struct aba aba) { return aba.__ptr >> 48U; }
But someone said to me that only 16 bits for the tag is unsafe. My new plan is to use pointer alignment to cache-lines to stuff more tag bits in but I want to know if that'll work.
If that fails to work my next plan is to use Linux's MAP_32BIT mmap flag to allocated data so I only need 32 bits of pointer space.
How many bits do I need for the ABA tag in lock-free data-structures?
The amount of tag bits that is practically safe can be estimated based on the preemption time and the frequency of pointer modifications.
To remind, the ABA problem happens when a thread reads the value it wants to change with compare-and-swap, gets preempted, and when it resumes the actual value of the pointer happens to be equal to what the thread read before. Therefore the compare-and-swap operation may succeed despite data structure modifications possibly done by other threads during the preemption time.
The idea of adding the monotonically incremented tag is to make each modification of the pointer unique. For it to succeed, increments must produce unique tag values during the time when a modifying thread might be preempted; i.e. for guaranteed correctness the tag may not wraparound during the whole preemption time.
Let's assume that preemption lasts a single OS scheduling time slice, which is typically tens to hundreds of milliseconds. The latency of CAS on modern systems is tens to hundreds of nanoseconds. So rough worst-case estimate is that there might be millions of pointer modifications while a thread is preempted, and so there should be 20+ bits in the tag in order for it to not wraparound.
In practice it can be possible to make a better estimate for a particular real use case, based on known frequency of CAS operations. One also need to estimate the worst-case preemption time more accurately; for example, a low-priority thread preempted by a higher-priority job might end up with much longer preemption time.
According to the paper
http://web.cecs.pdx.edu/~walpole/class/cs510/papers/11.pdf
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects (IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 15, NO. 6, JUNE 2004 p. 491) by PhD Maged M. Michael
tag bits should be sized to make wraparound impossible in real lockfree scenarios (I can read this as if you may have N threads running and each may access the structure, you should have N+1 different states for tags at least):
6.1.1 IBM ABA-Prevention Tags
The earliest and simplest lock-free method for node reuse is
the tag (update counter) method introduced with the
documentation of CAS on the IBM System 370 [11]. It
requires associating a tag with each location that is the
target of ABA-prone comparison operations. By incrementing
the tag when the value of the associated location is
written, comparison operations (e.g., CAS) can determine if
the location was written since it was last accessed by the
same thread, thus preventing the ABA problem.
The method requires that the tag contains enough bits to make
full wraparound impossible during the execution of any
single lock-free attempt. This method is very efficient and
allows the immediate reuse of retired nodes.
Depending on your data structure you could be able to steal some extra bits from the pointers. For example if the objects are 64 bytes and always aligned on 64 byte boundaries, the lower 6 bits of each pointer could be used for the tags (but that's probably what you already suggested for your new plan).
Another option would be to use an index into your objects instead of pointers.
In case of contiguous objects that would of course simply be an index into an array or vector. In case of lists or trees with objects allocated on the heap, you could use a custom allocator and use an index into your allocated block(s).
For say 17M objects you would only need 24 bits, leaving 40 bits for the tags.
This would need some (small and fast) extra calculation to get the address, but if the alignment is a power of 2 only a shift and an addition are needed.

Maximum Thread Stack Size .NET?

What is the maximum stack size allowed for a thread in C#.NET 2.0? Also, does this value depend on the version of the CLR and/or the bitness (32 or 64) of the underlying OS?
I have looked at the following resources msdn1 and msdn2
public Thread(
ThreadStart start,
int maxStackSize
)
The only information I can see is that the default size is 1 megabytes and in the above method, if maxStackSize is '0' the default maximum stack size specified in the header for the executable will be used, what's the maximum value that we can change the value in the header upto? Also is it advisable to do so? Thanks.
For the record, this fits Raymond Chen's category of "if you need to know then you are doing something wrong".
The default stack size for threads running 64-bit code is 4 megabytes, 1 megabyte for 32-bit code. While the Thread constructor lets you pass a integer value up to int.MaxValue, you'll never get that on a 32-bit machine. The stack must fit in an available hole in the virtual memory address space, that usually tops out at ~600 MB early in the process lifetime. Rapidly getting smaller as you allocate memory and fragment the address space.
Allocating more than the default is quite unnecessary. You might contemplate doing this when you have a heavily recursive method that blows the stack. Don't, fix the algorithm or you'll blow it anyway when the job gets bigger.
The smallest stack that .NET lets you choose is 250 KB. It silently rounds it up if you pass a value that's smaller. Necessary because both the jitter and the garbage collector need stack space to get their job done. Again, doing so should be quite unnecessary. If you contemplate doing so because you have a lot of threads and consume all virtual memory with their stacks then you have too many threads. A StackOverflowException is one of the nastiest runtime exceptions you can get. Process death is immediate and untrappable.
The stack size for the main thread is determined by an option in the EXE header. The compiler doesn't have an option to change it, you have to use editbin.exe /stack to patch the .exe header.
I am unaware of what the maximum is, but MSDN speaks to whether you should do it or not:
Avoid using this constructor overload. The default stack size used by the Thread(ThreadStart) constructor overload is the recommended stack size for threads. If a thread has memory problems, the most likely cause is programming error, such as infinite recursion.
I have never had a StackOverflow occur in C# which was not due to infinite recursion. If there truly was a case where recursion went to that depth, I would consider replacing it with iteration.

Hazards of not protection shared variables in a threaded environment

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?
Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.
In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.
An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)
It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.
The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.
Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Resources