GCC's std::string - why so weird implementation - string

When I was looking at the way std::string is implemented in gcc I noticed that sizeof(std::string) is exactly equal to the size of pointer (4 bytes in x32 build, and 8 bytes for x64). As string should hold a pointer to string buffer and its length as a bare minimum, this made me think that std::string object in GCC is actually a pointer to some internal structure that holds this data.
As a consequence when new string is created one dynamic memory allocation should occur (even if the string is empty).
In addition to performance overhead this also cause memory overhead (that happens when we are allocating very small chunk of memory).
So I see only downsides of such design. What am I missing? What are the upsides and what is the reason for such implementation in the first place?

Read the long comment at the top of <bits/basic_string.h>, it explains what the pointer points to and where the string length (and reference count) are stored and why it's done that way.
However, C++11 doesn't allow a reference-counted Copy-On-Write std::string so the GCC implementation will have to change, but doing so would break the ABI so is being delayed until an ABI change is inevitable. We don't want to change the ABI, then have to change it again a few months later, then again. When it changes it should only change once to minimise the hassles for users.

Related

On Rust, is it possible to alloc a slice of memory, in such a way that the returned pointer fits in a u32?

HVM is a functional runtime which represents pointers as 32-bit values. Its allocator reserves a huge (4 GB) buffer preemptively, which it uses to create internal objects. This is not ideal. Instead, I'd like to use the system allocator, but that's not possible, since it returns 64-bit pointers, which may be larger than the space available to store them. Is there any cross-platform way in Rust to allocate a buffer, such that the pointer to the buffer is guaranteed to fit in an u32? In other words, I'm looking for something akin to:
let ptr = Box::new_with_small_ptr(size);
assert!(ptr as u64 + size < u32::MAX);
There isn't, because it's an extremely niche need and requires a lot of care.
It's not as simple as "just returning a low pointer" - you need to actually allocate that space from the OS. Your entry point into that would be mmap. Be prepared to do some low-level work with MAP_FIXED and reading /proc/self/maps, and also implementing an allocator on top of the memory region you get from mmap.
If your concern is just excess memory usage, note that Linux overcommits memory by default - allocating 4GB of memory won't reserve physical memory unless you actually try to use it all.

What is uninitialized memory and why isn't it initialized when allocating?

Taking this signature for a method of the GlobalAllocator:
unsafe fn alloc(&self, layout: Layout) -> *mut u8
and this sentence from the method's documentation:
The allocated block of memory may or may not be initialized.
Suppose that we are going to allocate some chunk of memory for an [i32, 10]. Assuming the size of i32 it's 4 bytes, our example array would need 40 bytes for the requested storage.
Now, the allocator found a memory spot that fits our requirements. Some 40 bytes of a memory region... but... what's there? I always read the term garbage data, and assume that it's just old data already stored there by another process, program... etc.
What's unitialized memory? Just data that is not initialized with zeros of with some default value for the type that we want to store there?
Why not always memory it's initialized before returning the pointer? It's too costly? But the memory must be initialized in order to use it properly and not cause UB. Why then doesn't comes already initialized?
When some resource it's deallocated, things musn't be pointing to that freed memory. That's that place got zeroed? What really happens when you deallocate some piece of memory?
What's unitialized memory? Just data that is not initialized with zeros of with some default value for the type that we want to store there?
It's worse than either of those. Reading from uninitialized memory is undefined behavior, as in you can no longer reason about a program which does so. Practically, compilers often optimize assuming that code paths that would trigger undefined behavior are never executed and their code can be removed. Or not, depending on how aggressive the compiler is.
If you could reliably read from the pointer, it would contain arbitrary data. It may be zeroes, it may be old data structures, it may be parts of old data structures. It may even be things like passwords and encryption keys, which is another reason why reading uninitialized memory is problematic.
Why not always memory it's initialized before returning the pointer? It's too costly? But the memory must be initialized in order to use it properly and not cause UB. Why then doesn't comes already initialized?
Yes, cost is the issue. The first thing that is typically done after allocating a piece of memory is to write to it. Having the allocator "pre-initialize" the memory is wasteful when the caller is going to overwrite it anyway with the values it wants. This is especially significant with large buffers used for IO or other large storage.
When some resource it's deallocated, things musn't be pointing to that freed memory. That's that place got zeroed? What really happens when you deallocate some piece of memory?
It's up to how the memory allocator is implemented. Most don't waste processing power to clear the data that's been deallocated, since it will be overwritten anyway when it's reallocated. Some allocators may write some bookkeeping data to the freed space. GlobalAllocator is an interface to whatever allocator the system comes with, so it can vary depending on the environment.
I always read the term garbage data, and assume that it's just old data already stored there by another process, program... etc.
Worth noting: all modern desktop OSs have memory isolation between processes - your program cannot access the memory of other processes or the kernel (unless you explicitly share it via specialized functionality). The kernel will clear memory before it assigns it to your process, to prevent leaking sensitive data. But you can see old data from your own process, for the reasons described above.
What you are asking are implementation details that can even vary from run to run. From the perspective of the abstract machine and thus the optimizer they don't matter.
Turning contents of uninitialized memory into almost any type (other than MaybeUninit) is immediate undefined behavior.
let mem: *u8 = unsafe { alloc(...) };
let x: u8 = unsafe { ptr::read(mem) };
if x != x {
print!("wtf");
}
May or may not print, crash or delete the contents of your harddrive, possibly even before reaching that alloc call because the optimizer worked backwards and eliminated the entire code block because it could prove that all execution paths are UB.
This may happen due to assumptions the optimizer relies on, i.e. even when the underlying allocator is well-behaved. But real systems may also behave non-deterministically. E.g. theoretically on a freshly booted embedded system memory might be in an uninitialized state that doesn't reliably return 0 or 1. Or on linux madvise(MADV_FREE) can cause allocations to return inconsistent results over time until initialized.

Maximum size of an array in 32 bits?

According to the Rust Reference:
The isize type is a signed integer type with the same number of bits as the platform's pointer type. The theoretical upper bound on object and array size is the maximum isize value. This ensures that isize can be used to calculate differences between pointers into an object or array and can address every byte within an object along with one byte past the end.
This obviously constrain an array to at most 2G elements on 32 bits system, however what is not clear is whether an array is also constrained to at most 2GB of memory.
In C or C++, you would be able to cast the pointers to the first and one past last elements to char* and obtain the difference of pointers from those two; effectively limiting the array to 2GB (lest it overflow intptr_t).
Is an array in 32 bits also limited to 2GB in Rust? Or not?
The internals of Vec do cap the value to 4GB, both in with_capacity and grow_capacity, using
let size = capacity.checked_mul(mem::size_of::<T>())
.expect("capacity overflow");
which will panic if the pointer overflows.
As such, Vec-allocated slices are also capped in this way in Rust. Given that this is because of an underlying restriction in the allocation API, I would be surprised if any typical type could circumvent this. And if they did, Index on slices would be unsafe due to pointer overflow. So I hope not.
It might still not be possible to allocate all 4GB for other reasons, though. In particular, allocate won't let you allocate more than 2GB (isize::MAX bytes), so Vec is restricted to that.
Rust uses LLVM as compiler backend. The LLVM instruction for pointer arithmetic (GetElementPtr) takes signed integer offsets and has undefined behavior on overflow, so it is impossible to index into arrays larger than 2GB when targeting a 32-bit platform.
To avoid undefined behavior, Rust will refuse to allocate more than 2 GB in a single allocation. See Rust issue #18726 for details.

Garbage collection - root nodes

I have recently read bits and pieces about garbage collection (mostly in Java) and one question still remains unanswered: how does a JVM (or runtime system in general) keeps track of CURRENTLY live objects?
I understand there objects are the ones which are currently on the stack, so all the local variables or function parameters, which ARE objects. The roblem with this approch is that whenever runtime system checks what currently is on the stack, how would it differentiate between a reference variable and simple int? it can't, can it?
Therefore, there must be some sort of mechanism to allow runtime to build initial list of live objects to pass for mark-sweep phase...
I found the answer provided by greyfairer is wrong. The JVM runtime does not gather the root set from stack by looking at what bytecodes are used to push data on the stack. The stack frame consists of 4 byte(32bit arch) slots. Each slot could be a reference to a heap object or a primitive value such as an int. When a GC is needed, the runtime scans the stack, from top to bottom. For each slot, it contains a reference if:
a. It's aligned at 4 byte boundary.
b. The value in the slot point to the region of the heap(between lower and upper bound).
c. The allocbit is set. The allocbit is a flag indicating whether the memory location corresponding to it is allocated or not.
Here is my reference: http://www.ibm.com/developerworks/ibm/library/i-garbage2/.
There are some other techniques to find the root set(not in Java). For example, because pointers are usually aligned at 4/8 bytes boundary, the first bit can be used to indicate whether a slot is a primitive value or pointer: for primitive values, the first bit is set to 1. The disadvantage of this is that you only have 31bits(32 bits arch) to represent the integer, and every operations on primitive values involves shifting, which is obvious an overhead.
Also, you can make all types including int allocated on the heap. That is, all things are objects. Then all slots in a stack frame are then references.
The runtime can perfectly differentiate between reference variables and primitives, because that's in the compiled bytecode.
For example if a function f1 calls a function f2(int i, Object o, long l), the calling function f1 will push 4 bytes on the stack (or in a register) representing i, 4 (or 8?) bytes for the reference to o, and 8 bytes for l. The called function f2 knows where to find these bytes on the stack, and could potentially copy the reference to some object on the heap, or not. When the function f2 returns, the calling function will drop the parameters from the stack.
The runtime interpretes the bytecode and keeps record of what it pushes or drops on the stack, so it knows what is a reference and what is a primitive value.
According to http://www.javacoffeebreak.com/articles/thinkinginjava/abitaboutgarbagecollection.html, java uses a tracing garbage collector and not a reference counting algorithm.
The HotSpot VM generates a GC map for each subroutine compiled which contain information about where the roots are. For example, suppose it has compiled a subroutine to machine code (the principle is the same for byte code) which is 120 bytes long, then the GC map for it could look something like this:
0 : [RAX, RBX]
4 : [RAX, [RSP+0]]
10 : [RBX, RSI, [RSP+0]]
...
120 : [[RSP+0],[RSP+8]]
Here [RSP+x] is supposed to indicate stack locations and R?? registers. So if the thread is stopped at the assembly instruction at offset 10 and a gc cycle runs then HotSpot knows that the three roots are in RBX, RSI and [RSP+0]. It traces those roots and updates the pointers if it has to move the objects.
The format I've described for the GC map is just for demonstrating the principle and obviously not the one HotSpot actually uses. It is not complete because it doesn't contain information about registers and stack slots which contain primitive live values and it is not space efficient to use a list for every instruction offset. There are many ways in which you can pack the information in a much more efficient way.

Hazards of not protection shared variables in a threaded environment

I'm trying to understand the hazards of not locking shared variables in a threaded (or shared memory) environment. It is easy to argue that if you are doing two or more dependent operations on a variable it is important to hold some lock first. The typical example is the increment operation, which first reads the current value before adding one and writing back.
But what if you only have one writer (and lots of readers) and the write is not dependent on the previous value. So I have one thread storing a timestamp offset once every second. The offset holds the difference between local time and some other time base. A lot of readers use this offset to timestamp events and getting a read lock for each time is a little expensive. In this situation I don't care if the reader gets the value just before the write or just after, as long as the reader don't get garbage (that is an offset that was never set).
Say that the variable is a 32 bit integer. Is it possible to get a garbage read of the variable in the middle of a write? Or are writing a 32 bit integer an atomic operation? Will it depend on the Os or hardware? What a about a 64 bit integer on a 32 bit system?
What about shared memory instead of threading?
Writing a 64-bit integer on a 32-bit system is not atomic, and you could have incorrect data if you don't take a lock.
As an example, if your integer is
0x00000000 0xFFFFFFFF
and you are going to write the next int in sequence, you want to write:
0x00000001 0x00000000
But if you read the value after one of the ints is written and before the other is, then you could read
0x00000000 0x00000000
or
0x00000001 0xFFFFFFFF
which are wildly different than the correct value.
If you want to work without locks, you have to be very certain what constitutes an atomic operation on your OS/CPU/compiler combination.
In additions to the above comments, beware the register bank in a slightly more general setting. You may end up updating only the cpu register and not really write it back to main memory right away. Or the other way around where you use a cached register copy while the original value in memory has been updated. Some languages have a volatile keyword to mark a variable as "read-always-and-never-locally-register-cache".
The memory model of your language is important. It describes exactly under what conditions a given value is shared among several threads. Either this is the rules of the CPU architecture you are executing on, or it is determined by a virtual machine in which the language is running. Java for instance has a separate memory model you can look at to figure out what exactly to expect.
An 8-bit, 16-bit or 32-bit read/write is guaranteed to be atomic if it is aligned to it's size (on 486 and later) and unaligned but within a cache line (on P6 and later). Most compilers will guarantee stack (local, assuming C/C++) variables are aligned.
A 64-bit read/write is guaranteed to be atomic if it is aligned (on Pentium and later), however, this relies on the compiler generating a single instruction (for example, popping a 64-bit float from the FPU or using MMX). I expect most compilers will use two 32-bit accesses for compatibility, though it is certainly possible to check (the disassembly) and it may be possible to coerce different handling.
The next issue is caching and memory fencing. However, the effect of ignoring these is that some threads may see the old value even though it has been updated. The value won't be invalid, simply out of date (by microseconds, probably). If this is critical to your application, you will have to dig deeper, but I doubt it is.
(Source: Intel Software Developer Manual Volume 3A)
It very much depends on hardware and how you are talking to it. If you are writing assembler, you will know exactly what you get as processor manuals will tell you which operations are atomic and under what conditions. For example, in the Intel Pentium, 32-bit reads are atomic if the address is aligned, but not otherwise.
If you are working on any level above that, it will depend on how that ultimately gets translated into machine code. Be that a compiler, interpreter, or virtual machine.
The platform you run on determines the size of atomic reads/writes. Generally, a 32-bit (register) platform only supports 32-bit atomic operations. So, if you are writing more than 32-bits, you will probably have to use some other mechanism to coordinate access to that shared data.
One mechanism is to double or triple buffer the actual data and use a shared index to determine the "latest" version:
write(blah)
{
new_index= ...; // find a free entry in the global_data array.
global_data[new_index]= blah;
WriteBarrier(); // write-release
global_index= new_index;
}
read()
{
read_index= global_index;
ReadBarrier(); // read-acquire
return global_data[read_index];
}
You need the memory barriers to ensure that you don't read from global_data[...] until after you read global_index and you don't write to global_index until after you write to global_data[...].
This is a little awful since you can also run into the ABA issue with preemption, so don't use this directly.
Platforms often provide atomic read/write access (enforced at the hardware level) to primitive values (32-bit or 64-bit,as in your example) - see the Interlocked* APIs on Windows.
This can avoid the use of a heavier weight lock for threadsafe variable or member access, but should not be mixed up with other types of lock on the same instance or member. In other words, don't use a Mutex to mediate access in one place and use Interlocked* to modify or read it in another.

Resources