What is uninitialized memory and why isn't it initialized when allocating? - rust

Taking this signature for a method of the GlobalAllocator:
unsafe fn alloc(&self, layout: Layout) -> *mut u8
and this sentence from the method's documentation:
The allocated block of memory may or may not be initialized.
Suppose that we are going to allocate some chunk of memory for an [i32, 10]. Assuming the size of i32 it's 4 bytes, our example array would need 40 bytes for the requested storage.
Now, the allocator found a memory spot that fits our requirements. Some 40 bytes of a memory region... but... what's there? I always read the term garbage data, and assume that it's just old data already stored there by another process, program... etc.
What's unitialized memory? Just data that is not initialized with zeros of with some default value for the type that we want to store there?
Why not always memory it's initialized before returning the pointer? It's too costly? But the memory must be initialized in order to use it properly and not cause UB. Why then doesn't comes already initialized?
When some resource it's deallocated, things musn't be pointing to that freed memory. That's that place got zeroed? What really happens when you deallocate some piece of memory?

What's unitialized memory? Just data that is not initialized with zeros of with some default value for the type that we want to store there?
It's worse than either of those. Reading from uninitialized memory is undefined behavior, as in you can no longer reason about a program which does so. Practically, compilers often optimize assuming that code paths that would trigger undefined behavior are never executed and their code can be removed. Or not, depending on how aggressive the compiler is.
If you could reliably read from the pointer, it would contain arbitrary data. It may be zeroes, it may be old data structures, it may be parts of old data structures. It may even be things like passwords and encryption keys, which is another reason why reading uninitialized memory is problematic.
Why not always memory it's initialized before returning the pointer? It's too costly? But the memory must be initialized in order to use it properly and not cause UB. Why then doesn't comes already initialized?
Yes, cost is the issue. The first thing that is typically done after allocating a piece of memory is to write to it. Having the allocator "pre-initialize" the memory is wasteful when the caller is going to overwrite it anyway with the values it wants. This is especially significant with large buffers used for IO or other large storage.
When some resource it's deallocated, things musn't be pointing to that freed memory. That's that place got zeroed? What really happens when you deallocate some piece of memory?
It's up to how the memory allocator is implemented. Most don't waste processing power to clear the data that's been deallocated, since it will be overwritten anyway when it's reallocated. Some allocators may write some bookkeeping data to the freed space. GlobalAllocator is an interface to whatever allocator the system comes with, so it can vary depending on the environment.
I always read the term garbage data, and assume that it's just old data already stored there by another process, program... etc.
Worth noting: all modern desktop OSs have memory isolation between processes - your program cannot access the memory of other processes or the kernel (unless you explicitly share it via specialized functionality). The kernel will clear memory before it assigns it to your process, to prevent leaking sensitive data. But you can see old data from your own process, for the reasons described above.

What you are asking are implementation details that can even vary from run to run. From the perspective of the abstract machine and thus the optimizer they don't matter.
Turning contents of uninitialized memory into almost any type (other than MaybeUninit) is immediate undefined behavior.
let mem: *u8 = unsafe { alloc(...) };
let x: u8 = unsafe { ptr::read(mem) };
if x != x {
print!("wtf");
}
May or may not print, crash or delete the contents of your harddrive, possibly even before reaching that alloc call because the optimizer worked backwards and eliminated the entire code block because it could prove that all execution paths are UB.
This may happen due to assumptions the optimizer relies on, i.e. even when the underlying allocator is well-behaved. But real systems may also behave non-deterministically. E.g. theoretically on a freshly booted embedded system memory might be in an uninitialized state that doesn't reliably return 0 or 1. Or on linux madvise(MADV_FREE) can cause allocations to return inconsistent results over time until initialized.

Related

On Rust, is it possible to alloc a slice of memory, in such a way that the returned pointer fits in a u32?

HVM is a functional runtime which represents pointers as 32-bit values. Its allocator reserves a huge (4 GB) buffer preemptively, which it uses to create internal objects. This is not ideal. Instead, I'd like to use the system allocator, but that's not possible, since it returns 64-bit pointers, which may be larger than the space available to store them. Is there any cross-platform way in Rust to allocate a buffer, such that the pointer to the buffer is guaranteed to fit in an u32? In other words, I'm looking for something akin to:
let ptr = Box::new_with_small_ptr(size);
assert!(ptr as u64 + size < u32::MAX);
There isn't, because it's an extremely niche need and requires a lot of care.
It's not as simple as "just returning a low pointer" - you need to actually allocate that space from the OS. Your entry point into that would be mmap. Be prepared to do some low-level work with MAP_FIXED and reading /proc/self/maps, and also implementing an allocator on top of the memory region you get from mmap.
If your concern is just excess memory usage, note that Linux overcommits memory by default - allocating 4GB of memory won't reserve physical memory unless you actually try to use it all.

In Rust, is there a way to directly read the content of a file into the given uninitialized byte array?

I am looking for a way to directly read the content of a file into the provided uninitialized byte array.
Currently, I have a code like the following:
use std::fs::File;
use std::mem::MaybeUninit;
let buf: MaybeUninit<[u8; 4096]> = MaybeUninit::zeroed();
let f = File::open("some_file")?;
f.read(buf.as_mut_ptr().as_mut().unwrap())?;
The code does work, except that it unnecessarily initializes the byte array with 0. I would like to replace MaybeUninit::zeroed() with MaybeUninit::uninit() but doing so will trigger an undefined behavior according to the document of MaybeUninit. Is there a way to initialize an uninitialized memory region with the content of the file without first reading the data to somewhere else, by only using the standard library? Or do we need to go for the OS-specific API?
The previous shot at the answer is kept below for posterity. Let's deal with the actual elephant in the room:
Is there a way to initialize an uninitialized memory region with the content of the file without first reading the data to somewhere else, by only using the standard library? Or do we need to go for the OS-specific API?
There is: Read::read_to_end(&mut self, &mut Vec<u8>)
This function will drain your impl Read object, and depending on the underlying implementation will do one or more reads, extending the Vec provided as it goes and appending all bytes to it.
It then returns the number of bytes read. It can also be interrupted, and this error needs to be handled.
You are trying to micro-optimize something based on heuristics you think are the case, when they are not.
The initialization of the array is done in one go as low-level as it can get with memset, all in one chunk. Both calloc and malloc+memset are highly optimized, calloc relies on a trick or two to make it even more performant. Somebody on codereview pitted "highly optimized code" against a naive implementation and lost as a result.
The takeaway is that second-guessing the compiler is typically fraught with issues and, overall, not worth micro-optimizing for unless you can put some real numbers on the issues.
The second takeaway is one of memory logic. As I am sure you are aware, allocation of memory is dramatically faster in some cases depending on the position of the memory you are allocating and the size of the contiguous chunk you are allocating, due to how memory is laid out in atomic units (pages). This is a much more impactful factor, to the point that below the hood, the compiler will often align your memory request to an entire page to avoid having to fragment it, particularly as it gets into L1/L2 caches.
If anything isn't clear, let me know and I'll generate some small benchmarks for you.
Finally, MaybeUninit is not at all the tool you want for the job in any case. The point of MaybeUninit isn't to skip a memset or two, since you will be performing those memsets yourself by having to guarantee (by contract due to assume_init) that those types are sane. There are cases for this, but they're rare.
In larger cases
There is an impact on performance in uninitializing vs. initializing memory, and we're going to show this by taking an absolutely perfect scenario: we're going to make ourselves a 64M buffer in memory and wrap it in a Cursor so we get a Read type. This Read type will have latency far, far inferior to most I/O operations you will encounter in the wild, since it is almost guaranteed to reside entirely in L2 cache during the benchmark cycle (due to its size) or L3 cache (because we're single-threaded). This should allow us to notice the performance loss from memsetting.
We're going to run three versions for each case (the code):
One where we define out buffer as [MaybeUninit::uninit().assume_init(); N], i.e. we're taking N chunks of MaybeUninit<u8>
One where out MaybeUninit is a contiguous N-element long chunk
One where we're just mapping straight into an initialized buffer
The results (on a core i9-9900HK laptop):
large reads/one uninit time: [1.6720 us 1.7314 us 1.7848 us]
large reads/small uninit elements
time: [2.1539 us 2.1597 us 2.1656 us]
large reads/safe time: [2.0627 us 2.0697 us 2.0771 us]
small reads/one uninit time: [4.5579 us 4.5722 us 4.5893 us]
small reads/small uninit elements
time: [5.1050 us 5.1219 us 5.1383 us]
small reads/safe time: [7.9654 us 7.9782 us 7.9889 us]
The results are as expected:
Allocating N MaybeUninit is slower than one huge chunk; this is completely expected and should not come as a surprise.
Small, iterative 4096-byte reads are slower than a huge, single, 128M read even when the buffer only contains 64M
There is a small performance loss in reading using initialized memory, of about 30%
Opening anything else on the laptop while testing causes a 50%+ increase in benchmarked time
The last point is particularly important, and it becomes even more important when dealing with real I/O as opposed to a buffer in memory. The more layers of cache you have to traverse, the more side-effects you get from other processes impacting your own processing. If you are reading a file, you will typically encounter:
The filesystem cache (may or may not be swapped)
L3 cache (if on the same core)
L2 cache
L1 cache
Depending on the level of the cache that produces a cache miss, you're more or less likely to have your performance gain from using uninitialized memory dwarfed by the performance loss in having a cache miss.
So, the (unexpected TL;DR):
Small, iterative reads are slower
There is a performance gain in using MaybeUninit but it is typically an order of magnitude less than any I/O opt

What is the case of using Buffer.allocUnsafe() and Buffer.alloc()?

I am confused about using Buffer.allocUnsafe() and Buffer.alloc() , I know that Buffer.allocUnsafe() creates a buffer with pre-filled data or old buffers, but why do i need such thing if Buffer.alloc() creates a buffer with zero filled data
In Node.js Buffer is an abstraction over RAM, therefore if you allocate it in an unsafe way, there is a high risk of having even some source code in the buffer instance. Try running console.log(Buffer.allocUnsafe(10000).toString('utf-8')) and I guarantee that you will see some code in your stdout.
Allocation is a synchronous operation and we know that single threaded Node.js doesn't really feel good about synchronous stuff. Unsafe allocation is much faster than safe, because the buffer santarization step takes time. Safe allocation is, well, safe, but there is a performance trade off.
I'd suggest sticking to safe allocation first and if you end up with low performance, you can think of ways to implement unsafe allocation, without exposing private stuff. Just keep in mind that allocUnsafe method has the word unsafe for a reason. E.g, if you are going to pass some compliance certification like PCI DSS, I'm pretty sure QSA will notice that and will have a lot of questions.
Buffer.alloc(size, fill, encoding) -> returns a new initialized Buffer
of the specified size. This method is slower than Buffer.allocUnsafe(size) but guarantees that newly created Buffer instances never contain old data that is potentially sensitive.
Buffer.allocUnsafe(size) -> the Buffer is uninitialized, the allocated
segment of memory might contain old data that is potentially
sensitive. Using a Buffer created by Buffer.allocUnsafe() without completely overwriting the memory can allow this old data to be leaked when the Buffer memory is read.
Note: While there are clear performance advantages to using Buffer.allocUnsafe(), extra care must be taken in order to avoid introducing security vulnerabilities into an application

mmap(): resetting old memory to a zero'd non-resident state

I'm writing a memory allocation routine, and it's currently running smoothly. I get my memory from the OS with mmap() in 4096-byte pages. When I start my memory allocator I allocate 1gig of virtual address space with mmap(), and then as allocations are made I divide it up into hunks according to the specifics of my allocation algorithm.
I feel safe allocating as much as a 1gig of memory on a whim because I know mmap() doesn't actually put pages into physical memory until I actually write to them.
Now, the program using my allocator might have a spurt where it needs a lot of memory, and in this case the OS would have to eventually put a whole 1gig worth of pages into physical RAM. The trouble is that the program might then go into a dormant period where it frees most of that 1gig and then uses only minimal amounts of memory. Yet, all I really do inside of my allocator's MyFree() function is to flip a few bits of bookkeeping data which mark the previously used gig as free, but I know this doesn't cause the OS remove those pages from physical memory.
I can't use something like munmap() to fix this problem, because the nature of the allocation algorithm is such that it requires a continuous region of memory without any holes in it. Basically I need a way to tell the OS "Listen, you can take these pages out of physical memory and clear them to 0, but please remap them on the fly when I need them again, as if they were freshly mmap()'d"
What would be the best way to go about this?
Actually, after writing this all up I just realized that I can probably do an munmap() followed immediately by a fresh mmap(). Would that be the correct way to go about? I get the sense that there's probably some more efficient way to do this.
You are looking for madvise(addr, length, MADV_DONTNEED). From the manpage:
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
Note especially the language about how subsequent accesses will succeed but revert to zero-fill-on-demand (for mappings without an underlying file).
Your thinking-out-loud alternative of an munmap followed immediately by another mmap will also work but risks kernel-side inefficiencies because it is no longer tracking the allocation a single contiguous region; if there are many such unmap-and-remap events the kernelside data structures might wind up being quite bloated.
By the way, with this kind of allocator it's very important that you use MAP_NORESERVE for the initial allocation, and then touch each page as you allocate it, and trap any resulting SIGSEGV and fail the allocation. (And you'll need to document that your allocator installs a handler for SIGSEGV.) If you don't do this your application will not work on systems that have disabled memory overcommit. See the mmap manpage for more detail.

Against cold boot attacks: how to restrain sensitive information in Haskell

Is there any way to ensure key material gets securely erased from the memory after the program exits? Being able to erase it manually and keep the program running would be even better. As Haskell uses automated garbage collection (which may not happen at all if there is loads of free memory?), I assume that the second task is impossible. Could something that serves the purpose be implemented using FFI?
GHC can return memory to the OS when it is no longer needed, so merely blanking the memory on exit won't achieve your goal. Garbage collection is a complicated business, but there is in general no way to ensure that old copies of your secure data are not returned to the OS memory pool.
However the OS will blank your memory before allocating it to another process. If you don't trust the OS to keep your memory secure then you have a much bigger problem.
I'm not sure what you mean by "unreliable"; Haskell GC is reliable, but the program has comparatively little visibility of what is happening.
However if you are concerned merely with a cryptographic key rather than a big, complicated data structure then life gets a bit better. You can use a Foreign Pointer to point to a memory location for your key, and then make blanking that bit of memory into a part of your finaliser. You can even write a bit of code that allocates a block of memory, mlocks it, and then hands off foreign pointers to key-sized chunks of that memory on request, with finalisers that wipe the key. That would probably do what you want.
The point of a ForeignPtr is that it is guaranteed not to be moved or re-interpreted by the GC.

Resources