Are read_volatile and write_volatile atomic for usize? - rust

I want to use read_volatile and write_volatile for IPC using shared memory. Is it guaranteed that writing of an unsigned integer of usize type will be atomic?

At the time of this writing, Rust does not have a proper memory model, but instead it uses that imposed by the LLVM, that is basically that of C++, that in turn is inherited fom C. So the best references you have of what is guaranteed doing memory stuff is that from C.
In C volatile should not be used for syncronization, its intended use is for memory mapped I/O and maybe for single-threaded signal handlers. See for example this Linux-kernel specific gideline. Or this other description of volatile:
This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution.
If you want to do concurrent access to a value you should use atomics operations. They have the volatile guarantee plus additional ones. They are guaranteed to be atomic even in the presence of concurrent access. And moreover they allow you to set the ordering mode.
For your particular case you should use AtomicUsize. Note that the availability of that type is conditioned on your architecture having the necessary support, but that is exactly what you want.
Note that an AtomicUsize has the same memory layout of a plain usize, so if you have a usize embedded in a shared struct you can access atomically with a pointer cast. I think this code is sound:
struct SharedData {
// ...
x: usize
}
fn test(data: *mut SharedData) {
let x = unsafe { &*(&(*data).x as *const usize as *const AtomicUsize) };
let _ = x.load(Ordering::Relaxed);
}
Although you would be better just declaring that x as AtomicUsize directly.
Also note that reading or writing that value using any non-atomic operation (even just reading it out of curiosity, even using volatile access) invokes Undefined Behavior.

Related

What does "uninitialized" mean in the context of FFI?

I'm writing some GPU code for macOS using the metal crate. In doing so, I allocate a Buffer object by calling:
let buffer = device.new_buffer(num_bytes, MTLResourceOptions::StorageModeShared)
This FFIs to Apple's Metal API, which allocates a region of memory that both the CPU and GPU can access and the Rust wrapper returns a Buffer object. I can then get a pointer to this region of memory by doing:
let data = buffer.contents() as *mut u32
In the colloquial sense, this region of memory is uninitialized. However, is this region of memory "uninitialized" in the Rust sense?
Is this sound?
let num_bytes = num_u32 * std::mem::size_of::<u32>();
let buffer = device.new_buffer(num_bytes, MTLResourceOptions::StorageModeShared);
let data = buffer.contents() as *mut u32;
let as_slice = unsafe { slice::from_raw_parts_mut(data, num_u32) };
for i in as_slice {
*i = 42u32;
}
Here I'm writing u32s to a region of memory returned to me by FFI. From the nomicon:
...The subtle aspect of this is that usually, when we use = to assign to a value that the Rust type checker considers to already be initialized (like x[i]), the old value stored on the left-hand side gets dropped. This would be a disaster. However, in this case, the type of the left-hand side is MaybeUninit<Box>, and dropping that does not do anything! See below for some more discussion of this drop issue.
None of the from_raw_parts rules are violated and u32 doesn't have a drop method.
Nonetheless, is this sound?
Would reading from the region (as u32s) before writing to it be sound (nonsense values aside)? The region of memory is valid and u32 is defined for all bit patterns.
Best practices
Now consider a type T that does have a drop method (and you've done all the bindgen and #[repr(C)] nonsense so that it can go across FFI boundaries).
In this situation, should one:
Initialize the buffer in Rust by scanning the region with pointers and calling .write()?
Do:
let as_slice = unsafe { slice::from_raw_parts_mut(data as *mut MaybeUninit<T>, num_t) };
for i in as_slice {
*i = unsafe { MaybeUninit::new(T::new()).assume_init() };
}
Furthermore, after initializing the region, how does the Rust compiler remember this region is initialized on subsequent calls to .contents() later in the program?
Thought experiment
In some cases, the buffer is the output of a GPU kernel and I want to read the results. All the writes occurred in code outside of Rust's control and when I call .contents(), the pointer at the region of memory contains the correct uint32_t values. This thought experiment should relay my concern with this.
Suppose I call C's malloc, which returns an allocated buffer of uninitialized data. Does reading u32 values from this buffer (pointers are properly aligned and in bounds) as any type should fall squarely into undefined behavior.
However, suppose I instead call calloc, which zeros the buffer before returning it. If you don't like calloc, then suppose I have an FFI function that calls malloc, explicitly writes 0 uint32_t types in C, then returns this buffer to Rust. This buffer is initialized with valid u32 bit patterns.
From Rust's perspective, does malloc return "uninitialized" data while calloc returns initialized data?
If the cases are different, how would the Rust compiler know the difference between the two with respect to soundness?
There are multiple parameters to consider when you have an area of memory:
The size of it is the most obvious.
Its alignment is still somewhat obvious.
Whether or not it's initialized -- and notably, for types like bool whether it's initialized with valid values as not all bit-patterns are valid.
Whether it's concurrently read/written.
Focusing on the trickier aspects, the recommendation is:
If the memory is potentially uninitialized, use MaybeUninit.
If the memory is potentially concurrently read/written, use a synchronization method -- be it a Mutex or AtomicXXX or ....
And that's it. Doing so will always be sound, no need to look for "excuses" or "exceptions".
Hence, in your case:
let num_bytes = num_u32 * std::mem::size_of::<u32>();
assert!(num_bytes <= isize::MAX as usize);
let buffer = device.new_buffer(num_bytes, MTLResourceOptions::StorageModeShared);
let data = buffer.contents() as *mut MaybeUninit<u32>;
// Safety:
// - `data` is valid for reads and writes.
// - `data` points to `num_u32` elements.
// - Access to `data` is exclusive for the duration.
// - `num_u32 * size_of::<u32>() <= isize::MAX`.
let as_slice = unsafe { slice::from_raw_parts_mut(data, num_u32) };
for i in as_slice {
i.write(42); // Yes you can write `*i = MaybeUninit::new(42);` too,
// but why would you?
}
// OR with nightly:
as_slice.write_slice(some_slice_of_u32s);
This is very similar to this post on the users forum mentioned in the comment on your question. (here's some links from that post: 2 3)
The answers there aren't the most organized, but it seems like there's four main issues with uninitialized memory:
Rust assumes it is initialized
Rust assumes the memory is a valid bit pattern for the type
The OS may overwrite it
Security vulnerabilities from reading freed memory
For #1, this seems to me to not be an issue, since if there was another version of the FFI function that returned initialized memory instead of uninitialized memory, it would look identical to rust.
I think most people understand #2, and that's not an issue for u32.
#3 could be a problem, but since this is for a specific OS you may be able to ignore this if MacOS guarantees it does not do this.
#4 may or may not be undefined behavior, but it is highly undesirable. This is why you should treat it as uninitialized even if rust thinks it's a list of valid u32s. You don't want rust to think it's valid. Therefore, you should use MaybeUninit even for u32.
MaybeUninit
It's correct to cast the pointer to a slice of MaybeUninit. Your example isn't written correctly, though. assume_init returns T, and you can't assign that to an element from [MaybeUninit<T>]. Fixed:
let as_slice = unsafe { slice::from_raw_parts_mut(data as *mut MaybeUninit<T>, num_t) };
for i in as_slice {
i.write(T::new());
}
Then, turning that slice of MaybeUninit into a slice of T:
let init_slice = unsafe { &mut *(as_slice as *mut [MaybeUninit<T>] as *mut [T]) };
Another issue is that &mut may not be correct to have at all here since you say it's shared between GPU and CPU. Rust depends on your rust code being the only thing that can access &mut data, so you need to ensure any &mut are gone while the GPU accesses the memory. If you want to interlace rust access and GPU access, you need to synchronize them somehow, and only store *mut while the GPU has access (or reacquire it from FFI).
Notes
The code is mainly taken from Initializing an array element-by-element in the MaybeUninit doc, plus the very useful Alternatives section from transmute. The conversion from &mut [MaybeUninit<T>] to &mut [T] is how slice_assume_init_mut is written as well. You don't need to transmute like in the other examples since it is behind a pointer. Another similar example is in the nomicon: Unchecked Uninitialized Memory. That one accesses the elements by index, but it seems like doing that, using * on each &mut MaybeUninit<T>, and calling write are all valid. I used write since it's shortest and is easy to understand. The nomicon also says that using ptr methods like write is also valid, which should be equivalent to using MaybeUninit::write.
There's some nightly [MaybeUninit] methods that will be helpful in the future, like slice_assume_init_mut

Why does a boolean need to be atomic?

In rust, there is such a thing as an AtomicBool. It is defined as:
A boolean type which can be safely shared between threads.
I understand that if you're using a boolean to implement a thread lock, to be used from multiple threads to control access to a resource, doing something like:
// Acquire the lock
if thread_lock == false:
thread_lock = true
...
// Release the lock
thread_lock = false
Is definitely not thread safe. Both threads can read the thread_lock variable at the same time, see that it's unlocked (false), set it to true, and both think they have exclusive access to the thread.
With a proper thread lock, you need a boolean where, when you try to set it, one of two things will happen:
Trying to acquire a lock can fail if another thread already has a lock
Trying to acquire a lock will block until no other threads have a lock
I don't know if Rust has a concept like this, but I know Python's threading.Lock does exactly that.
As far as I can tell, this is NOT the scenario that an AtomicBool addresses. An AtomicBool has a load() method, and a store() method. Neither return a Result<bool> type (implying the operation can't fail), and as far as I can tell, neither do any kind of blocking.
What exactly does an AtomicBool protect us from? Why can we not use a regular bool from different threads (other than the fact that the compiler won't let us)?
The only thing I can think of is that when one thread is writing the bits into memory, another might try to read those bits at the same time. A bool is 8 bits. If 4 of the 8 bits were written when the other thread tries to read the data, the data read will be 4 bits of the old value, and 4 bits of the new value. Is this the problem being addressed? Can this happen? It doesn't seem like even in that scenario, a bool would need to be atomic, since of the 8 bits, only one bit matters, which will either be a 0 or a 1.
What exactly does an AtomicBool protect us from? Why can we not use a regular bool from different threads (other than the fact that the compiler won't let us)?
Anything that might go wrong, whether you can think of it or not. I hate to follow this up with something I can think of, because it doesn't matter. The rules say it's not guaranteed to work and that should end it. Thinking you have to think of a way it can fail or it can't fail is just wrong.
But here's one way:
// Release the lock
thread_lock = false
Say this particular CPU doesn't have a particularly good way to set a boolean to false without using a register but does have a good single operation that negates a boolean and tests if it's zero without using a register. On this CPU, in conditions of register pressure, this might get optimized to:
Negate thread_lock and test if it's zero.
If the copy of thread_lock was false, negate thread_lock again.
What happens if in-betweens steps 1 and 2 another thread observes thread_lock to be true even though it was false going into this operation and will be false when it's done?
The thread lock in Rust is Mutex. It is typically used to provide multi-thread mutable access to a value (which is usually the reason why you want to lock between threads), but you can also use it to lock an empty tuple Mutex<()> to lock on nothing. I can't think of good reasons that you need to lock threads without needing to lock on particular values, though; for example if you want to write to a log file from multiple threads, you might want to have a Mutex<fs::File> like this:
let file = Arc::new(Mutex::new(fs::File::create("write.log")?));
for _ in 0..10 {
let file = Arc::clone(&file);
thread::spawn(move |file| {
// do other stuff
let mut guard = file.lock();
guard.write_all(b"stuff").unwrap();
drop(guard);
// do other stuff
Ok(())
})
}
For atomic values, usually the most important primitives are not load and store but compare_and_exchange, etc. Atomics can be thought as "lightweight" mutexes that only contain primitive data, but you perform all operations you want in a single call instead of acquiring and releasing it in two separate operations. Furthermore, mutexes can actually be implemented based on an AtomicBool if the operating system doesn't support it, like the following code:
struct MyMutex(AtomicBool);
impl MyMutex {
fn try_lock(&self) -> Result<(), ()> {
let result = self.0.compare_exchange(false, true, Ordering::SeqCst);
if result {
Ok(()) // we have acquired the lock
} else {
Err(()) // someone else is holding the lock
}
}
fn release(&self) {
self.0.store(false, Ordering::Release);
}
}
You can share any value that is Sync from multiple threads, provided that you can deal with the lifetime properly. For example, the following compiles without any unsafe code:
fn process(b: &'static bool) {
if b { do_something () }
else { do_something_else() }
}
fn main() {
let boxed = Box::new(true);
let refed: &'static bool = my_bool.leak();
for _ in 0..10 {
thread::spawn(move || process(refed));
}
}
You can also do this with non-'static references with the sufficient tools, such as wrapping them in Arcs, etc.
A bool is 8 bits. If 4 of the 8 bits were written when the other thread tries to read the data, the data read will be 4 bits of the old value, and 4 bits of the new value.
This cannot happen in Rust. Rust enforces ownership and borrowing very strictly. You can't even have two mutable references to the same value on the same thread, much less on different threads.
Multiple mutable references to the same value is always Undefined Behaviour in Rust; there are no exceptions to this strict rule. By declaring that a reference is mutable, the compiler is allowed to do various optimizations on your code assuming that we are the unique place that can read/write the value; not other threads, not other functions, not even other variables (if a: &mut bool and let b = &mut *a, you can't use a before b is dropped). You will have much worse problems than writing different bits concurrently if you have multiple mutable pointers.
(By the way, "writing bits" to the same value is not a correct way of thinking it; it's much more complicated than "writing bits" in modern CPUs even without Rust's borrow checking rules)
TL;DR: If you don't have the unsafe keyword anyway in your code, you don't need to worry about race conditions. Rust is a very memory-safe language where memory bugs are mostly checked at compile time.

Multiprocessing atomics as a spinlock in Rust?

I would like to use a spinlock in my code which will be used by different processes. Since Rust has atomics for multithreading with swap operations that could be useful for spinlocks, I would like to know:
Can I use atomics from Rust in a shared memory between processes while keeping the provided safety guarantees?
First I need to know how to create an instance inside shared memory, but most important:
Is it possible to do something like this to reuse an atomic in a second process?
use std::sync::atomic::{AtomicU8, Ordering};
fn use_atomic(ptr_value: u64) {
// given: some memory pointer
// e.g. ptr_value == 0xDEADBEEFu64
let atomic = unsafe { &*(ptr_value as *mut AtomicU8) };
let _old_value = atomic.swap(1, Ordering::Relaxed);
}
In case this is a bad idea: is there a better way to do this in Rust? (I'm not used to assembly, but maybe there is some finished code I missed)

Does reading or writing a whole 32-bit word, even though we only have a reference to a part of it, result in undefined behaviour?

I'm trying to understand what exactly the Rust aliasing/memory model allows. In particular I'm interested in when accessing memory outside the range you have a reference to (which might be aliased by other code on the same or different threads) becomes undefined behaviour.
The following examples all access memory outside what is ordinarily allowed, but in ways that would be safe if the compiler produced the obvious assembly code. In addition, I see little conflict potential with compiler optimization, but they might still violate strict aliasing rules of Rust or LLVM thus constituting undefined behavior.
The operations are all properly aligned and thus cannot cross a cache-line or page boundary.
Read the aligned 32-bit word surrounding the data we want to access and discard the parts outside of what we're allowed to read.
Variants of this could be useful in SIMD code.
pub fn read(x: &u8) -> u8 {
let pb = x as *const u8;
let pw = ((pb as usize) & !3) as *const u32;
let w = unsafe { *pw }.to_le();
(w >> ((pb as usize) & 3) * 8) as u8
}
Same as 1, but reads the 32-bit word using an atomic_load intrinsic.
pub fn read_vol(x: &u8) -> u8 {
let pb = x as *const u8;
let pw = ((pb as usize) & !3) as *const AtomicU32;
let w = unsafe { (&*pw).load(Ordering::Relaxed) }.to_le();
(w >> ((pb as usize) & 3) * 8) as u8
}
Replace the aligned 32-bit word containing the value we care about using CAS. It overwrites the parts outside what we're allowed to access with what's already in there, so it only affects the parts we're allowed to access.
This could be useful to emulate small atomic types using bigger ones. I used AtomicU32 for simplicity, in practice AtomicUsize is the interesting one.
pub fn write(x: &mut u8, value:u8) {
let pb = x as *const u8;
let atom_w = unsafe { &*(((pb as usize) & !3) as *const AtomicU32) };
let mut old = atom_w.load(Ordering::Relaxed);
loop {
let shift = ((pb as usize) & 3) * 8;
let new = u32::from_le((old.to_le() & 0xFF_u32 <<shift)|((value as u32) << shift));
match atom_w.compare_exchange_weak(old, new, Ordering::SeqCst, Ordering::Relaxed) {
Ok(_) => break,
Err(x) => old = x,
}
}
}
This is a very interesting question.
There are actually several issues with these functions, making them unsound (i.e., not safe to expose) for various formal reasons.
At the same time, I am unable to actually construct a problematic interaction between these functions and compiler optimizations.
Out-of-bounds accesses
I'd say all of these functions are unsound because they can access unallocated memory. Each of them I can call with a &*Box::new(0u8) or &mut *Box::new(0u8), resulting in out-of-bounds accesses, i.e. accesses beyond what was allocated using malloc (or whatever allocator). Neither C nor LLVM permit such accesses. (I'm using the heap because I find it easier to think about allocations there, but the same applies to the stack where every stack variable is really its own independent allocation.)
Granted, the LLVM language reference doesn't actually define when a load has undefined behavior due to the access not being inside the object. However, we can get a hint in the documentation of getlementptr inbounds, which says
The in bounds addresses for an allocated object are all the addresses that point into the object, plus the address one byte past the end.
I am fairly certain that being in bounds is a necessary but not sufficient requirement for actually using an address with load/store.
Note that this is independent of what happens on the assembly level; LLVM will do optimizations based on a much higher-level memory model that argues in terms of allocated blocks (or "objects" as C calls them) and staying within the bounds of these blocks.
C (and Rust) are not assembly, and it is not possible to use assembly-based reasoning on them.
Most of the time it is possible to derive contradictions from assembly-based reasoning (see e.g. this bug in LLVM for a very subtle example: casting a pointer to an integer and back is not a NOP).
This time, however, the only examples I can come up with are fairly far-fetched: For example, with memory-mapped IO, even reads from a location could "mean" something to the underlying hardware, and there could be such a read-sensitive location sitting right next to the one that's passed into read.
But really I don't know much about this kind of embedded/driver development, so this may be entirely unrealistic.
(EDIT: I should add that I am not an LLVM expert. Probably the llvm-dev mailing list is a better place to determine if they are willing to commit to permitting such out-of-bounds accesses.)
Data races
There is another reason at least some of these functions are not sound: Concurrency. You clearly already saw this coming, judging from the use of concurrent accesses.
Both read and read_vol are definitely unsound under the concurrency semantics of C11. Imagine x is the first element of a [u8], and another thread is writing to the second element at the same time as we execute read/read_vol. Our read of the whole 32bit word overlaps with the other thread's write. This is a classical "data race": Two threads accessing the same location at the same time, one access being a write, and one access not being atomic. Under C11, any data race is UB so we are out. LLVM is slightly more permissive so both read and read_val are probably allowed, but right now Rust declares that it uses the C11 model.
Also note that "vol" is a bad name (assuming you meant this as short-hand for "volatile") -- in C, atomicity has nothing to do with volatile! It is literally impossible to write correct concurrent code when using volatile and not atomics. Unfortunately, Java's volatile is about atomicity, but that's a very different volatile than the one in C.
And finally, write also introduces a data race between an atomic read-modify-update and a non-atomic write in the other thread, so it is UB in C11 as well. And this time it is also UB in LLVM: Another thread could be reading from one of the extra locations that write affects, so calling write would introduce a data race between our writing and the other thread's reading. LLVM specifies that in this case, the read returns undef. So, calling write can make safe accesses to the same location in other threads return undef, and subsequently trigger UB.
Do we have any examples of issues caused by these functions?
The frustrating part is, while I found multiple reasons to rule out your functions following the spec(s), there seems to be no good reason that these functions are ruled out! The read and read_vol concurrency issues are fixed by LLVM's model (which however has other problems, compared to C11), but write is illegal in LLVM just because read-write data races make the read return undef -- and in this case we know we are writing the same value that was already stored in these other bytes! Couldn't LLVM just say that in this special case (writing the value that's already there), the read must return that value? Probably yes, but this stuff is subtle enough that I would also not be surprised if that invalidates some obscure optimization.
Moreover, at least on non-embedded platforms the out-of-bounds accesses done by read are unlikely to cause actual trouble. I guess one could imagine a semantics which returns undef when reading an out-of-bounds byte that is guaranteed to sit on the same page as an in-bounds byte. But that would still leave write illegal, and that is a really tough one: write can only be allowed if the memory on these other locations is left absolutely unchanged. There could be arbitrary data sitting there from other allocations, parts of the stack frame, whatever. So somehow the formal model would have to let you read those other bytes, not allow you to gain anything by inspecting them, but also verify that you are not changing the bytes before writing them back with a CAS. I'm not aware of any model that would let you do that. But I thank you for bringing these nasty cases to my attention, it's always good to know that there is still plenty of stuff left to research in terms of memory models :)
Rust's aliasing rules
Finally, what you were probably wondering about is whether these functions violate any of the additional aliasing rules that Rust adds. The trouble is, we don't know -- these rules are still under development. However, all the proposals I have seen so far would indeed rule out your functions: When you hold an &mut u8 (say, one that points right next to the one that's passed to read/read_vol/write), the aliasing rules provide a guarantee that no access whatsoever will happen to that byte by anyone but you. So, your functions reading from memory that others could hold a &mut u8 to already makes them violate the aliasing rules.
However, the motivation for these rules is to conform with the C11 concurrency model and LLVM's rules for memory access. If LLVM declares something UB, we have to make it UB in Rust as well unless we are willing to change our codegen in a way that avoids the UB (and typically sacrifices performance). Moreover, given that Rust adopted the C11 concurrency model, the same holds true for that. So for these cases, the aliasing rules really don't have any choice but make these accesses illegal. We could revisit this once we have a more permissive memory model, but right now our hands are bound.

Is it possible to have safe mutable aliasing to non-overlapping memory?

I'm looking for a way to take a large object and break it into smaller mutable child objects, which can be processed in parallel.
Something like:
struct PixelBuffer { data:Vec<u32>, width:u32, height:u32 }
struct PixelBlock { data:Vec<u32> }
impl PixelBuffer {
fn decompose(&'a mut self) -> Vec<Guard<'a, PixelBlock>>> {
...
}
}
Where the resulting PixelBlock's can be processed in parallel, and the parent PixelBuffer will remain locked until all Guard<PixelBlock> are dropped.
This is effectively mutable pointer aliasing; the large data block in PixelBuffer will be directly modified via each PixelBlock.
However, each PixelBlock is non-overlapping segment from the internal data in PixelBuffer.
You can certainly do this in unsafe code (internal buffer is a raw pointer; generate a new external pointer for each PixelBlock); but is it possible to achieve the same result using safe code?
(NB. I'm open to using a data block allocated from libc::malloc if that'll help?)
This works fine and is a natural consequence of how, e.g., iterators work: the next method hands out a sequence of values that are not lifetime-connected to the reference they come from, i.e. fn next(&mut self) -> Option<Self::Item>. This automatically means that any iterator that yields &mut pointers (like, slice.iter_mut()) is yielding pointers to non-overlapping memory, because anything else would be incorrect.
One way to use this in parallel is something like my simple_parallel library, e.g. Pool::for_.
(You'll need to give more details about the internals of PixelBuffer to be more specific about how to do it in this case.)
There is no way to completely avoid unsafe Rust, because the compiler cannot currently evaluate the safety of sub-slices. However, the standard library contains code that provides a safe wrapper that you can use.
Read up on std::slice::Chunks and std::slice::ChunksMut.
Sample code: https://play.rust-lang.org/?gist=ceec5be3e1530c0a6d3b&version=stable
However, your next problem is sending the slices to separate threads, because the best way to do that would be thread::scoped, which is currently deprecated due to some safety problems that were discovered this year...
Also, keep in mind that Vec<_> owns its contents, whereas slices are just a view. Generally, you want to write most functions in terms of slices, and keep only one "Vec" to hold the data.

Resources