References in rust self referential structs - rust

Given the code snippet below:
use std::{io::BufWriter, pin::Pin};
pub struct SelfReferential {
pub writer: BufWriter<&'static mut [u8]>, // borrowed from buffer
pub buffer: Pin<Box<[u8]>>,
}
#[cfg(test)]
mod tests {
use std::io::Write;
use super::*;
fn init() -> SelfReferential {
let mut buffer = Pin::new(vec![0; 12].into_boxed_slice());
let writer = unsafe { buffer.as_mut().get_unchecked_mut() };
let writer = unsafe { (writer as *mut [u8]).as_mut().unwrap() };
let writer = BufWriter::new(writer);
SelfReferential { writer, buffer }
}
#[test]
fn move_works() {
let mut sr = init();
sr.writer.write(b"hello ").unwrap();
sr.writer.flush().unwrap();
let mut slice = &mut sr.buffer[6..];
slice.write(b"world!").unwrap();
assert_eq!(&sr.buffer[..], b"hello world!".as_ref());
let mut sr_moved = sr;
sr_moved.writer.write(b"W").unwrap();
sr_moved.writer.flush().unwrap();
assert_eq!(&sr_moved.buffer[..], b"hello World!".as_ref());
}
}
The first question: is it OK to assign 'static lifetime to mutable slice reference in BufWriter? As technically speaking, it's bound to the lifetime of struct instances themselves, and AFAIK there's no safe way to invalidate it.
The second question: besides the fact that unsafe instantiation of this type, in test example, creates two mutable references into the underlying buffer, is there any other potential dangers associated with such an "unidiomatic" (for the lack of better word) type?

is it OK to assign 'static lifetime to mutable slice reference in BufWriter?
Sort of, but there's a bigger problem. The lifetime itself is not worse than any other choice, because there is no lifetime that you can use here which is really accurate. But it is not safe to expose that reference, because then it can be taken:
let w = BufWriter<&'static mut [u8]> = {
let sr = init();
sr.writer
};
// `sr.buffer` has now been dropped, so `w` has a dangling reference
is there any other potential dangers associated with such an "unidiomatic" (for the lack of better word) type?
Yes, it's undefined behavior. Box isn't just managing an allocation; it also (currently) signals a claim of unique, non-aliasing access to the contents. You violate that non-aliasing by creating the writer and then moving the buffer — even though the heap memory is not actually touched, the move of buffer is counted invalidating all references into it.
This is an area of Rust semantics which is not yet fully nailed down, but as far as the current compiler is concerned, this is UB. You can see this if you run your test code under the Miri interpreter.
The good news is, what you're trying to do is a very common desire and people have worked on the problem. I personally recommend using ouroboros — with the help of a macro, it allows you to create the struct you want without writing any new unsafe code. There will be some restrictions on how you use the writer, but nothing you can't tidy out of the way by writing an impl io::Write for SelfReferential. Another, newer library in this space is yoke; I haven't tried it.

Related

Can I implement Index/IndexMut for a type that has locked data?

I've got a struct that contains some locked data. The real world is complex, but here's a minimal example (or as minimal as I can make it):
use std::fmt::Display;
use std::ops::{Index, IndexMut};
use std::sync::Mutex;
struct LockedVector<T> {
stuff: Mutex<Vec<T>>,
}
impl<T> LockedVector<T> {
pub fn new(v: Vec<T>) -> Self {
LockedVector {
stuff: Mutex::new(v),
}
}
}
impl<T> Index<usize> for LockedVector<T> {
type Output = T;
fn index(&self, index: usize) -> &Self::Output {
todo!()
}
}
impl<T> IndexMut<usize> for LockedVector<T> {
fn index_mut(&mut self, index: usize) -> &mut Self::Output {
let thing = self.stuff.get_mut().unwrap();
&mut thing[index]
}
}
impl<T: Display> Display for LockedVector<T> {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let strings: Vec<String> = self
.stuff
.lock()
.unwrap()
.iter()
.map(|s| format!("{}", s))
.collect();
write!(f, "{}", strings.join(", "))
}
}
fn main() {
let mut my_stuff = LockedVector::new(vec![0, 1, 2, 3, 4]);
println!("initially: {}", my_stuff);
my_stuff[2] = 5;
println!("then: {}", my_stuff);
let a_mut_var: &mut usize = &mut my_stuff[3];
*a_mut_var = 54;
println!("Still working: {}", my_stuff);
}
What I'm trying to do here is implement the Index and IndexMut traits on a struct, where the data being indexed is behind a Mutex lock. My very fuzzy reasoning for why this should be possible is that the result of locking a mutex is sort-of like a reference, and it seems like you could map a reference onto another reference, or somehow make a sort of reference that wraps the entire lock but only de-references the specific index.
My much less fuzzy reasoning is that the code above compiles and runs (note the todo!) - I'm able to get back mutable references, and I assume I haven't somehow snuck past the mutex in an unthread-safe way. (I made an attempt to test the threaded behavior, but ran into other issues trying to get a mutable reference into another thread at all.)
The weird issue is, I can't do the same for Index - there is no get_immut() I can use, and I haven't found another approach. I can get a mutable reference out of Mutex, but not an immutable one (and of course, I can't get the mutable one if I only have an immutable reference to begin with)
My expectation is that indexing would acquire a lock, and the returned reference (in both mutable and immutable cases) would maintain the lock for their lifetimes. As a bonus, it would be nice if RwLock-ed things could only grab/hold the read lock for the immutable cases, and the write lock for mutable ones.
For context as to why I'd do this: I have a Grid trait that is used by a bunch of different code, but backed by different implementations, some of which are thread-safe. I was hoping to put the Index and IndexMut traits on it for the nice syntax. Threads don't generally have mutable references to the thread-safe Grids at all, so the IndexMut trait would see little use there, but I could see it being valuable during setup or for the non-thread-safe cases. The immutable Index behavior seems like it would be useful everywhere.
Bonus question: I absolutely hate that Display code, how can I make it less hideous?
If you look at the documentation of get_mut you'll see it's only possible precisely because a mutable reference ensures that there is no other reference or a lock to it, unfortunately for you that means that a get_ref for Mutex would only be possible by taking a mutable reference, that's just an artificially limited get_mut though.
Unfortunately for you since Index only gives you a shared reference you can't safely get a shared reference to it's contents, so you can't implement an Index so that it indexes into something behind a Mutex.

What is the preferred unsafe way to extend lifetimes which are correct but not provable?

I have a minimal arena allocator which demonstrates the intention, although it isn't optimized for minimizing allocations/deallocations like a true arena would be:
#[derive(Clone)]
pub struct Arena(Arc<Mutex<Vec<Box<[u8]>>>>);
impl Arena {
/// Allocate memory of the given size.
pub fn allocate(&self, size: usize) -> &mut [u8] {
let inner = &mut *self.0.lock().unwrap();
let mut new_mem = vec![0u8; size].into_boxed_slice();
let slice = &mut new_mem[..]; // THIS OBVIOUSLY DOESNT WORK
inner.push(new_mem);
slice
}
}
allocate is the only operation, therefore I know that I can safely take a reference to the memory contained in new_mem with the same lifetime as &self, because I don't provide any operations that would allow the boxed memory to become aliased, and I know because the memory block is boxed, it won't move even if the vector it is stored in has to reallocate to add additional blocks.
There's also no way I know of to safely tell the compiler that the reference to the memory block is safe. Using &mut new_mem[..] fails because the compiler thinks I'm borrowing new_mem while trying to move it into the vector in push. I could invert the order and do push followed by &mut inner.last().unwrap()[..], but that also fails, because the compiler sees that reference as being owned by the mutex guard.
That means that to tell the compiler that this borrow is OK, I need to do something unsafe to create a reference with a longer lifetime than normal borrowing would produce in this case.
I know of two ways to extend this lifetime:
std::mem::transmute:
let slice = {
// Strongly-typed line to make sure we aren't accidentally starting from a pointer to
// the box itself by accident.
let slice: &mut [u8] = &mut new_mem[..];
unsafe { mem::transmute(slice) }
};
Dereferencing a raw pointer:
let slice = {
// Strongly-typed line to make sure we aren't accidentally starting from a pointer to
// the box itself by accident.
let slice: &mut [u8] = &mut new_mem[..];
let ptr: *mut _ = slice;
unsafe { &mut *ptr }
};
Is there any particular advantage to either of these options? For example, are there classes of mistakes that are possible with one option that aren't possible, or are harder to make, with the other? Are there other ways to do this that have different advantages? Or are all options about the same?
For classes of mistakes, I'm particularly wondering whether there are type inference mistakes that can occur in one that aren't possible in the other. Obviously transmute can convert to anything with the same size, and raw pointers allow casting to any type regardless of size, but I wonder if type inference on pointers is more restricted in the absence of an as cast.
How about this?
impl Arena {
/// Allocate memory of the given size.
pub fn allocate(&self, size: usize) -> &mut [u8] {
let inner = &mut *self.0.lock().unwrap();
let mut new_mem = vec![0u8; size].into_boxed_slice();
let mem_ptr: *mut u8 = new_mem.as_mut_ptr();
let slice = unsafe { std::slice::from_raw_parts_mut(mem_ptr, size) };
inner.push(new_mem);
slice
}
}
I don't know the pros and cons, but that's how I would do it.

From semantic perspective, what's the moment an undefined behavior of `&mut` noalias occurred in Rust?

As Rust reference documention said
Breaking the pointer aliasing rules. &mut T and &T follow LLVM’s scoped noalias model, except if the &T contains an UnsafeCell.
It's really ambiguous.
I want to know that what's the exactly moment an undefined behavior of &mut noalias occurred in Rust.
Is it any of below, or something else?
When defining two &mut that point to the same address?
When two &mut that point to the same address exposed to rust?
When perfrom any operation on a &mut that point to the same address of any other &mut?
For example, this code is observously UB:
unsafe {
let mut x = 123usize;
let a = (&mut x as *mut usize).as_mut().unwrap(); // created, but not accessed
let b = (&mut x as *mut usize).as_mut().unwrap(); // created, accessed
*b = 666;
drop(a);
}
But what if I modify the code like this:
struct S<'a> {
ref_x: &'a mut usize
}
fn main() {
let mut x = 123;
let s = S { ref_x: &mut x }; // like the `T` in `ManuallyDrop<T>`
let taken = unsafe { std::ptr::read(&s as *const S) }; // like `ManuallyDrop<T>::take`
// at thist ime, we have two `&mut x`
*(taken.ref_x) = 666;
drop(s);
// UB or not?
}
Is the second version also UB?
The second version is totally the same implemention to std::mem::ManuallyDrop. If the second version is UB, is it a security bug of std::mem::ManuallyDrop<T>?
What the aliasing restriction is not
It is actually common to have multiple existing &mut T aliasing the same item.
The simplest example is:
fn main() {
let mut i = 32;
let j = &mut i;
let k = &mut *j;
*k = 3;
println!("{}", i);
}
Note, though, that due to borrowing rules you cannot access the other aliases simultaneously.
If you look at the implementation of ManuallyDrop::take:
pub unsafe fn take(slot: &mut ManuallyDrop<T>) -> T {
ptr::read(&slot.value)
}
You will note that there are no simultaneously accessible &mut T: calling the function re-borrows ManuallyDrop making slot the only accessible mutable reference.
Why is aliasing in Rust so ill-defined
It's really ambiguous. I want to know that what's the exactly moment an undefined behavior of &mut noalias occurred in Rust.
Tough luck, because as specified in the Nomicon:
Unfortunately, Rust hasn't actually defined its aliasing model.
The reason is that the language team wants to make sure that the definition they reach is both safe (demonstrably so), practical, and yet does not close the door to possible refinements. It's a tall order.
The Rust Unsafe Code Guidelines Working Group is still working on establishing the exact boundaries, and in particular Ralf Jung is working on an operational model for aliasing called Stacked Borrows.
Note: the Stacked Borrows model is implemented in MIRI, and therefore you can validate your code against the Stacked Borrows model simply by executing your code in MIRI. Of course Stacked Borrows is still experimental, so this doesn't guarantee anything.
What caution recommends
I personally subscribe to caution. Seeing as the exact model is unspecified, the rules are ever changing and therefore I recommend taking the stricter interpretation possible.
Thus, I interpret the no-aliasing rule of &mut T as:
At any point in the code, there shall not be two accessible references in scope which alias the same memory if one of them is &mut T.
That is, I consider that forming a &mut T to an instance T for which another &T or &mut T is in scope without invalidating the aliases (via borrowing) is ill-formed.
It may very well be overly cautious, but at least if the aliasing model ends up being more conservative than planned, my code will still be valid.

Using a HashSet to canonicalize objects in Rust

As an educational exercise, I'm looking at porting cvs-fast-export to Rust.
Its basic mode of operation is to parse a number of CVS master files into a intermediate form, and then to analyse the intermediate form with the goal of transforming it into a git fast-export stream.
One of the things that is done when parsing is to convert common parts of the intermediate form into a canonical representation. A motivating example is commit authors. A CVS repository may have hundreds of thousands of individual file commits, but probably less than a thousand authors. So an interning table is used when parsing where you input the author as you parse it from the file and it will give you a pointer to a canonical version, creating a new one if it hasn't seen it before. (I've heard this called atomizing or interning too). This pointer then gets stored on the intermediate object.
My first attempt to do something similar in Rust attempted to use a HashSet as the interning table. Note this uses CVS version numbers rather than authors, this is just a sequence of digits such as 1.2.3.4, represented as a Vec.
use std::collections::HashSet;
use std::hash::Hash;
#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Vec<u16>);
fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> &T {
let dupe = item.clone();
if !set.contains(&item) {
set.insert(item);
}
set.get(&dupe).unwrap()
}
fn main() {
let mut set: HashSet<CvsNumber> = HashSet::new();
let c1 = CvsNumber(vec![1, 2]);
let c2 = intern(&mut set, c1);
let c3 = CvsNumber(vec![1, 2]);
let c4 = intern(&mut set, c3);
}
This fails with error[E0499]: cannot borrow 'set' as mutable more than once at a time. This is fair enough, HashSet doesn't guarantee references to its keys will be valid if you add more items after you have obtained a reference. The C version is careful to guarantee this. To get this guarantee, I think the HashSet should be over Box<T>. However I can't explain the lifetimes for this to the borrow checker.
The ownership model I am going for here is that the interning table owns the canonical versions of the data, and hands out references. The references should be valid as long the interning table exists. We should be able to add new things to the interning table without invalidating the old references. I think the root of my problem is that I'm confused how to write the interface for this contract in a way consistent with the Rust ownership model.
Solutions I see with my limited Rust knowledge are:
Do two passes, build a HashSet on the first pass, then freeze it and use references on the second pass. This means additional temporary storage (sometimes substantial).
Unsafe
Does anyone have a better idea?
I somewhat disagree with #Shepmaster on the use of unsafe here.
While right now it does not cause issue, should someone decide in the future to change the use of HashSet to include some pruning (for example, to only ever keep a hundred authors in there), then unsafe will bite you sternly.
In the absence of a strong performance reason, I would simply use a Rc<XXX>. You can alias it easily enough: type InternedXXX = Rc<XXX>;.
use std::collections::HashSet;
use std::hash::Hash;
use std::rc::Rc;
#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Rc<Vec<u16>>);
fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> T {
if !set.contains(&item) {
let dupe = item.clone();
set.insert(dupe);
item
} else {
set.get(&item).unwrap().clone()
}
}
fn main() {
let mut set: HashSet<CvsNumber> = HashSet::new();
let c1 = CvsNumber(Rc::new(vec![1, 2]));
let c2 = intern(&mut set, c1);
let c3 = CvsNumber(Rc::new(vec![1, 2]));
let c4 = intern(&mut set, c3);
}
Your analysis is correct. The ultimate issue is that when modifying the HashSet, the compiler cannot guarantee that the mutations will not affect the existing allocations. Indeed, in general they might affect them, unless you add another layer of indirection, as you have identified.
This is a prime example of a place that unsafe is useful. You, the programmer, can assert that the code will only ever be used in a particular way, and that particular way will allow the variable to be stable through any mutations. You can use the type system and module visibility to help enforce these conditions.
Note that String already introduces a heap allocation. So long as you don't change the String once it's allocated, you don't need an extra Box.
Something like this seems like an OK start:
use std::{cell::RefCell, collections::HashSet, mem};
struct EasyInterner(RefCell<HashSet<String>>);
impl EasyInterner {
fn new() -> Self {
EasyInterner(RefCell::new(HashSet::new()))
}
fn intern<'a>(&'a self, s: &str) -> &'a str {
let mut set = self.0.borrow_mut();
if !set.contains(s) {
set.insert(s.into());
}
let interned = set.get(s).expect("Impossible missing string");
// TODO: Document the pre- and post-conditions that the code must
// uphold to make this unsafe code valid instead of copying this
// from Stack Overflow without reading it
unsafe { mem::transmute(interned.as_str()) }
}
}
fn main() {
let i = EasyInterner::new();
let a = i.intern("hello");
let b = i.intern("world");
let c = i.intern("hello");
// Still strings
assert_eq!(a, "hello");
assert_eq!(a, c);
assert_eq!(b, "world");
// But with the same address
assert_eq!(a.as_ptr(), c.as_ptr());
assert!(a.as_ptr() != b.as_ptr());
// This shouldn't compile; a cannot outlive the interner
// let x = {
// let i = EasyInterner::new();
// let a = i.intern("hello");
// a
// };
let the_pointer;
let i = {
let i = EasyInterner::new();
{
// Introduce a scope to contstrain the borrow of `i` for `s`
let s = i.intern("inner");
the_pointer = s.as_ptr();
}
i // moving i to a new location
// All outstanding borrows are invalidated
};
// but the data is still allocated
let s = i.intern("inner");
assert_eq!(the_pointer, s.as_ptr());
}
However, it may be much more expedient to use a crate like:
string_cache, which has the collective brainpower of the Servo project behind it.
typed-arena
generational-arena

Traits in algebraic data types

I'm having trouble understanding the rules about traits in algebraic data types.
Here's a simplified example:
use std::rc::Rc;
use std::cell::RefCell;
trait Quack {
fn quack(&self);
}
struct Duck;
impl Quack for Duck {
fn quack(&self) { println!("Quack!"); }
}
fn main() {
let mut pond: Vec<Box<Quack>> = Vec::new();
let duck: Box<Duck> = Box::new(Duck);
pond.push(duck); // This is valid.
let mut lake: Vec<Rc<RefCell<Box<Quack>>>> = Vec::new();
let mallard: Rc<RefCell<Box<Duck>>> = Rc::new(RefCell::new(Box::new(Duck)));
lake.push(mallard); // This is a type mismatch.
}
The above fails to compile, yielding the following error message:
expected `alloc::rc::Rc<core::cell::RefCell<Box<Quack>>>`,
found `alloc::rc::Rc<core::cell::RefCell<Box<Duck>>>`
(expected trait Quack,
found struct `Duck`) [E0308]
src/main.rs:19 lake.push(mallard);
Why is it that pond.push(duck) is valid, yet lake.push(mallard) isn't? In both cases, a Duck has been supplied where a Quack was expected. In the former, the compiler is happy, but in the latter, it's not.
Is the reason for this difference related to CoerceUnsized?
This is a correct behavior, even if it is somewhat unfortunate.
In the first case we have this:
let mut pond: Vec<Box<Quack>> = Vec::new();
let duck: Box<Duck> = Box::new(Duck);
pond.push(duck);
Note that push(), when called on Vec<Box<Quack>>, accepts Box<Quack>, and you're passing Box<Duck>. This is OK - rustc is able to understand that you want to convert a boxed value to a trait object, like here:
let duck: Box<Duck> = Box::new(Duck);
let quack: Box<Quack> = duck; // automatic coercion to a trait object
In the second case we have this:
let mut lake: Vec<Rc<RefCell<Box<Quack>>>> = Vec::new();
let mallard: Rc<RefCell<Box<Duck>>> = Rc::new(RefCell::new(Box::new(Duck)));
lake.push(mallard);
Here push() accepts Rc<RefCell<Box<Quack>>> while you provide Rc<RefCell<Box<Duck>>>:
let mallard: Rc<RefCell<Box<Duck>>> = Rc::new(RefCell::new(Box::new(Duck)));
let quack: Rc<RefCell<Box<Quack>>> = mallard;
And now there is a trouble. Box<T> is a DST-compatible type, so it can be used as a container for a trait object. The same thing will soon be true for Rc and other smart pointers when this RFC is implemented. However, in this case there is no coercion from a concrete type to a trait object because Box<Duck> is inside of additional layers of types (Rc<RefCell<..>>).
Remember, trait object is a fat pointer, so Box<Duck> is different from Box<Quack> in size. Consequently, in principle, they are not directly compatible: you can't just take bytes of Box<Duck> and write them to where Box<Quack> is expected. Rust performs a special conversion, that is, it obtains a pointer to the virtual table for Duck, constructs a fat pointer and writes it to Box<Quack>-typed variable.
When you have Rc<RefCell<Box<Duck>>>, however, rustc would need to know how to construct and destructure both RefCell and Rc in order to apply the same fat pointer conversion to its internals. Naturally, because these are library types, it can't know how to do it. This is also true for any other wrapper type, e.g. Arc or Mutex or even Vec. You don't expect that it would be possible to use Vec<Box<Duck>> as Vec<Box<Quack>>, right?
Also there is a fact that in the example with Rc the Rcs created out of Box<Duck> and Box<Quack> wouldn't have been connected - they would have had different reference counters.
That is, a conversion from a concrete type to a trait object can only happen if you have direct access to a smart pointer which supports DST, not when it is hidden inside some other structure.
That said, I see how it may be possible to allow this for a few select types. For example, we could introduce some kind of Construct/Unwrap traits which are known to the compiler and which it could use to "reach" inside of a stack of wrappers and perform trait object conversion inside them. However, no one designed this thing and provided an RFC about it yet - probably because it is not a widely needed feature.
Vladimir's answer explained what the
compiler is doing. Based on that information, I developed a solution: Creating a wrapper
struct around Box<Quack>.
The wrapper is called QuackWrap. It has a fixed size, and it can be used just like any
other struct (I think). The Box inside QuackWrap allows me to build a QuackWrap
around any trait that implements Quack. Thus, I can have a Vec<Rc<RefCell<QuackWrap>>>
where the inner values are a mixture of Ducks, Gooses, etc.
use std::rc::Rc;
use std::cell::RefCell;
trait Quack {
fn quack(&self);
}
struct Duck;
impl Quack for Duck {
fn quack(&self) { println!("Quack!"); }
}
struct QuackWrap(Box<Quack>);
impl QuackWrap {
pub fn new<T: Quack + 'static>(value: T) -> QuackWrap {
QuackWrap(Box::new(value))
}
}
fn main() {
let mut pond: Vec<Box<Quack>> = Vec::new();
let duck: Box<Duck> = Box::new(Duck);
pond.push(duck); // This is valid.
// This would be a type error:
//let mut lake: Vec<Rc<RefCell<Box<Quack>>>> = Vec::new();
//let mallard: Rc<RefCell<Box<Duck>>> = Rc::new(RefCell::new(Box::new(Duck)));
//lake.push(mallard); // This is a type mismatch.
// Instead, we can do this:
let mut lake: Vec<Rc<RefCell<QuackWrap>>> = Vec::new();
let mallard: Rc<RefCell<QuackWrap>> = Rc::new(RefCell::new(QuackWrap::new(Duck)));
lake.push(mallard); // This is valid.
}
As an added convenience, I'll probably want to implement Deref and DefrefMut on
QuackWrap. But that's not necessary for the above example.

Resources