Using a HashSet to canonicalize objects in Rust

Using a HashSet to canonicalize objects in Rust - rust

As an educational exercise, I'm looking at porting cvs-fast-export to Rust.
Its basic mode of operation is to parse a number of CVS master files into a intermediate form, and then to analyse the intermediate form with the goal of transforming it into a git fast-export stream.
One of the things that is done when parsing is to convert common parts of the intermediate form into a canonical representation. A motivating example is commit authors. A CVS repository may have hundreds of thousands of individual file commits, but probably less than a thousand authors. So an interning table is used when parsing where you input the author as you parse it from the file and it will give you a pointer to a canonical version, creating a new one if it hasn't seen it before. (I've heard this called atomizing or interning too). This pointer then gets stored on the intermediate object.
My first attempt to do something similar in Rust attempted to use a HashSet as the interning table. Note this uses CVS version numbers rather than authors, this is just a sequence of digits such as 1.2.3.4, represented as a Vec.
use std::collections::HashSet;
use std::hash::Hash;
#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Vec<u16>);
fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> &T {
let dupe = item.clone();
if !set.contains(&item) {
set.insert(item);
}
set.get(&dupe).unwrap()
}
fn main() {
let mut set: HashSet<CvsNumber> = HashSet::new();
let c1 = CvsNumber(vec![1, 2]);
let c2 = intern(&mut set, c1);
let c3 = CvsNumber(vec![1, 2]);
let c4 = intern(&mut set, c3);
}
This fails with error[E0499]: cannot borrow 'set' as mutable more than once at a time. This is fair enough, HashSet doesn't guarantee references to its keys will be valid if you add more items after you have obtained a reference. The C version is careful to guarantee this. To get this guarantee, I think the HashSet should be over Box<T>. However I can't explain the lifetimes for this to the borrow checker.
The ownership model I am going for here is that the interning table owns the canonical versions of the data, and hands out references. The references should be valid as long the interning table exists. We should be able to add new things to the interning table without invalidating the old references. I think the root of my problem is that I'm confused how to write the interface for this contract in a way consistent with the Rust ownership model.
Solutions I see with my limited Rust knowledge are:
Do two passes, build a HashSet on the first pass, then freeze it and use references on the second pass. This means additional temporary storage (sometimes substantial).
Unsafe
Does anyone have a better idea?

I somewhat disagree with #Shepmaster on the use of unsafe here.
While right now it does not cause issue, should someone decide in the future to change the use of HashSet to include some pruning (for example, to only ever keep a hundred authors in there), then unsafe will bite you sternly.
In the absence of a strong performance reason, I would simply use a Rc<XXX>. You can alias it easily enough: type InternedXXX = Rc<XXX>;.
use std::collections::HashSet;
use std::hash::Hash;
use std::rc::Rc;
#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Rc<Vec<u16>>);
fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> T {
if !set.contains(&item) {
let dupe = item.clone();
set.insert(dupe);
item
} else {
set.get(&item).unwrap().clone()
}
}
fn main() {
let mut set: HashSet<CvsNumber> = HashSet::new();
let c1 = CvsNumber(Rc::new(vec![1, 2]));
let c2 = intern(&mut set, c1);
let c3 = CvsNumber(Rc::new(vec![1, 2]));
let c4 = intern(&mut set, c3);
}

Your analysis is correct. The ultimate issue is that when modifying the HashSet, the compiler cannot guarantee that the mutations will not affect the existing allocations. Indeed, in general they might affect them, unless you add another layer of indirection, as you have identified.
This is a prime example of a place that unsafe is useful. You, the programmer, can assert that the code will only ever be used in a particular way, and that particular way will allow the variable to be stable through any mutations. You can use the type system and module visibility to help enforce these conditions.
Note that String already introduces a heap allocation. So long as you don't change the String once it's allocated, you don't need an extra Box.
Something like this seems like an OK start:
use std::{cell::RefCell, collections::HashSet, mem};
struct EasyInterner(RefCell<HashSet<String>>);
impl EasyInterner {
fn new() -> Self {
EasyInterner(RefCell::new(HashSet::new()))
}
fn intern<'a>(&'a self, s: &str) -> &'a str {
let mut set = self.0.borrow_mut();
if !set.contains(s) {
set.insert(s.into());
}
let interned = set.get(s).expect("Impossible missing string");
// TODO: Document the pre- and post-conditions that the code must
// uphold to make this unsafe code valid instead of copying this
// from Stack Overflow without reading it
unsafe { mem::transmute(interned.as_str()) }
}
}
fn main() {
let i = EasyInterner::new();
let a = i.intern("hello");
let b = i.intern("world");
let c = i.intern("hello");
// Still strings
assert_eq!(a, "hello");
assert_eq!(a, c);
assert_eq!(b, "world");
// But with the same address
assert_eq!(a.as_ptr(), c.as_ptr());
assert!(a.as_ptr() != b.as_ptr());
// This shouldn't compile; a cannot outlive the interner
// let x = {
// let i = EasyInterner::new();
// let a = i.intern("hello");
// a
// };
let the_pointer;
let i = {
let i = EasyInterner::new();
{
// Introduce a scope to contstrain the borrow of `i` for `s`
let s = i.intern("inner");
the_pointer = s.as_ptr();
}
i // moving i to a new location
// All outstanding borrows are invalidated
};
// but the data is still allocated
let s = i.intern("inner");
assert_eq!(the_pointer, s.as_ptr());
}
However, it may be much more expedient to use a crate like:
string_cache, which has the collective brainpower of the Servo project behind it.
typed-arena
generational-arena

Related

References in rust self referential structs

Given the code snippet below:
use std::{io::BufWriter, pin::Pin};
pub struct SelfReferential {
pub writer: BufWriter<&'static mut [u8]>, // borrowed from buffer
pub buffer: Pin<Box<[u8]>>,
}
#[cfg(test)]
mod tests {
use std::io::Write;
use super::*;
fn init() -> SelfReferential {
let mut buffer = Pin::new(vec![0; 12].into_boxed_slice());
let writer = unsafe { buffer.as_mut().get_unchecked_mut() };
let writer = unsafe { (writer as *mut [u8]).as_mut().unwrap() };
let writer = BufWriter::new(writer);
SelfReferential { writer, buffer }
}
#[test]
fn move_works() {
let mut sr = init();
sr.writer.write(b"hello ").unwrap();
sr.writer.flush().unwrap();
let mut slice = &mut sr.buffer[6..];
slice.write(b"world!").unwrap();
assert_eq!(&sr.buffer[..], b"hello world!".as_ref());
let mut sr_moved = sr;
sr_moved.writer.write(b"W").unwrap();
sr_moved.writer.flush().unwrap();
assert_eq!(&sr_moved.buffer[..], b"hello World!".as_ref());
}
}
The first question: is it OK to assign 'static lifetime to mutable slice reference in BufWriter? As technically speaking, it's bound to the lifetime of struct instances themselves, and AFAIK there's no safe way to invalidate it.
The second question: besides the fact that unsafe instantiation of this type, in test example, creates two mutable references into the underlying buffer, is there any other potential dangers associated with such an "unidiomatic" (for the lack of better word) type?

is it OK to assign 'static lifetime to mutable slice reference in BufWriter?
Sort of, but there's a bigger problem. The lifetime itself is not worse than any other choice, because there is no lifetime that you can use here which is really accurate. But it is not safe to expose that reference, because then it can be taken:
let w = BufWriter<&'static mut [u8]> = {
let sr = init();
sr.writer
};
// `sr.buffer` has now been dropped, so `w` has a dangling reference
is there any other potential dangers associated with such an "unidiomatic" (for the lack of better word) type?
Yes, it's undefined behavior. Box isn't just managing an allocation; it also (currently) signals a claim of unique, non-aliasing access to the contents. You violate that non-aliasing by creating the writer and then moving the buffer — even though the heap memory is not actually touched, the move of buffer is counted invalidating all references into it.
This is an area of Rust semantics which is not yet fully nailed down, but as far as the current compiler is concerned, this is UB. You can see this if you run your test code under the Miri interpreter.
The good news is, what you're trying to do is a very common desire and people have worked on the problem. I personally recommend using ouroboros — with the help of a macro, it allows you to create the struct you want without writing any new unsafe code. There will be some restrictions on how you use the writer, but nothing you can't tidy out of the way by writing an impl io::Write for SelfReferential. Another, newer library in this space is yoke; I haven't tried it.

`HashMap::get_mut` leading to "returns reference to local value", any efficient work-around?

There have been a fair number of questions around this, and the solution is mostly "use Entry".
However this is an issue because HashMap::entry() requires an owned value meaning possibly expensive copies / allocations even when the key is already present and we just want to update the value in-place, hence the use of get_mut. However the use of get_mut on a reference to a local leads rustc to assume that said reference gets stored into the hashmap, and thus that returning the hashmap is an error:
use std::borrow::Cow;
use std::collections::HashMap;
fn get_string() -> String { String::from("xxxxxxx") }
fn foo() -> HashMap<Cow<'static, str>, usize> {
let mut v = HashMap::new();
// stand-in for "get a string slice as key",
// real case is getting a String from an
// mpsc and the key being a segment of that string
let s = get_string();
// stand-in for a structure which contains an `Option<Cow>`
let k = Cow::from(&s[2..3]);
// because of get_mut, `&s` is apparently considered to be stored in `v`?
if let Some(e) = v.get_mut(&k) {
*e += 1;
} else {
v.insert(Cow::from(k.into_owned()), 0);
}
v
}
Note that the manipulations at lines 9~13 are there to clarify the point of the pattern, but get_mut alone is sufficient to trigger the issue
Is there a way around without the efficiency hit, or is an eager allocation the only way? (note: because this is a static issue, dynamic gates like contains_key or get obviously don't do anything).

According to the docs, HashSet::get_mut() requires a value of type &Q such that the key of the hash implements Borrow<Q>.
The key of your hash is Cow<'static, str>, that implements Borrow<str>. This means that you can use either a &Cow<'static, str> or a &str. But you are passing a &Cow<'local, str> for some 'local lifetime. The compiler tries to match that 'local with 'static and issues a somewhat confusing error message about lifetimes.
The solution is actually easy, because you can get an &str from the Cow either calling k.as_ref() or doing &*k, and the lifetime of the &str is unrestricted: (playground)
let k = Cow::from(&s[2..3]);
if let Some(e) = v.get_mut(k.as_ref()) { /* ...*/ }

How to create a Box<UnsafeCell<[T]>>

The recommended way to create a regular boxed slice (i.e. Box<[T]>) seems to be to first create a std::Vec<T>, and use .into_boxed_slice(). However, nothing similar to this seems to work if I want the slice to be wrapped in UnsafeCell.
A solution with unsafe code is fine, but I'd really like to avoid having to manually manage the memory.

The only (not-unsafe) way to create a Box<[T]> is via Box::from, given a &[T] as the parameter. This is because [T] is ?Sized and can't be passed a parameter. This in turn effectively requires T: Copy, because T has to be copied from behind the reference into the new Box. But UnsafeCell is not Copy, regardless if T is. Discussion about making UnsafeCell Copy has been going on for years, yielding no final conclusion, due to safety concerns.
If you really, really want a Box<UnsafeCell<[T]>>, there are only two ways:
Because Box and UnsafeCell are both CoerceUnsize, and [T; N] is Unsize, you can create a Box<UnsafeCell<[T; N]>> and coerce it to a Box<UnsafeCell<[T]>. This limits you to initializing from fixed-sized arrays.
Unsize coercion:
fn main() {
use std::cell::UnsafeCell;
let x: [u8;3] = [1,2,3];
let c: Box<UnsafeCell<[_]>> = Box::new(UnsafeCell::new(x));
}
Because UnsafeCell is #[repr(transparent)], you can create a Box<[T]> and unsafely mutate it to a Box<UnsafeCell<[T]>, as the UnsafeCell<[T]> is guaranteed to have the same memory layout as a [T], given that [T] doesn't use niche-values (even if T does).
Transmute:
// enclose the transmute in a function accepting and returning proper type-pairs
fn into_boxed_unsafecell<T>(inp: Box<[T]>) -> Box<UnsafeCell<[T]>> {
unsafe {
mem::transmute(inp)
}
}
fn main() {
let x = vec![1,2,3];
let b = x.into_boxed_slice();
let c: Box<UnsafeCell<[_]>> = into_boxed_unsafecell(b);
}
Having said all this: I strongly suggest you are suffering from the xy-problem. A Box<UnsafeCell<[T]>> is a very strange type (especially compared to UnsafeCell<Box<[T]>>). You may want to give details on what you are trying to accomplish with such a type.

Just swap the pointer types to UnsafeCell<Box<[T]>>:
use std::cell::UnsafeCell;
fn main() {
let mut res: UnsafeCell<Box<[u32]>> = UnsafeCell::new(vec![1, 2, 3, 4, 5].into_boxed_slice());
unsafe {
println!("{}", (*res.get())[1]);
res.get_mut()[1] = 10;
println!("{}", (*res.get())[1]);
}
}
Playground

replace a value behind a mutable reference by moving and mapping the original

TLDR: I want to replace a T behind &mut T with a new T that I construct from the old T
Note: please forgive me if the solution to this problem is easy to find. I did a lot of googling, but I am not sure how to word the problem correctly.
Sample code (playground):
struct T { s: String }
fn main() {
let ref mut t = T { s: "hello".to_string() };
*t = T {
s: t.s + " world"
}
}
This obviously fails because the add impl on String takes self by value, and therefore would require being able to move out of T, which is however not possible, since T is behind a reference.
From what I was able to find, the usual way to achieve this is to do something like
let old_t = std::mem::replace(t, T { s: Default::default() });
t.s = old_t + " world";
but this requires that it's possible and feasible to create some placeholder T until we can fill it with real data.
Fortunately, in my use-case I can create a placeholder T, but it's still not clear to me why is an api similar to this not possible:
map_in_place(t, |old_t: T| T { s: old_t.s + " world" });
Is there a reason that is not possible or commonly done?

Is there a reason [map_in_place] is not possible or commonly done?
A map_in_place is indeed possible:
// XXX unsound, don't use
pub fn map_in_place<T>(place: &mut T, f: impl FnOnce(T) -> T) {
let place = place as *mut T;
unsafe {
let val = std::ptr::read(place);
let new_val = f(val);
std::ptr::write(place, new_val);
}
}
But unfortunately it's not sound. If f() panics, *place will be dropped twice. First it will be dropped while unwinding the scope of f(), which thinks it owns the value it received. Then it will be dropped a second time by the owner of the value place is borrowed from, which never got the memo that the value it thinks it owns is actually garbage because it was already dropped. This can even be reproduced in the playground where a simple panic!() in the closure results in a double free.
For this reason an implementation of map_in_place would itself have to be marked unsafe, with a safety contract that f() not panic. But since pretty much anything in Rust can panic (e.g. any slice access), it would be hard to ensure that safety contract and the function would be somewhat of a footgun.
The replace_with crate does offer such functionality, with several recovery options in case of panic. Judging by the documentation, the authors are keenly aware of the panic issue, so if you really need that functionality, that might be a good place to get it from.

From semantic perspective, what's the moment an undefined behavior of `&mut` noalias occurred in Rust?

As Rust reference documention said
Breaking the pointer aliasing rules. &mut T and &T follow LLVM’s scoped noalias model, except if the &T contains an UnsafeCell.
It's really ambiguous.
I want to know that what's the exactly moment an undefined behavior of &mut noalias occurred in Rust.
Is it any of below, or something else?
When defining two &mut that point to the same address?
When two &mut that point to the same address exposed to rust?
When perfrom any operation on a &mut that point to the same address of any other &mut?
For example, this code is observously UB:
unsafe {
let mut x = 123usize;
let a = (&mut x as *mut usize).as_mut().unwrap(); // created, but not accessed
let b = (&mut x as *mut usize).as_mut().unwrap(); // created, accessed
*b = 666;
drop(a);
}
But what if I modify the code like this:
struct S<'a> {
ref_x: &'a mut usize
}
fn main() {
let mut x = 123;
let s = S { ref_x: &mut x }; // like the `T` in `ManuallyDrop<T>`
let taken = unsafe { std::ptr::read(&s as *const S) }; // like `ManuallyDrop<T>::take`
// at thist ime, we have two `&mut x`
*(taken.ref_x) = 666;
drop(s);
// UB or not?
}
Is the second version also UB?
The second version is totally the same implemention to std::mem::ManuallyDrop. If the second version is UB, is it a security bug of std::mem::ManuallyDrop<T>?

What the aliasing restriction is not
It is actually common to have multiple existing &mut T aliasing the same item.
The simplest example is:
fn main() {
let mut i = 32;
let j = &mut i;
let k = &mut *j;
*k = 3;
println!("{}", i);
}
Note, though, that due to borrowing rules you cannot access the other aliases simultaneously.
If you look at the implementation of ManuallyDrop::take:
pub unsafe fn take(slot: &mut ManuallyDrop<T>) -> T {
ptr::read(&slot.value)
}
You will note that there are no simultaneously accessible &mut T: calling the function re-borrows ManuallyDrop making slot the only accessible mutable reference.
Why is aliasing in Rust so ill-defined
It's really ambiguous. I want to know that what's the exactly moment an undefined behavior of &mut noalias occurred in Rust.
Tough luck, because as specified in the Nomicon:
Unfortunately, Rust hasn't actually defined its aliasing model.
The reason is that the language team wants to make sure that the definition they reach is both safe (demonstrably so), practical, and yet does not close the door to possible refinements. It's a tall order.
The Rust Unsafe Code Guidelines Working Group is still working on establishing the exact boundaries, and in particular Ralf Jung is working on an operational model for aliasing called Stacked Borrows.
Note: the Stacked Borrows model is implemented in MIRI, and therefore you can validate your code against the Stacked Borrows model simply by executing your code in MIRI. Of course Stacked Borrows is still experimental, so this doesn't guarantee anything.
What caution recommends
I personally subscribe to caution. Seeing as the exact model is unspecified, the rules are ever changing and therefore I recommend taking the stricter interpretation possible.
Thus, I interpret the no-aliasing rule of &mut T as:
At any point in the code, there shall not be two accessible references in scope which alias the same memory if one of them is &mut T.
That is, I consider that forming a &mut T to an instance T for which another &T or &mut T is in scope without invalidating the aliases (via borrowing) is ill-formed.
It may very well be overly cautious, but at least if the aliasing model ends up being more conservative than planned, my code will still be valid.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using a HashSet to canonicalize objects in Rust - rust

Related

References in rust self referential structs

`HashMap::get_mut` leading to "returns reference to local value", any efficient work-around?

How to create a Box<UnsafeCell<[T]>>

replace a value behind a mutable reference by moving and mapping the original

From semantic perspective, what's the moment an undefined behavior of `&mut` noalias occurred in Rust?

Categories

Resources