I am trying to create a disjoint set structure in Rust. It looks like this
struct DisjointSet<'s> {
id: usize,
parent: &'s mut DisjointSet<'s>,
}
The default disjoint set is a singleton structure, in which the parent refers to itself. Hence, I would like to have the option to do the following:
let a: DisjointSet = DisjointSet {
id: id,
parent: self,
};
where the self is a reference to the object that will be created.
I have tried working around this issue by creating a custom constructor. However, my attempts failed because partial initialization is not allowed. The compiler suggests using Option<DisjointSet<'s>>, but this is quite ugly. Do you have any suggestions?
My question differs from Structure containing fields that know each other
because I am interested in getting the reference to the struct that will be created.
As #delnan says, at their core, these sort of data structures are directed acyclic graphs (DAGs), with all the sharing that entails. Rust is strict about what sharing can happen so it takes a bit of extra effort to convince the compiler to accept your code in this case.
Fortunately though, "all the sharing that entails" isn't literally "all the sharing": a DAG is acyclic (modulo wanting to have parent: self), so a reference counting type like Rc or Arc is a perfect way to handle the sharing (reference counting is not so good if there are cycles). Specifically:
struct DisjointSet {
id: Cell<usize>,
parent: Rc<DisjointSet>,
}
The Cell has zero runtime overhead (there is definitely some syntactic overhead) for such a small type.
Unfortunately, this still isn't quite right for the same reason that the compiler suggests using Option<...>. There's no way to create the first DisjointSet. However, the suggested fix still works:
struct DisjointSet {
id: Cell<usize>,
parent: Option<Rc<DisjointSet>>,
}
(The Option<...> is free: Option<Rc<...>> is a single nullable pointer, just like Rc<...> is a single non-nullable pointer, and presumably one would need a branch on "do I have a parent or not" anyway.)
If you are going to take this approach, I would recommend not trying to use the Option for partial initialisation, but instead use it to represent the fact that the given set is a "root". It is easy to traverse up a chain with this representation, e.g.
fn find_root(mut x: &DisjointSet) -> &DisjointSet {
while let Some(ref parent) = x.parent {
x = parent
}
x
}
The same approach should work fine with references, but the lifetimes can often be hard to juggle.
Related
I want to create a function that provides a two step write and commit, like so:
// Omitting locking for brevity
struct States {
commited_state: u64,
// By reference is just a placeholder - I don't know how to do this
pending_states: HashSet<i64>
}
impl States {
fn read_dirty(&self) -> {
// Sum committed state and all non committed states
self.commited_state +
pending_states.into_iter().fold(sum_all_values).unwrap_or(0)
}
fn read_committed(&self) {
self.commited_state
}
}
let state_container = States::default();
async fn update_state(state_container: States, new_state: i64) -> Future {
// This is just pseudo code missing locking and such
// I'd like to add a reference to new_state
state_container.pending_states.insert(
new_state
)
async move {
// I would like to defer the commit
// I add the state to the commited state
state_container.commited_state =+ new_state;
// Then remove it *by reference* from the pending states
state_container.remove(new_state)
}
}
I'd like to be in a situation where I can call it like so
let commit_handler = update_state(state_container, 3).await;
// Do some external transactional stuff
third_party_transactional_service(...)?
// Commit if the above line does not error
commit_handler.await;
The problem I have is that HashMaps and HashSets, hash values based of their value and not their actual reference - so I can't remove them by reference.
I appreciate this a bit of a long question, but I'm just trying to give a bit more context as to what I'm trying to do. I know that in a typical database you'd generally have an atomic counter to generate the transaction ID, but that feels a bit overkill when the pointer reference would be enough.
However, I don't want to get the pointer value using unsafe, because it just seems a bit off to do something relatively simple.
Values in rust don't have an identity like they do in other languages. You need to ascribe them an identity somehow. You've hit on two ways to do this in your question: an ID contained within the value, or the address of the value as a pointer.
Option 1: An ID contained in the value
It's trivial to have a usize ID with a static AtomicUsize (atomics have interior mutability).
use std::sync::atomic::{AtomicUsize, Ordering};
// No impl of clone/copy as we want these IDs to be unique.
#[derive(Debug, Hash, PartialEq, Eq)]
#[repr(transparent)]
pub struct OpaqueIdentifier(usize);
impl OpaqueIdentifier {
pub fn new() -> Self {
static COUNTER: AtomicUsize = AtomicUsize::new(0);
Self(COUNTER.fetch_add(1, Ordering::Relaxed))
}
pub fn id(&self) -> usize {
self.0
}
}
Now your map key becomes usize, and you're done.
Having this be a separate type that doesn't implement Copy or Clone allows you to have a concept of an "owned unique ID" and then every type with one of these IDs is forced not to be Copy, and a Clone impl would require obtaining a new ID.
(You can use a different integer type than usize. I chose it semi-arbitrarily.)
Option 2: A pointer to the value
This is more challenging in Rust since values in Rust are movable by default. In order for this approach to be viable, you have to remove this capability by pinning.
To make this work, both of the following must be true:
You pin the value you're using to provide identity, and
The pinned value is !Unpin (otherwise pinning still allows moves!), which can be forced by adding a PhantomPinned member to the value's type.
Note that the pin contract is only upheld if the object remains pinned for its entire lifetime. To enforce this, your factory for such objects should only dispense pinned boxes.
This could complicate your API as you cannot obtain a mutable reference to a pinned value without unsafe. The pin documentation has examples of how to do this properly.
Assuming that you have done all of this, you can then use *const T as the key in your map (where T is the pinned type). Note that conversion to a pointer is safe -- it's conversion back to a reference that isn't. So you can just use some_pin_box.get_ref() as *const _ to obtain the pointer you'll use for lookup.
The pinned box approach comes with pretty significant drawbacks:
All values being used to provide identity have to be allocated on the heap (unless using local pinning, which is unlikely to be ergonomic -- the pin! macro making this simpler is experimental).
The implementation of the type providing identity has to accept self as a &Pin or &mut Pin, requiring unsafe code to mutate the contents.
In my opinion, it's not even a good semantic fit for the problem. "Location in memory" and "identity" are different things, and it's only kind of by accident that the former can sometimes be used to implement the latter. It's a bit silly that moving a value in memory would change its identity, no?
I'd just go with adding an ID to the value. This is a substantially more obvious pattern, and it has no serious drawbacks.
To illustrate the necessity of Rc<T>, the Book presents the following snippet (spoiler: it won't compile) to show that we cannot enable multiple ownership without Rc<T>.
enum List {
Cons(i32, Box<List>),
Nil,
}
use crate::List::{Cons, Nil};
fn main() {
let a = Cons(5, Box::new(Cons(10, Box::new(Nil))));
let b = Cons(3, Box::new(a));
let c = Cons(4, Box::new(a));
}
It then claims (emphasis mine)
We could change the definition of Cons to hold references instead, but then we would have to specify lifetime parameters. By specifying lifetime parameters, we would be specifying that every element in the list will live at least as long as the entire list. The borrow checker wouldn’t let us compile let a = Cons(10, &Nil); for example, because the temporary Nil value would be dropped before a could take a reference to it.
Well, not quite. The following snippet compiles under rustc 1.52.1
enum List<'a> {
Cons(i32, &'a List<'a>),
Nil,
}
use crate::List::{Cons, Nil};
fn main() {
let a = Cons(5, &Cons(10, &Nil));
let b = Cons(3, &a);
let c = Cons(4, &a);
}
Note that by taking a reference, we no longer need a Box<T> indirection to hold the nested List. Furthermore, I can point both b and c to a, which gives a multiple conceptual owners (which are actually borrowers).
Question: why do we need Rc<T> when immutable references can do the job?
With "ordinary" borrows you can very roughly think of a statically proven order-by-relationship, where the compiler needs to prove that the owner of something always comes to life before any borrows and always dies after all borrows died (a owns String, it comes to life before b which borrows a, then b dies, then a dies; valid). For a lot of use-cases, this can be done, which is Rust's insight to make the borrow-system practical.
There are cases where this can't be done statically. In the example you've given, you're sort of cheating, because all borrows have a 'static-lifetime; and 'static items can be "ordered" before or after anything out to infinity because of that - so there actually is no constraint in the first place. The example becomes much more complex when you take different lifetimes (many List<'a>, List<'b>, etc.) into account. This issue will become apparent when you try to pass values into functions and those functions try to add items. This is because values created inside functions will die after leaving their scope (i.e. when the enclosing function returns), so we cannot keep a reference to them afterwards, or there will be dangling references.
Rc comes in when one can't prove statically who is the original owner, whose lifetime starts before any other and ends after any other(!). A classic example is a graph structure derived from user input, where multiple nodes can refer to one other node. They need to form a "born after, dies before" relationship with the node they are referencing at runtime, to guarantee that they never reference invalid data. The Rc is a very simple solution to that because a simple counter can represent these relationships. As long as the counter is not zero, some "born after, dies before" relationship is still active. The key insight here is that it does not matter in which order the nodes are created and die because any order is valid. Only the points on either end - where the counter gets to 0 - are actually important, any increase or decrease in between is the same (0=+1+1+1-1-1-1=0 is the same as 0=+1+1-1+1-1-1=0) The Rc is destroyed when the counter reaches zero. In the graph example this is when a node is not being referred to any longer. This tells the owner of that Rc (the last node referring) "Oh, it turns out I am the owner of the underlying node - nobody knew! - and I get to destroy it".
Even single-threaded, there are still times the destruction order is determined dynamically, whereas for the borrow checker to work, there must be a determined lifetime tree (stack).
fn run() {
let writer = Rc::new(std::io::sink());
let mut counters = vec![
(7, Rc::clone(&writer)),
(7, writer),
];
while !counters.is_empty() {
let idx = read_counter_index();
counters[idx].0 -= 1;
if counters[idx].0 == 0 {
counters.remove(idx);
}
}
}
fn read_counter_index() -> usize {
unimplemented!()
}
As you can see in this example, the order of destruction is determined by user input.
Another reason to use smart pointers is simplicity. The borrow checker does incur some code complexity. For example, using smart pointer, you are able to maneuver around the self-referential struct problem with a tiny overhead.
struct SelfRefButDynamic {
a: Rc<u32>,
b: Rc<u32>,
}
impl SelfRefButDynamic {
pub fn new() -> Self {
let a = Rc::new(0);
let b = Rc::clone(&a);
Self { a, b }
}
}
This is not possible with static (compile-time) references:
struct WontDo {
a: u32,
b: &u32,
}
Given a struct like so:
pub struct MyStruct<'a> {
id: u8,
other: &'a OtherStruct,
}
I want to partially initialize it with an id field, then assign to other reference field afterwards. Note: For what I'm showing in this question, it seems extremely unnecessary to do this, but it is necessary in the actual implementation.
The rust documentation talks about initializing a struct field-by-field, which would be done like so:
fn get_struct<'a>(other: &'a OtherStruct) -> MyStruct<'a> {
let mut uninit: MaybeUninit<MyStruct<'a>> = MaybeUninit::uninit();
let ptr = uninit.as_mut_ptr();
unsafe {
addr_of_mut!((*ptr).id).write(8);
addr_of_mut!((*ptr).other).write(other);
uninit.assume_init()
}
}
Ok, so that's a possibility and it works, but it it necessary? Is it safe to instead do the following, which also seems to work?
fn get_struct2<'a>(other: &'a OtherStruct) -> MyStruct<'a> {
let mut my_struct = MyStruct {
id: 8,
other: unsafe { MaybeUninit::uninit().assume_init() },
};
my_struct.other = other;
my_struct
}
Note the first way causes no warnings and the second one gives the following warning...
other: unsafe { MaybeUninit::uninit().assume_init() },
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
this code causes undefined behavior when executed
help: use `MaybeUninit<T>` instead, and only call `assume_init` after initialization is done
...which makes sense because if the other field were accessed that could cause problems.
From having almost no understanding of this, I'm guessing that for the second way it's initially defining a struct that has its other reference pointing at whatever location in memory, but once a valid reference is assigned it should be good. Is that correct? I'm thinking it might matter for situations like if there was a struct or enum that wasn't initialized due to compiler optimizations so wrapping in MaybeUninit would prevent those optimizations, but is it ok for a reference? I'm never accessing the reference until it's assigned to.
Edit: Also, I know this could also be solved by using an Option or some other container for initialization in the private API of the struct, but let's skip over that.
It's undefined behavior, (What Every C (Rust using unsafe also) Programmer Should Know About Undefined Behavior):
Behavior considered undefined
A reference or Box that is dangling, unaligned, or points to an invalid value.
Note:
Undefined behavior affects the entire program. For example, calling a function in C that exhibits undefined behavior of C means your entire program contains undefined behaviour that can also affect the Rust code. And vice versa, undefined behavior in Rust can cause adverse affects on code executed by any FFI calls to other languages.
Dangling pointers
A reference/pointer is "dangling" if it is null or not all of the bytes it points to are part of the same allocation (so in particular they all have to be part of some allocation). The span of bytes it points to is determined by the pointer value and the size of the pointee type (using size_of_val). As a consequence, if the span is empty, "dangling" is the same as "non-null". Note that slices and strings point to their entire range, so it is important that the length metadata is never too large. In particular, allocations and therefore slices and strings cannot be bigger than isize::MAX bytes.
The reference book
I'm implementing an object that owns several resources created from C libraries through FFI. In order to clean up what's already been done if the constructor panics, I'm wrapping each resource in its own struct and implementing Drop for them. However, when it comes to dropping the object itself, I cannot guarantee that resources will be dropped in a safe order because Rust doesn't define the order that a struct's fields are dropped.
Normally, you would solve this by making it so the object doesn't own the resources but rather borrows them (so that the resources may borrow each other). In effect, this pushes the problem up to the calling code, where the drop order is well defined and enforced with the semantics of borrowing. But this is inappropriate for my use case and in general a bit of a cop-out.
What's infuriating is that this would be incredibly easy if drop took self instead of &mut self for some reason. Then I could just call std::mem::drop in my desired order.
Is there any way to do this? If not, is there any way to clean up in the event of a constructor panic without manually catching and repanicking?
You can specify drop order of your struct fields in two ways:
Implicitly
I wrote RFC 1857 specifying drop order and it was merged 2017/07/03! According to the RFC, struct fields are dropped in the same order as they are declared.
You can check this by running the example below
struct PrintDrop(&'static str);
impl Drop for PrintDrop {
fn drop(&mut self) {
println!("Dropping {}", self.0)
}
}
struct Foo {
x: PrintDrop,
y: PrintDrop,
z: PrintDrop,
}
fn main() {
let foo = Foo {
x: PrintDrop("x"),
y: PrintDrop("y"),
z: PrintDrop("z"),
};
}
The output should be:
Dropping x
Dropping y
Dropping z
Explicitly
RFC 1860 introduces the ManuallyDrop type, which wraps another type and disables its destructor. The idea is that you can manually drop the object by calling a special function (ManuallyDrop::drop). This function is unsafe, since memory is left uninitialized after dropping the object.
You can use ManuallyDrop to explicitly specify the drop order of your fields in the destructor of your type:
#![feature(manually_drop)]
use std::mem::ManuallyDrop;
struct Foo {
x: ManuallyDrop<String>,
y: ManuallyDrop<String>
}
impl Drop for Foo {
fn drop(&mut self) {
// Drop in reverse order!
unsafe {
ManuallyDrop::drop(&mut self.y);
ManuallyDrop::drop(&mut self.x);
}
}
}
fn main() {
Foo {
x: ManuallyDrop::new("x".into()),
y: ManuallyDrop::new("y".into())
};
}
If you need this behavior without being able to use either of the newer methods, keep on reading...
The issue with drop
The drop method cannot take its parameter by value, since the parameter would be dropped again at the end of the scope. This would result in infinite recursion for all destructors of the language.
A possible solution/workaround
A pattern that I have seen in some codebases is to wrap the values that are being dropped in an Option<T>. Then, in the destructor, you can replace each option with None and drop the resulting value in the right order.
For instance, in the scoped-threadpool crate, the Pool object contains threads and a sender that will schedule new work. In order to join the threads correctly upon dropping, the sender should be dropped first and the threads second.
pub struct Pool {
threads: Vec<ThreadData>,
job_sender: Option<Sender<Message>>
}
impl Drop for Pool {
fn drop(&mut self) {
// By setting job_sender to `None`, the job_sender is dropped first.
self.job_sender = None;
}
}
A note on ergonomics
Of course, doing things this way is more of a workaround than a proper solution. Also, if the optimizer cannot prove that the option will always be Some, you now have an extra branch for each access to your struct field.
Fortunately, nothing prevents a future version of Rust to implement a feature that allows specifying drop order. It would probably require an RFC, but seems certainly doable. There is an ongoing discussion on the issue tracker about specifying drop order for the language, though it has been inactive last months.
A note on safety
If destroying your structs in the wrong order is unsafe, you should probably consider making their constructors unsafe and document this fact (in case you haven't done that already). Otherwise it would be possible to trigger unsafe behavior just by creating the structs and letting them fall out of scope.
I'm new to rust (using 0.10) and exploring its use by implementing something like the rustc::middle::graph::Graph struct, but using strings as node indices and storing nodes in a HashMap.
Assuming non-static keys, what's a reasonable and efficient policy for ownership of the strings? Does the HashMap need to own its keys? Does each NodeIndex need to own its str? Is it possible for the node to own the string that defines its index and have everything else borrow that string? More generally, how should one share an immutable (but non-static) string amongst several data structures? If the answer is "it depends", what are the relevant issues?
If it is possible to have ownership of the string in one place and borrow it elsewhere, how is that accomplished? For example, if the Node struct were modified to store the node index as a string, how would the HashMap and NodeIndex use a borrowed version of it?
Is it possible for the node to own the string that defines its index and have everything else borrow that string?
[...]
If it is possible to have ownership of the string in one place and borrow it elsewhere, how is that accomplished? For example, if the Node struct were modified to store the node index as a string, how would the HashMap and NodeIndex use a borrowed version of it?
Not really: it's impossible (for the compiler) to verify that self-references don't get invalidated, i.e. many borrowing-internally situations (including this one specifically) end up allowing code similar to
struct Foo<'a> {
things: Vec<~str>,
borrows: Vec<&'a str>
}
let mut foo = Foo { ... };
foo.things.push(~"x");
foo.things.push(~"y");
// foo.things is [~"x", ~"y"]
// try to borrow the ~"y" to put a "y" into borrows
foo.borrows.push(foo.things.get(1).as_slice());
// ... time/SLoC passes ...
// we make the borrowed slice point to freed memory
// by popping/deallocating the ~"y"
foo.things.pop();
The compiler has a very hard time tell that an arbitrary modification won't be like the .pop call and invalidate the internal pointers. Basically, putting a self-reference to data that an object owns into the object itself would have to freeze that object, so no further modifications could occur (including moving out of the struct, e.g. to return it).
Tl;dr: you can't have one part of the Graph storing ~strs and another part storing &str slices into those ~strs.
That said, if you were to use some unsafe code, and only allow the Graph to be expanded, i.e. never removing nodes (i.e. never letting a ~str be deallocated until the whole graph is being destroyed), then some version of this could actually work.
More generally, how should one share an immutable (but non-static) string amongst several data structures?
You can use a Rc<~str>, with each data structure storing its own pointer. Something like
struct Graph<T> {
nodes: HashMap<Rc<~str>, Node<T>>
}
struct Node<T> {
name: Rc<~str>,
data: T,
outgoing_edges: Vec<Rc<~str>>
}
Another approach would be using a bidirectional map connecting strings with an "interned" index of a simple type, something like
struct Graph<T> {
index: BiMap<~str, Index>,
nodes: HashMap<Index, Node<T>>
}
#[deriving(Hash, Eq, TotalEq)]
struct Index { x: uint }
struct Node<T> {
name: Index,
data: T,
outgoing_edges: Vec<Index>
}
(Unfortunately this is hypothetical: Rust's stdlib doesn't have a BiMap type like this (yet...).)