I would like to create a non binary tree structure in Rust. Here is a try
struct TreeNode<T> {
tag : T,
father : Weak<TreeNode<T>>,
childrenlists : [Rc<TreeNode<T>>]
}
Unfortunately, this does not compile.
main.rs:4:1: 8:2 error: the trait `core::marker::Sized` is not implemented for the type `[alloc::rc::Rc<TreeNode<T>>]` [E0277]
main.rs:4 struct TreeNode<T> {
main.rs:5 tag : T,
main.rs:6 father : Weak<TreeNode<T>>,
main.rs:7 childrenlist : [Rc<TreeNode<T>>]
main.rs:8 }
main.rs:4:1: 8:2 note: `[alloc::rc::Rc<TreeNode<T>>]` does not have a constant size known at compile-time
main.rs:4 struct TreeNode<T> {
main.rs:5 tag : T,
main.rs:6 father : Weak<TreeNode<T>>,
main.rs:7 childrenlist : [Rc<TreeNode<T>>]
main.rs:8 }
error: aborting due to previous error
The code compiles if we replace an array with a Vec. However, the structure is immutable and I do not need an overallocated Vec.
I heard it could be possible to have a struct field with size unknown at compile time, provided it is unique. How can we do it?
Rust doesn't have the concept of a variable-length (stack) array, which you seem to be trying to use here.
Rust has a couple different array-ish types.
Vec<T> ("vector"): Dynamically sized; dynamically allocated on the heap. This is probably what you want to use. Initialize it with Vec::with_capacity(foo) to avoid overallocation (this creates an empty vector with the given capacity).
[T; n] ("array"): Statically sized; lives on the stack. You need to know the size at compile time, so this won't work for you (unless I've misanalysed your situation).
[T] ("slice"): Unsized; usually used from &[T]. This is a view into a contiguous set of Ts in memory somewhere. You can get it by taking a reference to an array, or a vector (called "taking a slice of an array/vector"), or even taking a view into a subset of the array/vector. Being unsized, [T] can't be used directly as a variable (it can be used as a member of an unsized struct), but you can view it from behind a pointer. Pointers referring to [T] are fat ; i.e. they have an extra field for the length. &[T] would be useful if you want to store a reference to an existing array; but I don't think that's what you want to do here.
If you don't know the size of the list in advance, you have two choices:
&[T] which is just a reference to some piece of memory that you don't own
Vec<T> which is your own storage.
The correct thing here is to use a Vec. Why? Because you want the children list (array of Rc) to be actually owned by the TreeNode. If you used a &[T], it means that someone else would be keeping the list, not the TreeNode. With some lifetime trickery, you could write some valid code but you would have to go very far to please the compiler because the borrowed reference would have to be valid at least as long as the TreeNode.
Finally, a sentence in your question shows a misunderstanding:
However, the structure is immutable and I do not need an overallocated Vec.
You confuse mutability and ownership. Sure you can have an immutable Vec. It seems like you want to avoid allocating memory from the heap, but that's not possible, precisely because you don't know the size of the children list. Now if you're concerned with overallocating, you can fine-tune the vector storage with methods like with_capacity() and shrink_to_fit().
A final note: if you actually know the size of the list because it is fixed at compile time, you just need to use a [T; n] where n is compile-time known. But that's not the same as [T].
Related
I want to create a struct that will hold a bunch of Path references for later lookup.
I first started with this:
struct AbsoluteSystemPaths{
home_folder:Path,
shortcuts_file:Path
}
The compiler complained about -> Path doesn't have a size known at compile-time.
It is sensible, a struct can only have as much as one non sized var member and I added 2 (home_folder and shortcuts_file). So, I decided to move them to references:
struct AbsoluteSystemPaths{
home_folder:& Path,
shortcuts_file: &Path
}
That made the sized error go away but brought another one related to lifetimes: "Missing lifetime specifier" for the two var members of type Path.
And again, it is pretty sensible, Rust needs to know how much time this 2 references will live to avoid dangling references or memory leaks.
So, I added the life time specifiers:
struct AbsoluteSystemPaths<'a> {
home_folder:&'a Path,
shortcuts_file: &'a Path
}
and then everything compiled without errors.
What I would like to know is:
Is my general reasoning correct?
The lifetime annotations I added could be read as: Both path references will last at least as 'a that is the lifetime of an instance contrusted from AbsoluteSystemPaths struct?
Yes, your understanding is correct.
One thing to note with Path, as it is an unsized type as you've mentioned, is that the standard library provides a struct called PathBuf that represents "An owned, mutable path (akin to String).". This is an alternative to using &Path if you'd like your struct to own the Path instead of hold references to them. This may be useful depending on the way you're using your struct.
I'm not looking for code samples. I want to state my understanding of Box vs. Rc and have you tell me if my understanding is right or wrong.
Let's say I have some trait ChattyAnimal and a struct Cat that implements this trait, e.g.
pub trait ChattyAnimal {
fn make_sound(&self);
}
pub struct Cat {
pub name: String,
pub sound: String
}
impl ChattyAnimal for Cat {
fn make_sound(&self) {
println!("Meow!");
}
}
Now let's say I have other structs (Dog, Cow, Chicken, ...) that also implement the ChattyAnimal trait, and let's say I want to store all of these in the same vector.
So step 1 is I would have to use a Box type, because the Rust compiler cannot determine the size of everything that might implement this trait. And therefore, we must store these items on the heap – viola using a Box type, which is like a smarter pointer in C++. Anything wrapped with Box is automatically deleted by Rust when it goes out of scope.
// I can alias and use my Box type that wraps the trait like this:
pub type BoxyChattyAnimal = Box<dyn ChattyAnimal>;
// and then I can use my type alias, i.e.
pub struct Container {
animals: Vec<BoxyChattyAnimal>
}
Meanwhile, with Box, Rust's borrow checker requires changing when I pass or reassign the instance. But if I actually want to have multiple references to the same underlying instance, I have to use Rc. And so to have a vector of ChattyAnimal instances where each instance can have multiple references, I would need to do:
pub type RcChattyAnimal = Rc<dyn ChattyAnimal>;
pub struct Container {
animals: Vec<RcChattyAnimal>
}
One important take away from this is that if I want to have a vector of some trait type, I need to explicitly set that vector's type to a Box or Rc that wraps my trait. And so the Rust language designers force us to think about this in advance so that a Box or Rc cannot (at least not easily or accidentally) end up in the same vector.
This feels like a very and well thought design – helping prevent me from introducing bugs in my code. Is my understanding as stated above correct?
Yes, all this is correct.
There's a second reason for this design: it allows the compiler to verify that the operations you're performing on the vector elements are using memory in a safe way, relative to how they're stored.
For example, if you had a method on ChattyAnimal that mutates the animal (i.e. takes a &mut self argument), you could call that method on elements of a Vec<Box<dyn ChattyAnimal>> as long as you had a mutable reference to the vector; the Rust compiler would know that there could only be one reference to the ChattyAnimal in question (because the only reference is inside the Box, which is inside the Vec, and you have a mutable reference to the Vec so there can't be any other references to it). If you tried to write the same code with a Vec<Rc<dyn ChattyAnimal>>, the compiler would complain; it wouldn't be able to completely eliminate the possibility that your code might be mutating the animal at the same time as the code that called it was in the middle of trying to read the animal, which might lead to some inconsistencies in the calling code.
As a consequence, the compiler needs to know that all the elements of the Vec have their memory treated in the same way, so that it can check to make sure that a reference to some arbitrary element of the Vec is being used appropriately.
(There's a third reason, too, which is performance; because the compiler knows that this is a "vector of Boxes" or "vector of Rcs", it can generate code that assumes a particular storage mechanism. For example, if you have a vector of Rcs, and clone one of the elements, the machine code that the compiler generates will work simply by going to the memory address listed in the vector and adding 1 to the reference count stored there – there's no need for any extra levels of indirection. If the vector were allowed to mix different allocation schemes, the generated code would have to be a lot more complex, because it wouldn't be able to assume things like "there is a reference count", and would instead need to (at runtime) find the appropriate piece of code for dealing with the memory allocation scheme in use, and then run it; that would be much slower.)
This question already has answers here:
Why is it useful to use PhantomData to inform the compiler that a struct owns a generic if I already implement Drop?
(2 answers)
Closed 5 years ago.
How does PhantomData work in Rust? In the Nomicon it says the following:
In order to tell dropck that we do own values of type T, and therefore may drop some T's when we drop, we must add an extra PhantomData saying exactly that.
To me that seems to imply that when we add a PhantomData field to a structure, say in the case of a Vec.
pub struct Vec<T> {
data: *mut T,
length: usize,
capacity: usize,
phantom: PhantomData<T>,
}
that the drop checker should forbid the following sequence of code:
fn main() -> () {
let mut vector = Vec::new();
let x = Box::new(1 as i32);
let y = Box::new(2 as i32);
let z = Box::new(3 as i32);
vector.push(x);
vector.push(y);
vector.push(z);
}
Since the freeing of x, y, and z would occur before the freeing of the Vec, I would expect some complaint from the compiler. However, if you run the code above there is no warning or error.
The PhantomData<T> within Vec<T> (held indirectly via a Unique<T> within RawVec<T>) communicates to the compiler that the vector may own instances of T, and therefore the vector may run destructors for T when the vector is dropped.
Deep dive: We have a combination of factors here:
We have a Vec<T> which has an impl Drop (i.e. a destructor implementation).
Under the rules of RFC 1238, this would usually imply a relationship between instances of Vec<T> and any lifetimes that occur within T, by requiring that all lifetimes within T strictly outlive the vector.
However, the destructor for Vec<T> specifically opts out of this semantics for just that destructor (of Vec<T> itself) via the use of special unstable attributes (see RFC 1238 and RFC 1327). This allows for a vector to hold references that have the same lifetime of the vector itself. This is considered sound; after all, the vector itself will not dereference data pointed to by such references (all its doing is dropping values and deallocating the backing array), as long as an important caveat holds.
The important caveat: While the vector itself will not dereference pointers within its contained values while destructing itself, it will drop the values held by the vector. If those values of type T themselves have destructors, those destructors for T get run. And if those destructors access the data held within their references, then we would have a problem if we allowed dangling pointers within those references.
So, diving in even more deeply: the way that we confirm dropck validity for a given structure S, we first double check if S itself has an impl Drop for S (and if so, we enforce rules on S with respect to its type parameters). But even after that step, we then recursively descend into the structure of S itself, and double check for each of its fields that everything is kosher according to dropck. (Note that we do this even if a type parameter of S is tagged with #[may_dangle].)
In this specific case, we have a Vec<T> which (indirectly via RawVec<T>/Unique<T>) owns a collection of values of type T, represented in a raw pointer *const T. However, the compiler attaches no ownership semantics to *const T; that field alone in a structure S implies no relationship between S and T, and thus enforces no constraint in terms of the relationship of lifetimes within the types S and T (at least from the viewpoint of dropck).
Therefore, if the Vec<T> had solely a *const T, the recursive descent into the structure of the vector would fail to capture the ownership relation between the vector and the instances of T contained within the vector. That, combined with the #[may_dangle] attribute on T, would cause the compiler to accept unsound code (namely cases where destructors for T end up trying to access data that has already been deallocated).
BUT: Vec<T> does not solely contain a *const T. There is also a PhantomData<T>, and that conveys to the compiler "hey, even though you can assume (due to the #[may_dangle] T) that the destructor for Vec won't access data of T when the vector is dropped, it is still possible that some destructor of T itself will access data of T as the vector is dropped."
The end effect: Given Vec<T>, if T doesn't have a destructor, then the compiler provides you with more flexibility (namely, it allows a vector to hold data with references to data that lives for the same amount of time as the vector itself, even though such data may be torn down before the vector is). But if T does have a destructor (and that destructor is not otherwise communicating to the compiler that it won't access any referenced data), then the compiler is more strict, requiring any referenced data to strictly outlive the vector (thus ensuring that when the destructor for T runs, all the referenced data will still be valid).
Can someone explain to me why Rc<> is not Copy?
I'm writing code that uses a lot of shared pointers, and having to type .clone() all the time is getting on my nerves.
It seems to me that Rc<> should just consist of a pointer, which is a fixed size, so the type itself should be Sized and hence Copy, right?
Am I missing something?
It seems to me that Rc<> should just consist of a pointer, which is a fixed size, so the type itself should be Sized and hence Copy, right?
This is not quite true. Rc is short for Reference Counted. This means that the type keeps track of how many references point to the owned data. That way we can have multiple owners at the same time and safely free the data, once the reference count reaches 0.
But how do we keep the reference counter valid and up to date? Exactly, we have to do something whenever a new reference/owner is created and whenever a reference/owner is deleted. Specifically, we have to increase the counter in the former case and decrease it in the latter.
The counter is decreased by implementing Drop, the Rust equivalent of a destructor. This drop() function is executed whenever a variable goes out of scope – perfect for our goal.
But when do we do the increment? You guessed it: in clone(). The Copy trait, by definition, says that a type can be duplicated just by copying bits:
Types that can be copied by simply copying bits (i.e. memcpy).
This is not true in our case, because: yes, we "just copy bits", but we also do additional work! We do need to increment our reference counter!
Drop impl of Rc
Clone impl of Rc
A type cannot implement Copy if it implements Drop (source). Since Rc does implement it to decrement its reference count, it is not possible.
In addition, Rc is not just a pointer. It consists of a Shared:
pub struct Rc<T: ?Sized> {
ptr: Shared<RcBox<T>>,
}
Which, in turn, is not only a pointer:
pub struct Shared<T: ?Sized> {
pointer: NonZero<*const T>,
_marker: PhantomData<T>,
}
PhantomData is needed to express the ownership of T:
this marker has no consequences for variance, but is necessary for
dropck to understand that we logically own a T.
For details, see:
https://github.com/rust-lang/rfcs/blob/master/text/0769-sound-generic-drop.md#phantom-data
I'm trying to make a graph with adjacency lists, but I can't figure out how to specify an appropriate lifetime for the references in the adjacency list.
What I'm trying to get at is the following:
struct Graph<T> {
nodes : Vec<T>,
adjacencies : Vec<Vec<&T>>
}
This won't work because there is a lifetime specifier missing for the reference type. I suppose I could use indices for the adjacencies, but I'm actually interested in the internal reference problem, and this is just a vehicle to express that problem.
The way I see it, this should be possible to do safely, since the nodes are owned by the object. It should be allowed to keep references to those nodes around.
Am I right? How can this be done in Rust? Or, if I'm wrong, what did I miss?
It is not possible to represent this concept in Rust with just references due to Rust’s memory safety—such an object could not be constructed without already existing. As long as nodes and adjacencies are stored separately, it’s OK, but as soon as you try to join them inside the same structure, it can’t be made to work thus.
The alternatives are using reference counting (Rc<T> if immutable is OK or Rc<RefCell<T>> with inner mutability) or using unsafe pointers (*const T or *mut T).
I think there is an issue here. If we transpose this to C++:
template <typename T>
struct Graph {
std::vector<T> nodes;
std::vector<std::vector<T*>> adjacencies;
};
where adjacencies points into nodes, then one realizes that there is an issue: an operation on nodes that invalidates references (such as reallocation) will leave dangling pointers into adjacencies.
I see at least two ways to fix this:
Use indexes in adjacencies, as those are stable
Use a level of indirection in nodes, so that the memory is pinned
In Rust, this gives:
struct Graph<T> {
nodes: Vec<T>,
adjacencies: Vec<Vec<uint>>,
};
struct Graph<T> {
nodes: Vec<Rc<RefCell<T>>>,
adjacencies: Vec<Vec<Rc<RefCell<T>>>>,
};
Note: the RefCell is to allow mutating T despite it being aliased, by introducing a runtime check.