Is transmuting (T, ()) to T safe? - rust

I am implementing an alternative to a BTreeMap<K, V>. On top of this I'm building a BTreeSet, which is a wrapper around type MyBTreeSetContents<T> = MyBTreeMap<T, ()>.
Internally, leaf nodes of this BTree contain a Vec<(K, V)> of values.
In the case of the BTreeSet, this thus becomes a Vec<(K, ())>.
I want to provide a fast iterator over references of the values in the BTreeSet. An iterator that produces &T. But the best I can get so far without reaching for transmute is an iterator that produces &(T, ()).
So therefore the question:
Is the memory representation of K, (K, ) and (K, ()) the same?
Is it therefore OK to transmute between (K, ()) and K?
And by extension, is it OK to transmute the Vec<(K, ())> to a Vec<K>?
If there are alternative approaches that circumvent usage of std::mem::transmute all-together, those would of course also be very much appreciated!

No. As far as what is currently enforced, transmuting (T, ()) to T is not guaranteed. Tuples use the default representation which does not imply anything about the layout beyond what is said in The Rust Reference. Only #[repr(transparent)] will guarantee layout compatibility.
However, it will probably work and may eventually be guaranteed. From Structs and Tuples in the Unsafe Code Guidelines:
In general, an anonymous tuple type (T1..Tn) of arity N is laid out "as if" there were a corresponding tuple struct...
...
For the purposes of struct layout 1-ZST[1] fields are ignored.
In particular, if all but one field are 1-ZST, then the struct is equivalent to a single-field struct. In other words, if all but one field is a 1-ZST, then the entire struct has the same layout as that one field.
For example:
type Zst1 = ();
struct S1(i32, Zst1); // same layout as i32
[1] Types with zero size are called zero-sized types, which is abbreviated as "ZST". This document also uses the "1-ZST" abbreviation, which stands for "one-aligned zero-sized type", to refer to zero-sized types with an alignment requirement of 1.
If my understanding of this is correct, (K, ()) has the equivalent layout to K and thus can be transmuted safely. However, that will not extend to transmuting Vec<T> to Vec<U> as mentioned in Transmutes from the Rustonomicon:
Even different instances of the same generic type can have wildly different layout. Vec<i32> and Vec<u32> might have their fields in the same order, or they might not.
Unfortunately, you should take this with a grain of salt. The Unsafe Code Guidelines is an effort to recommend what unsafe code can rely on, but currently it only advertises itself as a work-in-progress and that any concrete additions to the language specification will be moved to the official Rust Reference. I say this "will probably work" because a facet of the guidelines is to document current behavior. But as of yet, no guarantee like this has been mentioned in the reference.

Related

Why do I get double references when filtering a Vec?

I am always having difficulties writing a proper iteration over Vec.
Perhaps this is because I don't understand yet properly when and why references are introduced.
For example
pub fn notInCheck(&self) -> bool { .... } // tells whether king left in check
pub fn apply(&self, mv: Move) -> Self { ... } // applies a move and returns the new position
pub fn rawMoves(&self, vec: &mut Vec<Move>) { ... } // generates possible moves, not taking chess into account
/// List of possible moves in a given position.
/// Verified to not leave the king of the moving player in check.
pub fn moves(&self) -> Vec<Move> {
let mut vec: Vec<Move> = Vec::with_capacity(40);
self.rawMoves(&mut vec);
vec.iter().filter(|m| self.apply(**m).notInCheck()).map(|m| *m).collect()
}
where
#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Eq, Ord, Hash)]
#[repr(transparent)]
pub struct Move {
mv: u32,
}
The iteration I wrote first was:
vec.iter().filter(|m| self.apply(m).notInCheck()).collect()
but, of course, the compiler gave all sorts of errors. In addressing these errors, I arrived finally at the version shown above, but while the compiler is happy, I'm not sure I am.
It looks like the vector doesn't hold Move's at all, but merely references to Moves? But then, where are the Move's stored? In addition, the filter() function adds another level of indirection. Is this correct? Please explain to me!
Bonus question: When I have vector elements with a type that implements Copy, is there a way to avoid all this useless reference taking stuff. I understand how it would make sense with vector elements of a notable size one does not want to copy around. However, I definitely want to avoid &&value in filter(). Can I?
It looks like the vector doesn't hold Move's at all, but merely references to Moves? But then, where are the Move's stored? In addition, the filter() function adds another level of indirection. Is this correct? Please explain to me!
No, Vec<Move> definitely holds moves. The part you're missing is that aside from filter() getting a reference to the iterator item, slice::iter creates an iterator on references to the slice (or vec here) items, so Vec<Move> -> Iterator<Item=&Move> -> filter(predicate: FnMut(&&Move) -> bool), and that's why you've got two indirections in your filter callback.
When I have vector elements with a type that implements Copy, is there a way to avoid all this useless reference taking stuff. I understand how it would make sense with vector elements of a notable size one does not want to copy around. However, I definitely want to avoid &&value in filter(). Can I?
Yes. You can use into_iter which will consume the source vector but iterate on the contained values directly, or you can use the Iterator::copied adapter which will Copy the iterator item, therefore going from Iterator<Item=&T> to Iterator<Item=T>. However filter will never get a T, the most it can get is an &T since otherwise the item would get "lost" (it would be consumed by the filter, which would only return a boolean, yielding… nothing useful).
The alternative is to use something like filter_map which does get a T input, and returns an Option<U>. Because (as the name indicates) it both filters and maps, it gets to consume the input item and either return an output item (possibly the same) or return "nothing" and essentially remove the item from the collection.
Incidentally, there's also an Iterator::cloned adapter for types which are Clone but not (necessarily) Copy.
Also you could have basically done that by hand by just flipping map and filter around in the original:
vec.iter().map(|m| *m).filter(|m| self.apply(*m).notInCheck()).collect()
map transforms the Iterator<Item=&T> into an Iterator<Item=T>, then filter just gets an &T instead of an &&T.
That aside, I don't really get why apply needs to consume the input move. Or why rawMoves doesn't just… create the vector internally and return it? I get the optimisation allowing for reusing the buffer but that seems like a case of premature optimisation maybe?
And your Move seems… both over-complicated and a bit too simple? If you just want to newtype a u32 then using a tuple-struct seems more than sufficient.
And repr(transparent) is wholly unnecessary, it's only a concern in FFI contexts where the newtype is intended as a type-safety measure which only exists on the Rust side (aka the newtype itself is not visible from / exposed to C, only the wrapped type is).

Why are len() and is_empty() not defined in a trait?

Most patterns in Rust are captured by traits (Iterator, From, Borrow, etc.).
How come a pattern as pervasive as len/is_empty has no associated trait in the standard library? Would that cause problems which I do not foresee? Was it deemed useless? Or is it only that nobody thought of it (which seems unlikely)?
Was it deemed useless?
I would guess that's the reason.
What could you do with the knowledge that something is empty or has length 15? Pretty much nothing, unless you also have a way to access the elements of the collection for example. The trait that unifies collections is Iterator. In particular an iterator can tell you how many elements its underlying collection has, but it also does a lot more.
Also note that should you need an Empty trait, you can create one and implement it for all standard collections, unlike interfaces in most languages. This is the power of traits. This also means that the standard library doesn't need to provide small utility traits for every single use case, they can be provided by libraries!
Just adding a late but perhaps useful answer here. Depending on what exactly you need, using the slice type might be a good option, rather than specifying a trait. Slices have len(), is_empty(), and other useful methods (full docs here). Consider the following:
use core::fmt::Display;
fn printme<T: Display>(x: &[T]) {
println!("length: {}, empty: ", x.len());
for c in x {
print!("{}, ", c);
}
println!("\nDone!");
}
fn main() {
let s = "This is some string";
// Vector
let vv: Vec<char> = s.chars().collect();
printme(&vv);
// Array
let x = [1, 2, 3, 4];
printme(&x);
// Empty
let y:Vec<u8> = Vec::new();
printme(&y);
}
printme can accept either a vector or an array. Most other things that it accepts will need some massaging.
I think maybe the reason for there being no Length trait is that most functions will either a) work through an iterator without needing to know its length (with Iterator), or b) require len because they do some sort of random element access, in which case a slice would be the best bet. In the first case, knowing length may be helpful to pre-allocate memory of some size, but size_hint takes care of this when used for anything like Vec::with_capacity, or ExactSizeIterator for anything that needs specific allocations. Most other cases would probably need to be collected to a vector at some point within the function, which has its len.
Playground link to my example here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9a034c2e8b75775449afa110c05858e7

Is transmuting PhantomData markers safe?

This is taken out of context so it might seem a bit weird, but I have the following data structure:
use std::marker::PhantomData;
pub struct Map<T, M=()> {
data: Vec<T>,
_marker: PhantomData<fn(M) -> M>,
}
Map is an associative map where keys are "marked" to prevent using keys from one map on another unrelated map. Users can opt into this by passing some unique type they've made as M, for example:
struct PlayerMapMarker;
let mut player_map: Map<String, PlayerMapMarker> = Map::new();
This is all fine, but some iterators (e.g. the ones giving only values) I want to write for this map do not contain the marker in their type. Would the following transmute be safe to discard the marker?
fn discard_marker<T, M>(map: &Map<T, M>) -> &Map<T, ()> {
unsafe { std::mem::transmute(map) }
}
So that I could write and use:
fn values(&self) -> Values<T> {
Values { inner: discard_marker(self).iter() }
}
struct Values<'a, T> {
inner: Iter<'a, T, ()>,
}
TL;DR: Add #[repr(C)] and you should be good.
There are two separate concerns here: Whether the transmute is valid in the sense of returning valid data at the return type, and whether the entire thing violates any higher-level invariants that might be attached to the involved types. (In the terminology of my blog post, you have to make sure that both validity and safety invariants are maintained.)
For the validity invariant, you are in uncharted territory. The compiler could decide to lay out Map<T, M> very differently from Map<T, ()>, i.e. the data field could be at a different offset and there could be spurious padding. It does not seem likely, but so far we are guaranteeing very little here. Discussion about what we can and want to guarantee there is happening right now. We purposefully want to avoid making too many guarantees about repr(Rust) to avoid painting ourselves into a corner.
What you could do is to add repr(C) to your struct, then I am fairly sure you can count on ZSTs not changing anything (but I asked for clarification just to be sure). For repr(C) we provide more guarantees about how the struct is laid out, which in fact is its entire purpose. If you want to play tricks with struct layout, you should probably add that attribute.
For the higher-level safety invariant, you must be careful not to create a broken Map and let that "leak" beyond the borders of your API (into the surrounding safe code), i.e. you shouldn't return an instance of Map that violates any invariants you might have put on it. Moreover, PhantomData has some effects on variance and the drop checker that you should be aware of. With the types that are being transmuted being so trivial (your marker types don't require dropping, i.e. them and their transitive fields all do not implement Drop) I do not think you have to expect any problem from this side.
To be clear, repr(Rust) (the default) might also be fine once we decide this is something we want to guarantee -- and ignoring size-0-align-1 types (like PhantomData) entirely seems like a pretty sensible guarantee to me. Personally though I'd still advise for using repr(C) unless that has a cost you are not willing to pay (e.g. because you lose the compilers automatic size-reduction-by-reordering and cannot replicate it manually).

What is the `PhantomData` actually doing in the implementation of `Vec`? [duplicate]

This question already has answers here:
Why is it useful to use PhantomData to inform the compiler that a struct owns a generic if I already implement Drop?
(2 answers)
Closed 5 years ago.
How does PhantomData work in Rust? In the Nomicon it says the following:
In order to tell dropck that we do own values of type T, and therefore may drop some T's when we drop, we must add an extra PhantomData saying exactly that.
To me that seems to imply that when we add a PhantomData field to a structure, say in the case of a Vec.
pub struct Vec<T> {
data: *mut T,
length: usize,
capacity: usize,
phantom: PhantomData<T>,
}
that the drop checker should forbid the following sequence of code:
fn main() -> () {
let mut vector = Vec::new();
let x = Box::new(1 as i32);
let y = Box::new(2 as i32);
let z = Box::new(3 as i32);
vector.push(x);
vector.push(y);
vector.push(z);
}
Since the freeing of x, y, and z would occur before the freeing of the Vec, I would expect some complaint from the compiler. However, if you run the code above there is no warning or error.
The PhantomData<T> within Vec<T> (held indirectly via a Unique<T> within RawVec<T>) communicates to the compiler that the vector may own instances of T, and therefore the vector may run destructors for T when the vector is dropped.
Deep dive: We have a combination of factors here:
We have a Vec<T> which has an impl Drop (i.e. a destructor implementation).
Under the rules of RFC 1238, this would usually imply a relationship between instances of Vec<T> and any lifetimes that occur within T, by requiring that all lifetimes within T strictly outlive the vector.
However, the destructor for Vec<T> specifically opts out of this semantics for just that destructor (of Vec<T> itself) via the use of special unstable attributes (see RFC 1238 and RFC 1327). This allows for a vector to hold references that have the same lifetime of the vector itself. This is considered sound; after all, the vector itself will not dereference data pointed to by such references (all its doing is dropping values and deallocating the backing array), as long as an important caveat holds.
The important caveat: While the vector itself will not dereference pointers within its contained values while destructing itself, it will drop the values held by the vector. If those values of type T themselves have destructors, those destructors for T get run. And if those destructors access the data held within their references, then we would have a problem if we allowed dangling pointers within those references.
So, diving in even more deeply: the way that we confirm dropck validity for a given structure S, we first double check if S itself has an impl Drop for S (and if so, we enforce rules on S with respect to its type parameters). But even after that step, we then recursively descend into the structure of S itself, and double check for each of its fields that everything is kosher according to dropck. (Note that we do this even if a type parameter of S is tagged with #[may_dangle].)
In this specific case, we have a Vec<T> which (indirectly via RawVec<T>/Unique<T>) owns a collection of values of type T, represented in a raw pointer *const T. However, the compiler attaches no ownership semantics to *const T; that field alone in a structure S implies no relationship between S and T, and thus enforces no constraint in terms of the relationship of lifetimes within the types S and T (at least from the viewpoint of dropck).
Therefore, if the Vec<T> had solely a *const T, the recursive descent into the structure of the vector would fail to capture the ownership relation between the vector and the instances of T contained within the vector. That, combined with the #[may_dangle] attribute on T, would cause the compiler to accept unsound code (namely cases where destructors for T end up trying to access data that has already been deallocated).
BUT: Vec<T> does not solely contain a *const T. There is also a PhantomData<T>, and that conveys to the compiler "hey, even though you can assume (due to the #[may_dangle] T) that the destructor for Vec won't access data of T when the vector is dropped, it is still possible that some destructor of T itself will access data of T as the vector is dropped."
The end effect: Given Vec<T>, if T doesn't have a destructor, then the compiler provides you with more flexibility (namely, it allows a vector to hold data with references to data that lives for the same amount of time as the vector itself, even though such data may be torn down before the vector is). But if T does have a destructor (and that destructor is not otherwise communicating to the compiler that it won't access any referenced data), then the compiler is more strict, requiring any referenced data to strictly outlive the vector (thus ensuring that when the destructor for T runs, all the referenced data will still be valid).

Indexing vector by a 32-bit integer

In Rust, vectors are indexed using usize, so when writing
let my_vec: Vec<String> = vec!["Hello", "world"];
let index: u32 = 0;
println!("{}", my_vec[index]);
you get an error, as index is expected to be of type usize. I'm aware that this can be fixed by explicitly converting index to usize:
my_vec[index as usize]
but this is tedious to write. Ideally I'd simply overload the [] operator by implementing
impl<T> std::ops::Index<u32> for Vec<T> { ... }
but that's impossible as Rust prohibits this (as neither the trait nor struct are local). The only alternative that I can see is to create a wrapper class for Vec, but that would mean having to write lots of function wrappers as well. Is there any more elegant way to address this?
Without a clear use case it is difficult to recommend the best approach.
There are basically two questions here:
do you really need indexing?
do you really need to use u32 for indices?
When using functional programming style, indexing is generally unnecessary as you operate on iterators instead. In this case, the fact that Vec only implements Index for usize really does not matter.
If your algorithm really needs indexing, then why not use usize? There are many ways to convert from u32 to usize, converting at the last moment possible is one possibility, but there are other sites where you could do the conversion, and if you find a chokepoint (or create it) you can get away with only a handful of conversions.
At least, that's the YAGNI point of view.
Personally, as a type freak, I tend to wrap things around a lot. I just like to add semantic information, because let's face it Vec<i32> just doesn't mean anything.
Rust offers a simple way to create wrapper structures: struct MyType(WrappedType);. That's it.
Once you have your own type, adding indexing is easy. There are several ways to add other operations:
if only a few operations make sense, then adding explicitly is best.
if many operations are necessary, and you do not mind exposing the fact that underneath is a Vec<X>, then you can expose it:
by making it public: struct MyType(pub WrappedType);, users can then call .0 to access it.
by implementing AsRef and AsMut, or creating a getter.
by implementing Deref and DerefMut (which is implicit, make sure you really want to).
Of course, breaking encapsulation can be annoying later, as it also prevents the maintenance of invariants, so I would consider it a last ditch solution.
I prefer to store "references" to nodes as u32 rather than usize. So when traversing the graph I keep retrieving adjacent vertex "references", which I then use to look up the actual vertex object in the Vec object
So actually you don't want u32, because you will never do calculations on it, and u32 easily allows you to do math. You want an index-type that can just do indexing but whose values are immutable otherwise.
I suggest you implement something along the line of rustc_data_structures::indexed_vec::IndexVec.
This custom IndexVec type is not only generic over the element type, but also over the index type, and thus allows you to use a NodeId newtype wrapper around u32. You'll never accidentally use a non-id u32 to index, and you can use them just as easily as a u32. You don't even have to create any of these indices by calculating them from the vector length, instead the push method returns the index of the location where the element has just been inserted.

Resources