Is transmuting PhantomData markers safe? - rust

This is taken out of context so it might seem a bit weird, but I have the following data structure:
use std::marker::PhantomData;
pub struct Map<T, M=()> {
data: Vec<T>,
_marker: PhantomData<fn(M) -> M>,
}
Map is an associative map where keys are "marked" to prevent using keys from one map on another unrelated map. Users can opt into this by passing some unique type they've made as M, for example:
struct PlayerMapMarker;
let mut player_map: Map<String, PlayerMapMarker> = Map::new();
This is all fine, but some iterators (e.g. the ones giving only values) I want to write for this map do not contain the marker in their type. Would the following transmute be safe to discard the marker?
fn discard_marker<T, M>(map: &Map<T, M>) -> &Map<T, ()> {
unsafe { std::mem::transmute(map) }
}
So that I could write and use:
fn values(&self) -> Values<T> {
Values { inner: discard_marker(self).iter() }
}
struct Values<'a, T> {
inner: Iter<'a, T, ()>,
}

TL;DR: Add #[repr(C)] and you should be good.
There are two separate concerns here: Whether the transmute is valid in the sense of returning valid data at the return type, and whether the entire thing violates any higher-level invariants that might be attached to the involved types. (In the terminology of my blog post, you have to make sure that both validity and safety invariants are maintained.)
For the validity invariant, you are in uncharted territory. The compiler could decide to lay out Map<T, M> very differently from Map<T, ()>, i.e. the data field could be at a different offset and there could be spurious padding. It does not seem likely, but so far we are guaranteeing very little here. Discussion about what we can and want to guarantee there is happening right now. We purposefully want to avoid making too many guarantees about repr(Rust) to avoid painting ourselves into a corner.
What you could do is to add repr(C) to your struct, then I am fairly sure you can count on ZSTs not changing anything (but I asked for clarification just to be sure). For repr(C) we provide more guarantees about how the struct is laid out, which in fact is its entire purpose. If you want to play tricks with struct layout, you should probably add that attribute.
For the higher-level safety invariant, you must be careful not to create a broken Map and let that "leak" beyond the borders of your API (into the surrounding safe code), i.e. you shouldn't return an instance of Map that violates any invariants you might have put on it. Moreover, PhantomData has some effects on variance and the drop checker that you should be aware of. With the types that are being transmuted being so trivial (your marker types don't require dropping, i.e. them and their transitive fields all do not implement Drop) I do not think you have to expect any problem from this side.
To be clear, repr(Rust) (the default) might also be fine once we decide this is something we want to guarantee -- and ignoring size-0-align-1 types (like PhantomData) entirely seems like a pretty sensible guarantee to me. Personally though I'd still advise for using repr(C) unless that has a cost you are not willing to pay (e.g. because you lose the compilers automatic size-reduction-by-reordering and cannot replicate it manually).

Related

Object safe generic without type erasure

In the trait Extendable below, I'd like to make app generic where it currently uses the concrete type i32.
At first blush, you'd think to use a generic type but doing so while keeping Extendable object safe isn't easy.
trait Isoextender {
type Input;
type Output;
fn forward(&self, v: Self::Input) -> Self::Output;
fn backward(&self, v: Self::Output) -> Self::Input;
}
trait Extendable {
type Item;
fn app(
&self,
v: &dyn Isoextender<Input = Self::Item, Output = i32>,
) -> Box<dyn Extendable<Item = i32>>;
}
There's a (really neat) type erasure trick (link) I can use with std::Any but then I'm losing type information.
This is one of those things that feels like it should be possible from my understanding of how Rust works but which simply might not be. I do see there's an issue with size. Clearly rust needs to know the size of the associated type Item, but I don't see how to solve with with references/pointers in a way that's both object safe and keeps the type information.
Is it the case that:
I am missing something and it's actually possible?
This is not possible today, but it may become possible in the future?
This is not likely to be possible ever?
Update:
So I suppose a part of the issue here is that I feel like the following is a generic function that has only 1 possible implementation. You should only have to compile this once.
fn pointer_identity<T>(v: &T) -> &T {
v
}
It feels like I should be able to put this function into a vtable, but still somehow express that it is generic. There are a whole class of easily identifiable functions that effectively act this way. Interactions between trait objects often act this way. It's all pointers to functions where the types don't matter except to be carried forward.
I found some document ion that may answer my question.
Currently, the compiler has two concepts surrounding Generic code, only one of which seems to be surfaced in the type system (link).
It sounds like this may be possible some day, but not currently expressible with Rust.
Monomorphization
The compiler stamps out a different copy of the code of a generic function for each concrete type needed.
Polymorphization
In addition to MIR optimizations, rustc attempts to determine when fewer copies of functions are necessary and avoid making those copies - known as "polymorphization".
As a result of polymorphization, items collected during monomorphization cannot be assumed to be monomorphic.
It is intended that polymorphization be extended to more advanced cases, such as where only the size/alignment of a generic parameter are required.

Is it safe to transmute::<&'a Arc<T>, &'a Weak<T>>(…)?

Is it safe to transmute a shared reference & to a strong Arc<T> into a shared reference & to a Weak<T>?
To ask another way: is the following safe function sound, or is it a vulnerability waiting to happen?
pub fn as_weak<'a, T>(strong: &'a Arc<T>) -> &'a Weak<T> {
unsafe { transmute::<&'a Arc<T>, &'a Weak<T>>(strong) }
}
Why I want to do this
We have an existing function that returns a &Weak<T>. The internal data structure has changed a bit, and I now have an Arc<T> where I previously had a Weak<T>, but I need to maintain semver compatibility with this function's interface. I'd rather avoid needing to stash an actual Weak<T> copy just for the sake of this function if I don't need to.
Why I hope this is safe
The underlying memory representations of Arc<T> and Weak<T> are the same: a not-null pointer (or pointer-like value for Weak::new()) to an internal ArcInner struct, which contains the strong and weak reference counts and the inner T value.
Arc<T> also contains a PhantomData<T>, but my understanding is that if that changes anything, it would only apply on drop, which isn't relevant for the case here as we're only transmuting a shared reference, not an owned value.
The operations that an Arc<T> will perform on its inner pointer are presumably a superset of those that may be performed by a Weak<T>, since they have the same representation but Arc carries a guarantee that the inner T value is still alive, while Weak does not.
Given these facts, it seems to me like nothing could go wrong. However, I haven't written much unsafe code before, and never for a production case like this. I'm not confident that I fully understand the possible issues. Is this transmutation safe and sound, or are are there other factors that need to be considered?
No, this is not sound.
Neither Arc nor Weak has a #[repr] forcing a particular layout, therefore they are both #[repr(Rust)] by default. According to the Rustonomicon section about repr(Rust):
struct A {
a: i32,
b: u64,
}
struct B {
a: i32,
b: u64,
}
Rust does guarantee that two instances of A have their data laid out in exactly the same way. However Rust does not currently guarantee that an instance of A has the same field ordering or padding as an instance of B.
You cannot therefore assume that Arc<T> and Weak<T> have the same layout.

Is allowing library users to embed arbitrary data in your structures a correct usage of std::mem::transmute?

A library I'm working on stores various data structures in a graph-like manner.
I'd like to let users store metadata ("annotations") in nodes, so they can retrieve them later. Currently, they have to create their own data structure which mirrors the library's, which is very inconvenient.
I'm placing very little constraints on what an annotation can be, because I do not know what the users will want to store in the future.
The rest of this question is about my current attempt at solving this use case, but I'm open to completely different implementations as well.
User annotations are represented with a trait:
pub trait Annotation {
fn some_important_method(&self)
}
This trait contains a few methods (all on &self) which are important for the domain, but these are always trivial to implement for users. The real data of an annotation implementation cannot be retrieved this way.
I can store a list of annotations this way:
pub struct Node {
// ...
annotations: Vec<Box<dyn Annotation>>,
}
I'd like to let the user retrieve whatever implementation they previously added to a list, something like this:
impl Node {
fn annotations_with_type<T>(&self) -> Vec<&T>
where
T: Annotation,
{
// ??
}
}
I originally aimed to convert dyn Annotation to dyn Any, then use downcast_ref, however trait upcasting coercion is unsable.
Another solution would be to require each Annotation implementation to store its TypeId, compare it with annotations_with_type's type parameter's TypeId, and std::mem::transmute the resulting &dyn Annotation to &T… but the documentation of transmute is quite scary and I honestly don't know whether that's one of the allowed cases in which it is safe. I definitely would have done some kind of void * in C.
Of course it's also possible that there's a third (safe) way to go through this. I'm open to suggestions.
What you are describing is commonly solved by TypeMaps, allowing a type to be associated with some data.
If you are open to using a library, you might consider looking into using an existing implementation, such as https://crates.io/crates/typemap_rev, to store data. For example:
struct MyAnnotation;
impl TypeMapKey for MyAnnotation {
type Value = String;
}
let mut map = TypeMap::new();
map.insert::<MyAnnotation>("Some Annotation");
If you are curious. It underlying uses a HashMap<TypeId, Box<(dyn Any + Send + Sync)>> to store the data. To retrieve data, it uses a downcast_ref on the Any type which is stable. This could also be a pattern to implement it yourself if needed.
You don't have to worry whether this is valid - because it doesn't compile (playground):
error[E0512]: cannot transmute between types of different sizes, or dependently-sized types
--> src/main.rs:7:18
|
7 | _ = unsafe { std::mem::transmute::<&dyn Annotation, &i32>(&*v) };
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: source type: `&dyn Annotation` (128 bits)
= note: target type: `&i32` (64 bits)
The error message should be clear, I hope: &dyn Trait is a fat pointer, and has size 2*size_of::<usize>(). &T, on the other hand, is a thin pointer (as long as T: Sized), of size of only one usize, and you cannot transmute between types of different sizes.
You can work around that with transmute_copy(), but it will just make things worse: it will work, but it is unsound and is not guaranteed to work in any way. It may become UB in future Rust versions. This is because the only guaranteed thing (as of now) for &dyn Trait references is:
Pointers to unsized types are sized. The size and alignment is guaranteed to be at least equal to the size and alignment of a pointer.
Nothing guarantees the order of the fields. It can be (data_ptr, vtable_ptr) (as it is now, and thus transmute_copy() works) or (vtable_ptr, data_ptr). Nothing is even guaranteed about the contents. It can not contain a data pointer at all (though I doubt somebody will ever do something like that). transmute_copy() copies the data from the beginning, meaning that for the code to work the data pointer should be there and should be first (which it is). For the code to be sound this needs to be guaranteed (which is not).
So what can we do? Let's check how Any does its magic:
// SAFETY: caller guarantees that T is the correct type
unsafe { &*(self as *const dyn Any as *const T) }
So it uses as for the conversion. Does it work? Certainly. And that means std can do that, because std can do things that are not guaranteed and relying on how things work in practice. But we shouldn't. So, is it guaranteed?
I don't have a firm answer, but I'm pretty sure the answer is no. I have found no authoritative source that guarantees the behavior of casts from unsized to sized pointers.
Edit: #CAD97 pointed on Zulip that the reference promises that *[const|mut] T as *[const|mut V] where V: Sized will be a pointer-to-pointer case, and that can be read as a guarantee this will work.
But I still feel fine with relying on that. Because, unlike the transmute_copy(), people are doing it. In production. And there is no better way in stable. So the chance it will become undefined behavior is very low. It is much more likely to be defined.
Does a guaranteed way even exist? Well, yes and no. Yes, but only using the unstable pointer metadata API:
#![feature(ptr_metadata)]
let v: &dyn Annotation;
let v = v as *const dyn Annotation;
let v: *const T = v.to_raw_parts().0.cast::<T>();
let v: &T = unsafe { &*v };
In conclusion, if you can use nightly features, I would prefer the pointer metadata API just to be extra safe. But in case you can't, I think the cast approach is fine.
Last point, there may be a crate that already does that. Prefer that, if it exists.

Is my understanding of a Rust vector that supports Rc or Box wrapped types correct?

I'm not looking for code samples. I want to state my understanding of Box vs. Rc and have you tell me if my understanding is right or wrong.
Let's say I have some trait ChattyAnimal and a struct Cat that implements this trait, e.g.
pub trait ChattyAnimal {
fn make_sound(&self);
}
pub struct Cat {
pub name: String,
pub sound: String
}
impl ChattyAnimal for Cat {
fn make_sound(&self) {
println!("Meow!");
}
}
Now let's say I have other structs (Dog, Cow, Chicken, ...) that also implement the ChattyAnimal trait, and let's say I want to store all of these in the same vector.
So step 1 is I would have to use a Box type, because the Rust compiler cannot determine the size of everything that might implement this trait. And therefore, we must store these items on the heap – viola using a Box type, which is like a smarter pointer in C++. Anything wrapped with Box is automatically deleted by Rust when it goes out of scope.
// I can alias and use my Box type that wraps the trait like this:
pub type BoxyChattyAnimal = Box<dyn ChattyAnimal>;
// and then I can use my type alias, i.e.
pub struct Container {
animals: Vec<BoxyChattyAnimal>
}
Meanwhile, with Box, Rust's borrow checker requires changing when I pass or reassign the instance. But if I actually want to have multiple references to the same underlying instance, I have to use Rc. And so to have a vector of ChattyAnimal instances where each instance can have multiple references, I would need to do:
pub type RcChattyAnimal = Rc<dyn ChattyAnimal>;
pub struct Container {
animals: Vec<RcChattyAnimal>
}
One important take away from this is that if I want to have a vector of some trait type, I need to explicitly set that vector's type to a Box or Rc that wraps my trait. And so the Rust language designers force us to think about this in advance so that a Box or Rc cannot (at least not easily or accidentally) end up in the same vector.
This feels like a very and well thought design – helping prevent me from introducing bugs in my code. Is my understanding as stated above correct?
Yes, all this is correct.
There's a second reason for this design: it allows the compiler to verify that the operations you're performing on the vector elements are using memory in a safe way, relative to how they're stored.
For example, if you had a method on ChattyAnimal that mutates the animal (i.e. takes a &mut self argument), you could call that method on elements of a Vec<Box<dyn ChattyAnimal>> as long as you had a mutable reference to the vector; the Rust compiler would know that there could only be one reference to the ChattyAnimal in question (because the only reference is inside the Box, which is inside the Vec, and you have a mutable reference to the Vec so there can't be any other references to it). If you tried to write the same code with a Vec<Rc<dyn ChattyAnimal>>, the compiler would complain; it wouldn't be able to completely eliminate the possibility that your code might be mutating the animal at the same time as the code that called it was in the middle of trying to read the animal, which might lead to some inconsistencies in the calling code.
As a consequence, the compiler needs to know that all the elements of the Vec have their memory treated in the same way, so that it can check to make sure that a reference to some arbitrary element of the Vec is being used appropriately.
(There's a third reason, too, which is performance; because the compiler knows that this is a "vector of Boxes" or "vector of Rcs", it can generate code that assumes a particular storage mechanism. For example, if you have a vector of Rcs, and clone one of the elements, the machine code that the compiler generates will work simply by going to the memory address listed in the vector and adding 1 to the reference count stored there – there's no need for any extra levels of indirection. If the vector were allowed to mix different allocation schemes, the generated code would have to be a lot more complex, because it wouldn't be able to assume things like "there is a reference count", and would instead need to (at runtime) find the appropriate piece of code for dealing with the memory allocation scheme in use, and then run it; that would be much slower.)

Why do I get double references when filtering a Vec?

I am always having difficulties writing a proper iteration over Vec.
Perhaps this is because I don't understand yet properly when and why references are introduced.
For example
pub fn notInCheck(&self) -> bool { .... } // tells whether king left in check
pub fn apply(&self, mv: Move) -> Self { ... } // applies a move and returns the new position
pub fn rawMoves(&self, vec: &mut Vec<Move>) { ... } // generates possible moves, not taking chess into account
/// List of possible moves in a given position.
/// Verified to not leave the king of the moving player in check.
pub fn moves(&self) -> Vec<Move> {
let mut vec: Vec<Move> = Vec::with_capacity(40);
self.rawMoves(&mut vec);
vec.iter().filter(|m| self.apply(**m).notInCheck()).map(|m| *m).collect()
}
where
#[derive(Clone, Copy, Debug, PartialEq, PartialOrd, Eq, Ord, Hash)]
#[repr(transparent)]
pub struct Move {
mv: u32,
}
The iteration I wrote first was:
vec.iter().filter(|m| self.apply(m).notInCheck()).collect()
but, of course, the compiler gave all sorts of errors. In addressing these errors, I arrived finally at the version shown above, but while the compiler is happy, I'm not sure I am.
It looks like the vector doesn't hold Move's at all, but merely references to Moves? But then, where are the Move's stored? In addition, the filter() function adds another level of indirection. Is this correct? Please explain to me!
Bonus question: When I have vector elements with a type that implements Copy, is there a way to avoid all this useless reference taking stuff. I understand how it would make sense with vector elements of a notable size one does not want to copy around. However, I definitely want to avoid &&value in filter(). Can I?
It looks like the vector doesn't hold Move's at all, but merely references to Moves? But then, where are the Move's stored? In addition, the filter() function adds another level of indirection. Is this correct? Please explain to me!
No, Vec<Move> definitely holds moves. The part you're missing is that aside from filter() getting a reference to the iterator item, slice::iter creates an iterator on references to the slice (or vec here) items, so Vec<Move> -> Iterator<Item=&Move> -> filter(predicate: FnMut(&&Move) -> bool), and that's why you've got two indirections in your filter callback.
When I have vector elements with a type that implements Copy, is there a way to avoid all this useless reference taking stuff. I understand how it would make sense with vector elements of a notable size one does not want to copy around. However, I definitely want to avoid &&value in filter(). Can I?
Yes. You can use into_iter which will consume the source vector but iterate on the contained values directly, or you can use the Iterator::copied adapter which will Copy the iterator item, therefore going from Iterator<Item=&T> to Iterator<Item=T>. However filter will never get a T, the most it can get is an &T since otherwise the item would get "lost" (it would be consumed by the filter, which would only return a boolean, yielding… nothing useful).
The alternative is to use something like filter_map which does get a T input, and returns an Option<U>. Because (as the name indicates) it both filters and maps, it gets to consume the input item and either return an output item (possibly the same) or return "nothing" and essentially remove the item from the collection.
Incidentally, there's also an Iterator::cloned adapter for types which are Clone but not (necessarily) Copy.
Also you could have basically done that by hand by just flipping map and filter around in the original:
vec.iter().map(|m| *m).filter(|m| self.apply(*m).notInCheck()).collect()
map transforms the Iterator<Item=&T> into an Iterator<Item=T>, then filter just gets an &T instead of an &&T.
That aside, I don't really get why apply needs to consume the input move. Or why rawMoves doesn't just… create the vector internally and return it? I get the optimisation allowing for reusing the buffer but that seems like a case of premature optimisation maybe?
And your Move seems… both over-complicated and a bit too simple? If you just want to newtype a u32 then using a tuple-struct seems more than sufficient.
And repr(transparent) is wholly unnecessary, it's only a concern in FFI contexts where the newtype is intended as a type-safety measure which only exists on the Rust side (aka the newtype itself is not visible from / exposed to C, only the wrapped type is).

Resources