Can From trait implementations be lossy? - rust

Context
I have a pair of related structs in my program, Rom and ProfiledRom. They both store a list of u8 values and implement a common trait, GetRom, to provide access to those values.
trait GetRom {
fn get(&self, index: usize) -> u8;
}
The difference is that Rom just wraps a simple Vec<u8>, but ProfiledRom wraps each byte in a ProfiledByte type that counts the number of times it is returned by get.
struct Rom(Vec<u8>);
struct ProfiledRom(Vec<ProfiledByte>);
struct ProfiledByte {
value: u8;
get_count: u32;
};
Much of my program operates on trait GetRom values, so I can substitute in Rom or ProfiledRom type/value depending on whether I want profiling to occur.
Question
I have implemented From<Rom> for ProfiledRom, because converting a Rom to a ProfiledRom just involves wrapping each byte in a new ProfiledByte: a simple and lossless operation.
However, I'm not sure whether it's appropriate to implement From<ProfiledRom> for Rom, because ProfiledRom contains information (the get counts) that can't be represented in a Rom. If you did a round-trip conversion, these values would be lost/reset.
Is it appropriate to implement the From trait when only parts of the source object will be used?
Related
I have seen that the standard library doesn't implement integer conversions like From<i64> for i32 because these could result in bytes being truncated/lost. However, that seems like a somewhat distinct case from what we have here.
With the potentially-truncating integer conversion, you would need to inspect the original i64 to know whether it would be converted appropriately. If you didn't, the behaviour or your code could change unexpectedly when you get an out-of-bounds value. However, in our case above, it's always statically clear what data is being preserved and what data is being lost. The conversion's behaviour won't suddenly change. It should be safer, but is it an appropriate use of the From trait?

From implementations are usually lossless, but there is currently no strict requirement that they be.
The ongoing discussion at rust-lang/rfcs#2484 is related. Some possibilities include adding a FromLossy trait and more exactly prescribing the behaviour of From. We'll have to see where that goes.
For consideration, here are some Target::from(Source) implementations in the standard library:
Lossless conversions
Each Source value is converted into a distinct Target value.
u16::from(u8), i16::from(u8) and other conversions to strictly-larger integer types.
Vec<u8>::from(String)
Vec<T>::from(BinaryHeap<T>)
OsString::from(String)
char::from(u8)
Lossy conversions
Multiple Source values may be convert into the same Target value.
BinaryHeap<T>::from(Vec<T>) loses the order of elements.
Box<[T]>::from(Vec<T>) and Box<str>::from(String) lose any excess capacity.
Vec<T>::from(VecDeque<T>) loses the internal split of elements exposed by .as_slices().

Related

The most idiomatic way to efficiently serialize/deserialize a Copy struct into/out of [u8]

Copy means that the struct could be copied just by copying bytes as is. As a result, it should be easily possible to re-interpret such a struct as [u8]. What's the most idiomatic way to do so, preferably without involving unsafe.
I want to have an optimized struct which could be easily sent via processes/wire/disk. I understand, that there're a lot of details which needs to be taken care of, like alignment, and looking for a solution for such a high performance use case. I.e. I am looking for close to zero copy high performance serialization.
Copy means that the struct could be copied just by copying bytes as is.
This is true.
As a result, it should be easily possible to re-interpret such a struct as [u8].
This is not true, because Copy structs can still contain padding, which is not permitted to be read except incidentally while copying.
What's the most idiomatic way to do so, preferably without involving unsafe.
You should start with bytemuck. It is a library which provides trivial conversion to and from [u8] when it is safe to do so. In particular, it checks that there is no padding in the struct, and that the representation is well-defined (not subject to the whims of the compiler).
You will still need to consider alignment, and for that purpose may need to introduce explicit “padding” fields (whose value is explicitly set rather than being left undefined) so that the alignment of other fields is satisfied.
Your program's data will also not be compatible with machines of different endianness unless you take care. (However, it is possible to do so, in ways which have zero run-time overhead if not necessary, and most machines are little-endian today so that cost will almost never actually apply.)

Rust Storage of Option<i32>

How is an Option laid out in memory? Since a i32 already takes up an even number of bytes, it Rust forced to use a full byte to store the single bit None/Some?
EDIT: According to this answer, Rust in fact uses an extra 4 (!) bytes. Why?
For structs and enums declared without special layout modifiers, the Rust docs state
Nominal types without a repr attribute have the default representation. Informally, this representation is also called the rust representation.
There are no guarantees of data layout made by this representation.
Option cannot possibly be repr(transparent) or repr(i*) since it is neither a newtype struct nor a fieldless enum, and we can check the source code and see that it's not declared repr(C). So no guarantees are made about the layout.
If it were declared repr(C), then we'd get the C representation, which is what you're envisioning. We need one integer to indicate whether it's None or Some (which size of integer is implementation-defined) and then we need enough space to store the actual i32.
In reality, since Rust is given a lot of leeway here, it can do clever things. If you have a variable which is only ever Some, it needn't store the tag bit (and, again, no guarantees are made about layout, so it's free to make this change internally). If you have an i32 that starts at 0 and goes up to 10, it's provably never negative, so Rust might choose to use, say, -1 to indicate None.

Is it better to return by value for small types for getters in traits?

For most data types, I follow the convention in https://stackoverflow.com/a/35391084/11963778 and have getters returning references:
trait HasName {
fn name(&self) -> &String;
fn name_mut(&mut self) -> &mut String;
}
However, for data types that have copy semantics and are smaller than (or around the size of) a pointer, should I have a getter method returning the value instead? It would look something like this:
trait HasNum {
fn num_v(&self) -> i32;
fn num(&self) -> &i32;
fn num_mut(&mut self) -> &mut i32;
}
Is it good practice to have a getter that returns the value instead? If so, then up to what size should I do this for small data types?
As a rule of thumb you can copy values held on a single cache line instead of using references. While cache lines are typically 64bytes on x86, Intel recommends limiting data to 16 bytes to reduce the chance of the value not being aligned.
So in other words, its probably fine to just copy anything around the size of [i32; 4] or less.
Note: While there is some reasoning behind it, I just made this rule up based on what I know about performance so far. If enough people were to look at this post, I'm sure someone else will have a better answer. That being said, even if my reasoning is a bit off I think it will likely still hold up in most cases when you aren't trying to optimize an extremely time critical piece of code or for a specific CPU.
In the time I spent writing this answer I also found a few more interesting points in this question.
https://stackoverflow.com/a/40185996/5987669
It is common, for example, for a machine to have an architecture (machine registers, memory architecture, etc) which result in a "sweet spot" - copying variables of some size is most "efficient", but copying larger OR SMALLER variables is less so. Larger variables will cost more to copy, because there may be a need to do multiple copies of smaller chunks. Smaller ones may also cost more, because the compiler needs to copy the smaller value into a larger variable (or register), do the operations on it, then copy the value back.
https://stackoverflow.com/a/49523201/5987669
This answer is specific to C, but I wouldn't be surprised if it applied to Rust as well
There is a certain GCC optimization called IPA SRA, that replaces "pass by reference" with "pass by value" automatically: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html (-fipa-sra)
...
So with this optimization enabled, using references for small types should be as fast as passing them by value.
On the other hand passing (for example) std::string by value could not be optimized to by-reference speed, as custom copy semantics are being involved.

How can I simplify multiple uses of BigInt::from()?

I wrote a program where I manipulated a lot of BigInt and BigUint values and perform some arithmetic operations.
I produced code where I frequently used BigInt::from(Xu8) because it is not possible to directly add numbers from different types (if I understand correctly).
I want to reduce the number of BigInt::from in my code. I thought about a function to "wrap" this, but I would need a function for each type I want to convert into BigInt/BigUint:
fn short_name(n: X) -> BigInt {
return BigInt::from(n)
}
Where X will be each type I want to convert.
I couldn't find any solution that is not in contradiction with the static typing philosophy of Rust.
I feel that I am missing something about traits, but I am not very comfortable with them, and I did not find a solution using them.
Am I trying to do something impossible in Rust? Am I missing an obvious solution?
To answer this part:
I produced code where I frequently used BigInt::from(Xu8) because it is not possible to directly add numbers from different types (if I understand correctly).
On the contrary, if you look at BigInt's documentation you'll see many impl Add:
impl<'a> Add<BigInt> for &'a u64
impl Add<u8> for BigInt
and so on. The first allows calling a_ref_to_u64 + a_bigint, the second a_bigint + an_u8 (and both set OutputType to be BigInt). You don't need to convert these types to BigInt before adding them! And if you want your method to handle any such type you just need an Add bound similar to the From bound in Frxstrem's answer. Of course if you want many such operations, From may end up more readable.
The From<T> trait (and the complementary Into<T> trait) is what is typically used to convert between types in Rust. In fact, the BigInt::from method comes from the From trait.
You can modify your short_name function into a generic function with a where clause to accept all types that BigInt can be converted from:
fn short_name<T>(n: T) -> BigInt // function with generic type T
where
BigInt: From<T>, // where BigInt implements the From<T> trait
{
BigInt::from(n)
}

Indexing vector by a 32-bit integer

In Rust, vectors are indexed using usize, so when writing
let my_vec: Vec<String> = vec!["Hello", "world"];
let index: u32 = 0;
println!("{}", my_vec[index]);
you get an error, as index is expected to be of type usize. I'm aware that this can be fixed by explicitly converting index to usize:
my_vec[index as usize]
but this is tedious to write. Ideally I'd simply overload the [] operator by implementing
impl<T> std::ops::Index<u32> for Vec<T> { ... }
but that's impossible as Rust prohibits this (as neither the trait nor struct are local). The only alternative that I can see is to create a wrapper class for Vec, but that would mean having to write lots of function wrappers as well. Is there any more elegant way to address this?
Without a clear use case it is difficult to recommend the best approach.
There are basically two questions here:
do you really need indexing?
do you really need to use u32 for indices?
When using functional programming style, indexing is generally unnecessary as you operate on iterators instead. In this case, the fact that Vec only implements Index for usize really does not matter.
If your algorithm really needs indexing, then why not use usize? There are many ways to convert from u32 to usize, converting at the last moment possible is one possibility, but there are other sites where you could do the conversion, and if you find a chokepoint (or create it) you can get away with only a handful of conversions.
At least, that's the YAGNI point of view.
Personally, as a type freak, I tend to wrap things around a lot. I just like to add semantic information, because let's face it Vec<i32> just doesn't mean anything.
Rust offers a simple way to create wrapper structures: struct MyType(WrappedType);. That's it.
Once you have your own type, adding indexing is easy. There are several ways to add other operations:
if only a few operations make sense, then adding explicitly is best.
if many operations are necessary, and you do not mind exposing the fact that underneath is a Vec<X>, then you can expose it:
by making it public: struct MyType(pub WrappedType);, users can then call .0 to access it.
by implementing AsRef and AsMut, or creating a getter.
by implementing Deref and DerefMut (which is implicit, make sure you really want to).
Of course, breaking encapsulation can be annoying later, as it also prevents the maintenance of invariants, so I would consider it a last ditch solution.
I prefer to store "references" to nodes as u32 rather than usize. So when traversing the graph I keep retrieving adjacent vertex "references", which I then use to look up the actual vertex object in the Vec object
So actually you don't want u32, because you will never do calculations on it, and u32 easily allows you to do math. You want an index-type that can just do indexing but whose values are immutable otherwise.
I suggest you implement something along the line of rustc_data_structures::indexed_vec::IndexVec.
This custom IndexVec type is not only generic over the element type, but also over the index type, and thus allows you to use a NodeId newtype wrapper around u32. You'll never accidentally use a non-id u32 to index, and you can use them just as easily as a u32. You don't even have to create any of these indices by calculating them from the vector length, instead the push method returns the index of the location where the element has just been inserted.

Resources