I was wondering what would happen when I cast a very large float value to an integer. This is an example I wrote:
fn main() {
let x = 82747650246702476024762_f32;//-1_i16;
let y = x as u8;
let z = x as i32;
println!("{} {} {}", x, y, z);
}
and the output is:
$ ./casts
82747650000000000000000 0 -2147483648
Obviously the float wouldn't fit in any of the integers, but since Rust so strongly advertises that it is safe, I would have expected an error of some kind. These operations use the llvm fptosi and fptoui instructions, which produce a so called poison value if the value doesn't fit within the type it has been casted to. This may produce undefined behavior, which is very bad, especially when writing Rust code.
How can I be sure my float to int casts don't result in undefined behavior in Rust? And why would Rust even allow this (as it is known for creating safe code)?
In Rust 1.44 and earlier, if you use as to cast a floating-point number to an integer type and the floating-point number does not fit¹ in the target type, the result is an undefined value², and most things that you can do with it cause undefined behavior.
This serious issue (#10184) was fixed in Rust 1.45. Since that release, float-to-integer casts saturate instead (that is, values that are too large or small are converted to T::MAX or T::MIN, respectively; NaN is converted to 0).
In older versions of Rust, you can enable the new, safe behavior with the -Z saturating-float-casts flag. Note that saturating casts may be slightly slower since they have to check the type bounds first. If you really need to avoid the check, the standard library provides to_int_unchecked. Since the behavior is undefined when the number is out of range, you must use unsafe.
(There used to be a similar issue for certain integer-to-float casts, but it was resolved by making such casts always saturating. This change was not considered a performance regression and there is no way to opt in to the old behavior.)
Related questions
Can casting in safe Rust ever lead to a runtime error?
¹ "Fit" here means either NaN, or a number of such large magnitude that it cannot be approximated by the smaller type. 8.7654321_f64 will still be truncated to 8 by an as u8 cast, even though the value cannot be represented exactly by the destination type -- loss of precision does not cause undefined behavior, only being out of range does.
² A "poison" value in LLVM, as you correctly note in the question, but Rust itself does not distinguish between undef and poison values.
Part of your problem is "as" does lossy casts, i.e. you can cast a u8 to a u16 no problem but a u16 can't always fit in a u8, so the u16 is truncated. You can read about the behavior of "as" here.
Rust is designed to be memory safe, which means you can't access memory you shouldn't , or have data races, (unless you use unsafe) but you can still have memory leaks and other undesirable behavior.
What you described is unexpected behavior, but it is still well defined, these are very different things. Two different rust compilers would compile this to code that has the same result, if it was undefined behavior then the compiler implementer could have this compile to whatever they wanted, or not compile at all.
Edit: As pointed out in the comments and other answers, casting a float to an int using as currently causes undefined behavior, this is due to a known bug in the compiler.
Related
How is an Option laid out in memory? Since a i32 already takes up an even number of bytes, it Rust forced to use a full byte to store the single bit None/Some?
EDIT: According to this answer, Rust in fact uses an extra 4 (!) bytes. Why?
For structs and enums declared without special layout modifiers, the Rust docs state
Nominal types without a repr attribute have the default representation. Informally, this representation is also called the rust representation.
There are no guarantees of data layout made by this representation.
Option cannot possibly be repr(transparent) or repr(i*) since it is neither a newtype struct nor a fieldless enum, and we can check the source code and see that it's not declared repr(C). So no guarantees are made about the layout.
If it were declared repr(C), then we'd get the C representation, which is what you're envisioning. We need one integer to indicate whether it's None or Some (which size of integer is implementation-defined) and then we need enough space to store the actual i32.
In reality, since Rust is given a lot of leeway here, it can do clever things. If you have a variable which is only ever Some, it needn't store the tag bit (and, again, no guarantees are made about layout, so it's free to make this change internally). If you have an i32 that starts at 0 and goes up to 10, it's provably never negative, so Rust might choose to use, say, -1 to indicate None.
I am reading through Rust by Example, and I am curious about why we cannot coerce a decimal to a u8, like in the following snippet:
let decimal = 65.4321_f32;
// Error! No implicit conversion
let integer: u8 = decimal;
But explicit casting is allowed, so I don't understand why can't we have it implicit too.
Is this a language design decision? What advantages does this bring?
Safety is a big part of the design of Rust and its standard library. A lot of the focus is on memory safety but Rust also tries to help prevent common bugs by forcing you to make decisions where data could be lost or where your program could panic.
A good example of this is that it uses the Option type instead of null. If you are given an Option<T> you are now forced to decide what to do with it. You could decide to unwrap it and panic, or you could use unwrap_or to provide a sensible default. Your decision, but you have to make it.
To convert a f64 to a u8 you can use the as operator. It doesn't happen automatically because Rust can't decide for you what you want to happen in the case where the number is too big or too small. Or maybe you want to do something with the extra decimal part? Do you want to round it up or down or to the nearest integer?
Even the as operator is considered by some[1] to be an early design mistake, since you can easily lose data unintentionally - especially when your code evolves over time and the types are less visible because of type inference.
[1] https://github.com/rust-lang/rfcs/issues/2784#issuecomment-543180066
I seems like I am in a catch 22 situation here.
I have this code:
const MAX u32 = 10;
let vec Vec<String> = vec![String::from("test")];
let exceeds = vec.len() > MAX;
I get this error:
let res = vec_one.len() < MAX;
^^^ expected `usize`, found `u32`
help: you can convert an `u32` to `usize` and panic if
the converted value wouldn't fit
let res = vec_one.len() < MAX.try_into().unwrap();
^^^^^^^^^^^^^^^^^^^^^^^
As far as I understand from the Rust documentation, if the problem of comparing two different integer types occurs, it has a bad smell.
That shouldn't happen.
However, on the one hand u32 is the recommended integer type, if one is not sure which on to take, because the compiler can handle it most efficiently. I use MAX in many places in the code, where I use it for comparisons with other u32 variables and constants. Therefore MAX has to be of u32 type.
On the other hand Rust's Vector instance returns usize, which is, again as far as I understand from the documentation, an integer type depending on the underlying architecture.
In this context the help hint:
help: you can convert an `u32` to `usize` and panic if
the converted value wouldn't fit
looks to me rather misleading. A situation where the difference of two integer types raises panic should IMHO be avoided by using the proper integer types up front.
The rust help message should come up with a hint how to solve the integer type collision and not with a help which might in the worst case lead to the halt of program execution.
And, to get to the point, how can I resolve the integer type collision in a better way?
However, on the one hand u32 is the recommended integer type, if one is not sure which on to take, because the compiler can handle it most efficiently.
The compiler doesn't really care, it's more a detail of the architecture. u32 is generally a pretty good default because it's handled efficiently on both 32b and 64b architecture, and when sufficient it avoids "wasting" memory on a u64 (or, god forbid, u128).
On the other hand Rust's Vector instance returns usize, which is, again as far as I understand from the documentation, an integer type depending on the underlying architecture.
It's, specifically, an integer large enough to hold a pointer. So strictly speaking it's ABI-related rather than architecture: though the two are usually identical, Linux's x32 ABI uses 32b pointers on a 64b architecture.
x32 makes some sense because 32b values are handled efficiently on 64b architectures (so there's no loss there) and it saves memory when lots of values are pointers (lower stack use, smaller structures, better cache locality, ...).
The rust help message should come up with a hint how to solve the integer type collision and not with a help which might in the worst case lead to the halt of program execution.
And, to get to the point, how can I resolve the integer type collision in a better way?
Just don't put in the unwrap call?
try_into is not magic, it's just a failible conversion, it'll return Ok(result) if the conversion succeeds and Err(...) if the conversion fails (which would require that the platform's pointers be less than 32b and the specific value doesn't fit in a pointer, which seems unlikely).
But I don't really see the point of performing a runtime conversion here, just provide an usize version of MAX.
As Rust doesn't have untyped constant as is less safe than desirable (and using separate literal risks them drifting apart), so I'd suggest using a trivial macro expanding to the literal (I guess you could even use a macro instead of a constant in the first place but that's a bit meh doc-wise) e.g.
macro_rules! MAX {
() => { 500 }
}
const MAX: u32 = MAX!();
const MAXsise: usize = MAX!();
const MAX8: u8 = MAX!();
will properly trigger a compilation error on the third definition whereas const MAX8: u8 = MAX as u8; would not.
Or you could perform the conversion with as and ignore the issue altogether, given the magnitude of your constant's value the risk is basically non-existent (though present if the possibility exists that MAX would ever be larger than... 2^16 probably)
Context
I have a pair of related structs in my program, Rom and ProfiledRom. They both store a list of u8 values and implement a common trait, GetRom, to provide access to those values.
trait GetRom {
fn get(&self, index: usize) -> u8;
}
The difference is that Rom just wraps a simple Vec<u8>, but ProfiledRom wraps each byte in a ProfiledByte type that counts the number of times it is returned by get.
struct Rom(Vec<u8>);
struct ProfiledRom(Vec<ProfiledByte>);
struct ProfiledByte {
value: u8;
get_count: u32;
};
Much of my program operates on trait GetRom values, so I can substitute in Rom or ProfiledRom type/value depending on whether I want profiling to occur.
Question
I have implemented From<Rom> for ProfiledRom, because converting a Rom to a ProfiledRom just involves wrapping each byte in a new ProfiledByte: a simple and lossless operation.
However, I'm not sure whether it's appropriate to implement From<ProfiledRom> for Rom, because ProfiledRom contains information (the get counts) that can't be represented in a Rom. If you did a round-trip conversion, these values would be lost/reset.
Is it appropriate to implement the From trait when only parts of the source object will be used?
Related
I have seen that the standard library doesn't implement integer conversions like From<i64> for i32 because these could result in bytes being truncated/lost. However, that seems like a somewhat distinct case from what we have here.
With the potentially-truncating integer conversion, you would need to inspect the original i64 to know whether it would be converted appropriately. If you didn't, the behaviour or your code could change unexpectedly when you get an out-of-bounds value. However, in our case above, it's always statically clear what data is being preserved and what data is being lost. The conversion's behaviour won't suddenly change. It should be safer, but is it an appropriate use of the From trait?
From implementations are usually lossless, but there is currently no strict requirement that they be.
The ongoing discussion at rust-lang/rfcs#2484 is related. Some possibilities include adding a FromLossy trait and more exactly prescribing the behaviour of From. We'll have to see where that goes.
For consideration, here are some Target::from(Source) implementations in the standard library:
Lossless conversions
Each Source value is converted into a distinct Target value.
u16::from(u8), i16::from(u8) and other conversions to strictly-larger integer types.
Vec<u8>::from(String)
Vec<T>::from(BinaryHeap<T>)
OsString::from(String)
char::from(u8)
Lossy conversions
Multiple Source values may be convert into the same Target value.
BinaryHeap<T>::from(Vec<T>) loses the order of elements.
Box<[T]>::from(Vec<T>) and Box<str>::from(String) lose any excess capacity.
Vec<T>::from(VecDeque<T>) loses the internal split of elements exposed by .as_slices().
Running this code in rust:
fn main() {
println!("{:?}", std::mem::size_of::<[u8; 1024]>());
println!("{:?}", std::mem::size_of::<[bool; 1024]>());
}
1024
1024
This is not what I expected. So I compiled and ran in release mode. But I got the same answer.
Why does the rust compiler seemingly allocate a whole byte for each single boolean? To me it seems to be a simple optimization to only allocate 128 bytes instead. This project implies I'm not the first to think this.
Is this a case of compilers being way harder than the seem? Or is this not optimized because it isn't a realistic scenario? Or am I not understanding something here?
Pointers and references.
There is an assumption that you can always take a reference to an item of a slice, a field of a struct, etc...
There is an assumption in the language that any reference to an instance of a statically sized type can transmuted to a type-erased pointer *mut ().
Those two assumptions together mean that:
due to (2), it is not possible to create a "bit-reference" which would allow sub-byte addressing,
due to (1), it is not possible not to have references.
This essentially means that any type must have a minimum alignment of one byte.
Note that this is not necessarily an issue. Opting in to a 128 bytes representation should be done cautiously, as it implies trading off speed (and convenience) for memory. It's not a pure win.
Prior art (in the name of std::vector<bool> in C++) is widely considered a mistake in hindsight.