When doing integer arithmetic with checks for overflows, calculations often need to compose several arithmetic operations. A straightforward way of chaining checked arithmetic in Rust uses checked_* methods and Option chaining:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
elem_size.checked_mul(length)
.and_then(|acc| acc.checked_add(offset))
}
However, this tells the compiler to generate a branch per each elementary operation. I have encountered a more unrolled approach using overflowing_* methods:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
Continuing computation regardless of overflows and aggregating the overflow flags with bitwise OR ensures that at most one branching is performed in the entire evaluation (provided that the implementations of overflowing_* generate branchless code). This optimization-friendly approach is more cumbersome and requires some caution in dealing with intermediate values.
Does anyone have experience with how the Rust compiler optimizes either of the patterns above on various CPU architectures, to tell whether the explicit unrolling is worthwhile, especially for more complex expressions?
Does anyone have experience with how the Rust compiler optimizes either of the patterns above on various CPU architectures, to tell whether the explicit unrolling is worthwhile, especially for more complex expressions?
You can use the playground to check how LLVM optimizes things: just click on "LLVM IR" or "ASM" instead of "Run". Stick a #[inline(never)] on the function you wish to check, and pay attention to pass it run-time arguments, to avoid constant folding. As in here:
use std::env;
#[inline(never)]
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
fn main() {
let vec: Vec<usize> = env::args().map(|s| s.parse().unwrap()).collect();
let result = calculate_size(vec[0], vec[1], vec[2]);
println!("{:?}",result);
}
The answer you'll get, however, is that the overflow intrinsics in Rust and LLVM have been coded for convenience and not performance, unfortunately. This means that while the explicit unrolling optimizes well, counting on LLVM to optimize the checked code is not realistic for now.
Normally this is not an issue; but for a performance hotspot, you may want to unroll manually.
Note: this lack of performance is also the reason that overflow checking is disabled by default in Release mode.
Related
I'm developing a chess engine in Rust. I have a Move struct with from and to fields, which are Square. Square is a struct containing a usize, so that I can use it directly when accessing board elements of the position. Since in Rust indexing must be done with usize, I'm wondering what's the fastest way to handle this situation (note that move generation should be as fast as possible). I understand it's more memory friendly to store u8 and cast them every time I need to use them as an index, but is it faster? What would be the idiomatic way to approach this?
I have:
struct Square {index: usize}
fn position.at(square: Square) -> Option<Piece> {
position.board[square.index]
}
I've tried migrating to u8 and casting every time with mixed results:
struct Square(u8)
fn position.at(square: Square) -> Option<Piece> {
position.board[square.0 as usize]
}
Pro u8 casting:
better cache utilization (objects are smaller); but might be only interesting when there are a lot of objects
Con u8:
casting might require additional instructions on some platforms; but these are usually only register operations which are optimized by the cpu
Idiomatic way to avoid the as usize: implement a wrapper
impl Square {
#[inline]
pub fn index(&self) -> usize {
self.0 as usize
}
}
Or, when you want to make it really typesafe, implement std::ops::Index:
struct Piece;
struct Board([Piece; 64]);
struct Square(u8);
impl std::ops::Index<Square> for Board {
type Output = Piece;
fn index(&self, index: Square) -> &Self::Output {
&self.0[index.0 as usize]
}
}
When you only cast for indexing rusts compiler is smart enough to notice that, so they produce the exact same assembly.
see playground
even if that wouldn't be the case, casting from u8 to usize wouldnt be much more than a single instruction (which is pretty much no overhead)
on the other hand usize takes 8 times as much space than u8 (on a 64bit machine)
So if you plan on having A LOT of square instances that might be a factor to consider and go with the casting option, if not it pretty much doesn't matter at all
I'd like to try to eliminate bounds checking on code generated by Rust. I have variables that are rarely zero and my code paths ensure they do not run into trouble. But because they can be, I cannot use NonZeroU64. When I am sure they are non-zero, how can I signal this to the compiler?
For example, if I have the following function, I know it will be non-zero. Can I tell the compiler this or do I have to have the unnecessary check?
pub fn f(n:u64) -> u32 {
n.trailing_zeros()
}
I can wrap the number in NonZeroU64 when I am sure, but then I've already incurred the check, which defeats the purpose ...
Redundant checks within a single function body can usually be optimized out. So you just need convert the number to NonZeroU64 before calling trailing_zeros(), and rely on the compiler to optimize the bound checks.
use std::num::NonZeroU64;
pub fn g(n: NonZeroU64) -> u32 {
n.trailing_zeros()
}
pub fn other_fun(n: u64) -> u32 {
if n != 0 {
println!("Do something with non-zero!");
let n = NonZeroU64::new(n).unwrap();
g(n)
} else {
42
}
}
In the above code, the if n != 0 makes sure n cannot be zero within the block, and compiler is smart enough to remove the unwrap call, making NonZeroU64::new(n).unwrap() an zero-cost operation. You can check the asm to verify that.
core::intrinsics::assume
Informs the optimizer that a condition is always true. If the
condition is false, the behavior is undefined.
No code is generated for this intrinsic, but the optimizer will try to
preserve it (and its condition) between passes, which may interfere
with optimization of surrounding code and reduce performance. It
should not be used if the invariant can be discovered by the optimizer
on its own, or if it does not enable any significant optimizations.
This intrinsic does not have a stable counterpart.
Generalized Question
How can I implement a general function pinned_array_of_default in stable Rust where [T; N] is too large to fit on the stack?
fn pinned_array_of_default<T: Default, const N: usize>() -> Pin<Box<[T; N]>> {
unimplemented!()
}
Alternatively, T can implement Copy if that makes the process easier.
fn pinned_array_of_element<T: Copy, const N: usize>(x: T) -> Pin<Box<[T; N]>> {
unimplemented!()
}
Keeping the solution in safe Rust would have been preferable, but it seems unlikely that it is possible.
Approaches
Initially I was hopping that by implementing Default I might be able to get Default to handle the initial allocation, however it still creates it on the stack so this will not work for large values of N.
let boxed: Box<[T; N]> = Box::default();
let foo = Pin::new(boxed);
I suspect I need to use MaybeUninit to achieve this and there is a Box::new_uninit() function, but it is currently unstable and I would ideally like to keep this within stable Rust. I also somewhat unsure if transmuting Pin<Box<MaybeUninit<B>>> to Pin<Box<B>> could somehow have negative effects on the Pin.
Background
The purpose behind using a Pin<Box<[T; N]>> is to hold a block of pointers where N is some constant factor/multiple of the page size.
#[repr(C)]
#[derive(Copy, Clone)]
pub union Foo<R: ?Sized> {
assigned: NonNull<R>,
next_unused: Option<NonNull<Self>>,
}
Each pointer may or may not be in use at a given point in time. An in-use Foo points to R, and an unused/empty Foo has a pointer to either the next empty Foo in the block or None. A pointer to the first unused Foo in the block is stored separately. When a block is full, a new block is created and then pointer chain of unused positions continues through the next block.
The box needs to be pinned since it will contain self referential pointers as well as outside structs holding pointers into assigned positions in each block.
I know that Foo is wildly unsafe by Rust standards, but the general question of creating a Pin<Box<[T; N]>> still stands
A way to construct a large array on the heap and avoid creating it on the stack is to proxy through a Vec. You can construct the elements and use .into_boxed_slice() to get a Box<[T]>. You can then use .try_into() to convert it to a Box<[T; N]>. And then use .into() to convert it to a Pin<Box<[T; N]>>:
fn pinned_array_of_default<T: Default, const N: usize>() -> Pin<Box<[T; N]>> {
let mut vec = vec![];
vec.resize_with(N, T::default);
let boxed: Box<[T; N]> = match vec.into_boxed_slice().try_into() {
Ok(boxed) => boxed,
Err(_) => unreachable!(),
};
boxed.into()
}
You can optionally make this look more straight-forward if you add T: Clone so that you can do vec![T::default(); N] and/or add T: Debug so you can use .unwrap() or .expect().
See also:
Creating a fixed-size array on heap in Rust
I like using partial application, because it permits (among other things) to split a complicated function call, that is more readable.
An example of partial application:
fn add(x: i32, y: i32) -> i32 {
x + y
}
fn main() {
let add7 = |x| add(7, x);
println!("{}", add7(35));
}
Is there overhead to this practice?
Here is the kind of thing I like to do (from a real code):
fn foo(n: u32, things: Vec<Things>) {
let create_new_multiplier = |thing| ThingMultiplier::new(thing, n); // ThingMultiplier is an Iterator
let new_things = things.clone().into_iter().flat_map(create_new_multiplier);
things.extend(new_things);
}
This is purely visual. I do not like to imbricate too much the stuff.
There should not be a performance difference between defining the closure before it's used versus defining and using it it directly. There is a type system difference — the compiler doesn't fully know how to infer types in a closure that isn't immediately called.
In code:
let create_new_multiplier = |thing| ThingMultiplier::new(thing, n);
things.clone().into_iter().flat_map(create_new_multiplier)
will be the exact same as
things.clone().into_iter().flat_map(|thing| {
ThingMultiplier::new(thing, n)
})
In general, there should not be a performance cost for using closures. This is what Rust means by "zero cost abstraction": the programmer could not have written it better themselves.
The compiler converts a closure into implementations of the Fn* traits on an anonymous struct. At that point, all the normal compiler optimizations kick in. Because of techniques like monomorphization, it may even be faster. This does mean that you need to do normal profiling to see if they are a bottleneck.
In your particular example, yes, extend can get inlined as a loop, containing another loop for the flat_map which in turn just puts ThingMultiplier instances into the same stack slots holding n and thing.
But you're barking up the wrong efficiency tree here. Instead of wondering whether an allocation of a small struct holding two fields gets optimized away you should rather wonder how efficient that clone is, especially for large inputs.
Does the compiler generate the same code for iter().map().sum() and iter().fold()? In the end they achieve the same goal, but the first code would iterate two times, once for the map and once for the sum.
Here is an example. Which version would be faster in total?
pub fn square(s: u32) -> u64 {
match s {
s # 1...64 => 2u64.pow(s - 1),
_ => panic!("Square must be between 1 and 64")
}
}
pub fn total() -> u64 {
// A fold
(0..64).fold(0u64, |r, s| r + square(s + 1))
// or a map
(1..64).map(square).sum()
}
What would be good tools to look at the assembly or benchmark this?
For them to generate the same code, they'd first have to do the same thing. Your two examples do not:
fn total_fold() -> u64 {
(0..64).fold(0u64, |r, s| r + square(s + 1))
}
fn total_map() -> u64 {
(1..64).map(square).sum()
}
fn main() {
println!("{}", total_fold());
println!("{}", total_map());
}
18446744073709551615
9223372036854775807
Let's assume you meant
fn total_fold() -> u64 {
(1..64).fold(0u64, |r, s| r + square(s + 1))
}
fn total_map() -> u64 {
(1..64).map(|i| square(i + 1)).sum()
}
There are a few avenues to check:
The generated LLVM IR
The generated assembly
Benchmark
The easiest source for the IR and assembly is one of the playgrounds (official or alternate). These both have buttons to view the assembly or IR. You can also pass --emit=llvm-ir or --emit=asm to the compiler to generate these files.
Make sure to generate assembly or IR in release mode. The attribute #[inline(never)] is often useful to keep functions separate to find them easier in the output.
Benchmarking is documented in The Rust Programming Language, so there's no need to repeat all that valuable information.
Before Rust 1.14, these do not produce the exact same assembly. I'd wait for benchmarking / profiling data to see if there's any meaningful impact on performance before I worried.
As of Rust 1.14, they do produce the same assembly! This is one reason I love Rust. You can write clear and idiomatic code and smart people come along and make it equally as fast.
but the first code would iterate two times, once for the map and once for the sum.
This is incorrect, and I'd love to know what source told you this so we can go correct it at that point and prevent future misunderstandings. An iterator operates on a pull basis; one element is processed at a time. The core method is next, which yields a single value, running just enough computation to produce that value.
First, let's fix those example to actually return the same result:
pub fn total_fold_iter() -> u64 {
(1..65).fold(0u64, |r, s| r + square(s))
}
pub fn total_map_iter() -> u64 {
(1..65).map(square).sum()
}
Now, let's develop them, starting with fold. A fold is just a loop and an accumulator, it is roughly equivalent to:
pub fn total_fold_explicit() -> u64 {
let mut total = 0;
for i in 1..65 {
total = total + square(i);
}
total
}
Then, let's go with map and sum, and unwrap the sum first, which is roughly equivalent to:
pub fn total_map_partial_iter() -> u64 {
let mut total = 0;
for i in (1..65).map(square) {
total += i;
}
total
}
It's just a simple accumulator! And now, let's unwrap the map layer (which only applies a function), obtaining something that is roughly equivalent to:
pub fn total_map_explicit() -> u64 {
let mut total = 0;
for i in 1..65 {
let s = square(i);
total += s;
}
total
}
As you can see, the both of them are extremely similar: they have apply the same operations in the same order and have the same overall complexity.
Which is faster? I have no idea. And a micro-benchmark may only tell half the truth anyway: just because something is faster in a micro-benchmark does not mean it is faster in the midst of other code.
What I can say, however, is that they both have equivalent complexity and therefore should behave similarly, ie within a factor of each other.
And that I would personally go for map + sum, because it expresses the intent more clearly whereas fold is the "kitchen-sink" of Iterator methods and therefore far less informative.