There's no much information about __builtin_ia32_addcarryx_u64, only I coud find is this: https://clang.llvm.org/doxygen/adxintrin_8h_source.html
It looks like __builtin_ia32_addcarryx_u64 returns the carry of the addition of 2 u64 numbers as a u8. While I could implement this in Rust, would be nice to have the fast version of it.
What you're looking for is probably the u64::overflowing_add function (which exists for u64 and all other primitive integer types):
let x: u64 = 0x0123456789abcdef;
let y: u64 = 0xfedcba9876543212;
let (sum, carry) = x.overflowing_add(y);
// sum = 1, carry = true
This function returns a tuple with the resulting sum and the carry bit (as a bool), and is implemented using the add_with_overflow compiler intrinsic, so it should be very fast.
If you need to take a carry value as input, you can (as you mentioned in a comment) use the nightly-only u64::carrying_add. It's just implemented in terms of two overflowing_add operations, though, and you can just use these instead on stable Rust if you need it:
impl u64 {
pub const fn carrying_add(self, rhs: u64, carry: bool) -> (u64, bool) {
// note: longer-term this should be done via an intrinsic, but
// this has been shown to generate optimal code for now, and
// LLVM doesn't have an equivalent intrinsic
let (a, b) = self.overflowing_add(rhs);
let (c, d) = a.overflowing_add(carry as u64);
(c, b || d)
}
}
Related
I'm currently writing a simple function to swap numbers in Rust:
fn swapnumbers() {
let a = 1;
let b = 2;
let (a, b) = (b, a);
println!("{}, {}", a, b);
}
I am now trying to make a test for it, how do I do it? All my other attempts have failed.
I would suggest modifying the function to return something instead of printing it, and then using either the assert_eq! or assert! macros to test for proper function. (docs for assert_eq!, docs for assert!)
fn swapnumbers() -> (i32, i32) {
let a = 1;
let b = 2;
let (a, b) = (b, a);
return (a, b);
}
assert_eq!(swapnumbers(), (2, 1));
(-> (i32, i32) means that this function returns a tuple of two i32s)
And if you're unfamiliar with testing in Rust, the official Rust book tutorial can help you out with that!
If you want to actually swap numbers, you would need to do something like this:
fn swapnumbers(a: &mut i32, b: &mut i32) {
std::mem::swap(a, b);
}
Note the types specified after the parameter names. &mut i32 means the passed value must be a mutable reference of an i32 The parameter must be mutable for you to be able to assign to it and change its value, and it must be a reference so that the function does not actually take ownership of the data.
I ran into a Rustlings exercise that keeps bugging me:
pub fn factorial(num: u64) -> u64 {
// Complete this function to return factorial of num
// Do not use:
// - return
// For extra fun don't use:
// - imperative style loops (for, while)
// - additional variables
// For the most fun don't use:
// - recursion
// Execute `rustlings hint iterators4` for hints.
}
A hint to solution tells me...
In an imperative language you might write a for loop to iterate
through multiply the values into a mutable variable. Or you might
write code more functionally with recursion and a match clause. But
you can also use ranges and iterators to solve this in rust.
I tried this approach, but I am missing something:
if num > 1 {
(2..=num).map(|n| n * ( n - 1 ) ??? ).???
} else {
1
}
Do I have to use something like .take_while instead of if?
The factorial is defined as the product of all the numbers from a starting number down to 1. We use that definition and Iterator::product:
fn factorial(num: u64) -> u64 {
(1..=num).product()
}
If you look at the implementation of Product for the integers, you'll see that it uses Iterator::fold under the hood:
impl Product for $a {
fn product<I: Iterator<Item=Self>>(iter: I) -> Self {
iter.fold($one, Mul::mul)
}
}
You could hard-code this yourself:
fn factorial(num: u64) -> u64 {
(1..=num).fold(1, |acc, v| acc * v)
}
See also:
How to sum the values in an array, slice, or Vec in Rust?
How do I sum a vector using fold?
Although using .product() or .fold() is probably the best answer, you can also use .for_each().
fn factorial(num: u64) -> u64 {
let mut x = 1;
(1..=num).for_each(|i| x *= i);
x
}
Does the compiler generate the same code for iter().map().sum() and iter().fold()? In the end they achieve the same goal, but the first code would iterate two times, once for the map and once for the sum.
Here is an example. Which version would be faster in total?
pub fn square(s: u32) -> u64 {
match s {
s # 1...64 => 2u64.pow(s - 1),
_ => panic!("Square must be between 1 and 64")
}
}
pub fn total() -> u64 {
// A fold
(0..64).fold(0u64, |r, s| r + square(s + 1))
// or a map
(1..64).map(square).sum()
}
What would be good tools to look at the assembly or benchmark this?
For them to generate the same code, they'd first have to do the same thing. Your two examples do not:
fn total_fold() -> u64 {
(0..64).fold(0u64, |r, s| r + square(s + 1))
}
fn total_map() -> u64 {
(1..64).map(square).sum()
}
fn main() {
println!("{}", total_fold());
println!("{}", total_map());
}
18446744073709551615
9223372036854775807
Let's assume you meant
fn total_fold() -> u64 {
(1..64).fold(0u64, |r, s| r + square(s + 1))
}
fn total_map() -> u64 {
(1..64).map(|i| square(i + 1)).sum()
}
There are a few avenues to check:
The generated LLVM IR
The generated assembly
Benchmark
The easiest source for the IR and assembly is one of the playgrounds (official or alternate). These both have buttons to view the assembly or IR. You can also pass --emit=llvm-ir or --emit=asm to the compiler to generate these files.
Make sure to generate assembly or IR in release mode. The attribute #[inline(never)] is often useful to keep functions separate to find them easier in the output.
Benchmarking is documented in The Rust Programming Language, so there's no need to repeat all that valuable information.
Before Rust 1.14, these do not produce the exact same assembly. I'd wait for benchmarking / profiling data to see if there's any meaningful impact on performance before I worried.
As of Rust 1.14, they do produce the same assembly! This is one reason I love Rust. You can write clear and idiomatic code and smart people come along and make it equally as fast.
but the first code would iterate two times, once for the map and once for the sum.
This is incorrect, and I'd love to know what source told you this so we can go correct it at that point and prevent future misunderstandings. An iterator operates on a pull basis; one element is processed at a time. The core method is next, which yields a single value, running just enough computation to produce that value.
First, let's fix those example to actually return the same result:
pub fn total_fold_iter() -> u64 {
(1..65).fold(0u64, |r, s| r + square(s))
}
pub fn total_map_iter() -> u64 {
(1..65).map(square).sum()
}
Now, let's develop them, starting with fold. A fold is just a loop and an accumulator, it is roughly equivalent to:
pub fn total_fold_explicit() -> u64 {
let mut total = 0;
for i in 1..65 {
total = total + square(i);
}
total
}
Then, let's go with map and sum, and unwrap the sum first, which is roughly equivalent to:
pub fn total_map_partial_iter() -> u64 {
let mut total = 0;
for i in (1..65).map(square) {
total += i;
}
total
}
It's just a simple accumulator! And now, let's unwrap the map layer (which only applies a function), obtaining something that is roughly equivalent to:
pub fn total_map_explicit() -> u64 {
let mut total = 0;
for i in 1..65 {
let s = square(i);
total += s;
}
total
}
As you can see, the both of them are extremely similar: they have apply the same operations in the same order and have the same overall complexity.
Which is faster? I have no idea. And a micro-benchmark may only tell half the truth anyway: just because something is faster in a micro-benchmark does not mean it is faster in the midst of other code.
What I can say, however, is that they both have equivalent complexity and therefore should behave similarly, ie within a factor of each other.
And that I would personally go for map + sum, because it expresses the intent more clearly whereas fold is the "kitchen-sink" of Iterator methods and therefore far less informative.
Let's say I have the following struct in Rust:
struct Num {
pub num: i32;
}
impl Num {
pub fn new(x: i32) -> Num {
Num { num: x }
}
}
impl Clone for Num {
fn clone(&self) -> Num {
Num { num: self.num }
}
}
impl Copy for Num { }
impl Add<Num> for Num {
type Output = Num;
fn add(self, rhs: Num) -> Num {
Num { num: self.num + rhs.num }
}
}
And then I have the following code snippet:
let a = Num::new(0);
let b = Num::new(1);
let c = a + b;
let d = a + b;
This works because Num is marked as Copy. Otherwise, the second addition would be a compilation error, since a and b had already been moved into the add function during the first addition (I think).
The question is what the emitted assembly does. When the add function is called, are two copies of the arguments made into the new stack frame, or is the Rust compiler smart enough to know that in this case, it's not necessary to do that copying?
If the Rust compiler isn't smart enough, and actually does the copying like a function with a argument passed by value in C++, how do you avoid the performance overhead in cases where it matters?
The context is I am implementing a matrix class (just to learn), and if I have a 100x100 matrix, I really don't want to be invoking two copies every time I try to do a multiply or add.
The question is what the emitted assembly does
There's no need to guess; you can just look. Let's use this code:
use std::ops::Add;
#[derive(Copy, Clone, Debug)]
struct Num(i32);
impl Add for Num {
type Output = Num;
fn add(self, rhs: Num) -> Num {
Num(self.0 + rhs.0)
}
}
#[inline(never)]
fn example() -> Num {
let a = Num(0);
let b = Num(1);
let c = a + b;
let d = a + b;
c + d
}
fn main() {
println!("{:?}", example());
}
Paste it into the Rust Playground, then select the Release mode and view the LLVM IR:
Search through the result to see the definition of the example function:
; playground::example
; Function Attrs: noinline norecurse nounwind nonlazybind readnone uwtable
define internal fastcc i32 #_ZN10playground7example17h60e923840d8c0cd0E() unnamed_addr #2 {
start:
ret i32 2
}
That's right, this was completely and totally evaluated at compile time and simplified all the way down to a simple constant. Compilers are pretty good nowadays.
Maybe you want to try something not quite as hardcoded?
#[inline(never)]
fn example(a: Num, b: Num) -> Num {
let c = a + b;
let d = a + b;
c + d
}
fn main() {
let something = std::env::args().count();
println!("{:?}", example(Num(something as i32), Num(1)));
}
Produces
; playground::example
; Function Attrs: noinline norecurse nounwind nonlazybind readnone uwtable
define internal fastcc i32 #_ZN10playground7example17h73d4138fe5e9856fE(i32 %a) unnamed_addr #3 {
start:
%0 = shl i32 %a, 1
%1 = add i32 %0, 2
ret i32 %1
}
Oops, the compiler saw that we we basically doing (x + 1) * 2, so it did some tricky optimizations here to get to 2x + 2. Let's try something harder...
#[inline(never)]
fn example(a: Num, b: Num) -> Num {
a + b
}
fn main() {
let something = std::env::args().count() as i32;
let another = std::env::vars().count() as i32;
println!("{:?}", example(Num(something), Num(another)));
}
Produces
; playground::example
; Function Attrs: noinline norecurse nounwind nonlazybind readnone uwtable
define internal fastcc i32 #_ZN10playground7example17h73d4138fe5e9856fE(i32 %a, i32 %b) unnamed_addr #3 {
start:
%0 = add i32 %b, %a
ret i32 %0
}
A simple add instruction.
The real takeaway from this is:
Look at the generated assembly for your cases. Even similar-looking code might optimize differently.
Perform micro and macro benchmarking. You never know exactly how the code will play out in the big picture. Maybe all your cache will be blown, but your micro benchmarks will be stellar.
is the Rust compiler smart enough to know that in this case, it's not necessary to do that copying?
As you just saw, the Rust compiler plus LLVM are pretty smart. In general, it is possible to elide copies when it knows that the operand isn't needed. Whether it will work in your case or not is tough to answer.
Even if it did, you might not want to be passing large items via the stack as it's always possible that it will need to be copied.
And note that you don't have to implement copy for the value, you can choose to only allow it via references:
impl<'a, 'b> Add<&'b Num> for &'a Num {
type Output = Num;
fn add(self, rhs: &'b Num) -> Num {
Num(self.0 + rhs.0)
}
}
In fact, you may want to implement both ways of adding them, and maybe all 4 permutations of value / reference!
If the Rust compiler isn't smart enough, and actually does the copying like a function with a argument passed by value in C++, how do you avoid the performance overhead in cases where it matters?
The context is I am implementing a matrix class (just to learn), and if I have a 100x100 matrix, I really don't want to be invoking two copies every time I try to do a multiply or add.
All of Rust's implicit copies (be them from moves or actual Copy types) are a shallow memcpy. If you heap allocate, only the pointers and such are copied. Unlike C++, passing a vector by value will only copy three pointer-sized values.
To copy heap memory, an explicit copy must be made, normally by calling .clone(), implemented with #[derive(Clone)] or impl Clone.
I've talked in more detail about this elsewhere.
Shepmaster points out that shallow copies are often messed with a lot by the compiler - generally only heap memory and massive stack values cause problems.
Running example on play.rust-lang.org
fn main() {
show({
let number = b"123456";
for sequence in number.windows(6) {
let product = sequence.iter().fold(1, |a, &b| a * (b as u64));
println!("product of {:?} is {}", sequence, product);
}
});
}
Instead of having an output like "product of [49, 50, 51, 52, 53, 54] is 15312500000" I need the normal numbers in the brackets and the normalized result for the product.
Trying around with - b'0' to subtract the 48 to get the normal digits in line 5 doesn't work, i.e.
a * ((b as u64) -b'0')
or
(a - b'0') * (b as u64)
Seems I'm missing something here, for example I have no idea what exactly are the 'a' and 'b' values in the fold(). Can anyone enlighten me? :)
Looking at the signature of fold, we can see that it takes two arguments:
fn fold<B, F>(self, init: B, f: F) -> B
where F: FnMut(B, Self::Item) -> B
init, which is of some arbitrary type B, and f, which is a closure that takes a B value and an element from the iterator, in order to compute a new B value. The whole function returns a B. The types are strongly suggestive of what happens: the closure f is repeatedly called on successive elements of the iterator, passing the computed B value into the next f call. Checking the implementation confirms this suspicion:
let mut accum = init;
for x in self {
accum = f(accum, x);
}
accum
It runs through the iterator, passing the accumulated state into the closure in order to compute the next state.
First things first, lets put the type on the fold call:
let product = sequence.iter().fold(1, |a: u64, &b: &u8| a * (b as u64));
That is, the B type we want is u64 (that's what our final product will be), and the item type of the iterator is &u8, a reference to a byte.
Now, we can manually inline the definition of fold to compute product to try to clarify the desired behaviour (I'm ignoring the normalisation for now):
let mut accum = 1;
for x in sequence.iter() {
accum = { // the closure
let a: u64 = accum;
let &b: &u8 = x;
a * b as u64
}
}
let product = accum;
Simplifying:
let mut product = 1;
for &b in sequence.iter() {
product = product * (b as u64)
}
Hopefully this makes it clearer what needs to happen: b runs across each byte, and so it is the value that needs adjustment, to bring the ASCII encoded value down to the expected 0..10 range.
So, you were right with:
a * ((b as u64) -b'0')
However, the details mean that fails to compile, with a type error: b'0' has type u8, but b as u64 as type u64, and it's not legal to use - with u64 and u8. Moving the normalisation to happen before the u64 cast will ensure this works ok, since then you're subtracting b (which is a u8) and a u8:
product * (b - b'0') as u64
All in all, the fold might look clearer (and actually work) as:
let product = sequence.iter()
.fold(1, |prod, &byte| prod * (byte - b'0') as u64);
(I apologise for giving you such confusing code on IRC.)
As an alternative to fold, you can use map and MultiplicativeIterator::product. I find that the two steps help make it clearer what is happening.
#![feature(core)]
use std::iter::MultiplicativeIterator;
fn main() {
let number = b"123456";
for sequence in number.windows(6) {
let product = sequence.iter().map(|v| (v - b'0') as u64).product();
println!("product of {:?} is {}", sequence, product);
}
}
You could even choose to split up the resizing from u8 to u64:
sequence.iter().map(|v| v - b'0').map(|v| v as u64).product();
Nowadays, an alternative is product + to_digit: (itertools was used to print the contents of the iterator)
use {itertools::Itertools, std::char};
fn main() {
let number = b"123456";
let sequence = number
.iter()
.map(|&c| u64::from(char::from(c).to_digit(10).expect("not a digit")));
let product: u64 = sequence.clone().product();
println!("product of {:?} is {}", sequence.format(", "), product);
}
(playground)