I need to convert the first 8 bytes of a String in Rust to a u64, big endian. This code almost works:
fn main() {
let s = String::from("01234567");
let mut buf = [0u8; 8];
buf.copy_from_slice(s.as_bytes());
let num = u64::from_be_bytes(buf);
println!("{:X}", num);
}
There are multiple problems with this code. First, it only works if the string is exactly 8 bytes long. .copy_from_slice() requires that both source and destination have the same length. This is easy to deal with if the String is too long because I can just grab a slice of the right length, but if the String is short it won't work.
Another problem is that this code is part of a function which is very performance sensitive. It runs in a tight loop over a large data set.
In C, I would just zero the buf, memcpy over the right number of bytes, and do a cast to an unsigned long.
Is there some way to do this in Rust which runs just as fast?
You can just modify your existing code to take the length into account when copying:
let len = 8.min(s.len());
buf[..len].copy_from_slice(&s.as_bytes()[..len]);
If the string is short this will copy the bytes into what will become the most significant bits of the u64, of course.
As to performance: in this simple test main(), the conversions are completely optimized out to become a constant integer. So, we need an explicit function or loop:
pub fn convert(s: &str) -> u64 {
let mut buf = [0u8; 8];
let len = 8.min(s.len());
buf[..len].copy_from_slice(&s.as_bytes()[..len]);
u64::from_be_bytes(buf)
}
This (on the Rust Playground) generates the assembly:
playground::convert:
pushq %rax
movq %rdi, %rax
movq $0, (%rsp)
cmpq $8, %rsi
movl $8, %edx
cmovbq %rsi, %rdx
movq %rsp, %rdi
movq %rax, %rsi
callq *memcpy#GOTPCREL(%rip)
movq (%rsp), %rax
bswapq %rax
popq %rcx
retq
I feel a little skeptical that that memcpy call is actually a good idea compared to just issuing instructions to copy the bytes, but I'm no expert on instruction-level performance and presumably it'll at least equal your C code explicitly calling memcpy(). What we do see is that there are no branches in the compiled code, only a conditional move presumably to handle the 8 vs. len() choice — and no bounds-check panic.
(And the generated assembly will of course be different — hopefully for the better — when this function or code snippet is inlined into a larger loop.)
Related
I'd like to compare the assembly of different implementations of functions but I don't want to implement certain functions, just declare them.
In Rust, forward declaration are typically unnecessary, due to the compiler not needing the functions in order to resolve them (unlike in C). However, is it possible to do something equivalent to a forward declaration?
If you declare your function as #[inline(never)] you will get a function call instruction to prevent further optimizations.
The main limitation is that your function must not be empty after optimizations, so it must have some side effect (thanks to #hellow that suggests using compiler_fence instead of println!).
For example, this code (godbolt):
pub fn test_loop(num: i32) {
for _i in 0..num {
dummy();
}
}
#[inline(never)]
pub extern fn dummy() {
use std::sync::atomic::*;
compiler_fence(Ordering::Release);
}
Will produce the following assembly (with -O), that I think you need:
example::test_loop:
push r14
push rbx
push rax
test edi, edi
jle .LBB0_3
mov ebx, edi
mov r14, qword ptr [rip + example::dummy#GOTPCREL]
.LBB0_2:
call r14
add ebx, -1
jne .LBB0_2
.LBB0_3:
add rsp, 8
pop rbx
pop r14
ret
plus the code for dummy() that is actually empty:
example::dummy:
ret
It's not possible to forward-declare functions. There is only a single declaration for any given entity in Rust.
However, you can use the unimplemented!() and todo!() macros to quickly fill in the bodies of functions you don't want to implement (yet) for some reason. Both are basically aliases for panic!() with specific error messages.
Declare a trait and have your function and its signature in there.
eg.
pub trait Summary {
fn summarize(&self) -> String;
}
A very common operation in implementing algorithms is the cyclic rotate: given, say, 3 variables a, b, c change them to the effect of
t ⇽ c
c ⇽ b
b ⇽ a
a ⇽ t
Given that everything is bitwise swappable, cyclic rotation should be an area where Rust excels more than any other language I know of.
For comparison, in C++ the most efficient generic way to rotate N elements is performing n+1 std::move operations, which in turn roughly leads to (for a typical move constructor implementation) 3 (n+1) sizeof(T) word assignments (this can be improved for PODs via template specializing rotate, but requires work).
In Rust, the language makes it possible to implement rotate with only (n+1) size_of(T) word assignments. To my surprise, I could not find standard library support for rotation. (No rotate method in std::mem). It would probably look like this:
pub fn rotate<T>(x: &mut T, y: &mut T, z: &mut T) {
unsafe {
let mut t: T = std::mem::uninitialized();
std::ptr::copy_nonoverlapping(&*z, &mut t, 1);
std::ptr::copy_nonoverlapping(&*y, z, 1);
std::ptr::copy_nonoverlapping(&*x, y, 1);
std::ptr::copy_nonoverlapping(&t, x, 1);
std::mem::forget(t);
}
}
For clarification on why rotation cannot be implemented efficiently in C++, consider:
struct String {
char *data1;
char *data2;
String(String &&other) : data1(other.data1), data2(other.data2)
{ other.data1 = other.data2 = nullptr;}
String &operator=(String &&other)
{ std::swap(data1, other.data1); std::swap(data2, other.data2);
return *this; }
~String() { delete [] data1; delete [] data2; }
};
Here an operation like s2 = std::move(s1); will take 3 pointer assignments for each member field, totaling to 6 assignments since pointer swap requires 3 assignments (1 into temp, 1 out of temp, one across operands)
Is there a standard way of cyclically rotating mutable variables in Rust?
No.
I'd just swap the variables twice, no need for unsafe:
use std::mem;
pub fn rotate<T>(x: &mut T, y: &mut T, z: &mut T) {
mem::swap(x, y);
mem::swap(y, z);
}
fn main() {
let mut a = 1;
let mut b = 2;
let mut c = 3;
println!("{}, {}, {}", a, b, c);
// 1, 2, 3
rotate(&mut a, &mut b, &mut c);
println!("{}, {}, {}", a, b, c);
// 2, 3, 1
}
This produces 7 movl instructions (Rust 1.35.0, Release, x86_64, Linux)
playground::rotate:
movl (%rdi), %eax
movl (%rsi), %ecx
movl %ecx, (%rdi)
movl %eax, (%rsi)
movl (%rdx), %ecx
movl %ecx, (%rsi)
movl %eax, (%rdx)
retq
As opposed to the original 6 movl instructions:
playground::rotate_original:
movl (%rdx), %eax
movl (%rsi), %ecx
movl %ecx, (%rdx)
movl (%rdi), %ecx
movl %ecx, (%rsi)
movl %eax, (%rdi)
retq
I'm OK giving up that single instruction for purely safe code that is also easier to reason about.
In "real" code, I'd make use of the fact that all the variables are the same type and that slice::rotate_left and slice::rotate_right exist:
fn main() {
let mut vals = [1, 2, 3];
let [a, b, c] = &vals;
println!("{}, {}, {}", a, b, c);
// 1, 2, 3
vals.rotate_left(1);
let [a, b, c] = &vals;
println!("{}, {}, {}", a, b, c);
// 2, 3, 1
}
In my example below does cons.push(...) ever copy the self parameter?
Or is rustc intelligent enough to realize that the values coming from lines #a and #b can always use the same stack space and no copying needs to occur (except for the obvious i32 copies)?
In other words, does a call to Cons.push(self, ...) always create a copy of self as ownership is being moved? Or does the self struct always stay in place on the stack?
References to documentation would be appreciated.
#[derive(Debug)]
struct Cons<T, U>(T, U);
impl<T, U> Cons<T, U> {
fn push<V>(self, value: V) -> Cons<Self, V> {
Cons(self, value)
}
}
fn main() {
let cons = Cons(1, 2); // #a
let cons = cons.push(3); // #b
println!("{:?}", cons); // #c
}
The implication in my example above is whether or not the push(...) function grows more expensive to call each time we add a line like #b at the rate of O(n^2) (if self is copied each time) or at the rate of O(n) (if self stays in place).
I tried implementing the Drop trait and noticed that both #a and #b were dropped after #c. To me this seems to indicate that self stays in place in this example, but I'm not 100%.
In general, trust in the compiler! Rust + LLVM is a very powerful combination that often produces surprisingly efficient code. And it will improve even more in time.
In other words, does a call to Cons.push(self, ...) always create a copy of self as ownership is being moved? Or does the self struct always stay in place on the stack?
self cannot stay in place because the new value returned by the push method has type Cons<Self, V>, which is essentially a tuple of Self and V. Although tuples don't have any memory layout guarantees, I strongly believe they can't have their elements scattered arbitrarily in memory. Thus, self and value must both be moved into the new structure.
Above paragraph assumed that self was placed firmly on the stack before calling push. The compiler actually has enough information to know it should reserve enough space for the final structure. Especially with function inlining this becomes a very likely optimization.
The implication in my example above is whether or not the push(...) function grows more expensive to call each time we add a line like #b at the rate of O(n^2) (if self is copied each time) or at the rate of O(n) (if self stays in place).
Consider two functions (playground):
pub fn push_int(cons: Cons<i32, i32>, x: i32) -> Cons<Cons<i32, i32>, i32> {
cons.push(x)
}
pub fn push_int_again(
cons: Cons<Cons<i32, i32>, i32>,
x: i32,
) -> Cons<Cons<Cons<i32, i32>, i32>, i32> {
cons.push(x)
}
push_int adds a third element to a Cons and push_int_again adds a fourth element.
push_int compiles to the following assembly in Release mode:
movq %rdi, %rax
movl %esi, (%rdi)
movl %edx, 4(%rdi)
movl %ecx, 8(%rdi)
retq
And push_int_again compiles to:
movq %rdi, %rax
movl 8(%rsi), %ecx
movl %ecx, 8(%rdi)
movq (%rsi), %rcx
movq %rcx, (%rdi)
movl %edx, 12(%rdi)
retq
You don't need to understand assembly to see that pushing the fourth element requires more instructions than pushing the third element.
Note that this observation was made for these functions in isolation. Calls like cons.push(x).push(y).push(...) are inlined and the assembly grows linearly with one instruction per push.
The ownership of cons in #a type Cons will be transferred in push(). Again the ownership will be transferred to Cons<Cons,i32>(Cons<T,U>) type which is shadowed variable cons in #b.
If struct Cons implement Copy, Clone traits it will be copy. Otherwise no copy and you cannot use the original vars after they are moved (or owned) by someone else.
Move semantics:
let cons = Cons(1, 2); //Cons(1,2) as resource in memory being pointed by cons
let cons2 = cons; // Cons(1,2) now pointed by cons2. Problem! as cons also point it. Lets prevent access from cons
println!("{:?}", cons); //error because cons is moved
I am working on a project where I am doing a lot of index-based calculation. I have a few lines like:
let mut current_x: usize = (start.x as isize + i as isize * delta_x) as usize;
start.x and i are usizes and delta_x is of type isize. Most of my data is unsigned, therefore storing it signed would not make much sense. On the other hand, when I manipulate an array I am accessing a lot I have to convert everything back to usize as seen above.
Is casting between integers expensive? Does it have an impact on runtime performance at all?
Are there other ways to handle index arithmetics easier / more efficiently?
It depends
It's basically impossible to answer your question in isolation. These types of low-level things can be aggressively combined with operations that have to happen anyway, so any amount of inlining can change the behavior. Additionally, it strongly depends on your processor; changing to a 64-bit number on an 8-bit microcontroller is probably pretty expensive!
My general advice is to not worry. Keep your types consistent, get the right answers, then profile your code and fix the issues you find.
Pragmatically, what are you going to do instead?
That said, here's some concrete stuff for x86-64 and Rust 1.18.0.
Same size, changing sign
Basically no impact. If these were inlined, then you probably would never even see any assembly.
#[inline(never)]
pub fn signed_to_unsigned(i: isize) -> usize {
i as usize
}
#[inline(never)]
pub fn unsigned_to_signed(i: usize) -> isize {
i as isize
}
Each generates the assembly
movq %rdi, %rax
retq
Extending a value
These have to sign- or zero-extend the value, so some kind of minimal operation has to occur to fill those extra bits:
#[inline(never)]
pub fn u8_to_u64(i: u8) -> u64 {
i as u64
}
#[inline(never)]
pub fn i8_to_i64(i: i8) -> i64 {
i as i64
}
Generates the assembly
movzbl %dil, %eax
retq
movsbq %dil, %rax
retq
Truncating a value
Truncating is again just another move, basically no impact.
#[inline(never)]
pub fn u64_to_u8(i: u64) -> u8 {
i as u8
}
#[inline(never)]
pub fn i64_to_i8(i: i64) -> i8 {
i as i8
}
Generates the assembly
movl %edi, %eax
retq
movl %edi, %eax
retq
All these operations boil down to a single instruction on x86-64. Then you get into complications around "how long does an operation take" and that's even harder.
I have a small struct:
pub struct Foo {
pub a: i32,
pub b: i32,
pub c: i32,
}
I was using pairs of the fields in the form (a,b) (b,c) (c,a). To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs:
fn get_foo_ref(&self) -> [(&i32, &i32); 3] {
[(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)]
}
I had to decide if I should return the values as references or copy the i32. Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined.
I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples.
First the variant using the function:
pub fn testing_ref(f: Foo) -> i32 {
let mut sum = 0;
for i in 0..3 {
let (l, r) = f.get_foo_ref()[i];
sum += *l + *r;
}
sum
}
Then the hand-written variant:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case?
You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++.
In C++, the compiler generated equivalent code between get_foo and get_foo_ref, although I don't understand why the code for all 3 cases is not equivalent.
Why did the compiler did not generate equivalent code for all 3 cases?
Update:
I've modified slightly code to use arrays and to add one more direct case.
Rust version with f64 and arrays
C++ version with f64 and arrays
This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly.
Well, I guess this is another example that nothing can be taken for granted.
TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance.
Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references.
Then, you should check the generated assembly for your new type.
In your optimized example, the compiler is being very crafty:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
Yields:
example::testing_direct:
push rbp
mov rbp, rsp
mov eax, dword ptr [rdi + 4]
add eax, dword ptr [rdi]
add eax, dword ptr [rdi + 8]
add eax, eax
pop rbp
ret
Which is roughly sum += f.a; sum += f.b; sum += f.c; sum += sum;.
That is, the compiler realized that:
f.X was added twice
f.X * 2 was equivalent to adding it twice
While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to i32 (and addition being commutative).
For example, switching your code to f32 (still Copy, but addition is not longer commutative), I get the very same assembly for both testing_direct and testing (and slightly different for testing_ref):
example::testing:
push rbp
mov rbp, rsp
movss xmm1, dword ptr [rdi]
movss xmm2, dword ptr [rdi + 4]
movss xmm0, dword ptr [rdi + 8]
movaps xmm3, xmm1
addss xmm3, xmm2
xorps xmm4, xmm4
addss xmm4, xmm3
addss xmm2, xmm0
addss xmm2, xmm4
addss xmm0, xmm1
addss xmm0, xmm2
pop rbp
ret
And there's no trickery any longer.
So it's really not possible to infer much from your example, check with the real type.