I'd like to compare the assembly of different implementations of functions but I don't want to implement certain functions, just declare them.
In Rust, forward declaration are typically unnecessary, due to the compiler not needing the functions in order to resolve them (unlike in C). However, is it possible to do something equivalent to a forward declaration?
If you declare your function as #[inline(never)] you will get a function call instruction to prevent further optimizations.
The main limitation is that your function must not be empty after optimizations, so it must have some side effect (thanks to #hellow that suggests using compiler_fence instead of println!).
For example, this code (godbolt):
pub fn test_loop(num: i32) {
for _i in 0..num {
dummy();
}
}
#[inline(never)]
pub extern fn dummy() {
use std::sync::atomic::*;
compiler_fence(Ordering::Release);
}
Will produce the following assembly (with -O), that I think you need:
example::test_loop:
push r14
push rbx
push rax
test edi, edi
jle .LBB0_3
mov ebx, edi
mov r14, qword ptr [rip + example::dummy#GOTPCREL]
.LBB0_2:
call r14
add ebx, -1
jne .LBB0_2
.LBB0_3:
add rsp, 8
pop rbx
pop r14
ret
plus the code for dummy() that is actually empty:
example::dummy:
ret
It's not possible to forward-declare functions. There is only a single declaration for any given entity in Rust.
However, you can use the unimplemented!() and todo!() macros to quickly fill in the bodies of functions you don't want to implement (yet) for some reason. Both are basically aliases for panic!() with specific error messages.
Declare a trait and have your function and its signature in there.
eg.
pub trait Summary {
fn summarize(&self) -> String;
}
Related
I need to convert the first 8 bytes of a String in Rust to a u64, big endian. This code almost works:
fn main() {
let s = String::from("01234567");
let mut buf = [0u8; 8];
buf.copy_from_slice(s.as_bytes());
let num = u64::from_be_bytes(buf);
println!("{:X}", num);
}
There are multiple problems with this code. First, it only works if the string is exactly 8 bytes long. .copy_from_slice() requires that both source and destination have the same length. This is easy to deal with if the String is too long because I can just grab a slice of the right length, but if the String is short it won't work.
Another problem is that this code is part of a function which is very performance sensitive. It runs in a tight loop over a large data set.
In C, I would just zero the buf, memcpy over the right number of bytes, and do a cast to an unsigned long.
Is there some way to do this in Rust which runs just as fast?
You can just modify your existing code to take the length into account when copying:
let len = 8.min(s.len());
buf[..len].copy_from_slice(&s.as_bytes()[..len]);
If the string is short this will copy the bytes into what will become the most significant bits of the u64, of course.
As to performance: in this simple test main(), the conversions are completely optimized out to become a constant integer. So, we need an explicit function or loop:
pub fn convert(s: &str) -> u64 {
let mut buf = [0u8; 8];
let len = 8.min(s.len());
buf[..len].copy_from_slice(&s.as_bytes()[..len]);
u64::from_be_bytes(buf)
}
This (on the Rust Playground) generates the assembly:
playground::convert:
pushq %rax
movq %rdi, %rax
movq $0, (%rsp)
cmpq $8, %rsi
movl $8, %edx
cmovbq %rsi, %rdx
movq %rsp, %rdi
movq %rax, %rsi
callq *memcpy#GOTPCREL(%rip)
movq (%rsp), %rax
bswapq %rax
popq %rcx
retq
I feel a little skeptical that that memcpy call is actually a good idea compared to just issuing instructions to copy the bytes, but I'm no expert on instruction-level performance and presumably it'll at least equal your C code explicitly calling memcpy(). What we do see is that there are no branches in the compiled code, only a conditional move presumably to handle the 8 vs. len() choice — and no bounds-check panic.
(And the generated assembly will of course be different — hopefully for the better — when this function or code snippet is inlined into a larger loop.)
I have the following function using inline assembly, targeting mipsel-unknown-linux-gnu:
#![feature(asm)]
#[no_mangle]
pub unsafe extern "C" fn f(ptr: u32) {
let value: i8;
asm!(
"lb $v0, ($a0)",
in("$4") ptr,
out("$2") value,
);
asm!(
"sb $v0, ($a0)",
in("$4") ptr,
in("$2") value,
);
}
I expected this to compile into the following:
lb $v0, ($a0)
sb $v0, ($a0)
jr $ra
nop
Note: In this example, it's possible the compiler reorders the store instruction to after the jump to use the delay slot, but in my actual use case, I return via an asm block, so this is not a worry. Given this I expected the assembly above exactly.
Instead, what I got was:
00000000 <f>:
0: 80820000 lb v0,0(a0)
4: 304200ff andi v0,v0,0xff
8: a0820000 sb v0,0(a0)
c: 03e00008 jr ra
10: 00000000 nop
The compiler seems to not have trusted me that the output is an i8 and inserted an andi $v0, 0xff instruction there.
I need to produce the assembly I specified above exactly, so I'd like to get rid of the andi instruction, while keeping the type of value i8.
My use case for this is that I want to produce an exact assembly output from this function, while being able to later fork it and add rust code that interacts with the existing assembly code to extend the function. For this I'd like value's type to be properly described as an i8 in the rust side.
Edit
Looking at the llvm-ir generated by rust, the andi instruction seems to have been added by rustc, not llvm.
; Function Attrs: nonlazybind uwtable
define void #f(i32 %ptr) unnamed_addr #0 {
start:
%0 = tail call i32 asm sideeffect alignstack "lbu $$v0, ($$a0)", "=&{$2},{$4},~{memory}"(i32 %ptr) #1, !srcloc !2
%1 = and i32 %0, 255 # <--------- Over here
tail call void asm sideeffect alignstack "sb $$v0, ($$a0)", "{$4},{$2},~{memory}"(i32 %ptr, i32 %1) #1, !srcloc !3
ret void
}
There is also no mention of an i8, so I'm not quite sure what rustc is doing here.
I am working on a project where I am doing a lot of index-based calculation. I have a few lines like:
let mut current_x: usize = (start.x as isize + i as isize * delta_x) as usize;
start.x and i are usizes and delta_x is of type isize. Most of my data is unsigned, therefore storing it signed would not make much sense. On the other hand, when I manipulate an array I am accessing a lot I have to convert everything back to usize as seen above.
Is casting between integers expensive? Does it have an impact on runtime performance at all?
Are there other ways to handle index arithmetics easier / more efficiently?
It depends
It's basically impossible to answer your question in isolation. These types of low-level things can be aggressively combined with operations that have to happen anyway, so any amount of inlining can change the behavior. Additionally, it strongly depends on your processor; changing to a 64-bit number on an 8-bit microcontroller is probably pretty expensive!
My general advice is to not worry. Keep your types consistent, get the right answers, then profile your code and fix the issues you find.
Pragmatically, what are you going to do instead?
That said, here's some concrete stuff for x86-64 and Rust 1.18.0.
Same size, changing sign
Basically no impact. If these were inlined, then you probably would never even see any assembly.
#[inline(never)]
pub fn signed_to_unsigned(i: isize) -> usize {
i as usize
}
#[inline(never)]
pub fn unsigned_to_signed(i: usize) -> isize {
i as isize
}
Each generates the assembly
movq %rdi, %rax
retq
Extending a value
These have to sign- or zero-extend the value, so some kind of minimal operation has to occur to fill those extra bits:
#[inline(never)]
pub fn u8_to_u64(i: u8) -> u64 {
i as u64
}
#[inline(never)]
pub fn i8_to_i64(i: i8) -> i64 {
i as i64
}
Generates the assembly
movzbl %dil, %eax
retq
movsbq %dil, %rax
retq
Truncating a value
Truncating is again just another move, basically no impact.
#[inline(never)]
pub fn u64_to_u8(i: u64) -> u8 {
i as u8
}
#[inline(never)]
pub fn i64_to_i8(i: i64) -> i8 {
i as i8
}
Generates the assembly
movl %edi, %eax
retq
movl %edi, %eax
retq
All these operations boil down to a single instruction on x86-64. Then you get into complications around "how long does an operation take" and that's even harder.
I have a small struct:
pub struct Foo {
pub a: i32,
pub b: i32,
pub c: i32,
}
I was using pairs of the fields in the form (a,b) (b,c) (c,a). To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs:
fn get_foo_ref(&self) -> [(&i32, &i32); 3] {
[(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)]
}
I had to decide if I should return the values as references or copy the i32. Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined.
I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples.
First the variant using the function:
pub fn testing_ref(f: Foo) -> i32 {
let mut sum = 0;
for i in 0..3 {
let (l, r) = f.get_foo_ref()[i];
sum += *l + *r;
}
sum
}
Then the hand-written variant:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case?
You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++.
In C++, the compiler generated equivalent code between get_foo and get_foo_ref, although I don't understand why the code for all 3 cases is not equivalent.
Why did the compiler did not generate equivalent code for all 3 cases?
Update:
I've modified slightly code to use arrays and to add one more direct case.
Rust version with f64 and arrays
C++ version with f64 and arrays
This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly.
Well, I guess this is another example that nothing can be taken for granted.
TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance.
Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references.
Then, you should check the generated assembly for your new type.
In your optimized example, the compiler is being very crafty:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
Yields:
example::testing_direct:
push rbp
mov rbp, rsp
mov eax, dword ptr [rdi + 4]
add eax, dword ptr [rdi]
add eax, dword ptr [rdi + 8]
add eax, eax
pop rbp
ret
Which is roughly sum += f.a; sum += f.b; sum += f.c; sum += sum;.
That is, the compiler realized that:
f.X was added twice
f.X * 2 was equivalent to adding it twice
While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to i32 (and addition being commutative).
For example, switching your code to f32 (still Copy, but addition is not longer commutative), I get the very same assembly for both testing_direct and testing (and slightly different for testing_ref):
example::testing:
push rbp
mov rbp, rsp
movss xmm1, dword ptr [rdi]
movss xmm2, dword ptr [rdi + 4]
movss xmm0, dword ptr [rdi + 8]
movaps xmm3, xmm1
addss xmm3, xmm2
xorps xmm4, xmm4
addss xmm4, xmm3
addss xmm2, xmm0
addss xmm2, xmm4
addss xmm0, xmm1
addss xmm0, xmm2
pop rbp
ret
And there's no trickery any longer.
So it's really not possible to infer much from your example, check with the real type.
Rust employs dynamic checking methods to check many things. One such example is bounds checking of arrays.
Take this code for example,
fn test_dynamic_checking() -> i32 {
let x = [1, 2, 3, 4];
x[1]
}
The resulting LLVM IR is:
; Function Attrs: uwtable
define internal i32 #_ZN10dynamic_ck21test_dynamic_checking17hcef32a1e8c339e2aE() unnamed_addr #0 {
entry-block:
%x = alloca [5 x i32]
%0 = bitcast [5 x i32]* %x to i8*
call void #llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast ([5 x i32]* #const7091 to i8*), i64 20, i32 4, i1 false)
%1 = getelementptr inbounds [5 x i32], [5 x i32]* %x, i32 0, i32 0
%2 = call i1 #llvm.expect.i1(i1 false, i1 false)
br i1 %2, label %cond, label %next
next: ; preds = %entry-block
%3 = getelementptr inbounds i32, i32* %1, i64 1
%4 = load i32, i32* %3
ret i32 %4
cond: ; preds = %entry-block
call void #_ZN4core9panicking18panic_bounds_check17hcc71f10000bd8e6fE({ %str_slice, i32 }* noalias readonly dereferenceable(24) #panic_bounds_check_loc7095, i64 1, i64 5)
unreachable
}
A branch instruction is inserted to decide whether the index is out of bounds or not, which doesn't exist in clang-compiled LLVM IR.
Here are my questions:
In what situations does Rust implement dynamic checking?
How does Rust implement dynamic checking in different situations?
Is there any way to turn off the dynamic checking?
Conceptually, Rust performs array bound checking on each and every array access. However, the compiler is very good at optimizing away the checks when it can prove that it's safe to do so.
The LLVM intermediate output is misleading because it still undergoes optimizations by the LLVM's optimizing machinery before the machine assembly is generated. A better way to inspect assembly output is by generating the final assembly using an invocation such as rustc -O --emit asm --crate-type=lib. The assembly output for your function is just:
push rbp
mov rbp, rsp
mov eax, 2
pop rbp
ret
Not only is there no bound checking in sight, there is no array to begin with, the compiler has optimized the entire function to a return 2i32! To force bound checking, the function needs to be written so that Rust cannot prove that it can be elided:
pub fn test_dynamic_checking(ind: usize) -> i32 () {
let x = [1, 2, 3, 4];
x[ind]
}
This results in a larger assembly, where the bound check is implemented as the following two instructions:
cmp rax, 3 ; compare index with 3
ja .LBB0_2 ; if greater, jump to panic code
That is as efficient as it gets. Turning off the bound check is rarely a good idea because it can easily cause the program to crash. It can be done, but explicitly and only within the confines of an unsafe block or function:
pub unsafe fn test_dynamic_checking(ind: usize) -> i32 () {
let x = [1, 2, 3, 4];
*x.get_unchecked(ind)
}
The generated assembly shows the error checking to be entirely omitted:
push rbp
mov rbp, rsp
lea rax, [rip + const3141]
mov eax, dword ptr [rax + 4*rdi]
pop rbp
ret
const3141:
.long 1
.long 2
.long 3
.long 4
In what situations does Rust implement dynamic checking?
It's a bit of a cop out... but Rust, the language, does not implement any dynamic checking. The Rust libraries, however, starting with the core and std libraries do.
A library needs to use run-time checking whenever not doing so could lead to a memory safety issue. Examples include:
guaranteeing that an index is within bounds, such as when implementing Index
guaranteeing that no other reference to an object exists, such as when implementing RefCell
...
In general, the Rust philosophy is to perform as many checks as possible at compile-time, but some checks need by delayed to run-time because they depend on dynamic behavior/values.
How does Rust implement dynamic checking in different situations?
As efficiently as possible.
When dynamic checking is required, the Rust code will be crafted to be as efficient as possible. This can be slightly complicated, as it involves trying to ease the work of the optimizer and conforming to patterns that the optimizer recognizes (or fixing it), but we have the chance that a few developers are obsessed with performance (#bluss, for example).
Not all checks may be elided, of course, and those that remain will typically be just a branch.
Is there any way to turn off the dynamic checking?
Whenever dynamic checking is necessary to guarantee code safety, it is not possible to turn in off in safe code.
In some situations, though, an unsafe block or function may allow to bypass the check (for example get_unchecked for indexing).
This is not recommended in general, and should be a last-resort behaviour:
Most of the times, the run-time checks have little to no performance impacts; CPU prediction is awesome like that.
Even if they have some impact, unless they sit it a very hot-loop, it may not be worth optimizing them.
Finally, if it does NOT optimize, it is worth trying to understand why: maybe it's impossible or maybe the optimizer is not clever enough... in the latter case, reporting the issue is the best way to have someone digging into the optimizer and fixing it (if you cannot or are not willing to do it yourself).