Rust employs dynamic checking methods to check many things. One such example is bounds checking of arrays.
Take this code for example,
fn test_dynamic_checking() -> i32 {
let x = [1, 2, 3, 4];
x[1]
}
The resulting LLVM IR is:
; Function Attrs: uwtable
define internal i32 #_ZN10dynamic_ck21test_dynamic_checking17hcef32a1e8c339e2aE() unnamed_addr #0 {
entry-block:
%x = alloca [5 x i32]
%0 = bitcast [5 x i32]* %x to i8*
call void #llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast ([5 x i32]* #const7091 to i8*), i64 20, i32 4, i1 false)
%1 = getelementptr inbounds [5 x i32], [5 x i32]* %x, i32 0, i32 0
%2 = call i1 #llvm.expect.i1(i1 false, i1 false)
br i1 %2, label %cond, label %next
next: ; preds = %entry-block
%3 = getelementptr inbounds i32, i32* %1, i64 1
%4 = load i32, i32* %3
ret i32 %4
cond: ; preds = %entry-block
call void #_ZN4core9panicking18panic_bounds_check17hcc71f10000bd8e6fE({ %str_slice, i32 }* noalias readonly dereferenceable(24) #panic_bounds_check_loc7095, i64 1, i64 5)
unreachable
}
A branch instruction is inserted to decide whether the index is out of bounds or not, which doesn't exist in clang-compiled LLVM IR.
Here are my questions:
In what situations does Rust implement dynamic checking?
How does Rust implement dynamic checking in different situations?
Is there any way to turn off the dynamic checking?
Conceptually, Rust performs array bound checking on each and every array access. However, the compiler is very good at optimizing away the checks when it can prove that it's safe to do so.
The LLVM intermediate output is misleading because it still undergoes optimizations by the LLVM's optimizing machinery before the machine assembly is generated. A better way to inspect assembly output is by generating the final assembly using an invocation such as rustc -O --emit asm --crate-type=lib. The assembly output for your function is just:
push rbp
mov rbp, rsp
mov eax, 2
pop rbp
ret
Not only is there no bound checking in sight, there is no array to begin with, the compiler has optimized the entire function to a return 2i32! To force bound checking, the function needs to be written so that Rust cannot prove that it can be elided:
pub fn test_dynamic_checking(ind: usize) -> i32 () {
let x = [1, 2, 3, 4];
x[ind]
}
This results in a larger assembly, where the bound check is implemented as the following two instructions:
cmp rax, 3 ; compare index with 3
ja .LBB0_2 ; if greater, jump to panic code
That is as efficient as it gets. Turning off the bound check is rarely a good idea because it can easily cause the program to crash. It can be done, but explicitly and only within the confines of an unsafe block or function:
pub unsafe fn test_dynamic_checking(ind: usize) -> i32 () {
let x = [1, 2, 3, 4];
*x.get_unchecked(ind)
}
The generated assembly shows the error checking to be entirely omitted:
push rbp
mov rbp, rsp
lea rax, [rip + const3141]
mov eax, dword ptr [rax + 4*rdi]
pop rbp
ret
const3141:
.long 1
.long 2
.long 3
.long 4
In what situations does Rust implement dynamic checking?
It's a bit of a cop out... but Rust, the language, does not implement any dynamic checking. The Rust libraries, however, starting with the core and std libraries do.
A library needs to use run-time checking whenever not doing so could lead to a memory safety issue. Examples include:
guaranteeing that an index is within bounds, such as when implementing Index
guaranteeing that no other reference to an object exists, such as when implementing RefCell
...
In general, the Rust philosophy is to perform as many checks as possible at compile-time, but some checks need by delayed to run-time because they depend on dynamic behavior/values.
How does Rust implement dynamic checking in different situations?
As efficiently as possible.
When dynamic checking is required, the Rust code will be crafted to be as efficient as possible. This can be slightly complicated, as it involves trying to ease the work of the optimizer and conforming to patterns that the optimizer recognizes (or fixing it), but we have the chance that a few developers are obsessed with performance (#bluss, for example).
Not all checks may be elided, of course, and those that remain will typically be just a branch.
Is there any way to turn off the dynamic checking?
Whenever dynamic checking is necessary to guarantee code safety, it is not possible to turn in off in safe code.
In some situations, though, an unsafe block or function may allow to bypass the check (for example get_unchecked for indexing).
This is not recommended in general, and should be a last-resort behaviour:
Most of the times, the run-time checks have little to no performance impacts; CPU prediction is awesome like that.
Even if they have some impact, unless they sit it a very hot-loop, it may not be worth optimizing them.
Finally, if it does NOT optimize, it is worth trying to understand why: maybe it's impossible or maybe the optimizer is not clever enough... in the latter case, reporting the issue is the best way to have someone digging into the optimizer and fixing it (if you cannot or are not willing to do it yourself).
Related
Intro:
I'm curious about the performance difference (both cpu and memory usage) of storing small numbers as bitpacked unsigned integers versus vectors of bytes
Example
I'll use the example of storing RGBA values. They're 4 Bytes so it is very tempting to store them as a u32.
However, it would be more readable to store them as a vector of type u8.
As a more detailed example, say I want to store and retrieve the color rgba(255,0,0,255)
This is how I would go about doing the two methods:
// Bitpacked:
let i: u32 = 4278190335;
//binary is 11111111 00000000 00000000 11111111
//In reality I would most likely do something more similar to:
let i: u32 = 255 << 24 + 255; //i think this syntax is right
// Vector:
let v: Vec<u8> = [255,0,0,255];
Then the two red values could be queried with
i >> 24
//or
&v[0]
//both expressions evaluate to 255 (i think. I'm really new to rust <3 )
Question 1
As far as I know, the values of v must be stored on the heap and so there are the performance costs that are associated with that. Are these costs significant enough to make bit packing worth it?
Question 2
Then there's the two expressions i >> 24 and &v[0]. I don't know how fast rust is at bit shifting versus getting values off the heap. I'd test it but I won't have access to a machine with rust installed for a while. Are there any immediate insights someone could give on the drawbacks of these two operations?
Question 3
Finally, is the difference in memory usage as simple as just storing 32 bits on the stack for the u32 versus storing 64 bits on the stack for the pointer v as well as 32 bits on the heap for the values of v?
Sorry if this question is a bit confusing
Using a Vec will be more expensive; as you mentioned, it will need to perform heap allocations, and access will be bounds-checked as well.
That said, if you use an array [u8; 4] instead, the performance compared with a bitpacked u32 representation should be almost identical.
In fact, consider the following simple example:
pub fn get_red_bitpacked(i: u32) -> u8 {
(i >> 24) as u8
}
pub fn get_red_array(v: [u8; 4]) -> u8 {
v[3]
}
pub fn test_bits(colour: u8) -> u8 {
let colour = colour as u32;
let i = (colour << 24) + colour;
get_red_bitpacked(i)
}
pub fn test_arr(colour: u8) -> u8 {
let v = [colour, 0, 0, colour];
get_red_array(v)
}
I took a look on Compiler Explorer, and the compiler decided that get_red_bitpacked and get_red_array were completely identical: so much so it didn't even bother generating code for the former. The two "test" functions obviously optimised to the exact same assembly as well.
example::get_red_array:
mov eax, edi
shr eax, 24
ret
example::test_bits:
mov eax, edi
ret
example::test_arr:
mov eax, edi
ret
Obviously this example was seen through by the compiler: for a proper comparison you should benchmark with actual code. That said, I feel fairly safe in saying that with Rust the performance of u32 versus [u8; 4] for these kinds of operations should be identical in general.
tl;dr use a struct:
struct Color {
r: u8,
g: u8,
b: u8,
a: u8,
}
Maybe use repr(packed) as well.
It gives you the best of all worlds and you can give the channels their name.
Are these costs significant enough to make bit packing worth it?
Heap allocation has a huge cost.
Are there any immediate insights someone could give on the drawbacks of these two operations?
Both are noise compared to allocating memory.
I'd like to compare the assembly of different implementations of functions but I don't want to implement certain functions, just declare them.
In Rust, forward declaration are typically unnecessary, due to the compiler not needing the functions in order to resolve them (unlike in C). However, is it possible to do something equivalent to a forward declaration?
If you declare your function as #[inline(never)] you will get a function call instruction to prevent further optimizations.
The main limitation is that your function must not be empty after optimizations, so it must have some side effect (thanks to #hellow that suggests using compiler_fence instead of println!).
For example, this code (godbolt):
pub fn test_loop(num: i32) {
for _i in 0..num {
dummy();
}
}
#[inline(never)]
pub extern fn dummy() {
use std::sync::atomic::*;
compiler_fence(Ordering::Release);
}
Will produce the following assembly (with -O), that I think you need:
example::test_loop:
push r14
push rbx
push rax
test edi, edi
jle .LBB0_3
mov ebx, edi
mov r14, qword ptr [rip + example::dummy#GOTPCREL]
.LBB0_2:
call r14
add ebx, -1
jne .LBB0_2
.LBB0_3:
add rsp, 8
pop rbx
pop r14
ret
plus the code for dummy() that is actually empty:
example::dummy:
ret
It's not possible to forward-declare functions. There is only a single declaration for any given entity in Rust.
However, you can use the unimplemented!() and todo!() macros to quickly fill in the bodies of functions you don't want to implement (yet) for some reason. Both are basically aliases for panic!() with specific error messages.
Declare a trait and have your function and its signature in there.
eg.
pub trait Summary {
fn summarize(&self) -> String;
}
In my example below does cons.push(...) ever copy the self parameter?
Or is rustc intelligent enough to realize that the values coming from lines #a and #b can always use the same stack space and no copying needs to occur (except for the obvious i32 copies)?
In other words, does a call to Cons.push(self, ...) always create a copy of self as ownership is being moved? Or does the self struct always stay in place on the stack?
References to documentation would be appreciated.
#[derive(Debug)]
struct Cons<T, U>(T, U);
impl<T, U> Cons<T, U> {
fn push<V>(self, value: V) -> Cons<Self, V> {
Cons(self, value)
}
}
fn main() {
let cons = Cons(1, 2); // #a
let cons = cons.push(3); // #b
println!("{:?}", cons); // #c
}
The implication in my example above is whether or not the push(...) function grows more expensive to call each time we add a line like #b at the rate of O(n^2) (if self is copied each time) or at the rate of O(n) (if self stays in place).
I tried implementing the Drop trait and noticed that both #a and #b were dropped after #c. To me this seems to indicate that self stays in place in this example, but I'm not 100%.
In general, trust in the compiler! Rust + LLVM is a very powerful combination that often produces surprisingly efficient code. And it will improve even more in time.
In other words, does a call to Cons.push(self, ...) always create a copy of self as ownership is being moved? Or does the self struct always stay in place on the stack?
self cannot stay in place because the new value returned by the push method has type Cons<Self, V>, which is essentially a tuple of Self and V. Although tuples don't have any memory layout guarantees, I strongly believe they can't have their elements scattered arbitrarily in memory. Thus, self and value must both be moved into the new structure.
Above paragraph assumed that self was placed firmly on the stack before calling push. The compiler actually has enough information to know it should reserve enough space for the final structure. Especially with function inlining this becomes a very likely optimization.
The implication in my example above is whether or not the push(...) function grows more expensive to call each time we add a line like #b at the rate of O(n^2) (if self is copied each time) or at the rate of O(n) (if self stays in place).
Consider two functions (playground):
pub fn push_int(cons: Cons<i32, i32>, x: i32) -> Cons<Cons<i32, i32>, i32> {
cons.push(x)
}
pub fn push_int_again(
cons: Cons<Cons<i32, i32>, i32>,
x: i32,
) -> Cons<Cons<Cons<i32, i32>, i32>, i32> {
cons.push(x)
}
push_int adds a third element to a Cons and push_int_again adds a fourth element.
push_int compiles to the following assembly in Release mode:
movq %rdi, %rax
movl %esi, (%rdi)
movl %edx, 4(%rdi)
movl %ecx, 8(%rdi)
retq
And push_int_again compiles to:
movq %rdi, %rax
movl 8(%rsi), %ecx
movl %ecx, 8(%rdi)
movq (%rsi), %rcx
movq %rcx, (%rdi)
movl %edx, 12(%rdi)
retq
You don't need to understand assembly to see that pushing the fourth element requires more instructions than pushing the third element.
Note that this observation was made for these functions in isolation. Calls like cons.push(x).push(y).push(...) are inlined and the assembly grows linearly with one instruction per push.
The ownership of cons in #a type Cons will be transferred in push(). Again the ownership will be transferred to Cons<Cons,i32>(Cons<T,U>) type which is shadowed variable cons in #b.
If struct Cons implement Copy, Clone traits it will be copy. Otherwise no copy and you cannot use the original vars after they are moved (or owned) by someone else.
Move semantics:
let cons = Cons(1, 2); //Cons(1,2) as resource in memory being pointed by cons
let cons2 = cons; // Cons(1,2) now pointed by cons2. Problem! as cons also point it. Lets prevent access from cons
println!("{:?}", cons); //error because cons is moved
I have a small struct:
pub struct Foo {
pub a: i32,
pub b: i32,
pub c: i32,
}
I was using pairs of the fields in the form (a,b) (b,c) (c,a). To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs:
fn get_foo_ref(&self) -> [(&i32, &i32); 3] {
[(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)]
}
I had to decide if I should return the values as references or copy the i32. Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined.
I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples.
First the variant using the function:
pub fn testing_ref(f: Foo) -> i32 {
let mut sum = 0;
for i in 0..3 {
let (l, r) = f.get_foo_ref()[i];
sum += *l + *r;
}
sum
}
Then the hand-written variant:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case?
You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++.
In C++, the compiler generated equivalent code between get_foo and get_foo_ref, although I don't understand why the code for all 3 cases is not equivalent.
Why did the compiler did not generate equivalent code for all 3 cases?
Update:
I've modified slightly code to use arrays and to add one more direct case.
Rust version with f64 and arrays
C++ version with f64 and arrays
This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly.
Well, I guess this is another example that nothing can be taken for granted.
TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance.
Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references.
Then, you should check the generated assembly for your new type.
In your optimized example, the compiler is being very crafty:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
Yields:
example::testing_direct:
push rbp
mov rbp, rsp
mov eax, dword ptr [rdi + 4]
add eax, dword ptr [rdi]
add eax, dword ptr [rdi + 8]
add eax, eax
pop rbp
ret
Which is roughly sum += f.a; sum += f.b; sum += f.c; sum += sum;.
That is, the compiler realized that:
f.X was added twice
f.X * 2 was equivalent to adding it twice
While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to i32 (and addition being commutative).
For example, switching your code to f32 (still Copy, but addition is not longer commutative), I get the very same assembly for both testing_direct and testing (and slightly different for testing_ref):
example::testing:
push rbp
mov rbp, rsp
movss xmm1, dword ptr [rdi]
movss xmm2, dword ptr [rdi + 4]
movss xmm0, dword ptr [rdi + 8]
movaps xmm3, xmm1
addss xmm3, xmm2
xorps xmm4, xmm4
addss xmm4, xmm3
addss xmm2, xmm0
addss xmm2, xmm4
addss xmm0, xmm1
addss xmm0, xmm2
pop rbp
ret
And there's no trickery any longer.
So it's really not possible to infer much from your example, check with the real type.
I came across an interesting case while playing with zero sized types (ZSTs). A reference to an empty array will mold to a reference with any lifetime:
fn mold_slice<'a, T>(_: &'a T) -> &'a [T] {
&[]
}
I thought about how that is possible, since basically the "value" here lives on the stack frame of the function, yet the signature promises to return a reference to a value with a longer lifetime ('a contains the function call). I came to the conclusion that it is because the empty array [] is a ZST which basically only exists statically. The compiler can "fake" the value the reference refers to.
So I tried this:
fn mold_unit<'a, T>(_: &'a T) -> &'a () {
&()
}
and then the compiler complained:
error: borrowed value does not live long enough
--> <anon>:7:6
|
7 | &()
| ^^ temporary value created here
8 | }
| - temporary value only lives until here
|
note: borrowed value must be valid for the lifetime 'a as defined on the block at 6:40...
--> <anon>:6:41
|
6 | fn mold_unit<'a, T>(_: &'a T) -> &'a () {
| ^
It doesn't work for the unit () type, and it also does not work for an empty struct:
struct Empty;
// fails to compile as well
fn mold_struct<'a, T>(_: &'a T) -> &'a Empty {
&Empty
}
Somehow, the unit type and the empty struct are treated differently from the empty array. Are there any additional differences between those values besides just being ZSTs? Do the differences (&[] fitting any lifetime and &(), &Empty not) nothing to do with ZSTs at all?
Playground example
It's not that [] is zero-sized (though it is), it's that [] is a constant, compile-time literal. This means the compiler can store it in the executable, rather than having to allocate it dynamically on the heap or stack. This, in turn, means that pointers to it last as long as they want, because data in the executable isn't going anywhere.
Annoyingly, this doesn't extend to something like &[0], because Rust isn't quite smart enough to realise that [0] is definitely constant. You can work around this by using something like:
fn mold_slice<'a, T>(_: &'a T) -> &'a [i32] {
const C: &'static [i32] = &[0];
C
}
This trick also works with anything you can put in a const, like () or Empty.
Realistically, however, it'd be simpler to just have functions like this return a &'static borrow, since that can be coerced to any other lifetime automatically.
Edit: the previous version noted that &[] is not zero sized, which was a little tangential.
Do the differences (&[] fitting any lifetime and &(), &Empty not) nothing to do with ZSTs at all?
I think this is exactly the case. The compiler probably just treats arrays differently and there is no deeper reasoning behind it.
The only difference that could play a role is that &[] is a fat pointer, consisting of the data pointer and a length. This fat pointer itself expresses the fact that there is actually no data behind it (because length=0). &() on the other hand is just a normal pointer. Here, only the type system expresses the fact that it's not pointing to anything real. But I'm just guessing here.
To clarify: a referencing fitting any lifetime means that the reference has the 'static lifetime. So instead of introducing some lifetime 'a, we can just return a static reference and will have the same effect (&[] works, the others don't).
There is an RFC which specifies that references to constexpr rvalues will be stored in the static data section of the executable, instead of the stack. After this RFC has been implemented (tracking issue), all of your example will compile, as [], () and Empty are constexpr rvalues. References to it will always be 'static. But the important part of the RFC is that it works for non-ZSTs, too: e.g. &27 has the type &'static i32.
To have some fun, let's look at the generated assembly (I used the amazing Compiler Explorer)! First let's try the working version:
pub fn mold_slice() -> &'static [i32] {
&[]
}
Using the -O flag (meaning: optimizations enabled; I checked the unoptimized version, too, and it doesn't have significant differences), this is compiled down to:
mold_slice:
push rbp
mov rbp, rsp
lea rax, [rip + ref.0]
xor edx, edx
pop rbp
ret
ref.0:
The fat pointer is returned in the rax (data pointer) and rdx (length) registers. As you can see, the length is set to 0 (xor edx, edx) and the data pointer is set to this mysterious ref.0. The ref.0 is not actually referencing anything at all. It's just an empty marker. This means we return just some pointer to the data section.
Now let's just tell the compiler to trust us on &() in order to compile it:
pub fn possibly_broken() -> &'static () {
unsafe { std::mem::transmute(&()) }
}
Result:
possibly_broken:
push rbp
mov rbp, rsp
lea rax, [rip + ref.1]
pop rbp
ret
ref.1:
Wow, we pretty much see the same result! The pointer (returned via rax) points somewhere to the data section. So it actually is a 'static reference after code generation. Only the lifetime checker doesn't quite know that and still refuses to compile the code. Well... I guess this is nothing dramatic, especially since the RFC mentioned above will fix that in near future.