I need to perform 128bit by 64bit divisions in Rust. The x86-64 ISA contains a native DIV instruction for this purpose. However, my compiled test code doesn't use this instruction.
Test code:
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
assert!(hi < divisor);
let dividend = ((hi as u128) << 64) + lo as u128;
(dividend / divisor as u128) as u64
}
Compiler explorer output:
example::div:
push rax
cmp rdi, rdx
jae .LBB0_1
mov rax, rdi
mov rdi, rsi
mov rsi, rax
xor ecx, ecx
call qword ptr [rip + __udivti3#GOTPCREL]
pop rcx
ret
.LBB0_1:
...
Instead, an inefficient 128bit by 128bit division is performed via __udivti3. This is probably because the DIV instruction causes a CPU exception if the quotient does not fit into 64 bits.
In my case, however, this is impossible:
hi < divisor, lo < 2^64 -> dividend = hi * 2^64 + lo <= (divisor - 1) * 2^64 + 2^64 - 1 = divisor * 2^64 - 1
-> dividend / divisor <= 2^64 - 1 / divisor < 2^64
How can I force the compiler to use the native instruction?
Your only option is to use inline assembly. There might be an obscure combination of compiler flags that can coerce llvm into performing the optimization itself, but I don't think trying to find it is worth the effort. With assembly, it would look like this:
use std::arch::asm;
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
assert!(hi < divisor);
#[cfg(target_arch = "x86_64")]
unsafe {
let mut quot = lo;
let mut _rem = hi;
asm!(
"div {divisor}",
divisor = in(reg) divisor,
inout("rax") quot,
inout("rdx") _rem,
options(pure, nomem, nostack)
);
quot
}
#[cfg(not(target_arch = "x86_64"))]
{
let dividend = ((hi as u128) << 64) + lo as u128;
(dividend / divisor as u128) as u64
}
}
Godbolt
On x86_64, this just compiles the division down to a little register shuffling followed by a div, and performs the call to __udivti3 on other systems. It also shouldn't get in the way of the optimizer too much since it's pure.
It's definitely worth actually benchmarking your application to see if this actually helps. It's a lot easier for llvm to reason about integer division than inline assembly, and missed optimizations elsewhere could easily result in this version running slower than using the default version.
Related
What is the rusticle way to to represent the an i64 [-9223372036854775808, 9223372036854775807] into the u64 domain [0, 18446744073709551615]. So for example 0 in i64 is 9223372036854775808 in u64.
Here is what I have done.
let x: i64 = -10;
let x_transform = ((x as u64) ^ (1 << 63)) & (1 << 63) | (x as u64 & (u64::MAX >> 1));
let x_original = ((x_transform as i64) ^ (1 << 63)) & (1 << 63) | (x_transform & (u64::MAX >> 1)) as i64;
println!("x_transform {}", x_transform);
println!("x_original {} {}", x_original, x_original == x);
yielding
x_transform 9223372036854775798
x_original -10 true
Is there a built in way to do this, because it seems too verbose, and error prone?
From a performance view it doesn't really matter how you write it, a quick check on godbot shows both the wrapping and the bit shifty versions compile to the same machine code.
But I'd argue the variants with wrapping are way more readable and convey the intent better.
pub fn wrap_to_u64(x: i64) -> u64 {
(x as u64).wrapping_add(u64::MAX/2 + 1)
}
pub fn wrap_to_i64(x: u64) -> i64 {
x.wrapping_sub(u64::MAX/2 + 1) as i64
}
pub fn to_u64(x: i64) -> u64 {
((x as u64) ^ (1 << 63)) & (1 << 63) | (x as u64 & (u64::MAX >> 1))
}
pub fn to_i64(x: u64) -> i64 {
((x as i64) ^ (1 << 63)) & (1 << 63) | (x & (u64::MAX >> 1)) as i64
}
example::wrap_to_u64:
movabs rax, -9223372036854775808
xor rax, rdi
ret
example::wrap_to_i64:
movabs rax, -9223372036854775808
xor rax, rdi
ret
example::to_u64:
movabs rax, -9223372036854775808
xor rax, rdi
ret
example::to_i64:
movabs rax, -9223372036854775808
xor rax, rdi
ret
The lesson to learn is that unless you have a very specific optimization probably the compiler will outperform you.
The simplest solution would be to just translate the two's complement representation, rather than use offset-binary:
let x_transform = u64::from_ne_bytes(x.to_ne_bytes());
let x_original = i64::from_ne_bytes(x_transform.to_ne_bytes());
However according to the to the wiki:
Offset binary may be converted into two's complement by inverting the most significant bit.
So you could do that and use the less error-prone two's complement for the actual translation:
pub fn convert1(x: i64) -> u64 {
((x as u64) ^ (1 << 63)) & (1 << 63) | (x as u64 & (u64::MAX >> 1))
}
pub fn convert3(x: i64) -> u64 {
// manipulating the bytes in transit requires
// knowing the MSB, use LE as that's the most
// commmon by far
let mut bytes = x.to_le_bytes();
bytes[7] ^= 0x80;
u64::from_le_bytes(bytes)
}
pub fn convert4(x: i64) -> u64 {
u64::from_ne_bytes((x ^ i64::MIN).to_ne_bytes())
}
all produce the exact same x86_64 code:
movabs rax, -9223372036854775808
xor rax, rdi
ret
How do you efficiently convert a number n to a number with that many low significant bits set? That is, for an 8-bit integer:
0 -> 00000000
1 -> 00000001
2 -> 00000011
3 -> 00000111
4 -> 00001111
5 -> 00011111
6 -> 00111111
7 -> 01111111
8 -> 11111111
Seems trivial:
fn conv(n: u8) -> u8 { (1 << n) - 1 }
But it's not:
thread 'main' panicked at 'attempt to shift left with overflow', src/main.rs:2:5
playground
This is because, in Rust, a number with N bits cannot be shifted left N places. It is undefined behaviour on many architectures.
u8::MAX >> (8 - n) doesn't work either as shift right has the same overflow restriction.
So, what is an efficient way to achieve this without a lookup table or a conditional? Or is this not possible and one must resort to an implementation like:
fn conv2(n: u8) -> u8 {
match 1u8.checked_shl(n.into()) {
Some(n) => n - 1,
None => u8::MAX,
}
}
The most efficient way, for any bit width smaller than the largest native integer size, will almost certainly be to use a wider type for calculations and then narrow it to the output type:
pub fn conv(n: u8) -> u8 {
((1u32 << n) - 1) as u8
}
rustc 1.50 at -C opt-level=1 or higher renders this as four instructions with no branches or indirection, which is probably optimal for x86-64.
example::conv:
mov ecx, edi
mov eax, 1
shl eax, cl
add al, -1
ret
On most (all?) 32-bit or larger platforms, there's no advantage to be gained from doing math with u8 or u16 instead of u32; if you have to use a 32-bit register you may as well use the whole thing (which is also why certain integer methods only accept u32).
Even if you write a match, it may not be compiled to a branch. In my tests, the following function
pub fn conv(n: u32) -> u64 {
match 1u64.checked_shl(n) {
Some(n) => n - 1,
_ => !0,
}
}
compiles with no branches, just a cmov instruction, and looks pretty similar to the equivalent using (software-emulated) u128. Be sure to profile before assuming that the match is a problem.
This is just to satisfy my own curiosity.
Is there an implementation of this:
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
in Rust? If it exists, post the code.
I tried it and failed. I don't know how to encode the float number using integer format. Here is my attempt:
fn main() {
println!("Hello, world!");
println!("sqrt1: {}, ",sqrt2(100f64));
}
fn sqrt1(x: f64) -> f64 {
x.sqrt()
}
fn sqrt2(x: f64) -> f64 {
let mut x = x;
let xhalf = 0.5*x;
let mut i = x as i64;
println!("sqrt1: {}, ", i);
i = 0x5f375a86 as i64 - (i>>1);
x = i as f64;
x = x*(1.5f64 - xhalf*x*x);
1.0/x
}
Reference:
1. Origin of Quake3's Fast InvSqrt() - Page 1
2. Understanding Quake’s Fast Inverse Square Root
3. FAST INVERSE SQUARE ROOT.pdf
4. source code: q_math.c#L552-L572
I don't know how to encode the float number using integer format.
There is a function for that: f32::to_bits which returns an u32. There is also the function for the other direction: f32::from_bits which takes an u32 as argument. These functions are preferred over mem::transmute as the latter is unsafe and tricky to use.
With that, here is the implementation of InvSqrt:
fn inv_sqrt(x: f32) -> f32 {
let i = x.to_bits();
let i = 0x5f3759df - (i >> 1);
let y = f32::from_bits(i);
y * (1.5 - 0.5 * x * y * y)
}
(Playground)
This function compiles to the following assembly on x86-64:
.LCPI0_0:
.long 3204448256 ; f32 -0.5
.LCPI0_1:
.long 1069547520 ; f32 1.5
example::inv_sqrt:
movd eax, xmm0
shr eax ; i << 1
mov ecx, 1597463007 ; 0x5f3759df
sub ecx, eax ; 0x5f3759df - ...
movd xmm1, ecx
mulss xmm0, dword ptr [rip + .LCPI0_0] ; x *= 0.5
mulss xmm0, xmm1 ; x *= y
mulss xmm0, xmm1 ; x *= y
addss xmm0, dword ptr [rip + .LCPI0_1] ; x += 1.5
mulss xmm0, xmm1 ; x *= y
ret
I have not found any reference assembly (if you have, please tell me!), but it seems fairly good to me. I am just not sure why the float was moved into eax just to do the shift and integer subtraction. Maybe SSE registers do not support those operations?
clang 9.0 with -O3 compiles the C code to basically the same assembly. So that's a good sign.
It is worth pointing out that if you actually want to use this in practice: please don't. As benrg pointed out in the comments, modern x86 CPUs have a specialized instruction for this function which is faster and more accurate than this hack. Unfortunately, 1.0 / x.sqrt() does not seem to optimize to that instruction. So if you really need the speed, using the _mm_rsqrt_ps intrinsics is probably the way to go. This, however, does again require unsafe code. I won't go into much detail in this answer, as a minority of programmers will actually need it.
This one is implemented with less known union in Rust:
union FI {
f: f32,
i: i32,
}
fn inv_sqrt(x: f32) -> f32 {
let mut u = FI { f: x };
unsafe {
u.i = 0x5f3759df - (u.i >> 1);
u.f * (1.5 - 0.5 * x * u.f * u.f)
}
}
Did some micro benchmarks using criterion crate on a x86-64 Linux box. Surprisingly Rust's own sqrt().recip() is the fastest. But of course, any micro benchmark result should be taken with a grain of salt.
inv sqrt with transmute time: [1.6605 ns 1.6638 ns 1.6679 ns]
inv sqrt with union time: [1.6543 ns 1.6583 ns 1.6633 ns]
inv sqrt with to and from bits
time: [1.7659 ns 1.7677 ns 1.7697 ns]
inv sqrt with powf time: [7.1037 ns 7.1125 ns 7.1223 ns]
inv sqrt with sqrt then recip
time: [1.5466 ns 1.5488 ns 1.5513 ns]
You may use std::mem::transmute to make needed conversion:
fn inv_sqrt(x: f32) -> f32 {
let xhalf = 0.5f32 * x;
let mut i: i32 = unsafe { std::mem::transmute(x) };
i = 0x5f3759df - (i >> 1);
let mut res: f32 = unsafe { std::mem::transmute(i) };
res = res * (1.5f32 - xhalf * res * res);
res
}
You can look for a live example here: here
I have a small struct:
pub struct Foo {
pub a: i32,
pub b: i32,
pub c: i32,
}
I was using pairs of the fields in the form (a,b) (b,c) (c,a). To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs:
fn get_foo_ref(&self) -> [(&i32, &i32); 3] {
[(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)]
}
I had to decide if I should return the values as references or copy the i32. Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined.
I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples.
First the variant using the function:
pub fn testing_ref(f: Foo) -> i32 {
let mut sum = 0;
for i in 0..3 {
let (l, r) = f.get_foo_ref()[i];
sum += *l + *r;
}
sum
}
Then the hand-written variant:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case?
You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++.
In C++, the compiler generated equivalent code between get_foo and get_foo_ref, although I don't understand why the code for all 3 cases is not equivalent.
Why did the compiler did not generate equivalent code for all 3 cases?
Update:
I've modified slightly code to use arrays and to add one more direct case.
Rust version with f64 and arrays
C++ version with f64 and arrays
This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly.
Well, I guess this is another example that nothing can be taken for granted.
TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance.
Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references.
Then, you should check the generated assembly for your new type.
In your optimized example, the compiler is being very crafty:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
Yields:
example::testing_direct:
push rbp
mov rbp, rsp
mov eax, dword ptr [rdi + 4]
add eax, dword ptr [rdi]
add eax, dword ptr [rdi + 8]
add eax, eax
pop rbp
ret
Which is roughly sum += f.a; sum += f.b; sum += f.c; sum += sum;.
That is, the compiler realized that:
f.X was added twice
f.X * 2 was equivalent to adding it twice
While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to i32 (and addition being commutative).
For example, switching your code to f32 (still Copy, but addition is not longer commutative), I get the very same assembly for both testing_direct and testing (and slightly different for testing_ref):
example::testing:
push rbp
mov rbp, rsp
movss xmm1, dword ptr [rdi]
movss xmm2, dword ptr [rdi + 4]
movss xmm0, dword ptr [rdi + 8]
movaps xmm3, xmm1
addss xmm3, xmm2
xorps xmm4, xmm4
addss xmm4, xmm3
addss xmm2, xmm0
addss xmm2, xmm4
addss xmm0, xmm1
addss xmm0, xmm2
pop rbp
ret
And there's no trickery any longer.
So it's really not possible to infer much from your example, check with the real type.
Rust employs dynamic checking methods to check many things. One such example is bounds checking of arrays.
Take this code for example,
fn test_dynamic_checking() -> i32 {
let x = [1, 2, 3, 4];
x[1]
}
The resulting LLVM IR is:
; Function Attrs: uwtable
define internal i32 #_ZN10dynamic_ck21test_dynamic_checking17hcef32a1e8c339e2aE() unnamed_addr #0 {
entry-block:
%x = alloca [5 x i32]
%0 = bitcast [5 x i32]* %x to i8*
call void #llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast ([5 x i32]* #const7091 to i8*), i64 20, i32 4, i1 false)
%1 = getelementptr inbounds [5 x i32], [5 x i32]* %x, i32 0, i32 0
%2 = call i1 #llvm.expect.i1(i1 false, i1 false)
br i1 %2, label %cond, label %next
next: ; preds = %entry-block
%3 = getelementptr inbounds i32, i32* %1, i64 1
%4 = load i32, i32* %3
ret i32 %4
cond: ; preds = %entry-block
call void #_ZN4core9panicking18panic_bounds_check17hcc71f10000bd8e6fE({ %str_slice, i32 }* noalias readonly dereferenceable(24) #panic_bounds_check_loc7095, i64 1, i64 5)
unreachable
}
A branch instruction is inserted to decide whether the index is out of bounds or not, which doesn't exist in clang-compiled LLVM IR.
Here are my questions:
In what situations does Rust implement dynamic checking?
How does Rust implement dynamic checking in different situations?
Is there any way to turn off the dynamic checking?
Conceptually, Rust performs array bound checking on each and every array access. However, the compiler is very good at optimizing away the checks when it can prove that it's safe to do so.
The LLVM intermediate output is misleading because it still undergoes optimizations by the LLVM's optimizing machinery before the machine assembly is generated. A better way to inspect assembly output is by generating the final assembly using an invocation such as rustc -O --emit asm --crate-type=lib. The assembly output for your function is just:
push rbp
mov rbp, rsp
mov eax, 2
pop rbp
ret
Not only is there no bound checking in sight, there is no array to begin with, the compiler has optimized the entire function to a return 2i32! To force bound checking, the function needs to be written so that Rust cannot prove that it can be elided:
pub fn test_dynamic_checking(ind: usize) -> i32 () {
let x = [1, 2, 3, 4];
x[ind]
}
This results in a larger assembly, where the bound check is implemented as the following two instructions:
cmp rax, 3 ; compare index with 3
ja .LBB0_2 ; if greater, jump to panic code
That is as efficient as it gets. Turning off the bound check is rarely a good idea because it can easily cause the program to crash. It can be done, but explicitly and only within the confines of an unsafe block or function:
pub unsafe fn test_dynamic_checking(ind: usize) -> i32 () {
let x = [1, 2, 3, 4];
*x.get_unchecked(ind)
}
The generated assembly shows the error checking to be entirely omitted:
push rbp
mov rbp, rsp
lea rax, [rip + const3141]
mov eax, dword ptr [rax + 4*rdi]
pop rbp
ret
const3141:
.long 1
.long 2
.long 3
.long 4
In what situations does Rust implement dynamic checking?
It's a bit of a cop out... but Rust, the language, does not implement any dynamic checking. The Rust libraries, however, starting with the core and std libraries do.
A library needs to use run-time checking whenever not doing so could lead to a memory safety issue. Examples include:
guaranteeing that an index is within bounds, such as when implementing Index
guaranteeing that no other reference to an object exists, such as when implementing RefCell
...
In general, the Rust philosophy is to perform as many checks as possible at compile-time, but some checks need by delayed to run-time because they depend on dynamic behavior/values.
How does Rust implement dynamic checking in different situations?
As efficiently as possible.
When dynamic checking is required, the Rust code will be crafted to be as efficient as possible. This can be slightly complicated, as it involves trying to ease the work of the optimizer and conforming to patterns that the optimizer recognizes (or fixing it), but we have the chance that a few developers are obsessed with performance (#bluss, for example).
Not all checks may be elided, of course, and those that remain will typically be just a branch.
Is there any way to turn off the dynamic checking?
Whenever dynamic checking is necessary to guarantee code safety, it is not possible to turn in off in safe code.
In some situations, though, an unsafe block or function may allow to bypass the check (for example get_unchecked for indexing).
This is not recommended in general, and should be a last-resort behaviour:
Most of the times, the run-time checks have little to no performance impacts; CPU prediction is awesome like that.
Even if they have some impact, unless they sit it a very hot-loop, it may not be worth optimizing them.
Finally, if it does NOT optimize, it is worth trying to understand why: maybe it's impossible or maybe the optimizer is not clever enough... in the latter case, reporting the issue is the best way to have someone digging into the optimizer and fixing it (if you cannot or are not willing to do it yourself).