How to initialize array with one non-zero value - rust

I'm creating a number type that uses arrays to store the numbers. To implement the trait One, I find myself writing code like this:
fn one() -> Self {
let mut ret_array = [0; N];
ret_array[0] = 1;
Self(ret_array)
}
Is there an alternative way to initialize an array with one non-zero element?

I don't think so, no.
But the Rust compiler understands what you are trying to achieve and optimizes it accordingly:
pub fn one<const N: usize>() -> [i32; N] {
let mut ret_array = [0; N];
ret_array[0] = 1;
ret_array
}
pub fn one_with_length_5() -> [i32; 5] {
one()
}
example::one_with_length_5:
mov rax, rdi
xorps xmm0, xmm0
movups xmmword ptr [rdi + 4], xmm0
mov dword ptr [rdi], 1
ret
xorps xmm0, xmm0 sets the 16-byte (or 4-int) SSE register xmm0 to [0,0,0,0].
movups xmmword ptr [rdi + 4], xmm0 copies all 4 ints of the xmm0 register to the location [rdi + 4], which is the elements 1, 2, 3 and 4 of ret_array.
mov dword ptr [rdi], 1 moves the value 1 to the first element of the ret_array.
ret_array is at the location of [rdi], [rdi + 4] is the element at position ret_array[1], [rdi + 8] is the element at position ret_array[2], etc.
As you can see, it only initializes the other four values with 0, and then sets the first value to 1. The first value does not get written twice.
Small remark
If you set N to e.g. 8, it does actually write the value twice:
example::one_with_length_8:
mov rax, rdi
xorps xmm0, xmm0
movups xmmword ptr [rdi + 16], xmm0
movups xmmword ptr [rdi + 4], xmm0
mov dword ptr [rdi], 1
ret
Interestingly, it doesn't actually write the value [0] twice, but the value [4]. It once writes [1,2,3,4], and then [4,5,6,7], then the one to [0].
But that's because this is the fastest way to do it. It stores 4 ints of zeros in the SSE registers and then zero-initializes the vector 4 ints at a time. Writing an int twice is faster than initializing the other values without the help of 4-int SSE commands.
This would even happen if you initialized it completely manually:
pub fn one_with_length_8() -> [i32; 8] {
[1,0,0,0,0,0,0,0]
}
example::one_with_length_8:
mov rax, rdi
mov dword ptr [rdi], 1
xorps xmm0, xmm0
movups xmmword ptr [rdi + 4], xmm0
movups xmmword ptr [rdi + 16], xmm0
ret
You can see the order is different, but the instructions are identical.

Related

The impact of avoiding let variable bindings [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
Improve this question
Having fun with a project using gdnative, I wrote this:
#[method]
fn _save_player_position(&mut self, player_current_position: VariantArray) {
let player_current_position: (f64, f64) = (
player_current_position.get(0).to::<f64>().unwrap(),
player_current_position.get(1).to::<f64>().unwrap()
);
self.player_data.set_player_position(player_current_position.0, player_current_position.1);
self.received_signals += 1;
}
My doubt is, do you "win" some benefit by rewrite the code like this:
#[method]
fn _save_player_position(&mut self, player_current_position: VariantArray) {
self.player_data.set_player_position(
player_current_position.get(0).to::<f64>().unwrap(),
player_current_position.get(1).to::<f64>().unwrap()
);
self.received_signals += 1;
}
As far as I know, I am avoiding:
The creation of a new tuple struct
Storing the data in it's unnamed fields
Saving the data on the let player_current_position
Then moving the data to some of the self fields
And the questions are:
Is the above true?
Is worth starting code like this in order to avoid allocations (even if they are in the stack)
Is better to only optimize heap ones, and improve readability whenever it's possible?
You can check the compiler output for both cases (slightly rewritten for clarity) here:
https://godbolt.org/z/Gc4nr6afb
struct Foo {
pos : (f64, f64),
}
impl Foo {
fn bar(&mut self, current : (f64, f64)) {
let player_current_position: (f64, f64) = (
current.0,
current.1,
);
self.pos = (player_current_position.0, player_current_position.1);
}
fn bar2(&mut self, current : (f64, f64)) {
self.pos = (current.0, current.1);
}
}
pub fn main() {
let mut foo = Foo {pos: (1.0, 1.0)};
foo.bar((2.0,2.0));
foo.bar2((2.0,2.0));
}
this is with local variables:
Foo::bar:
sub rsp, 16
movsd qword ptr [rsp], xmm0
movsd qword ptr [rsp + 8], xmm1
movsd xmm1, qword ptr [rsp]
movsd xmm0, qword ptr [rsp + 8]
movsd qword ptr [rdi], xmm1
movsd qword ptr [rdi + 8], xmm0
add rsp, 16
ret
and this is without
Foo::bar2:
movsd qword ptr [rdi], xmm0
movsd qword ptr [rdi + 8], xmm1
ret
Note that as per #Jmb comment, once compiler optimisation is enabled, the output will be identical: https://godbolt.org/z/xKPfMP6Yr

128bit by 64bit native division

I need to perform 128bit by 64bit divisions in Rust. The x86-64 ISA contains a native DIV instruction for this purpose. However, my compiled test code doesn't use this instruction.
Test code:
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
assert!(hi < divisor);
let dividend = ((hi as u128) << 64) + lo as u128;
(dividend / divisor as u128) as u64
}
Compiler explorer output:
example::div:
push rax
cmp rdi, rdx
jae .LBB0_1
mov rax, rdi
mov rdi, rsi
mov rsi, rax
xor ecx, ecx
call qword ptr [rip + __udivti3#GOTPCREL]
pop rcx
ret
.LBB0_1:
...
Instead, an inefficient 128bit by 128bit division is performed via __udivti3. This is probably because the DIV instruction causes a CPU exception if the quotient does not fit into 64 bits.
In my case, however, this is impossible:
hi < divisor, lo < 2^64 -> dividend = hi * 2^64 + lo <= (divisor - 1) * 2^64 + 2^64 - 1 = divisor * 2^64 - 1
-> dividend / divisor <= 2^64 - 1 / divisor < 2^64
How can I force the compiler to use the native instruction?
Your only option is to use inline assembly. There might be an obscure combination of compiler flags that can coerce llvm into performing the optimization itself, but I don't think trying to find it is worth the effort. With assembly, it would look like this:
use std::arch::asm;
pub fn div(hi: u64, lo: u64, divisor: u64) -> u64 {
assert!(hi < divisor);
#[cfg(target_arch = "x86_64")]
unsafe {
let mut quot = lo;
let mut _rem = hi;
asm!(
"div {divisor}",
divisor = in(reg) divisor,
inout("rax") quot,
inout("rdx") _rem,
options(pure, nomem, nostack)
);
quot
}
#[cfg(not(target_arch = "x86_64"))]
{
let dividend = ((hi as u128) << 64) + lo as u128;
(dividend / divisor as u128) as u64
}
}
Godbolt
On x86_64, this just compiles the division down to a little register shuffling followed by a div, and performs the call to __udivti3 on other systems. It also shouldn't get in the way of the optimizer too much since it's pure.
It's definitely worth actually benchmarking your application to see if this actually helps. It's a lot easier for llvm to reason about integer division than inline assembly, and missed optimizations elsewhere could easily result in this version running slower than using the default version.

Convert bytes to u64

I need to convert the first 8 bytes of a String in Rust to a u64, big endian. This code almost works:
fn main() {
let s = String::from("01234567");
let mut buf = [0u8; 8];
buf.copy_from_slice(s.as_bytes());
let num = u64::from_be_bytes(buf);
println!("{:X}", num);
}
There are multiple problems with this code. First, it only works if the string is exactly 8 bytes long. .copy_from_slice() requires that both source and destination have the same length. This is easy to deal with if the String is too long because I can just grab a slice of the right length, but if the String is short it won't work.
Another problem is that this code is part of a function which is very performance sensitive. It runs in a tight loop over a large data set.
In C, I would just zero the buf, memcpy over the right number of bytes, and do a cast to an unsigned long.
Is there some way to do this in Rust which runs just as fast?
You can just modify your existing code to take the length into account when copying:
let len = 8.min(s.len());
buf[..len].copy_from_slice(&s.as_bytes()[..len]);
If the string is short this will copy the bytes into what will become the most significant bits of the u64, of course.
As to performance: in this simple test main(), the conversions are completely optimized out to become a constant integer. So, we need an explicit function or loop:
pub fn convert(s: &str) -> u64 {
let mut buf = [0u8; 8];
let len = 8.min(s.len());
buf[..len].copy_from_slice(&s.as_bytes()[..len]);
u64::from_be_bytes(buf)
}
This (on the Rust Playground) generates the assembly:
playground::convert:
pushq %rax
movq %rdi, %rax
movq $0, (%rsp)
cmpq $8, %rsi
movl $8, %edx
cmovbq %rsi, %rdx
movq %rsp, %rdi
movq %rax, %rsi
callq *memcpy#GOTPCREL(%rip)
movq (%rsp), %rax
bswapq %rax
popq %rcx
retq
I feel a little skeptical that that memcpy call is actually a good idea compared to just issuing instructions to copy the bytes, but I'm no expert on instruction-level performance and presumably it'll at least equal your C code explicitly calling memcpy(). What we do see is that there are no branches in the compiled code, only a conditional move presumably to handle the 8 vs. len() choice — and no bounds-check panic.
(And the generated assembly will of course be different — hopefully for the better — when this function or code snippet is inlined into a larger loop.)

Rust: Collect result of chain into byte array

I have the following two variables: s, which has type Option<&[u8; 32]> and k, which has type &[u8; 32]. I would like to concatenate s and k into a byte array [u8; 64] using the chain method (I'm using chain instead of concat because I've read it is much better performance wise), however, it seems I'm doing something wrong.
Here is a simplified example of what I'm trying to do:
fn combine(s: Option<&[u8; 32]>, k : &[u8; 32]) -> [u8; 64]{
let tmps = *s.unwrap();
let tmpk = *k;
let result = tmps.iter().chain(tmpk.iter()).collect();
result
}
fn main() {
let s = [47u8; 32];
let k = [23u8; 32];
println!("s: {:?}", s);
println!("k: {:?}", k);
let sk = combine(Some(s), k);
println!("sk: {:?}", sk);
}
The error I'm getting: a value of type '[u8; 64]' cannot be built from an iterator over elements of type '&u8'
Here is the link to Rust Playground with the code above.
fn combine(s: Option<&[u8; 32]>, k: &[u8; 32]) -> [u8; 64] {
let tmps = s.unwrap();
let mut result = [0; 64];
let (left, right) = result.split_at_mut(32);
left.copy_from_slice(&tmps[..]);
right.copy_from_slice(&k[..]);
result
}
slice::split_at_mut returns two mutable slices.
And from slice::copy_from_slice(&mut self, src: &[T]) docs:
Copies all elements from src into self, using a memcpy.
Resulting asm (GodBolt.org) is easily vectorized:
example::combine:
push rax
test rsi, rsi
je .LBB0_1
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rsi + 16]
movups xmmword ptr [rdi + 16], xmm1
movups xmmword ptr [rdi], xmm0
movups xmm0, xmmword ptr [rdx]
movups xmm1, xmmword ptr [rdx + 16]
movups xmmword ptr [rdi + 32], xmm0
movups xmmword ptr [rdi + 48], xmm1
mov rax, rdi
pop rcx
ret
.LBB0_1:
lea rdi, [rip + .L__unnamed_1]
lea rdx, [rip + .L__unnamed_2]
mov esi, 43
call qword ptr [rip + core::panicking::panic#GOTPCREL]
ud2
Here is a solution using the arrayvec crate.
use arrayvec::ArrayVec; // 0.7.0
fn combine(s: Option<&[u8; 32]>, k: &[u8; 32]) -> [u8; 64] {
let tmps = *s.unwrap();
let tmpk = *k;
let result: ArrayVec<u8, 64> = tmps.iter().chain(tmpk.iter()).copied().collect();
result.into_inner().unwrap()
}

Why is the produced assembly not equivalent between returning by reference and copy when inlined?

I have a small struct:
pub struct Foo {
pub a: i32,
pub b: i32,
pub c: i32,
}
I was using pairs of the fields in the form (a,b) (b,c) (c,a). To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs:
fn get_foo_ref(&self) -> [(&i32, &i32); 3] {
[(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)]
}
I had to decide if I should return the values as references or copy the i32. Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined.
I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples.
First the variant using the function:
pub fn testing_ref(f: Foo) -> i32 {
let mut sum = 0;
for i in 0..3 {
let (l, r) = f.get_foo_ref()[i];
sum += *l + *r;
}
sum
}
Then the hand-written variant:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case?
You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++.
In C++, the compiler generated equivalent code between get_foo and get_foo_ref, although I don't understand why the code for all 3 cases is not equivalent.
Why did the compiler did not generate equivalent code for all 3 cases?
Update:
I've modified slightly code to use arrays and to add one more direct case.
Rust version with f64 and arrays
C++ version with f64 and arrays
This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly.
Well, I guess this is another example that nothing can be taken for granted.
TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance.
Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references.
Then, you should check the generated assembly for your new type.
In your optimized example, the compiler is being very crafty:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
Yields:
example::testing_direct:
push rbp
mov rbp, rsp
mov eax, dword ptr [rdi + 4]
add eax, dword ptr [rdi]
add eax, dword ptr [rdi + 8]
add eax, eax
pop rbp
ret
Which is roughly sum += f.a; sum += f.b; sum += f.c; sum += sum;.
That is, the compiler realized that:
f.X was added twice
f.X * 2 was equivalent to adding it twice
While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to i32 (and addition being commutative).
For example, switching your code to f32 (still Copy, but addition is not longer commutative), I get the very same assembly for both testing_direct and testing (and slightly different for testing_ref):
example::testing:
push rbp
mov rbp, rsp
movss xmm1, dword ptr [rdi]
movss xmm2, dword ptr [rdi + 4]
movss xmm0, dword ptr [rdi + 8]
movaps xmm3, xmm1
addss xmm3, xmm2
xorps xmm4, xmm4
addss xmm4, xmm3
addss xmm2, xmm0
addss xmm2, xmm4
addss xmm0, xmm1
addss xmm0, xmm2
pop rbp
ret
And there's no trickery any longer.
So it's really not possible to infer much from your example, check with the real type.

Resources