Rust: Collect result of chain into byte array - rust

I have the following two variables: s, which has type Option<&[u8; 32]> and k, which has type &[u8; 32]. I would like to concatenate s and k into a byte array [u8; 64] using the chain method (I'm using chain instead of concat because I've read it is much better performance wise), however, it seems I'm doing something wrong.
Here is a simplified example of what I'm trying to do:
fn combine(s: Option<&[u8; 32]>, k : &[u8; 32]) -> [u8; 64]{
let tmps = *s.unwrap();
let tmpk = *k;
let result = tmps.iter().chain(tmpk.iter()).collect();
result
}
fn main() {
let s = [47u8; 32];
let k = [23u8; 32];
println!("s: {:?}", s);
println!("k: {:?}", k);
let sk = combine(Some(s), k);
println!("sk: {:?}", sk);
}
The error I'm getting: a value of type '[u8; 64]' cannot be built from an iterator over elements of type '&u8'
Here is the link to Rust Playground with the code above.

fn combine(s: Option<&[u8; 32]>, k: &[u8; 32]) -> [u8; 64] {
let tmps = s.unwrap();
let mut result = [0; 64];
let (left, right) = result.split_at_mut(32);
left.copy_from_slice(&tmps[..]);
right.copy_from_slice(&k[..]);
result
}
slice::split_at_mut returns two mutable slices.
And from slice::copy_from_slice(&mut self, src: &[T]) docs:
Copies all elements from src into self, using a memcpy.
Resulting asm (GodBolt.org) is easily vectorized:
example::combine:
push rax
test rsi, rsi
je .LBB0_1
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rsi + 16]
movups xmmword ptr [rdi + 16], xmm1
movups xmmword ptr [rdi], xmm0
movups xmm0, xmmword ptr [rdx]
movups xmm1, xmmword ptr [rdx + 16]
movups xmmword ptr [rdi + 32], xmm0
movups xmmword ptr [rdi + 48], xmm1
mov rax, rdi
pop rcx
ret
.LBB0_1:
lea rdi, [rip + .L__unnamed_1]
lea rdx, [rip + .L__unnamed_2]
mov esi, 43
call qword ptr [rip + core::panicking::panic#GOTPCREL]
ud2

Here is a solution using the arrayvec crate.
use arrayvec::ArrayVec; // 0.7.0
fn combine(s: Option<&[u8; 32]>, k: &[u8; 32]) -> [u8; 64] {
let tmps = *s.unwrap();
let tmpk = *k;
let result: ArrayVec<u8, 64> = tmps.iter().chain(tmpk.iter()).copied().collect();
result.into_inner().unwrap()
}

Related

Representing a limited-range integer with an enum in Rust?

I need an integral type that as a predefined limited range that includes 0, and want to implement like this:
#[repr(u8)]
pub enum X { A, B, C, D, E, F, G, H }
impl From<u8> for X {
fn from(x: u8) -> X {
unsafe { std::mem::transmute(x & 0b111) }
}
}
When I need the integer value, I would cast with as u8. Arithmetic ops would be implemented by casting to u8 then converting back into the enum using from. And because I limit the range with the bitand when converting from u8 to the enum, I'm always in range of the enum.
Some benefits I can see are that the range is known to the compiler so it can skip bounds checking, and enum optimizations such as representing Option<X> as 1 byte.
A drawback I can see via assembly is that I incur and al, 7 every time I convert to enum, but I can live with that.
Is this a sound transmutation of u8 into the enum? What are other drawbacks of representing a limited range integer this way, if any?
I don't think there is anything wrong with this transmutation, in that it is likely sound. However, I believe it is unnecessary.
If performance is critical for your application, you should test on your target arch, but I used the Rust playground to show the generated ASM (for whatever arch the playground runs on):
Your version:
#[repr(u8)]
#[derive(Debug)]
pub enum X { A, B, C, D, E, F, G, H }
impl From<u8> for X {
fn from(x: u8) -> X {
unsafe { std::mem::transmute(x & 0b111) }
}
}
#[no_mangle]
fn do_it_x(a: u8) -> X {
a.into()
}
Explicit match:
#[repr(u8)]
#[derive(Debug)]
pub enum Y { A, B, C, D, E, F, G, H }
impl From<u8> for Y {
fn from(y: u8) -> Y {
match y & 0b111 {
0 => Y::A,
1 => Y::B,
2 => Y::C,
3 => Y::D,
4 => Y::E,
5 => Y::F,
6 => Y::G,
7 => Y::H,
_ => unreachable!(),
}
}
}
#[no_mangle]
fn do_it_y(a: u8) -> Y {
a.into()
}
The resulting assembly (from the playground at least) is:
do_it_x:
pushq %rax
movb %dil, %al
movb %al, 7(%rsp)
movzbl %al, %edi
callq <T as core::convert::Into<U>>::into
movb %al, 6(%rsp)
movb 6(%rsp), %al
popq %rcx
retq
do_it_y:
pushq %rax
movb %dil, %al
movb %al, 7(%rsp)
movzbl %al, %edi
callq <T as core::convert::Into<U>>::into
movb %al, 6(%rsp)
movb 6(%rsp), %al
popq %rcx
retq

The impact of avoiding let variable bindings [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
Improve this question
Having fun with a project using gdnative, I wrote this:
#[method]
fn _save_player_position(&mut self, player_current_position: VariantArray) {
let player_current_position: (f64, f64) = (
player_current_position.get(0).to::<f64>().unwrap(),
player_current_position.get(1).to::<f64>().unwrap()
);
self.player_data.set_player_position(player_current_position.0, player_current_position.1);
self.received_signals += 1;
}
My doubt is, do you "win" some benefit by rewrite the code like this:
#[method]
fn _save_player_position(&mut self, player_current_position: VariantArray) {
self.player_data.set_player_position(
player_current_position.get(0).to::<f64>().unwrap(),
player_current_position.get(1).to::<f64>().unwrap()
);
self.received_signals += 1;
}
As far as I know, I am avoiding:
The creation of a new tuple struct
Storing the data in it's unnamed fields
Saving the data on the let player_current_position
Then moving the data to some of the self fields
And the questions are:
Is the above true?
Is worth starting code like this in order to avoid allocations (even if they are in the stack)
Is better to only optimize heap ones, and improve readability whenever it's possible?
You can check the compiler output for both cases (slightly rewritten for clarity) here:
https://godbolt.org/z/Gc4nr6afb
struct Foo {
pos : (f64, f64),
}
impl Foo {
fn bar(&mut self, current : (f64, f64)) {
let player_current_position: (f64, f64) = (
current.0,
current.1,
);
self.pos = (player_current_position.0, player_current_position.1);
}
fn bar2(&mut self, current : (f64, f64)) {
self.pos = (current.0, current.1);
}
}
pub fn main() {
let mut foo = Foo {pos: (1.0, 1.0)};
foo.bar((2.0,2.0));
foo.bar2((2.0,2.0));
}
this is with local variables:
Foo::bar:
sub rsp, 16
movsd qword ptr [rsp], xmm0
movsd qword ptr [rsp + 8], xmm1
movsd xmm1, qword ptr [rsp]
movsd xmm0, qword ptr [rsp + 8]
movsd qword ptr [rdi], xmm1
movsd qword ptr [rdi + 8], xmm0
add rsp, 16
ret
and this is without
Foo::bar2:
movsd qword ptr [rdi], xmm0
movsd qword ptr [rdi + 8], xmm1
ret
Note that as per #Jmb comment, once compiler optimisation is enabled, the output will be identical: https://godbolt.org/z/xKPfMP6Yr

How to initialize array with one non-zero value

I'm creating a number type that uses arrays to store the numbers. To implement the trait One, I find myself writing code like this:
fn one() -> Self {
let mut ret_array = [0; N];
ret_array[0] = 1;
Self(ret_array)
}
Is there an alternative way to initialize an array with one non-zero element?
I don't think so, no.
But the Rust compiler understands what you are trying to achieve and optimizes it accordingly:
pub fn one<const N: usize>() -> [i32; N] {
let mut ret_array = [0; N];
ret_array[0] = 1;
ret_array
}
pub fn one_with_length_5() -> [i32; 5] {
one()
}
example::one_with_length_5:
mov rax, rdi
xorps xmm0, xmm0
movups xmmword ptr [rdi + 4], xmm0
mov dword ptr [rdi], 1
ret
xorps xmm0, xmm0 sets the 16-byte (or 4-int) SSE register xmm0 to [0,0,0,0].
movups xmmword ptr [rdi + 4], xmm0 copies all 4 ints of the xmm0 register to the location [rdi + 4], which is the elements 1, 2, 3 and 4 of ret_array.
mov dword ptr [rdi], 1 moves the value 1 to the first element of the ret_array.
ret_array is at the location of [rdi], [rdi + 4] is the element at position ret_array[1], [rdi + 8] is the element at position ret_array[2], etc.
As you can see, it only initializes the other four values with 0, and then sets the first value to 1. The first value does not get written twice.
Small remark
If you set N to e.g. 8, it does actually write the value twice:
example::one_with_length_8:
mov rax, rdi
xorps xmm0, xmm0
movups xmmword ptr [rdi + 16], xmm0
movups xmmword ptr [rdi + 4], xmm0
mov dword ptr [rdi], 1
ret
Interestingly, it doesn't actually write the value [0] twice, but the value [4]. It once writes [1,2,3,4], and then [4,5,6,7], then the one to [0].
But that's because this is the fastest way to do it. It stores 4 ints of zeros in the SSE registers and then zero-initializes the vector 4 ints at a time. Writing an int twice is faster than initializing the other values without the help of 4-int SSE commands.
This would even happen if you initialized it completely manually:
pub fn one_with_length_8() -> [i32; 8] {
[1,0,0,0,0,0,0,0]
}
example::one_with_length_8:
mov rax, rdi
mov dword ptr [rdi], 1
xorps xmm0, xmm0
movups xmmword ptr [rdi + 4], xmm0
movups xmmword ptr [rdi + 16], xmm0
ret
You can see the order is different, but the instructions are identical.

Is there a standard way of cyclically rotating mutable variables in Rust?

A very common operation in implementing algorithms is the cyclic rotate: given, say, 3 variables a, b, c change them to the effect of
t ⇽ c
c ⇽ b
b ⇽ a
a ⇽ t
Given that everything is bitwise swappable, cyclic rotation should be an area where Rust excels more than any other language I know of.
For comparison, in C++ the most efficient generic way to rotate N elements is performing n+1 std::move operations, which in turn roughly leads to (for a typical move constructor implementation) 3 (n+1) sizeof(T) word assignments (this can be improved for PODs via template specializing rotate, but requires work).
In Rust, the language makes it possible to implement rotate with only (n+1) size_of(T) word assignments. To my surprise, I could not find standard library support for rotation. (No rotate method in std::mem). It would probably look like this:
pub fn rotate<T>(x: &mut T, y: &mut T, z: &mut T) {
unsafe {
let mut t: T = std::mem::uninitialized();
std::ptr::copy_nonoverlapping(&*z, &mut t, 1);
std::ptr::copy_nonoverlapping(&*y, z, 1);
std::ptr::copy_nonoverlapping(&*x, y, 1);
std::ptr::copy_nonoverlapping(&t, x, 1);
std::mem::forget(t);
}
}
For clarification on why rotation cannot be implemented efficiently in C++, consider:
struct String {
char *data1;
char *data2;
String(String &&other) : data1(other.data1), data2(other.data2)
{ other.data1 = other.data2 = nullptr;}
String &operator=(String &&other)
{ std::swap(data1, other.data1); std::swap(data2, other.data2);
return *this; }
~String() { delete [] data1; delete [] data2; }
};
Here an operation like s2 = std::move(s1); will take 3 pointer assignments for each member field, totaling to 6 assignments since pointer swap requires 3 assignments (1 into temp, 1 out of temp, one across operands)
Is there a standard way of cyclically rotating mutable variables in Rust?
No.
I'd just swap the variables twice, no need for unsafe:
use std::mem;
pub fn rotate<T>(x: &mut T, y: &mut T, z: &mut T) {
mem::swap(x, y);
mem::swap(y, z);
}
fn main() {
let mut a = 1;
let mut b = 2;
let mut c = 3;
println!("{}, {}, {}", a, b, c);
// 1, 2, 3
rotate(&mut a, &mut b, &mut c);
println!("{}, {}, {}", a, b, c);
// 2, 3, 1
}
This produces 7 movl instructions (Rust 1.35.0, Release, x86_64, Linux)
playground::rotate:
movl (%rdi), %eax
movl (%rsi), %ecx
movl %ecx, (%rdi)
movl %eax, (%rsi)
movl (%rdx), %ecx
movl %ecx, (%rsi)
movl %eax, (%rdx)
retq
As opposed to the original 6 movl instructions:
playground::rotate_original:
movl (%rdx), %eax
movl (%rsi), %ecx
movl %ecx, (%rdx)
movl (%rdi), %ecx
movl %ecx, (%rsi)
movl %eax, (%rdi)
retq
I'm OK giving up that single instruction for purely safe code that is also easier to reason about.
In "real" code, I'd make use of the fact that all the variables are the same type and that slice::rotate_left and slice::rotate_right exist:
fn main() {
let mut vals = [1, 2, 3];
let [a, b, c] = &vals;
println!("{}, {}, {}", a, b, c);
// 1, 2, 3
vals.rotate_left(1);
let [a, b, c] = &vals;
println!("{}, {}, {}", a, b, c);
// 2, 3, 1
}

Why is the produced assembly not equivalent between returning by reference and copy when inlined?

I have a small struct:
pub struct Foo {
pub a: i32,
pub b: i32,
pub c: i32,
}
I was using pairs of the fields in the form (a,b) (b,c) (c,a). To avoid duplication of the code, I created a utility function which would allow me to iterate over the pairs:
fn get_foo_ref(&self) -> [(&i32, &i32); 3] {
[(&self.a, &self.b), (&self.b, &self.c), (&self.c, &self.a)]
}
I had to decide if I should return the values as references or copy the i32. Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references. I expected the resulting code should be equivalent since everything would be inlined.
I am generally optimistic about optimizations, so I suspected that the code would be equivalent when using this function as compared to hand written code examples.
First the variant using the function:
pub fn testing_ref(f: Foo) -> i32 {
let mut sum = 0;
for i in 0..3 {
let (l, r) = f.get_foo_ref()[i];
sum += *l + *r;
}
sum
}
Then the hand-written variant:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
To my disappointment, all 3 methods resulted in different assembly code. The worst code was generated for the case with references, and the best code was the one that didn't use my utility function at all. Why is that? Shouldn't the compiler generate equivalent code in this case?
You can view the resulting assembly code on Godbolt; I also have the 'equivalent' assembly code from C++.
In C++, the compiler generated equivalent code between get_foo and get_foo_ref, although I don't understand why the code for all 3 cases is not equivalent.
Why did the compiler did not generate equivalent code for all 3 cases?
Update:
I've modified slightly code to use arrays and to add one more direct case.
Rust version with f64 and arrays
C++ version with f64 and arrays
This time the generated code between in C++ is exactly the same. However the Rust' assembly differs, and returning by references results in worse assembly.
Well, I guess this is another example that nothing can be taken for granted.
TL;DR: Microbenchmarks are trickery, instruction count does not directly translate into high/low performance.
Later on, I plan to switch to a non-Copy type instead of an i32, so I decided to use references.
Then, you should check the generated assembly for your new type.
In your optimized example, the compiler is being very crafty:
pub fn testing_direct(f: Foo) -> i32 {
let mut sum = 0;
sum += f.a + f.b;
sum += f.b + f.c;
sum += f.c + f.a;
sum
}
Yields:
example::testing_direct:
push rbp
mov rbp, rsp
mov eax, dword ptr [rdi + 4]
add eax, dword ptr [rdi]
add eax, dword ptr [rdi + 8]
add eax, eax
pop rbp
ret
Which is roughly sum += f.a; sum += f.b; sum += f.c; sum += sum;.
That is, the compiler realized that:
f.X was added twice
f.X * 2 was equivalent to adding it twice
While the former may be inhibited in the other cases by the use of indirection, the latter is VERY specific to i32 (and addition being commutative).
For example, switching your code to f32 (still Copy, but addition is not longer commutative), I get the very same assembly for both testing_direct and testing (and slightly different for testing_ref):
example::testing:
push rbp
mov rbp, rsp
movss xmm1, dword ptr [rdi]
movss xmm2, dword ptr [rdi + 4]
movss xmm0, dword ptr [rdi + 8]
movaps xmm3, xmm1
addss xmm3, xmm2
xorps xmm4, xmm4
addss xmm4, xmm3
addss xmm2, xmm0
addss xmm2, xmm4
addss xmm0, xmm1
addss xmm0, xmm2
pop rbp
ret
And there's no trickery any longer.
So it's really not possible to infer much from your example, check with the real type.

Resources