Given I want to sum the first n terms of a series 1,2,3,.. with the following function in Rust
fn sum_sequence(x: u64) -> u64
{
let mut s: u64 = 0;
for n in 1..=x
{
s = s + n;
}
return s;
}
When I compile it for x64 architecture
cargo build --release
and run it with x=10000000000 the result is 13106511857580896768 - fine.
But when I compile this very function to Webassembly (WASM)
cargo build --target wasm32-unknown-unknown --release
and run it with the same argument as before, x=10000000000,
wasmtime ./target/wasm32-unknown-unknown/release/sum_it.wasm --invoke sum_sequence 1000000000
Then the result is -5340232216128654848.
I would not have expected any deviation in results between Rust being compiled to x64 in comparison to Rust being compiled to WASM. Also, from the WASM text file (below), I do not see why I should get a negative result when I run it with WASM.
How does it come that WASM shows a different result and what can I do do correct the calculation of WASM?
(module
(type (;0;) (func (param i64) (result i64)))
(func $sum_sequence (type 0) (param i64) (result i64)
(local i64 i64 i32)
block ;; label = #1
local.get 0
i64.eqz
i32.eqz
br_if 0 (;#1;)
i64.const 0
return
end
i64.const 1
local.set 1
i64.const 0
local.set 2
block ;; label = #1
loop ;; label = #2
local.get 1
local.get 2
i64.add
local.set 2
local.get 1
local.get 1
local.get 0
i64.lt_u
local.tee 3
i64.extend_i32_u
i64.add
local.tee 1
local.get 0
i64.gt_u
br_if 1 (;#1;)
local.get 3
br_if 0 (;#2;)
end
end
local.get 2)
(table (;0;) 1 1 funcref)
(memory (;0;) 16)
(global (;0;) (mut i32) (i32.const 1048576))
(global (;1;) i32 (i32.const 1048576))
(global (;2;) i32 (i32.const 1048576))
(export "memory" (memory 0))
(export "sum" (func $sum))
(export "__data_end" (global 1))
(export "__heap_base" (global 2)))
It seems to be because wasm does not support native u64 as a type, only the signed variants (notably, i64), which is why it's using i64 as the type for the arithmetic operations. Since this then overflows a 64-bit integer (the correct output is n * (n+1) / 2, or 50000000005000000000, you're getting a negative value due to the overflow, which is then getting printed to console. This is due to a lack of type support in wasm.
Just for reference, a Σ n=0 to N := (N * (N+1) / 2, which I use from here on out since it's much faster computationally, and correct for our purposes.
The result, 50000000005000000000, takes ~65.4 bits in memory to accurately represent in memory, which is why you get wrapping behavior for x86_64 and wasm, just the types it wraps to are different.
Using NumPy, we can clearly confirm this:
>>> import numpy as np
>>> a = np.uint64(10000000000)
>>> b = np.uint64(10000000001)
>>> (a >> np.uint64(1)) * b
13106511857580896768
>>> import numpy as np
>>> a = np.int64(10000000000)
>>> b = np.int64(10000000001)
>>> (a >> np.int64(1)) * b
-5340232216128654848
The values you are getting are due to unsigned and signed (two's complement) integer overflow. (Note: I'm using a right bitshift to simulate division-by-two, I could probably also use the // operator).
EDIT: Also, a good point was raised in the comments by Herohtar: it clearly overflows if run in debug mode, panicking with 'attempt to add with overflow'.
How can I divide int64?
let v: int64 = 100
echo v / 10
Error Error: type mismatch: got <int64, int literal(10)>
Full example
import math
proc sec_to_min*(sec: int64): int =
let min = sec / 60 # <= error
min.round.to_int
echo 100.sec_to_min
P.S.
And, is there a way to safely cast int64 to int, so the result would be int and not int64, with the check for overflow.
There has been already a bit of discussion over int64 division in this issue and probably some improvement to current state can be made. From the above issue:
a good reason for not having in stdlib float division between int64 is that it might it may incur in loss of precision and so the user should explicitly convertint64 to float
still, float division between int types is present in stdlib
on 64 bit system int is int64 (and so you have division between int64 in 64 bit systems)
For your use case I think the following (playground) should work (better to use div instead of doing float division and then rounding off):
import math
proc sec_to_min*(sec: int64): int = sec.int div 60
echo 100.sec_to_min
let a = high(int64)
echo a.int # on playground this does not raise error since int is int64
echo a.int32 # this instead correctly raises error
output:
1
9223372036854775807
/usercode/in.nim(9) in
/playground/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: value out of range: 9223372036854775807 notin -2147483648 .. 2147483647 [RangeError]
P.S.: as you see above standard conversion has range checks
Apparently division between int64 types is terribly dangerous because it invokes an undying horde of bike shedding, but at least you can create your own operator:
proc `/`(x, y: int64): int64 = x div y
let v: int64 = 100
echo v / 10
Or
proc `/`(x, y: int64): int64 = x div y
import math
proc sec_to_min*(sec: int64): int =
int(sec / 60)
echo 100.sec_to_min
With regards to the int64 to int conversion, I'm not sure that makes much sense since most platforms will run int as an alias of int64. But of course you could be compiling/running on a 32 bit platform, where the loss would be tragic, so you can still do runtime checks:
let a = int64.high
echo "Unsurprising but potentially wrong ", int(a)
proc safe_int(big_int: int64): int =
if big_int > int32.high:
raise new_exception(Overflow_error, "Value is too high for 32 bit platforms")
int(big_int)
echo "Reachable code ", safe_int(int32.high)
echo "Unreachable code ", safe_int(a)
Also, if you are running into confusing minute, hour, day conversions, you might want to look into distinct types to avoid adding months to seconds (or do so in a more safe way).
On page 322 of Programming Rust by Blandy and Orendorff is this claim:
...Rust...recognizes that there's a simpler way to sum the numbers from one to n: the sum is always equal to n * (n+1) / 2.
This is of course a fairly well-known equivalence, but how does the compiler recognize it? I'm guessing it's in an LLVM optimization pass, but is LLVM somehow deriving the equivalence from first principles, or does it just have some set of "common loop computations" that can be simplified to arithmetic operations?
First of all, let's demonstrate that this actually happens.
Starting with this code:
pub fn sum(start: i32, end: i32) -> i32 {
let mut result = 0;
for i in start..end {
result += i;
}
return result;
}
And compiling in Release, we get:
; playground::sum
; Function Attrs: nounwind nonlazybind readnone uwtable
define i32 #_ZN10playground3sum17h41f12649b0533596E(i32 %start1, i32 %end) {
start:
%0 = icmp slt i32 %start1, %end
br i1 %0, label %bb5.preheader, label %bb6
bb5.preheader: ; preds = %start
%1 = xor i32 %start1, -1
%2 = add i32 %1, %end
%3 = add i32 %start1, 1
%4 = mul i32 %2, %3
%5 = zext i32 %2 to i33
%6 = add i32 %end, -2
%7 = sub i32 %6, %start1
%8 = zext i32 %7 to i33
%9 = mul i33 %5, %8
%10 = lshr i33 %9, 1
%11 = trunc i33 %10 to i32
%12 = add i32 %4, %start1
%13 = add i32 %12, %11
br label %bb6
bb6: ; preds = %bb5.preheader, %start
%result.0.lcssa = phi i32 [ 0, %start ], [ %13, %bb5.preheader ]
ret i32 %result.0.lcssa
}
Where we can indeed observe that there is no loop any longer.
Thus we validate the claim by Bandy and Orendorff.
As for how this occurs, my understanding is that this all happens in ScalarEvolution.cpp in LLVM. Unfortunately, that file is a 12,000+ lines monstruosity, so navigating it is a tad complicated; still, the head comment hints that we should be in the right place, and points to the papers it used which mention optimizing loops and closed-form functions1:
//===----------------------------------------------------------------------===//
//
// There are several good references for the techniques used in this analysis.
//
// Chains of recurrences -- a method to expedite the evaluation
// of closed-form functions
// Olaf Bachmann, Paul S. Wang, Eugene V. Zima
//
// On computational properties of chains of recurrences
// Eugene V. Zima
//
// Symbolic Evaluation of Chains of Recurrences for Loop Optimization
// Robert A. van Engelen
//
// Efficient Symbolic Analysis for Optimizing Compilers
// Robert A. van Engelen
//
// Using the chains of recurrences algebra for data dependence testing and
// induction variable substitution
// MS Thesis, Johnie Birch
//
//===----------------------------------------------------------------------===//
According to this blog article by Krister Walfridsson, it builds up chains of recurrences, which can be used to obtain a closed-form formula for each inductive variable.
This is a mid-point between full reasoning and full hardcoding:
Pattern-matching is used to build the chains of recurrence, so LLVM may not recognize all ways of expressing a certain computation.
A large variety of formulas can be optimized, not only the triangle sum.
The article also notes that the optimization may end up pessimizing the code: a small number of iterations can be faster if the "optimized" code requires a larger number of operations compared to the inner body of the loop.
1 n * (n+1) / 2 is the closed-form function to compute the sum of numbers in [0, n].
I must admit I'm a bit lost with macros.
I want to build a macro that does the following task and
I'm not sure how to do it. I want to perform a scalar product
of two arrays, say x and y, which have the same length N.
The result I want to compute is of the form:
z = sum_{i=0}^{N-1} x[i] * y[i].
x is const which elements are 0, 1, or -1 which are known at compile time,
while y's elements are determined at runtime. Because of the
structure of x, many computations are useless (terms multiplied by 0
can be removed from the sum, and multiplications of the form 1 * y[i], -1 * y[i] can be transformed into y[i], -y[i] respectively).
As an example if x = [-1, 1, 0], the scalar product above would be
z=-1 * y[0] + 1 * y[1] + 0 * y[2]
To speed up my computation I can unroll the loop by hand and rewrite
the whole thing without x[i], and I could hard code the above formula as
z = -y[0] + y[1]
But this procedure is not elegant, error prone
and very tedious when N becomes large.
I'm pretty sure I can do that with a macro, but I don't know where to
start (the different books I read are not going too deep into macros and
I'm stuck)...
Would anyone of you have any idea how to (if it is possible) this problem using macros?
Thank you in advance for your help!
Edit: As pointed out in many of the answers, the compiler is smart enough to remove optimize the loop in the case of integers. I am not only using integers but also floats (the x array is i32s, but in general y is f64s), so the compiler is not smart enough (and rightfully so) to optimize the loop. The following piece of code gives the following asm.
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [f64; 8]) -> f64 {
X.iter().zip(y.iter()).map(|(i, j)| (*i as f64) * j).sum()
}
playground::dot_x:
xorpd %xmm0, %xmm0
movsd (%rdi), %xmm1
mulsd %xmm0, %xmm1
addsd %xmm0, %xmm1
addsd 8(%rdi), %xmm1
subsd 16(%rdi), %xmm1
movupd 24(%rdi), %xmm2
xorpd %xmm3, %xmm3
mulpd %xmm2, %xmm3
addsd %xmm3, %xmm1
unpckhpd %xmm3, %xmm3
addsd %xmm1, %xmm3
addsd 40(%rdi), %xmm3
mulsd 48(%rdi), %xmm0
addsd %xmm3, %xmm0
subsd 56(%rdi), %xmm0
retq
First of all, a (proc) macro can simply not look inside your array x. All it gets are the tokens you pass it, without any context. If you want it to know about the values (0, 1, -1), you need to pass those directly to your macro:
let result = your_macro!(y, -1, 0, 1, -1);
But you don't really need a macro for this. The compiler optimizes a lot, as also shown in the other answers. However, it will not, as you already mention in your edit, optimize away 0.0 * x[i], as the result of that is not always 0.0. (It could be -0.0 or NaN for example.) What we can do here, is simply help the optimizer a bit by using a match or if, to make sure it does nothing for the 0.0 * y case:
const X: [i32; 8] = [0, -1, 0, 0, 0, 0, 1, 0];
fn foobar(y: [f64; 8]) -> f64 {
let mut sum = 0.0;
for (&x, &y) in X.iter().zip(&y) {
if x != 0 {
sum += x as f64 * y;
}
}
sum
}
In release mode, the loop is unrolled and the values of X inlined, resulting in most iterations being thrown away as they don't do anything. The only thing left in the resulting binary (on x86_64), is:
foobar:
xorpd xmm0, xmm0
subsd xmm0, qword, ptr, [rdi, +, 8]
addsd xmm0, qword, ptr, [rdi, +, 48]
ret
(As suggested by #lu-zero, this can also be done using filter_map. That will look like this: X.iter().zip(&y).filter_map(|(&x, &y)| match x { 0 => None, _ => Some(x as f64 * y) }).sum(), and gives the exact same generated assembly. Or even without a match, by using filter and map separately: .filter(|(&x, _)| x != 0).map(|(&x, &y)| x as f64 * y).sum().)
Pretty good! However, this function calculates 0.0 - y[1] + y[6], since sum started at 0.0 and we only subtract and add things to it. The optimizer is again not willing to optimize away a 0.0. We can help it a bit more by not starting at 0.0, but starting with None:
fn foobar(y: [f64; 8]) -> f64 {
let mut sum = None;
for (&x, &y) in X.iter().zip(&y) {
if x != 0 {
let p = x as f64 * y;
sum = Some(sum.map_or(p, |s| s + p));
}
}
sum.unwrap_or(0.0)
}
This results in:
foobar:
movsd xmm0, qword, ptr, [rdi, +, 48]
subsd xmm0, qword, ptr, [rdi, +, 8]
ret
Which simply does y[6] - y[1]. Bingo!
You may be able to achieve your goal with a macro that returns a function.
First, write this function without a macro. This one takes a fixed number of parameters.
fn main() {
println!("Hello, world!");
let func = gen_sum([1,2,3]);
println!("{}", func([4,5,6])) // 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32
}
fn gen_sum(xs: [i32; 3]) -> impl Fn([i32;3]) -> i32 {
move |ys| ys[0]*xs[0] + ys[1]*xs[1] + ys[2]*xs[2]
}
Now, completely rewrite it because the prior design doesn't work well as a macro. We had to give up on fixed sized arrays, as macros appear unable to allocate fixed-sized arrays.
Rust Playground
fn main() {
let func = gen_sum!(1,2,3);
println!("{}", func(vec![4,5,6])) // 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32
}
#[macro_export]
macro_rules! gen_sum {
( $( $x:expr ),* ) => {
{
let mut xs = Vec::new();
$(
xs.push($x);
)*
move |ys:Vec<i32>| {
if xs.len() != ys.len() {
panic!("lengths don't match")
}
let mut total = 0;
for i in 0 as usize .. xs.len() {
total += xs[i] * ys[i];
}
total
}
}
};
}
What does this do/What should it do
At compile time, it generates a lambda. This lambda accepts a list of numbers and multiplies it by a vec that was generated at compile time. I don't think this was exactly what you were after, as it does not optimize away zeroes at compile time. You could optimize away zeroes at compile time, but you would necessarily incur some cost at run-time by having to check where the zeroes were in x to determine which elements to multiply by in y. You could even make this lookup process in constant time using a hashset. It's still probably not worth it in general (where I presume 0 is not all that common). Computers are better at doing one thing that's "inefficient" than they are at detecting that the thing they're about to do is "inefficient" then skipping that thing. This abstraction breaks down when a significant portion of the operations they do are "inefficient"
Follow-up
Was that worth it? Does it improve run times? I didn't measure, but it seems like understanding and maintaining the macro I wrote isn't worth it compared to just using a function. Writing a macro that does the zero optimization you talked about would probably be even less pleasant.
In many cases, the optimisation stage of the compiler will take care of this for you. To give an example, this function definition
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [i32; 8]) -> i32 {
X.iter().zip(y.iter()).map(|(i, j)| i * j).sum()
}
results in this assembly output on x86_64:
playground::dot_x:
mov eax, dword ptr [rdi + 4]
sub eax, dword ptr [rdi + 8]
add eax, dword ptr [rdi + 20]
sub eax, dword ptr [rdi + 28]
ret
You won't be able to get any more optimised version than this, so simply writing the code in a naïve way is the best solution. Whether the compiler will unroll the loop for longer vectors is unclear, and it may change with compiler versions.
For floating-point numbers, the compiler is not normally able to perform all the optimisations above, since the numbers in y are not guaranteed to be finite – they could also be NaN, inf or -inf. For this reason, multiplying with 0.0 is not guaranteed to result in 0.0 again, so the compiler needs to keep the multiplication instructions in the code. You can explicitly allow it to assume all numbers are finite, though, by using the fmul_fast() instrinsic function:
#![feature(core_intrinsics)]
use std::intrinsics::fmul_fast;
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [f64; 8]) -> f64 {
X.iter().zip(y.iter()).map(|(i, j)| unsafe { fmul_fast(*i as f64, *j) }).sum()
}
This results in the following assembly code:
playground::dot_x: # #playground::dot_x
# %bb.0:
xorpd xmm1, xmm1
movsd xmm0, qword ptr [rdi + 8] # xmm0 = mem[0],zero
addsd xmm0, xmm1
subsd xmm0, qword ptr [rdi + 16]
addsd xmm0, xmm1
addsd xmm0, qword ptr [rdi + 40]
addsd xmm0, xmm1
subsd xmm0, qword ptr [rdi + 56]
ret
This still redundantly adds zeros between the steps, but I would not expect this to result in any measurable overhead for realistic CFD simulations, since such simulations tend to be limited by memory bandwidth rather than CPU. If you want to avoid these additions as well, you need to use fadd_fast() for the additions to allow the compiler to optimise further:
#![feature(core_intrinsics)]
use std::intrinsics::{fadd_fast, fmul_fast};
const X: [i32; 8] = [0, 1, -1, 0, 0, 1, 0, -1];
pub fn dot_x(y: [f64; 8]) -> f64 {
let mut result = 0.0;
for (&i, &j) in X.iter().zip(y.iter()) {
unsafe { result = fadd_fast(result, fmul_fast(i as f64, j)); }
}
result
}
This results in the following assembly code:
playground::dot_x: # #playground::dot_x
# %bb.0:
movsd xmm0, qword ptr [rdi + 8] # xmm0 = mem[0],zero
subsd xmm0, qword ptr [rdi + 16]
addsd xmm0, qword ptr [rdi + 40]
subsd xmm0, qword ptr [rdi + 56]
ret
As with all optmisations, you should start with the most readable and maintainable version of the code. If performance becomes an issue, you should profile your code and find the bottlenecks. As the next step, try to improve the fundamental approach, e.g. by using an algorithm with a better asymptotical complexity. Only then should you turn to micro-optimisations like the one you suggested in the question.
If you can spare an #[inline(always)] probably using an explicit filter_map() should be enough to have the compiler do what you want.
In the below code:
c := "fool"
d := []byte("fool")
fmt.Printf("c: %T, %d\n", c, unsafe.Sizeof(c)) // 16 bytes
fmt.Printf("d: %T, %d\n", d, unsafe.Sizeof(d)) // 24 bytes
To decide the datatype needed to receive JSON data from CloudFoundry, am testing above sample code to understand the memory allocation for []byte vs string type.
Expected size of string type variable c is 1 byte x 4 ascii encoded letter = 4 bytes, but the size shows 16 bytes.
For byte type variable d, GO embeds the string in the executable program as a string literal. It converts the string literal to a byte slice at runtime using the runtime.stringtoslicebyte function. Something like... []byte{102, 111, 111, 108}
Expected size of byte type variable d is again 1 byte x 4 ascii values = 4 bytes but the size of variable d shows 24 bytes as it's underlying array capacity.
Why the size of both variables is not 4 bytes?
Both slices and strings in Go are struct-like headers:
reflect.SliceHeader:
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
reflect.StringHeader:
type StringHeader struct {
Data uintptr
Len int
}
The sizes reported by unsafe.Sizeof() are the sizes of these headers, exluding the size of the pointed arrays:
Sizeof takes an expression x of any type and returns the size in bytes of a hypothetical variable v as if v was declared via var v = x. The size does not include any memory possibly referenced by x. For instance, if x is a slice, Sizeof returns the size of the slice descriptor, not the size of the memory referenced by the slice.
To get the actual ("recursive") size of some arbitrary value, use Go's builtin testing and benchmarking framework. For details, see How to get memory size of variable in Go?
For strings specifically, see String memory usage in Golang. The complete memory required by a string value can be computed like this:
var str string = "some string"
stringSize := len(str) + int(unsafe.Sizeof(str))