Given I want to sum the first n terms of a series 1,2,3,.. with the following function in Rust
fn sum_sequence(x: u64) -> u64
{
let mut s: u64 = 0;
for n in 1..=x
{
s = s + n;
}
return s;
}
When I compile it for x64 architecture
cargo build --release
and run it with x=10000000000 the result is 13106511857580896768 - fine.
But when I compile this very function to Webassembly (WASM)
cargo build --target wasm32-unknown-unknown --release
and run it with the same argument as before, x=10000000000,
wasmtime ./target/wasm32-unknown-unknown/release/sum_it.wasm --invoke sum_sequence 1000000000
Then the result is -5340232216128654848.
I would not have expected any deviation in results between Rust being compiled to x64 in comparison to Rust being compiled to WASM. Also, from the WASM text file (below), I do not see why I should get a negative result when I run it with WASM.
How does it come that WASM shows a different result and what can I do do correct the calculation of WASM?
(module
(type (;0;) (func (param i64) (result i64)))
(func $sum_sequence (type 0) (param i64) (result i64)
(local i64 i64 i32)
block ;; label = #1
local.get 0
i64.eqz
i32.eqz
br_if 0 (;#1;)
i64.const 0
return
end
i64.const 1
local.set 1
i64.const 0
local.set 2
block ;; label = #1
loop ;; label = #2
local.get 1
local.get 2
i64.add
local.set 2
local.get 1
local.get 1
local.get 0
i64.lt_u
local.tee 3
i64.extend_i32_u
i64.add
local.tee 1
local.get 0
i64.gt_u
br_if 1 (;#1;)
local.get 3
br_if 0 (;#2;)
end
end
local.get 2)
(table (;0;) 1 1 funcref)
(memory (;0;) 16)
(global (;0;) (mut i32) (i32.const 1048576))
(global (;1;) i32 (i32.const 1048576))
(global (;2;) i32 (i32.const 1048576))
(export "memory" (memory 0))
(export "sum" (func $sum))
(export "__data_end" (global 1))
(export "__heap_base" (global 2)))
It seems to be because wasm does not support native u64 as a type, only the signed variants (notably, i64), which is why it's using i64 as the type for the arithmetic operations. Since this then overflows a 64-bit integer (the correct output is n * (n+1) / 2, or 50000000005000000000, you're getting a negative value due to the overflow, which is then getting printed to console. This is due to a lack of type support in wasm.
Just for reference, a Σ n=0 to N := (N * (N+1) / 2, which I use from here on out since it's much faster computationally, and correct for our purposes.
The result, 50000000005000000000, takes ~65.4 bits in memory to accurately represent in memory, which is why you get wrapping behavior for x86_64 and wasm, just the types it wraps to are different.
Using NumPy, we can clearly confirm this:
>>> import numpy as np
>>> a = np.uint64(10000000000)
>>> b = np.uint64(10000000001)
>>> (a >> np.uint64(1)) * b
13106511857580896768
>>> import numpy as np
>>> a = np.int64(10000000000)
>>> b = np.int64(10000000001)
>>> (a >> np.int64(1)) * b
-5340232216128654848
The values you are getting are due to unsigned and signed (two's complement) integer overflow. (Note: I'm using a right bitshift to simulate division-by-two, I could probably also use the // operator).
EDIT: Also, a good point was raised in the comments by Herohtar: it clearly overflows if run in debug mode, panicking with 'attempt to add with overflow'.
Related
How can I divide int64?
let v: int64 = 100
echo v / 10
Error Error: type mismatch: got <int64, int literal(10)>
Full example
import math
proc sec_to_min*(sec: int64): int =
let min = sec / 60 # <= error
min.round.to_int
echo 100.sec_to_min
P.S.
And, is there a way to safely cast int64 to int, so the result would be int and not int64, with the check for overflow.
There has been already a bit of discussion over int64 division in this issue and probably some improvement to current state can be made. From the above issue:
a good reason for not having in stdlib float division between int64 is that it might it may incur in loss of precision and so the user should explicitly convertint64 to float
still, float division between int types is present in stdlib
on 64 bit system int is int64 (and so you have division between int64 in 64 bit systems)
For your use case I think the following (playground) should work (better to use div instead of doing float division and then rounding off):
import math
proc sec_to_min*(sec: int64): int = sec.int div 60
echo 100.sec_to_min
let a = high(int64)
echo a.int # on playground this does not raise error since int is int64
echo a.int32 # this instead correctly raises error
output:
1
9223372036854775807
/usercode/in.nim(9) in
/playground/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: value out of range: 9223372036854775807 notin -2147483648 .. 2147483647 [RangeError]
P.S.: as you see above standard conversion has range checks
Apparently division between int64 types is terribly dangerous because it invokes an undying horde of bike shedding, but at least you can create your own operator:
proc `/`(x, y: int64): int64 = x div y
let v: int64 = 100
echo v / 10
Or
proc `/`(x, y: int64): int64 = x div y
import math
proc sec_to_min*(sec: int64): int =
int(sec / 60)
echo 100.sec_to_min
With regards to the int64 to int conversion, I'm not sure that makes much sense since most platforms will run int as an alias of int64. But of course you could be compiling/running on a 32 bit platform, where the loss would be tragic, so you can still do runtime checks:
let a = int64.high
echo "Unsurprising but potentially wrong ", int(a)
proc safe_int(big_int: int64): int =
if big_int > int32.high:
raise new_exception(Overflow_error, "Value is too high for 32 bit platforms")
int(big_int)
echo "Reachable code ", safe_int(int32.high)
echo "Unreachable code ", safe_int(a)
Also, if you are running into confusing minute, hour, day conversions, you might want to look into distinct types to avoid adding months to seconds (or do so in a more safe way).
On page 322 of Programming Rust by Blandy and Orendorff is this claim:
...Rust...recognizes that there's a simpler way to sum the numbers from one to n: the sum is always equal to n * (n+1) / 2.
This is of course a fairly well-known equivalence, but how does the compiler recognize it? I'm guessing it's in an LLVM optimization pass, but is LLVM somehow deriving the equivalence from first principles, or does it just have some set of "common loop computations" that can be simplified to arithmetic operations?
First of all, let's demonstrate that this actually happens.
Starting with this code:
pub fn sum(start: i32, end: i32) -> i32 {
let mut result = 0;
for i in start..end {
result += i;
}
return result;
}
And compiling in Release, we get:
; playground::sum
; Function Attrs: nounwind nonlazybind readnone uwtable
define i32 #_ZN10playground3sum17h41f12649b0533596E(i32 %start1, i32 %end) {
start:
%0 = icmp slt i32 %start1, %end
br i1 %0, label %bb5.preheader, label %bb6
bb5.preheader: ; preds = %start
%1 = xor i32 %start1, -1
%2 = add i32 %1, %end
%3 = add i32 %start1, 1
%4 = mul i32 %2, %3
%5 = zext i32 %2 to i33
%6 = add i32 %end, -2
%7 = sub i32 %6, %start1
%8 = zext i32 %7 to i33
%9 = mul i33 %5, %8
%10 = lshr i33 %9, 1
%11 = trunc i33 %10 to i32
%12 = add i32 %4, %start1
%13 = add i32 %12, %11
br label %bb6
bb6: ; preds = %bb5.preheader, %start
%result.0.lcssa = phi i32 [ 0, %start ], [ %13, %bb5.preheader ]
ret i32 %result.0.lcssa
}
Where we can indeed observe that there is no loop any longer.
Thus we validate the claim by Bandy and Orendorff.
As for how this occurs, my understanding is that this all happens in ScalarEvolution.cpp in LLVM. Unfortunately, that file is a 12,000+ lines monstruosity, so navigating it is a tad complicated; still, the head comment hints that we should be in the right place, and points to the papers it used which mention optimizing loops and closed-form functions1:
//===----------------------------------------------------------------------===//
//
// There are several good references for the techniques used in this analysis.
//
// Chains of recurrences -- a method to expedite the evaluation
// of closed-form functions
// Olaf Bachmann, Paul S. Wang, Eugene V. Zima
//
// On computational properties of chains of recurrences
// Eugene V. Zima
//
// Symbolic Evaluation of Chains of Recurrences for Loop Optimization
// Robert A. van Engelen
//
// Efficient Symbolic Analysis for Optimizing Compilers
// Robert A. van Engelen
//
// Using the chains of recurrences algebra for data dependence testing and
// induction variable substitution
// MS Thesis, Johnie Birch
//
//===----------------------------------------------------------------------===//
According to this blog article by Krister Walfridsson, it builds up chains of recurrences, which can be used to obtain a closed-form formula for each inductive variable.
This is a mid-point between full reasoning and full hardcoding:
Pattern-matching is used to build the chains of recurrence, so LLVM may not recognize all ways of expressing a certain computation.
A large variety of formulas can be optimized, not only the triangle sum.
The article also notes that the optimization may end up pessimizing the code: a small number of iterations can be faster if the "optimized" code requires a larger number of operations compared to the inner body of the loop.
1 n * (n+1) / 2 is the closed-form function to compute the sum of numbers in [0, n].
This code works for setting the program counter to the address of the vector_table on the ARM architecture:
static mut JUMP: Option<extern "C" fn()> = None;
JUMP = Some(core::mem::transmute(vector_table));
(JUMP.unwrap())();
I calculate the vector table using let vector_table = *((address + 4) as * const u32);
Is there any way of expressing the same in pure Rust code?
The equivalent C code is:
((void (*)(void))address[1])();
address is an uint32_t *address, so you offset it by 4 bytes to hit the vector table.
In the below code, I create several arrays that are 4 elements long. This works fine however, when I make them larger (5 or 6) elements I get the below error.
IR Code
declare [4 x double] #malloc(double)
declare double #printd(double)
define double #__anon_expr0() {
entry:
%foo = alloca [4 x double]
%calltmp = call [4 x double] #malloc(double 2.560000e+02)
%0 = insertvalue [4 x double] %calltmp, double 1.000000e+00, 0
%1 = insertvalue [4 x double] %0, double 1.000000e+00, 1
%2 = insertvalue [4 x double] %1, double 1.000000e+00, 2
store [4 x double] %2, [4 x double]* %foo
%3 = getelementptr [4 x double], [4 x double]* %foo, i32 0, i32 0
%__ = load double, double* %3
%calltmp1 = call double #printd(double %__)
%4 = getelementptr [4 x double], [4 x double]* %foo, i32 0, i32 0
%__2 = load double, double* %4
%calltmp3 = call double #printd(double %__2)
ret double %calltmp3
}
define i32 #main() {
call double #__anon_expr0 ()
ret i32 0
}
Error
built(16968,0x7fff8f39c380) malloc: *** mach_vm_map(size=140732867354624) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
What is causing the error? I would think that malloc could handle more than a 4 element array. What am I doing wrong here?
The signature of malloc is void* malloc(size_t size) where size_t is a platform-specific integer type (32-bit on 32-bit platforms, 64-bit on 64-bit platforms).
So when you declare it as taking a double and call it as such, you're invoking undefined behaviour. In practice what will happen is that you're moving the value 256.0 into the double register for first arguments and then calling malloc, which will read its argument from the integer register for first arguments (which was never initialized).
Instead you should declare malloc to take an i64 on 64-platforms and i32 on 32-bit. And then you should also call it with an integer argument (i.e. 256 instead of 2.56e2).
Another problem is the return type: malloc returns a pointer, not an array (C function never return arrays - that isn't even syntactically possible). So it should be declared and used as such.
I've got this haskell file, compiled with ghc -O2 (ghc 7.4.1), and takes 1.65 sec on my machine
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
The same algorithm in C, compiled with gcc -O2 (gcc 4.6.3), runs in 0.18 sec.
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
}
Update
I thought it might be the Data.Bits stuff going slow, but surprisingly if I remove the shifting and just do a straight mod, it actually runs slower at 5.6 seconds!?!
import Data.Bits
main = do
print $ length $ filter (\i -> (i `mod` 4) /= 0) [0..123456789]
whereas the equivalent C runs slightly faster at 0.16 sec:
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i % 4) != 0)
++count;
printf("count: %d\n", count);
}
The two pieces of code do very different things.
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
creates a list of 123456790 Integer (lazily), takes the remainder modulo 4 of each (involving first a check whether the Integer is small enough to wrap a raw machine integer, then after the division a sign-check, since mod returns non-negative results only - though in ghc-7.6.1, there is a primop for that, so it's not as much of a brake to use mod as it was before), shifts the Integer 1 left the appropriate number of bits, which involves a conversion to "big" Integers and a call to GMP, takes the bitwise and with i - yet another call to GMP - and checks whether the result is 0, which causes another call to GMP or a conversion to small integer, not sure what GHC does here. Then, if the result is nonzero, a new list cell is created where that Integer is put in, and consumed by length. That's a lot of work done, most of which unnecessarily complicated due to the defaulting of unspecified number types to Integer.
The C code
#include <stdio.h>
int main(void) {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
return 0;
}
(I took the liberty of fixing the return type of main), does much much less. It takes an int, compares it to another, if smaller, takes the bitwise and of the first int with 3(1), shifts the int 1 to the left the appropriate number of bits, takes the bitwise and of that and the first int, and if nonzero increments another int, then increments the first. Those are all machine ops, working on raw machine types.
If we translate that code to Haskell,
module Main (main) where
import Data.Bits
maxNum :: Int
maxNum = 123456789
loop :: Int -> Int -> Int
loop acc i
| i < maxNum = loop (if i .&. (1 `shiftL` (i .&. 3)) /= 0 then acc + 1 else acc) (i+1)
| otherwise = acc
main :: IO ()
main = print $ loop 0 0
we get a much closer result:
C, gcc -O3:
count: 30864196
real 0m0.180s
user 0m0.178s
sys 0m0.001s
Haskell, ghc -O2:
30864196
real 0m0.247s
user 0m0.243s
sys 0m0.003s
Haskell, ghc -O2 -fllvm:
30864196
real 0m0.144s
user 0m0.140s
sys 0m0.003s
GHC's native code generator isn't a particularly good loop optimiser, so using the llvm backend makes a big difference here, but even the native code generator doesn't do too badly.
Okay, I have done the optimisation of replacing a modulus calculation with a power-of-two modulus with a bitwise and by hand, GHC's native code generator doesn't do that (yet), so with ```rem4`` instead of.&. 3`, the native code generator produces code that takes (here) 1.42 seconds to run, but the llvm backend does that optimisation, and produces the same code as with the hand-made optimisation.
Now, let us turn to gspr's question
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Well, the original code used Integers and lists, llvm doesn't know too well what to do with these, it can't transform that code into loops. The modified code uses Ints and the vector package rewrites the code to loops, so llvm does know how to optimise that well, and that shows.
(1) Assuming a normal binary computer. That optimisation is done by ordinary C compilers even without any optimisation flag, except on the very rare platforms where a div instruction is faster than a shift.
Few things beat a hand-written loop with a strict accumulator:
{-# LANGUAGE BangPatterns #-}
import Data.Bits
f :: Int -> Int
f n = g 0 0
where g !i !s | i <= n = g (i+1) (if i .&. (unsafeShiftL 1 (i `rem` 4)) /= 0 then s+1 else s)
| otherwise = s
main = print $ f 123456789
In addition to the tricks mentioned so far, this also replaces shift with unsafeShiftL, which doesn't check its argument.
Compiled with -O2 and -fllvm, this is about 13x faster than the original on my machine.
Note: Testing if bit i of x is set can be written more clearly as x `testBit` i. This produces the same assembly as the above.
Vector instead of list, fold instead of filter-and-length
Substituting the list for an unboxed vector and the filter-and-length for a fold (i.e. incrementing a counter) improves the time significantly for me. Here's what I used:
import qualified Data.Vector.Unboxed as UV
import Data.Bits
foo :: Int
foo = UV.foldl (\s i -> if i .&. (shift 1 (i `rem` 4)) /= 0 then s+1 else s) 0 (UV.enumFromN 0 123456789)
main = print foo
The original code (with two changes though: rem instead of mod as suggested in the comments, and adding an Int to the signature to avoid Integer) gave:
$ time ./orig
30864196
real 0m2.159s
user 0m2.144s
sys 0m0.008s
The modified code above gave:
$ time ./new
30864196
real 0m1.450s
user 0m1.440s
sys 0m0.004s
LLVM
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Original (LLVM):
$ time ./orig-llvm
30864196
real 0m2.047s
user 0m2.036s
sys 0m0.008s
Modified (LLVM):
$ time ./new-llvm
30864196
real 0m0.233s
user 0m0.228s
sys 0m0.004s
For comparison, OP's original C code comes in at 0m0.152s user on my system.
This is all GHC 7.4.1, GCC 4.6.3, and vector 0.9.1. LLVM is either 2.9 or 3.0; I have both but can't seem to figure out which one GHC is actually using.
Try this:
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `rem` 4)) /= 0) [0..123456789::Int]
Without the ::Int, the type defaults to ::Integer.
rem does the same as mod on positive values, and it is the same as % in C. mod on the other hand ist mathematically correct on negative values, but is slower.
int in C is 32bit
Int in Haskell is either 32 or 64bit wide, like long in C
Integer is an arbitrary-bit-integer, it has no min/max values, and its memory size depends on its value (similar to a string).