Performance issue when calling isSpaceAscii - nim-lang

I tried to call the isSpaceAscii from the standard library but got worse performance than with my own proc.
Code to reproduce:
import strutils
import std/monotimes
import stats
template timeIt(tag: string, iter: untyped, body: untyped) =
var st: RunningStat
for i in countup(1, iter):
let t0 = getMonoTime().ticks
body
let t1 = getMonoTime().ticks
let d = t1 - t0
st.push(d.float64)
echo tag, ": ", st.min
proc isSpace(c: char): bool =
result = c in Whitespace
when isMainModule:
# check eqaulity
for i in 1..255:
let c = char(i)
doAssert isSpace(c) == isSpaceAscii(c)
timeIt "isSpaceAscii", 1000:
for i in 1..255:
let c = char(i)
discard isSpaceAscii(c)
timeIt "isSpace", 1000:
for i in 1..255:
let c = char(i)
discard isSpace(c)
Benchmark results:
$ nim compile -d:release --verbosity:0 --hints:off --run test.nim
isSpaceAscii: 380.0
isSpace: 20.0
Compiler version:
$ nim -V
Nim Compiler Version 1.4.2 [Linux: amd64]
Why is isSpace faster than isSpaceAscii?

Benchmarking is hard, because you're not always measuring what you think you're measuring.
The incredibly stark differences you're seeing are because the isSpace loop doesn't do anything, and is in the same compilation unit as the isSpace function, so the compiler can optimize it away
as you can see on godbolt
if you instead compile with -d:release -d:lto the compiler will perform link-time optimization, and will optimize away both versions.
$ nim c -d:release -d:lto -r test.nim
isSpaceAscii: 16
isSpace: 16
We're just measuring the loop overhead.
To actually compare isSpaceAscii with isSpace, they need to do actual work as far as the compiler is concerned.
import std/[strutils,monotimes,stats]
template timeIt(tag: string, iters: int, body: untyped) =
var st: RunningStat
when declared(warmup): #BUG
for i in 1..iters:
body
for i in 1..iters:
let t0 = getMonoTime().ticks
body
let t1 = getMonoTime().ticks
st.push((t1-t0).float64)
echo tag,": ", st.min
proc isSpace(c:char):bool = c in Whitespace
template badloop(procname: untyped) =
for i in 1..255:
let c = char(i)
discard procname(c)
template goodloop(procname: untyped) =
var x: int
for i in 1..255:
let c = char(i)
if procname(c): inc x
when isMainModule:
let nruns = 1000
for i in 1..255:
doAssert isSpace(i.char) == isSpaceAscii(i.char)
timeit "isSpaceAscii, good",nruns:
goodloop(isSpaceAscii)
timeit "isSpace, good",nruns:
goodloop(isSpace)
timeit "isSpaceAscii, bad",nruns:
badloop(isSpaceAscii)
timeit "isSpace, bad",nruns:
badloop(isSpace)
result:
$ nim -d:release -d:lto -r test.nim
isSpaceAscii, good: 439.0
isSpace, good: 382.0
isSpaceAscii, bad: 17.0
isSpace, bad: 17.0
Closer, but there still seems to be a discrepancy, what's that about?
T̶h̶e̶ ̶c̶p̶u̶ ̶h̶a̶s̶ ̶w̶a̶r̶m̶e̶d̶ ̶u̶p̶ ̶b̶y̶ ̶t̶h̶e̶ ̶t̶i̶m̶e̶ ̶i̶t̶ ̶g̶e̶t̶s̶ ̶t̶o̶ ̶t̶h̶e̶ ̶s̶e̶c̶o̶n̶d̶ ̶t̶e̶s̶t̶,̶ ̶a̶n̶d̶ ̶i̶t̶ ̶g̶o̶e̶s̶ ̶f̶a̶s̶t̶e̶r̶.̶ (Edit due to a bug -d:warmup didn't change the code, differences were due to the compiler making different optimization choices.)
Try again, t̶u̶r̶n̶i̶n̶g̶ ̶o̶n̶ ̶t̶h̶e̶ ̶w̶a̶r̶m̶u̶p̶ ̶l̶o̶o̶p̶ we've added to timeit, and this time lets go all in for speed and use -d:danger
$ nim c -d:danger -d:lto -d:warmup -r test.nim
isSpaceAscii, good: 237.0
isSpace, good: 234.0
isSpaceAscii, bad: 15.0
isSpace, bad: 15.0
pretty much the same.
Edit as if to highlight the inscrutibility of microbenchmarks, I was completely wrong about the warmup section. I should have written when defined(warmup), and because of that bug, my "warmup" code never actually ran. Indeed, since we took the fastest time, the first few runs were sufficient warmup.
I've run several versions of the code since then, and the results vary too widely to draw any more conclusions, other than maybe:
compilers' optimization choices are fickle
Benchmarking is hard

Related

How to catch the output of a compiler error in nim?

I am not sure if this is currently possible (and maybe it is not even advisable), but I would like to be able to catch the output of a compiler error and reuse it in code. An example would be:
type IntOrString = int | string
var s: seq[IntOrString]
this code would not compile with error:
/usercode/in.nim(2, 5) Error: invalid type: 'IntOrString' in this context: 'seq[IntOrString]' for var
I am interested in a way to be able to use the message of this error in the code.
My use case is being able to easily document and discuss compiler errors in nimib. If I were to write a document that shows and discuss the different type of compiler errors catching the message automatically that could be useful (a workaround right now would be to write the code to file and compile with verbosity 0).
It is possible to use the compiler api to catch errors:
import ../compiler/[nimeval, llstream, ast, lineinfos, options], os, strformat
let std = findNimStdLibCompileTime()
let modules = [std, std / "pure", std / "std", std / "core"]
var intr = createInterpreter("script", modules)
intr.registerErrorHook proc(config:ConfigRef,info:TLineInfo,msg:string,severity:Severity) =
raise newException(CatchableError,&"{severity}: {(info.line,info.col)} {msg}")
)
try:
intr.evalScript(llStreamOpen("""type IntOrString = int | string
var s: seq[IntOrString]"""))
except CatchableError as e:
echo e.msg
echo "done"
destroyInterpreter(intr)
outputs:
Error: (2, 4) invalid type: 'IntOrString' in this context: 'seq[IntOrS
tring]' for var
done
Caveat: you can't run runtime code at compile time, for example, trying to run
type Dog = object of RootObj
method speak(i:Dog):string = "woof"
echo Dog().speak()
in the interpreter will give you errors, instead you would have to do something like:
type Dog = object of RootObj
method speak*(i:Dog):string = "woof"
proc newDog*():Dog = discard
let
dog = intr.callRoutine(intr.selectRoutine("newDog"),[])
speech = intr.callRoutine(intr.selectRoutine("speak"),[dog])
if speech.kind == nkStrLit:
echo speech.strval

Nim: work with read-only memory mapped files

I've only just started with Nim, hence it possibly is a simple question. We need to do many lookups into data that are stored in a file. Some of these files are too large to load into memory, hence the mmapped approach. I'm able to mmap the file by means of memfiles and either have a pointer or MemSlice at my hand. The file and the memory region are read-only, and hence have a fixed size. I was hoping that I'm able to access the data as immutable fixed size byte and char arrays without copying them, leveraging all the existing functionalities available to seqs, arrays, strings etc.. All the MemSlice / string methods copy the data, which is fair, but not what I want (and in my use case don't need).
I understand array, strings etc. types have a pointer to the data and a len field. But couldn't find a way to create them with a pointer and len. I assume it has something to do with ownership and refs to mem that may outlive my slice.
let mm = memfiles.open(...)
let myImmutableFixesSizeArr = ?? # cast[ptr array[fsize, char]](mm.mem) doesn't compile as fsize needs to be const. Neither could I find something like let x: [char] = array_from(mm.mem, fsize)
let myImmutableFixedSizeString = mm[20, 30].to_fixed_size_immutable_string # Create something that is string like so that I can use all the existing string methods.
UPDATE: I did find https://forum.nim-lang.org/t/4680#29226 which explains how to use OpenArray, but OpenArray is only allowed as function argument, and you - if I'm not mistaken - it is doesn't behave like a normal array.
Thanks for your help
It is not possible to convert a raw char array in memory (ptr UncheckedArray[char]) to a string without copying, only to an openArray[char] (or cstring)
So it won't be possible to use procs that expect a string, only those that accept openArray[T] or openArray[char]
Happily an openArray[T] behaves exactly like a seq[T] when sent to a proc.
({.experimental:"views".} does let you assign an openArray[T] to a local variable, but it's not anywhere near ready for production)
you can use the memSlices iterator to loop over delimited chunks in a memFile without copying:
import memfiles
template toOpenArray(ms: MemSlice, T: typedesc = byte): openArray[T] =
##template because openArray isn't a valid return type yet
toOpenArray(cast[ptr UncheckedArray[T]](ms.data),0,(ms.size div sizeof(T))-1)
func process(slice:openArray[char]) =
## your code here but e.g.
## count number of A's
var nA: int
for ch in slice.items:
if ch == 'A': inc nA
debugEcho nA
let mm = memfiles.open("file.txt")
for slice in mm.memSlices:
process slice.toOpenArray(char)
Or, to work with some char array represented in the middle of the file, you can use pointer arithmetic.
import memfiles
template extractImpl(typ,pntr,offset) =
cast[typ](cast[ByteAddress](pntr)+offset)
template checkFileLen(memfile,len,offset) =
if offset + len > memfile.size:
raise newException(IndexDefect,"file too short")
func extract*(mm: MemFile,T:typedesc, offset:Natural): ptr T =
checkFileLen(mm,T,offset)
result = extractImpl(ptr T,mm.mem,offset)
func extract*[U](mm: MemFile,T: typedesc[ptr U], offset: Natural): T =
extractImpl(T,mm.mem,offset)
let mm = memfiles.open("file.txt")
#to extract a compile-time known length string:
let mystring_offset = 3
const mystring_len = 10
type MyStringT = array[mystring_len,char]
let myString:ptr MyStringT = mm.extract(MyStringT,mystring_offset)
process myString[]
#to extract a dynamic length string:
let size_offset = 14
let string_offset = 18
let sz:ptr int32 = mm.extract(int32,size_offset)
let str:ptr UncheckedArray[char] = mm.extract(ptr UncheckedArray[char], string_offset)
checkFileLen(mm,sz[],string_offset)
process str.toOpenArray(0,sz[]-1)

How to divide int64 in Nim?

How can I divide int64?
let v: int64 = 100
echo v / 10
Error Error: type mismatch: got <int64, int literal(10)>
Full example
import math
proc sec_to_min*(sec: int64): int =
let min = sec / 60 # <= error
min.round.to_int
echo 100.sec_to_min
P.S.
And, is there a way to safely cast int64 to int, so the result would be int and not int64, with the check for overflow.
There has been already a bit of discussion over int64 division in this issue and probably some improvement to current state can be made. From the above issue:
a good reason for not having in stdlib float division between int64 is that it might it may incur in loss of precision and so the user should explicitly convertint64 to float
still, float division between int types is present in stdlib
on 64 bit system int is int64 (and so you have division between int64 in 64 bit systems)
For your use case I think the following (playground) should work (better to use div instead of doing float division and then rounding off):
import math
proc sec_to_min*(sec: int64): int = sec.int div 60
echo 100.sec_to_min
let a = high(int64)
echo a.int # on playground this does not raise error since int is int64
echo a.int32 # this instead correctly raises error
output:
1
9223372036854775807
/usercode/in.nim(9) in
/playground/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: value out of range: 9223372036854775807 notin -2147483648 .. 2147483647 [RangeError]
P.S.: as you see above standard conversion has range checks
Apparently division between int64 types is terribly dangerous because it invokes an undying horde of bike shedding, but at least you can create your own operator:
proc `/`(x, y: int64): int64 = x div y
let v: int64 = 100
echo v / 10
Or
proc `/`(x, y: int64): int64 = x div y
import math
proc sec_to_min*(sec: int64): int =
int(sec / 60)
echo 100.sec_to_min
With regards to the int64 to int conversion, I'm not sure that makes much sense since most platforms will run int as an alias of int64. But of course you could be compiling/running on a 32 bit platform, where the loss would be tragic, so you can still do runtime checks:
let a = int64.high
echo "Unsurprising but potentially wrong ", int(a)
proc safe_int(big_int: int64): int =
if big_int > int32.high:
raise new_exception(Overflow_error, "Value is too high for 32 bit platforms")
int(big_int)
echo "Reachable code ", safe_int(int32.high)
echo "Unreachable code ", safe_int(a)
Also, if you are running into confusing minute, hour, day conversions, you might want to look into distinct types to avoid adding months to seconds (or do so in a more safe way).

Calling a Clojure Function from Haskell

Is it possible to call a Clojure function from Haskell (on the GHC), using the FFI or some other trick? Here I'm interested in staying within the bounds of GHC (i.e., not using Frege). I'm also interested in keeping the central program in Haskell (meaning that the Clojure function should be called from Haskell, and not vice versa).
How to do this?
Let me start by advertising inline-java which should make it pretty easy to call Clojure by just writing the Java code that calls the Clojure API. That said, as I am not running the bleeding edge GHC 8.0.2 (and had a variety of other install issues) I haven't been able to use this. When (if) I get inline-java running, I'll update this solution.
My solution below starts by creating a C interface to the Java methods in the Clojure API for Java via the JNI. Then, it calls that C interface using Haskell FFI support. You may need to adjust the library and include file paths depending on where your JDK and JRE are installed. If everything works right, you should see 7 printed to stdout. This is 3 plus 4 calculated by Clojure.
Setup
Download the Clojure 1.8.0 jar if you don't already have it. We'll be using the Java Clojure API. Make sure you've defined LD_LIBRARY_PATH. On the machine I used, that means exporting
export LD_LIBRARY_PATH="/usr/lib64/jvm/java/jre/lib/amd64/server/"
Finally, here is a makefile to make compiling a bit easier. You may need to adjust some library and include paths.
# makefile
all:
gcc -O -c \
-I /usr/lib64/jvm/java/include/ \
-I /usr/lib64/jvm/java/include/linux/ \
java.c
ghc -O2 -Wall \
-L/usr/lib64/jvm/java/jre/lib/amd64/server/ \
-ljvm \
clojure.hs \
java.o
run:
./clojure
clean:
rm -f java.o
rm -f clojure clojure.o clojure.hi
C Interface to Clojure functions
Now, we will make a C interface for the JVM and Clojure functionality we need. For this, we will be using the JNI. I choose to expose a pretty limited interface:
create_vm initializes a new JVM with the Clojure jar on the classpath (make sure you adjust this if you put your Clojure jar somewhere other than in the same folder)
load_methods looks up the Clojure methods we will need. Thankfully the Java Clojure API is pretty small, so we can wrap almost all of the functions there without to much difficulty. We also need to have functions that convert things like numbers or strings to and from their corresponding Clojure representation. I've only done this for java.lang.Long (which is Clojure's default integral number type).
readObj wraps clojure.java.api.Clojure.read (with C strings)
varObj wraps the one arg version of clojure.java.api.Clojure.var (with C strings)
varObjQualified wraps the two arg version of clojure.java.api.Clojure.read (with C strings)
longValue converts a Clojure long to a C long
newLong converts a C long to a Clojure long
invokeFn dispatches to the clojure.lang.IFn.invoke of the right arity. Here, I only bother to expose this up to arity 2, but nothing is stopping you from going further.
Here is the code:
// java.c
#include <stdio.h>
#include <stdbool.h>
#include <jni.h>
// Uninitialized Java natural interface
JNIEnv *env;
JavaVM *jvm;
// JClass for Clojure
jclass clojure, ifn, longClass;
jmethodID readM, varM, varQualM, // defined on 'clojure.java.api.Clojure'
invoke[2], // defined on 'closure.lang.IFn'
longValueM, longC; // defined on 'java.lang.Long'
// Initialize the JVM with the Clojure JAR on classpath.
bool create_vm() {
// Configuration options for the JVM
JavaVMOption opts = {
.optionString = "-Djava.class.path=./clojure-1.8.0.jar",
};
JavaVMInitArgs args = {
.version = JNI_VERSION_1_6,
.nOptions = 1,
.options = &opts,
.ignoreUnrecognized = false,
};
// Make the VM
int rv = JNI_CreateJavaVM(&jvm, (void**)&env, &args);
if (rv < 0 || !env) {
printf("Unable to Launch JVM %d\n",rv);
return false;
}
return true;
}
// Lookup the classes and objects we need to interact with Clojure.
void load_methods() {
clojure = (*env)->FindClass(env, "clojure/java/api/Clojure");
readM = (*env)->GetStaticMethodID(env, clojure, "read", "(Ljava/lang/String;)Ljava/lang/Object;");
varM = (*env)->GetStaticMethodID(env, clojure, "var", "(Ljava/lang/Object;)Lclojure/lang/IFn;");
varQualM = (*env)->GetStaticMethodID(env, clojure, "var", "(Ljava/lang/Object;Ljava/lang/Object;)Lclojure/lang/IFn;");
ifn = (*env)->FindClass(env, "clojure/lang/IFn");
invoke[0] = (*env)->GetMethodID(env, ifn, "invoke", "()Ljava/lang/Object;");
invoke[1] = (*env)->GetMethodID(env, ifn, "invoke", "(Ljava/lang/Object;)Ljava/lang/Object;");
invoke[2] = (*env)->GetMethodID(env, ifn, "invoke", "(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;");
// Obviously we could keep going here. The Clojure API has 'invoke' for up to 20 arguments...
longClass = (*env)->FindClass(env, "java/lang/Long");
longValueM = (*env)->GetMethodID(env, longClass, "longValue", "()J");
longC = (*env)->GetMethodID(env, longClass, "<init>", "(J)V");
}
// call the 'invoke' function of the right arity on 'IFn'.
jobject invokeFn(jobject obj, unsigned n, jobject *args) {
return (*env)->CallObjectMethodA(env, obj, invoke[n], (jvalue*)args);
}
// 'read' static method from 'Clojure' object.
jobject readObj(const char *cStr) {
jstring str = (*env)->NewStringUTF(env, cStr);
return (*env)->CallStaticObjectMethod(env, clojure, readM, str);
}
// 'var' static method from 'Clojure' object.
jobject varObj(const char* fnCStr) {
jstring fn = (*env)->NewStringUTF(env, fnCStr);
return (*env)->CallStaticObjectMethod(env, clojure, varM, fn);
}
// qualified 'var' static method from 'Clojure' object.
jobject varObjQualified(const char* nsCStr, const char* fnCStr) {
jstring ns = (*env)->NewStringUTF(env, nsCStr);
jstring fn = (*env)->NewStringUTF(env, fnCStr);
return (*env)->CallStaticObjectMethod(env, clojure, varQualM, ns, fn);
}
Haskell Interface to C functions
Finally, we use Haskell's FFI to plug into the C functions we just made. This compiles to an executable which adds 3 and 4 using Clojure's add function. Here, I lost the motivation to make functions for readObj and varObj (mostly because I don't happen to need them for my example).
-- clojure.hs
{-# LANGUAGE GeneralizedNewtypeDeriving, ForeignFunctionInterface #-}
import Foreign
import Foreign.C.Types
import Foreign.C.String
-- Clojure objects are just Java objects, and jsvalue is a union with size 64
-- bits. Since we are cutting corners, we might as well just derive 'Storable'
-- from something else that has the same size - 'CLong'.
newtype ClojureObject = ClojureObject CLong deriving (Storable)
foreign import ccall "load_methods" load_methods :: IO ()
foreign import ccall "create_vm" create_vm :: IO ()
foreign import ccall "invokeFn" invokeFn :: ClojureObject -> CUInt -> Ptr ClojureObject -> IO ClojureObject
-- foreign import ccall "readObj" readObj :: CString -> IO ClojureObject
-- foreign import ccall "varObj" varObj :: CString -> IO ClojureObject
foreign import ccall "varObjQualified" varObjQualified :: CString -> CString -> IO ClojureObject
foreign import ccall "newLong" newLong :: CLong -> ClojureObject
foreign import ccall "longValue" longValue :: ClojureObject -> CLong
-- | In order for anything to work, this needs to be called first.
loadClojure :: IO ()
loadClojure = create_vm *> load_methods
-- | Make a Clojure function call
invoke :: ClojureObject -> [ClojureObject] -> IO ClojureObject
invoke fn args = do
args' <- newArray args
let n = fromIntegral (length args)
invokeFn fn n args'
-- | Make a Clojure number from a Haskell one
long :: Int64 -> ClojureObject
long l = newLong (CLong l)
-- | Make a Haskell number from a Clojure one
unLong :: ClojureObject -> Int64
unLong cl = let CLong l = longValue cl in l
-- | Look up a var in Clojure based on the namespace and name
varQual :: String -> String -> IO ClojureObject
varQual ns fn = withCString ns (\nsCStr ->
withCString fn (\fnCStr -> varObjQualified nsCStr fnCStr))
main :: IO ()
main = do
loadClojure
putStrLn "Clojure loaded"
plus <- varQual "clojure.core" "+"
out <- invoke plus [long 3, long 4]
print $ unLong out -- prints "7" on my tests
Try it!
Compiling should be just make all and running make run.
Limitations
Since this is only a proof of concept, there are a bunch of things that should be fixed:
proper conversion for all of Clojure's primitive types
tear down the JVM after you are done!
make sure we aren't introducing memory leaks anywhere (which we might be doing with newArray)
represent Clojure objects properly in Haskell
many more!
That said, it works!
An easy way would be to launch your Clojure process with a socket REPL or NRepl server.
This enables a socket based REPL, so you could then use sockets to call your Clojure function.

Can I speed up this Haskell algorithm?

I've got this haskell file, compiled with ghc -O2 (ghc 7.4.1), and takes 1.65 sec on my machine
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
The same algorithm in C, compiled with gcc -O2 (gcc 4.6.3), runs in 0.18 sec.
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
}
Update
I thought it might be the Data.Bits stuff going slow, but surprisingly if I remove the shifting and just do a straight mod, it actually runs slower at 5.6 seconds!?!
import Data.Bits
main = do
print $ length $ filter (\i -> (i `mod` 4) /= 0) [0..123456789]
whereas the equivalent C runs slightly faster at 0.16 sec:
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i % 4) != 0)
++count;
printf("count: %d\n", count);
}
The two pieces of code do very different things.
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
creates a list of 123456790 Integer (lazily), takes the remainder modulo 4 of each (involving first a check whether the Integer is small enough to wrap a raw machine integer, then after the division a sign-check, since mod returns non-negative results only - though in ghc-7.6.1, there is a primop for that, so it's not as much of a brake to use mod as it was before), shifts the Integer 1 left the appropriate number of bits, which involves a conversion to "big" Integers and a call to GMP, takes the bitwise and with i - yet another call to GMP - and checks whether the result is 0, which causes another call to GMP or a conversion to small integer, not sure what GHC does here. Then, if the result is nonzero, a new list cell is created where that Integer is put in, and consumed by length. That's a lot of work done, most of which unnecessarily complicated due to the defaulting of unspecified number types to Integer.
The C code
#include <stdio.h>
int main(void) {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
return 0;
}
(I took the liberty of fixing the return type of main), does much much less. It takes an int, compares it to another, if smaller, takes the bitwise and of the first int with 3(1), shifts the int 1 to the left the appropriate number of bits, takes the bitwise and of that and the first int, and if nonzero increments another int, then increments the first. Those are all machine ops, working on raw machine types.
If we translate that code to Haskell,
module Main (main) where
import Data.Bits
maxNum :: Int
maxNum = 123456789
loop :: Int -> Int -> Int
loop acc i
| i < maxNum = loop (if i .&. (1 `shiftL` (i .&. 3)) /= 0 then acc + 1 else acc) (i+1)
| otherwise = acc
main :: IO ()
main = print $ loop 0 0
we get a much closer result:
C, gcc -O3:
count: 30864196
real 0m0.180s
user 0m0.178s
sys 0m0.001s
Haskell, ghc -O2:
30864196
real 0m0.247s
user 0m0.243s
sys 0m0.003s
Haskell, ghc -O2 -fllvm:
30864196
real 0m0.144s
user 0m0.140s
sys 0m0.003s
GHC's native code generator isn't a particularly good loop optimiser, so using the llvm backend makes a big difference here, but even the native code generator doesn't do too badly.
Okay, I have done the optimisation of replacing a modulus calculation with a power-of-two modulus with a bitwise and by hand, GHC's native code generator doesn't do that (yet), so with ```rem4`` instead of.&. 3`, the native code generator produces code that takes (here) 1.42 seconds to run, but the llvm backend does that optimisation, and produces the same code as with the hand-made optimisation.
Now, let us turn to gspr's question
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Well, the original code used Integers and lists, llvm doesn't know too well what to do with these, it can't transform that code into loops. The modified code uses Ints and the vector package rewrites the code to loops, so llvm does know how to optimise that well, and that shows.
(1) Assuming a normal binary computer. That optimisation is done by ordinary C compilers even without any optimisation flag, except on the very rare platforms where a div instruction is faster than a shift.
Few things beat a hand-written loop with a strict accumulator:
{-# LANGUAGE BangPatterns #-}
import Data.Bits
f :: Int -> Int
f n = g 0 0
where g !i !s | i <= n = g (i+1) (if i .&. (unsafeShiftL 1 (i `rem` 4)) /= 0 then s+1 else s)
| otherwise = s
main = print $ f 123456789
In addition to the tricks mentioned so far, this also replaces shift with unsafeShiftL, which doesn't check its argument.
Compiled with -O2 and -fllvm, this is about 13x faster than the original on my machine.
Note: Testing if bit i of x is set can be written more clearly as x `testBit` i. This produces the same assembly as the above.
Vector instead of list, fold instead of filter-and-length
Substituting the list for an unboxed vector and the filter-and-length for a fold (i.e. incrementing a counter) improves the time significantly for me. Here's what I used:
import qualified Data.Vector.Unboxed as UV
import Data.Bits
foo :: Int
foo = UV.foldl (\s i -> if i .&. (shift 1 (i `rem` 4)) /= 0 then s+1 else s) 0 (UV.enumFromN 0 123456789)
main = print foo
The original code (with two changes though: rem instead of mod as suggested in the comments, and adding an Int to the signature to avoid Integer) gave:
$ time ./orig
30864196
real 0m2.159s
user 0m2.144s
sys 0m0.008s
The modified code above gave:
$ time ./new
30864196
real 0m1.450s
user 0m1.440s
sys 0m0.004s
LLVM
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Original (LLVM):
$ time ./orig-llvm
30864196
real 0m2.047s
user 0m2.036s
sys 0m0.008s
Modified (LLVM):
$ time ./new-llvm
30864196
real 0m0.233s
user 0m0.228s
sys 0m0.004s
For comparison, OP's original C code comes in at 0m0.152s user on my system.
This is all GHC 7.4.1, GCC 4.6.3, and vector 0.9.1. LLVM is either 2.9 or 3.0; I have both but can't seem to figure out which one GHC is actually using.
Try this:
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `rem` 4)) /= 0) [0..123456789::Int]
Without the ::Int, the type defaults to ::Integer.
rem does the same as mod on positive values, and it is the same as % in C. mod on the other hand ist mathematically correct on negative values, but is slower.
int in C is 32bit
Int in Haskell is either 32 or 64bit wide, like long in C
Integer is an arbitrary-bit-integer, it has no min/max values, and its memory size depends on its value (similar to a string).

Resources