Can I speed up this Haskell algorithm? - haskell

I've got this haskell file, compiled with ghc -O2 (ghc 7.4.1), and takes 1.65 sec on my machine
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
The same algorithm in C, compiled with gcc -O2 (gcc 4.6.3), runs in 0.18 sec.
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
}
Update
I thought it might be the Data.Bits stuff going slow, but surprisingly if I remove the shifting and just do a straight mod, it actually runs slower at 5.6 seconds!?!
import Data.Bits
main = do
print $ length $ filter (\i -> (i `mod` 4) /= 0) [0..123456789]
whereas the equivalent C runs slightly faster at 0.16 sec:
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i % 4) != 0)
++count;
printf("count: %d\n", count);
}

The two pieces of code do very different things.
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
creates a list of 123456790 Integer (lazily), takes the remainder modulo 4 of each (involving first a check whether the Integer is small enough to wrap a raw machine integer, then after the division a sign-check, since mod returns non-negative results only - though in ghc-7.6.1, there is a primop for that, so it's not as much of a brake to use mod as it was before), shifts the Integer 1 left the appropriate number of bits, which involves a conversion to "big" Integers and a call to GMP, takes the bitwise and with i - yet another call to GMP - and checks whether the result is 0, which causes another call to GMP or a conversion to small integer, not sure what GHC does here. Then, if the result is nonzero, a new list cell is created where that Integer is put in, and consumed by length. That's a lot of work done, most of which unnecessarily complicated due to the defaulting of unspecified number types to Integer.
The C code
#include <stdio.h>
int main(void) {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
return 0;
}
(I took the liberty of fixing the return type of main), does much much less. It takes an int, compares it to another, if smaller, takes the bitwise and of the first int with 3(1), shifts the int 1 to the left the appropriate number of bits, takes the bitwise and of that and the first int, and if nonzero increments another int, then increments the first. Those are all machine ops, working on raw machine types.
If we translate that code to Haskell,
module Main (main) where
import Data.Bits
maxNum :: Int
maxNum = 123456789
loop :: Int -> Int -> Int
loop acc i
| i < maxNum = loop (if i .&. (1 `shiftL` (i .&. 3)) /= 0 then acc + 1 else acc) (i+1)
| otherwise = acc
main :: IO ()
main = print $ loop 0 0
we get a much closer result:
C, gcc -O3:
count: 30864196
real 0m0.180s
user 0m0.178s
sys 0m0.001s
Haskell, ghc -O2:
30864196
real 0m0.247s
user 0m0.243s
sys 0m0.003s
Haskell, ghc -O2 -fllvm:
30864196
real 0m0.144s
user 0m0.140s
sys 0m0.003s
GHC's native code generator isn't a particularly good loop optimiser, so using the llvm backend makes a big difference here, but even the native code generator doesn't do too badly.
Okay, I have done the optimisation of replacing a modulus calculation with a power-of-two modulus with a bitwise and by hand, GHC's native code generator doesn't do that (yet), so with ```rem4`` instead of.&. 3`, the native code generator produces code that takes (here) 1.42 seconds to run, but the llvm backend does that optimisation, and produces the same code as with the hand-made optimisation.
Now, let us turn to gspr's question
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Well, the original code used Integers and lists, llvm doesn't know too well what to do with these, it can't transform that code into loops. The modified code uses Ints and the vector package rewrites the code to loops, so llvm does know how to optimise that well, and that shows.
(1) Assuming a normal binary computer. That optimisation is done by ordinary C compilers even without any optimisation flag, except on the very rare platforms where a div instruction is faster than a shift.

Few things beat a hand-written loop with a strict accumulator:
{-# LANGUAGE BangPatterns #-}
import Data.Bits
f :: Int -> Int
f n = g 0 0
where g !i !s | i <= n = g (i+1) (if i .&. (unsafeShiftL 1 (i `rem` 4)) /= 0 then s+1 else s)
| otherwise = s
main = print $ f 123456789
In addition to the tricks mentioned so far, this also replaces shift with unsafeShiftL, which doesn't check its argument.
Compiled with -O2 and -fllvm, this is about 13x faster than the original on my machine.
Note: Testing if bit i of x is set can be written more clearly as x `testBit` i. This produces the same assembly as the above.

Vector instead of list, fold instead of filter-and-length
Substituting the list for an unboxed vector and the filter-and-length for a fold (i.e. incrementing a counter) improves the time significantly for me. Here's what I used:
import qualified Data.Vector.Unboxed as UV
import Data.Bits
foo :: Int
foo = UV.foldl (\s i -> if i .&. (shift 1 (i `rem` 4)) /= 0 then s+1 else s) 0 (UV.enumFromN 0 123456789)
main = print foo
The original code (with two changes though: rem instead of mod as suggested in the comments, and adding an Int to the signature to avoid Integer) gave:
$ time ./orig
30864196
real 0m2.159s
user 0m2.144s
sys 0m0.008s
The modified code above gave:
$ time ./new
30864196
real 0m1.450s
user 0m1.440s
sys 0m0.004s
LLVM
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Original (LLVM):
$ time ./orig-llvm
30864196
real 0m2.047s
user 0m2.036s
sys 0m0.008s
Modified (LLVM):
$ time ./new-llvm
30864196
real 0m0.233s
user 0m0.228s
sys 0m0.004s
For comparison, OP's original C code comes in at 0m0.152s user on my system.
This is all GHC 7.4.1, GCC 4.6.3, and vector 0.9.1. LLVM is either 2.9 or 3.0; I have both but can't seem to figure out which one GHC is actually using.

Try this:
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `rem` 4)) /= 0) [0..123456789::Int]
Without the ::Int, the type defaults to ::Integer.
rem does the same as mod on positive values, and it is the same as % in C. mod on the other hand ist mathematically correct on negative values, but is slower.
int in C is 32bit
Int in Haskell is either 32 or 64bit wide, like long in C
Integer is an arbitrary-bit-integer, it has no min/max values, and its memory size depends on its value (similar to a string).

Related

unsigned integer devision rounding AVR GCC

I am struggling to understand how to divide an unsigned integer by a factor of 10 accounting for rounding like a float would round.
uint16_t val = 331 / 10; // two decimal to one decimal places 3.31 to 3.3
uint16_t val = 329 / 10; // two decimal to one decimal places 3.29 to 3.3 (not 2.9)
I would like both of these sums to round to 33 (3.3 in decimal equivalent of 1 decimal place)
Im sure the answer is simple if i were more knowledgable than i am on how processors perform integer division.
Since integer division rounds the result downward to zero, just add half of the divisor, whatever it is. I.e.:
uint16_t val = ((uint32_t)x + 5) / 10; // convert to 32-bit to avoid overflow
or in more general form:
static inline uint16_t divideandround(uint16_t quotient, uint16_t divisor) {
return ((uint32_t)quotient + (divisor >> 1)) / divisor;
}
If you are sure there will no 16-bit overflow (i.e. values will always be not more than 65530) you can speed up the calculation by keeping values 16 bit:
uint16_t val = (x + 5) / 10;
I think I have worked it out, This seems to give me the right answer, please let me know if I am actually wrong and it fails.
uint16_t val = 329;
if (val%10>=5)
{
val = (val+5)/10;
}
else
{
val = val/10;
}
You can do it with just one 16-bit divmod operation:
#include <stdint.h>
uint16_t udiv10_round (uint16_t n)
{
uint16_t q = n / 10;
uint16_t r = n % 10;
return r >= 5 ? q + 1 : q;
}
When you are optimizing for size (-Os), avr-gcc will compute both quotient and remainder by means of one library call to __udivmodhi4.
When you are optimizing for speed (-O2), avr-gcc might avoid1 __udivmodhi4 altogether and instead perform a 16×16=32 multiplication with a factor of 0xcccd, so that the quotient and remainder are easy to compute from the high part of that product.
1This happens if the MCU you are compiling for supports MUL. If MUL is nor supported, avr-gcc still uses divmod operation as a 16×16=32 multiplication would not gain anything.

How to divide int64 in Nim?

How can I divide int64?
let v: int64 = 100
echo v / 10
Error Error: type mismatch: got <int64, int literal(10)>
Full example
import math
proc sec_to_min*(sec: int64): int =
let min = sec / 60 # <= error
min.round.to_int
echo 100.sec_to_min
P.S.
And, is there a way to safely cast int64 to int, so the result would be int and not int64, with the check for overflow.
There has been already a bit of discussion over int64 division in this issue and probably some improvement to current state can be made. From the above issue:
a good reason for not having in stdlib float division between int64 is that it might it may incur in loss of precision and so the user should explicitly convertint64 to float
still, float division between int types is present in stdlib
on 64 bit system int is int64 (and so you have division between int64 in 64 bit systems)
For your use case I think the following (playground) should work (better to use div instead of doing float division and then rounding off):
import math
proc sec_to_min*(sec: int64): int = sec.int div 60
echo 100.sec_to_min
let a = high(int64)
echo a.int # on playground this does not raise error since int is int64
echo a.int32 # this instead correctly raises error
output:
1
9223372036854775807
/usercode/in.nim(9) in
/playground/nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: value out of range: 9223372036854775807 notin -2147483648 .. 2147483647 [RangeError]
P.S.: as you see above standard conversion has range checks
Apparently division between int64 types is terribly dangerous because it invokes an undying horde of bike shedding, but at least you can create your own operator:
proc `/`(x, y: int64): int64 = x div y
let v: int64 = 100
echo v / 10
Or
proc `/`(x, y: int64): int64 = x div y
import math
proc sec_to_min*(sec: int64): int =
int(sec / 60)
echo 100.sec_to_min
With regards to the int64 to int conversion, I'm not sure that makes much sense since most platforms will run int as an alias of int64. But of course you could be compiling/running on a 32 bit platform, where the loss would be tragic, so you can still do runtime checks:
let a = int64.high
echo "Unsurprising but potentially wrong ", int(a)
proc safe_int(big_int: int64): int =
if big_int > int32.high:
raise new_exception(Overflow_error, "Value is too high for 32 bit platforms")
int(big_int)
echo "Reachable code ", safe_int(int32.high)
echo "Unreachable code ", safe_int(a)
Also, if you are running into confusing minute, hour, day conversions, you might want to look into distinct types to avoid adding months to seconds (or do so in a more safe way).

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.
To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}
I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N
You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features
A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

Generating a comprehensive callgraph using GCC & Egypt

I am trying to generate a comprehensive callgraph (complete with low level calls to Linux, runtime, the lot).
I have statically compiled my source files with "-fdump-rtl-expand" and created RTL files, which I passed to a PERL script called Egypt (which I believe is Graphviz/Dot) and generated a PDF file of the callgraph. This works perfectly, no problems at all.
Except, there are calls being made into some libraries that are getting shown as built-in. I was looking to see if there is a way for the callgraph not to be printed as and instead the real calls made into the libraries ?
Please let me know if the question is unclear.
http://i.imgur.com/sp58v.jpg
Basically, I am trying to avoid the callgraph from generating < built-in >
Is there a way to do that ?
-------- CODE ---------
#include <cilk/cilk.h>
#include <stdio.h>
#include <stdlib.h>
unsigned long int t0, t5;
unsigned int NOSPAWN_THRESHOLD = 32;
int fib_nospawn(int n)
{
if (n < 2)
return n;
else
{
int x = fib_nospawn(n-1);
int y = fib_nospawn(n-2);
return x + y;
}
}
// spawning fibonacci function
int fib(long int n)
{
long int x, y;
if (n < 2)
return n;
else if (n <= NOSPAWN_THRESHOLD)
{
x = fib_nospawn(n-1);
y = fib_nospawn(n-2);
return x + y;
}
else
{
x = cilk_spawn fib(n-1);
y = cilk_spawn fib(n-2);
cilk_sync;
return x + y;
}
}
int main(int argc, char *argv[])
{
int n;
long int result;
long int exec_time;
n = atoi(argv[1]);
NOSPAWN_THRESHOLD = atoi(argv[2]);
result = fib(n);
printf("%ld\n", result);
return 0;
}
I compiled the Cilk Library from source.
I might have found the partial solution to the problem:
You need to pass the following option to egypt
--include-external
This produced a slightly more comprehensive callgraph, although there still is the " visible
http://i.imgur.com/GWPJO.jpg?1
Can anyone suggest if I get more depth in the callgraph ?
You can use the GCC VCG Plugin: A gcc plugin, which can be loaded when debugging gcc, to show internal structures graphically.
gcc -fplugin=/path/to/vcg_plugin.so -fplugin-arg-vcg_plugin-cgraph foo.c
Call-graph is place to store data needed
for inter-procedural optimization. All datastructures
are divided into three components:
local_info that is produced while analyzing
the function, global_info that is result
of global walking of the call-graph on the end
of compilation and rtl_info used by RTL
back-end to propagate data from already compiled
functions to their callers.

ThreadDelay Problem in Haskell (GHC) on Ubuntu

I noticed odd behavior with the threadDelay function in GHC.Conc on some of my machines. The following program:
main = do print "start"
threadDelay (1000 * 1000)
print "done"
takes 1 second to run, as expected. On the other hand, this program:
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent
main = do print "start"
loop 1000
print "done"
where loop :: Int -> IO ()
loop !n =
if n == 0
then return ()
else do threadDelay 1000
loop (n-1)
takes about 10 seconds to run on two of my machines, though on other machines it takes about 1 second, as expected. (I compiled both of the above programs with the '-threaded' flag.) Here is a screen shot from Threadscope showing that there is activity only once every 10 milliseconds:
On the other hand, here is a screenshot from ThreadScope from one of my machines on which the program takes 1 second total:
A similar C program:
#include <unistd.h>
#include <stdio.h>
int main() {
int i;
for (i=1; i < 1000; i++) {
printf("%i\n",i);
usleep(1000);
}
return 0;
}
does the right thing, i.e. running 'time ./a.out' gives output like:
1
2
...
999
real 0m1.080s
user 0m0.000s
sys 0m0.020s
Has anyone encountered this problem before, and if so, how can this be fixed? I am running ghc 7.2.1 for Linux(x86_64) on all of my machines and am running various versions of Ubuntu. It works badly on Ubuntu 10.04.2, but fine on 11.04.
threadDelay is not an accurate timer. It promises that your thread will sleep for at least as long as its argument says it should, but it doesn't promise anything more than that. If you want something to happen periodically, you will have to use something else. (I'm not sure what, but possibly Unix' realtime alarm signal would work for you.)
I suspect you forgot to compile with the '-threaded' option. (I did that once for 6.12.3, and consistently had 30 millisecond thread delays.)
As noted above, threadDelay only makes one guarantee, which is that you'll wait at least as long as you request. Haskell's runtime does not obtain special cooperation from the OS
Other than that, it's best effort from the OS.
It might be worth benchmarking your results for threadDelays. For example:
module Main where
import Control.Concurrent
import Data.Time
time op =
getCurrentTime >>= \ t0 ->
op >>
getCurrentTime >>= \ tf ->
return $! (diffUTCTime tf t0)
main :: IO ()
main =
let action tm = time (threadDelay tm) >>= putStrLn . show in
mapM action [2000,5000,10000,20000,30000,40000,50000] >>
return ()
On my windows box, this gives me:
0.0156098s
0.0156098s
0.0156098s
0.0312196s
0.0312196s
0.0468294s
0.0624392s
This suggests the combo of delay and getCurrentTime has a resolution of 15.6 milliseconds. When I loop 1000 times delay 1000, I end up waiting 15.6 seconds, so this is just the minimum wait for a thread.
On my Ubuntu box (11.04, with kernel 2.6.38-11), I get much greater precision (~100us).
It might be you can avoid the timing problem by keeping the program busier, so we don't context switch away. Either way, I would suggest you do not use threadDelay for timing, or at least check the time and perform any operations up to the given instant.
Your high-precision sleep via C might work for you, if you are willing to muck with FFI, but the cost is you'll need to use bound threads (at least for your timer).

Resources