Improving Haskell code performance (BangPatterns, LazyByteString)

Improving Haskell code performance (BangPatterns, LazyByteString) - haskell

I've used BangPatterns, Lazy ByteString. Don't know what else to do to improve performance of this code. Any ideas and suggestions? It's clearly not the fastest version as it exceeds time limit.
-- Find the sum of all the multiples of 3 or 5 below N
-- Input Format
-- First line contains T that denotes the number of test cases. This is followed by T lines, each containing an integer, N.
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -optc-O2 #-}
import qualified Data.ByteString.Lazy as L
import Control.Monad (mapM_)
readInt :: L.ByteString -> Int
readInt !s = L.foldl' (\x c -> 10 * x + fromIntegral c - 48) 0 s
main :: IO ()
main = do
-- don't need the number of inputs, since it is read lazily.
-- split input by lines
(_:ls) <- L.split 10 `fmap` L.getContents
-- length ls <= 10^5
mapM_ (print . f . readInt) ls
-- n <= 10^9
f :: Int -> Int
f n = go 0 0
where
go !i !a | i == n = a
go !i !a | i `mod` 3 == 0
|| i `mod` 5 == 0 = go (i+1) (a+i)
go !i !a = go (i+1) a

danidiaz has already discussed the input and output issue somewhat.
One fast way to produce multiples of 3 or 5 is to use a "wheel" of the sort commonly used for prime sieves.
multiples3or5 = go 0 $ cycle [3,2,1,3,1,2,3]
where
go n (x : xs) = n : go (n+x) xs
go n [] = error "impossible"
In fact, since the circular list never ends, it's cleaner to use a different type. And since you're using Int, it might as well be specialized and unpacked for performance. Note that the UNPACK pragma in this context is not needed for GHC version 7.8 or above.
data IntStream = {-# UNPACK #-} !Int :> IntStream
infixr 5 :>
wheel :: IntStream
wheel = 3 :> 2 :> 1 :> 3 :> 1 :> 2 :> 3 :> wheel
multiples3or5 = go 0 wheel
where
go !n (x :> xs) = n : go (n+x) xs
As fgv commented, this is in the nature of an anamorphism. You can see this by writing
multiples3or5 = unfoldr go (0, wheel) where
go (!n, (x :> xs)) = Just (n, (n+x, xs))
but note that unfoldr did not become efficient enough to be much use for anything until base 4.8, which has not officially been released.
When printing out the results, the system has to divide a lot of things by 10. I don't know if those routines are specially optimized, but I do know that GHC's native code generator does not currently optimize division by a known divisor unless that divisor is a power of 2. So you might find that you can improve performance by using -fllvm, and being careful to use a compatible version of LLVM.
Edit
See Chad Groft's answer for a better way.

Your use of print in the line
mapM_ (print . f . readInt) ls
may be introducing some overhead, because print depends on the Show instance for Int, meaning a conversion to inefficient Strings will take place.
Add the following imports
import qualified Data.ByteString.Builder as BB
import qualified Data.Foldable as F
import Data.List.Split (chunksOf) -- from the "split" package
import System.IO -- for stdout
and try to change that line with something like
let resultList = map (f . readInt) ls
F.mapM_ (BB.hPutBuilder stdout . F.foldMap BB.intDec) (chunksOf 1000 resultList)
that takes chunks of size 1000 from the list of Ints and uses the efficient Builder type and the specialized hPutBuilder function to write them to stdout.
(I added the chunking because otherwise I feared constructing the Builder would force the whole input list into memory. And we don't want that, because the list is being read lazily.)
I'm not sure if that's the main bottleneck, though.

If you're really concerned with efficiency, rethink the algorithm. Your main bottleneck is that you're manually summing a bunch of numbers between 1 and N, which will perform poorly on large N no matter what you do.
Instead, think mathematically. The sum of all multiples of 3 or 5 up to N is almost the sum of all multiples of 3 up to N (call this S_3), plus the sum of all multiples of 5 up to N (call this S_5). I say "almost" because some numbers get double-counted; call their sum T. Now the sum you want is exactly S_3 + S_5 – T, and each term has a nice closed formula (what is it?). Calculating these three numbers is much faster than what you're doing.

Here you the formula without those "think about" mentor answers
sumMultiplesOf::Integral n=>n->n->n
sumMultiplesOf k n = d * (1 + d) `div` 2 * k where d = (n - 1) `div` k
sumMultiplesOf3or5::Integral n=>n->n
sumMultiplesOf3or5 n = sumMultiplesOf 3 n + sumMultiplesOf 5 n - sumMultiplesOf 15 n

Related

Implementing an efficient sliding-window algorithm in Haskell

I needed an efficient sliding window function in Haskell, so I wrote the following:
windows n xz#(x:xs)
| length v < n = []
| otherwise = v : windows n xs
where
v = take n xz
My problem with this is that I think the complexity is O(n*m) where m is the length of the list and n is the window size. You count down the list once for take, another time for length, and you do it down the list of essentially m-n times. It seems like it can be more efficient than this, but I'm at a loss for how to make it more linear. Any takers?

You can't get better than O(m*n), since this is the size of the output data structure.
But you can avoid checking the lengths of the windows if you reverse the order of operations: First create n shifted lists and then just zip them together. Zipping will get rid of those that don't have enough elements automatically.
import Control.Applicative
import Data.Traversable (sequenceA)
import Data.List (tails)
transpose' :: [[a]] -> [[a]]
transpose' = getZipList . sequenceA . map ZipList
Zipping a list of lists is just a transposition, but unlike transpose from Data.List it throws away outputs that would have less than n elements.
Now it's easy to make the window function: Take m lists, each shifted by 1, and just zip them:
windows :: Int -> [a] -> [[a]]
windows m = transpose' . take m . tails
Works also for infinite lists.

You can use Seq from Data.Sequence, which has O(1) enqueue and dequeue at both ends:
import Data.Foldable (toList)
import qualified Data.Sequence as Seq
import Data.Sequence ((|>))
windows :: Int -> [a] -> [[a]]
windows n0 = go 0 Seq.empty
where
go n s (a:as) | n' < n0 = go n' s' as
| n' == n0 = toList s' : go n' s' as
| otherwise = toList s'' : go n s'' as
where
n' = n + 1 -- O(1)
s' = s |> a -- O(1)
s'' = Seq.drop 1 s' -- O(1)
go _ _ [] = []
Note that if you materialize the entire result your algorithm is necessarily O(N*M) since that is the size of your result. Using Seq just improves performance by a constant factor.
Example use:
>>> windows [1..5]
[[1,2,3],[2,3,4],[3,4,5]]

First let's get the windows without worrying about the short ones at the end:
import Data.List (tails)
windows' :: Int -> [a] -> [[a]]
windows' n = map (take n) . tails
> windows' 3 [1..5]
[[1,2,3],[2,3,4],[3,4,5],[4,5],[5],[]]
Now we want to get rid of the short ones without checking the length of every one.
Since we know they are at the end, we could lose them like this:
windows n xs = take (length xs - n + 1) (windows' n xs)
But that's not great since we still go through xs an extra time to get its length. It also doesn't work on infinite lists, which your original solution did.
Instead let's write a function for using one list as a ruler to measure the amount to take from another:
takeLengthOf :: [a] -> [b] -> [b]
takeLengthOf = zipWith (flip const)
> takeLengthOf ["elements", "get", "ignored"] [1..10]
[1,2,3]
Now we can write this:
windows :: Int -> [a] -> [[a]]
windows n xs = takeLengthOf (drop (n-1) xs) (windows' n xs)
> windows 3 [1..5]
[[1,2,3],[2,3,4],[3,4,5]]
Works on infinite lists too:
> take 5 (windows 3 [1..])
[[1,2,3],[2,3,4],[3,4,5],[4,5,6],[5,6,7]]
As Gabriella Gonzalez says, the time complexity is no better if you want to use the whole result. But if you only use some of the windows, we now manage to avoid doing the work of take and length on the ones you don't use.

If you want O(1) length then why not use a structure that provides O(1) length? Assuming you aren't looking for windows from an infinite list, consider using:
import qualified Data.Vector as V
import Data.Vector (Vector)
import Data.List(unfoldr)
windows :: Int -> [a] -> [[a]]
windows n = map V.toList . unfoldr go . V.fromList
where
go xs | V.length xs < n = Nothing
| otherwise =
let (a,b) = V.splitAt n xs
in Just (a,b)
Conversation of each window from a vector to a list might bite you some, I won't hazard an optimistic guess there, but I will bet that the performance is better than the list-only version.

For the sliding window I also used unboxed Vetors as length, take, drop as well as splitAt are O(1) operations.
The code from Thomas M. DuBuisson is a by n shifted window, not a sliding, except if n =1. Therefore a (++) is missing, however this has a cost of O(n+m). Therefore careful, where you put it.
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector)
import Data.List
windows :: Int -> Vector Double -> [[Int]]
windows n = (unfoldr go)
where
go !xs | V.length xs < n = Nothing
| otherwise =
let (a,b) = V.splitAt 1 xs
c= (V.toList a ++V.toList (V.take (n-1) b))
in (c,b)
I tried it out with +RTS -sstderr and:
putStrLn $ show (L.sum $ L.concat $ windows 10 (U.fromList $ [1..1000000]))
and got real time 1.051s and 96.9% usage, keeping in mind that after the sliding window two O(m) operations are performed.

Project euler 10 - [haskell] Why so inefficient?

Alright, so i've picked up project euler where i left off when using java, and i'm at problem 10. I use Haskell now and i figured it'd be good to learn some haskell since i'm still very much a beginner.
http://projecteuler.net/problem=10
My friend who still codes in java came up with a very straight forward way to implement the sieve of eratosthenes:
http://puu.sh/5zQoU.png
I tried implementing a better looking (and what i thought was gonna be a slightly more efficient) Haskell function to find all primes up to 2,000,000.
I came to this very elegant, yet apparently enormously inefficient function:
primeSieveV2 :: [Integer] -> [Integer]
primeSieveV2 [] = []
primeSieveV2 (x:xs) = x:primeSieveV2( (filter (\n -> ( mod n x ) /= 0) xs) )
Now i'm not sure why my function is so much slower than his (he claim his works in 5ms), if anything mine should be faster, since i only check composites once (they are removed from the list when they are found) whereas his checks them as many times as they can be formed.
Any help?

You don't actually have a sieve here. In Haskell you could write a sieve as
import Data.Vector.Unboxed hiding (forM_)
import Data.Vector.Unboxed.Mutable
import Control.Monad.ST (runST)
import Control.Monad (forM_, when)
import Prelude hiding (read)
sieve :: Int -> Vector Bool
sieve n = runST $ do
vec <- new (n + 1) -- Create the mutable vector
set vec True -- Set all the elements to True
forM_ [2..n] $ \ i -> do -- Loop for i from 2 to n
val <- read vec i -- read the value at i
when val $ -- if the value is true, set all it's multiples to false
forM_ [2*i, 3*i .. n] $ \j -> write vec j False
freeze vec -- return the immutable vector
main = print . ifoldl' summer 0 $ sieve 2000000
where summer s i b = if b then i + s else s
This "cheats" by using a mutable unboxed vector, but it's pretty darn fast
$ ghc -O2 primes.hs
$ time ./primes
142913828923
real: 0.238 s
This is about 5x faster than my benchmarking of augustss's solution.

To actually implement the sieve efficiently in Haskell you probably need to do it the Java way (i.e., allocate a mutable array an modify it).
For just generating primes I like this:
primes = 2 : filter (isPrime primes) [3,5 ..]
where isPrime (p:ps) x = p*p > x || x `rem` p /= 0 && isPrime ps x
And then you can print the sum of all primes primes < 2,000,000
main = print $ sum $ takeWhile (< 2000000) primes
You can speed it up by adding a type signature primes :: [Int].
But it works well with Integer as well and that also gives you the correct sum (which 32 bit Int will not).
See The Genuine Sieve of Eratosthenes for more information.

The time complexity of your code is n2 (in n primes produced). It is impractical to run for producing more than first 10...20 thousand primes.
The main problem with that code is not that it uses rem but that it starts its filters prematurely, so creates too many of them. Here's how you fix it, with a small tweak:
{-# LANGUAGE PatternGuards #-}
primes = 2 : sieve primes [3..]
sieve (p:ps) xs | (h,t) <- span (< p*p) xs = h ++ sieve ps [x | x <- t, rem x p /= 0]
-- sieve ps (filter (\x->rem x p/=0) t)
main = print $ sum $ takeWhile (< 100000) primes
This improves the time complexity by about n1/2 (in n primes produced) and gives it a drastic speedup: it gets to 100,000 75x faster. Your 28 seconds should become ~ 0.4 sec. But, you probably tested it in GHCi as interpreted code, not compiled. Marking it1) as :: [Int] and compiling with -O2 flag gives it another ~ 40x speedup, so it'll be ~ 0.01 sec. To reach 2,000,000 with this code takes ~ 90x longer, for a whopping ~ 1 sec of projected run time.
1) be sure to use sum $ map (fromIntegral :: Int -> Integer) $ takeWhile ... in main.
see also: http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth

Stack space overflow error while running recursive compiled code. Fine tune algorithm, or give more resources?

I am working on project Euler #14, and have a solution to get the answer, but am getting a stack space overflow error when I try to run the code. The algorithm works OK in the interactive GHCI (on low numbers), but wont work when I throw a really big number at it and try to compile it.
Here is a rough idea of what it does in the interactive GHCI. It takes about 10 seconds to calculate "answer 50000" on my computer.
After letting GHCI run the problem for a few minutes, it spits out the correct answer.
*Euler System.IO> answer 1000000
(525,837799)
But that doesn't solve the stack overflow error when compiling the program to run natively.
*Euler System.IO> answer 10
(20,9)
*Euler System.IO> answer 100
(119,97)
*Euler System.IO> answer 1000
(179,871)
*Euler System.IO> answer 10000
(262,6171)
*Euler System.IO> answer 50000
(324,35655)
What should I do to get the answer to for "answer 1000000"? I imagine my algorithm needs to be fine tuned a bit, but I have no idea how to go about doing that.
Code:
module Main
where
import System.IO
import Control.Monad
main = print (answer 1000000)
-- Count the length of the sequences
-- count' creates a tuple with the second value
-- being the starting number of the game
-- and the first value being the total
-- length of the chain
count' n = (cSeq n, n)
cSeq n = length $ game n
-- Find the maximum chain value of the game
answer n = maximum $ map count' [1..n]
-- Working game.
-- game 13 = [13,40,20,10,5,16,8,4,2,1]
game n = n : play n
play x
| x <= 0 = [] -- is negative or 0
| x == 1 = [] -- is 1
| even x = doEven x : play ((doEven x)) -- even
| otherwise = doOdd x : play ((doOdd x)) -- odd
where doOdd x = (3 * x) + 1
doEven x = (x `div` 2)

The problem here is that maximum is too lazy. Instead of keeping track of the largest element as it goes along, it builds up a huge tree of max thunks. This is because maximum is defined in terms of foldl, so the evaluation goes as follows:
maximum [1, 2, 3, 4, 5]
foldl max 1 [2, 3, 4, 5]
foldl max (max 1 2) [3, 4, 5]
foldl max (max (max 1 2) 3) [4, 5]
foldl max (max (max (max 1 2) 3) 4) [5]
foldl max (max (max (max (max 1 2) 3) 4) 5) []
max (max (max (max 1 2) 3) 4) 5 -- this expression will be huge for large lists
Trying to evaluate too many of these nested max calls causes a stack overflow.
The solution is to force it to evaluate these as it goes along by using the strict version foldl', (or, in this case, its cousin foldl1'). This prevents the max's from building up by reducing them at each step:
foldl1' max [1, 2, 3, 4, 5]
foldl' max 1 [2, 3, 4, 5]
foldl' max 2 [3, 4, 5]
foldl' max 3 [4, 5]
foldl' max 4 [5]
foldl' max 5 []
5
GHC can often solve these kinds of problems on its own if you compile with -O2 which (among other things) runs a strictness analysis of your program. However, I think it's good practice to write programs that don't need to rely on optimizations to work.
Note: After fixing this, the resulting program is still very slow. You might want to look into using memoization for this problem.

#hammar already pointed out the problem that maximum is too lazy, and how to resolve that (using foldl1', the strict version of foldl1).
But there are further inefficiencies in the code.
cSeq n = length $ game n
cSeq lets game construct a list, only to calculate its length. Unfortunately, length is not a "good consumer", so the construction of the intermediate list is not fused away. That's quite a bit of unnecessary allocation and costs time. Eliminating these lists
cSeq n = coll (1 :: Int) n
where
coll acc 1 = acc
coll acc m
| even m = coll (acc + 1) (m `div` 2)
| otherwise = coll (acc + 1) (3*m+1)
cuts down the allocation by something like 65% and the running time by about 20% (still slow). Next point, you're using div, which performs a sign check in addition to the normal division. Since all numbers involved are positive, using quot instead does speed it up a bit more (not much here, but it will become important later).
The next big point is that, since you haven't given type signatures, the type of the numbers (except where it was determined by the use of length or by the expression type signature (1 :: Int) in my rewrite) is Integer. The operations on Integer are considerably slower than the corresponding operations on Int, so if possible, you should use Int (or Word) rather than Integer when speed matters. If you have a 64-bit GHC, Int is sufficient for these computations, that reduces the running time by about half when using div, by about 70% when using quot, when using the native code generator, and when using the LLVM backend, the running time is reduced by about 70% when using div and by about 95% when using quot.
The difference between the native code generator and the LLVM backend is mostly due to some elementary low-level optimisations.
even and odd are defined
even, odd :: (Integral a) => a -> Bool
even n = n `rem` 2 == 0
odd = not . even
in GHC.Real. When the type is Int, LLVM knows to replace the division by 2 used to determine the modulus with a bitwise and (n .&. 1 == 0). The native code generator does not (yet) do many of these low-level optimisations. If you do that by hand, the code produced by the NCG and the LLVM backend performs nearly identically.
When using div, both, the NCG and LLVM, are not able to replace the division with a short shift-and-add sequence, so you get the relatively slow machine division instruction with the sign-test. With quot, both are able to do that for Int, so you get much faster code.
The knowledge that all occurring numbers are positive allows us to replace the division by 2 with a simple right shift, without any code to correct for negative arguments, that speeds up the code produced by the LLVM backend by another ~33%, oddly it doesn't make a difference for the NCG.
So from the original that took eight second plus/minus a bit (a little less with the NCG, a little more with the LLVM backend), we've gone to
module Main (main)
where
import Data.List
import Data.Bits
main = print (answer (1000000 :: Int))
-- Count the length of the sequences
-- count' creates a tuple with the second value
-- being the starting number of the game
-- and the first value being the total
-- length of the chain
count' n = (cSeq n, n)
cSeq n = go (1 :: Int) n
where
go !acc 1 = acc
go acc m
| even' m = go (acc+1) (m `shiftR` 1)
| otherwise = go (acc+1) (3*m+1)
even' :: Int -> Bool
even' m = m .&. 1 == 0
-- Find the maximum chain value of the game
answer n = foldl1' max $ map count' [1..n]
which takes 0.37 seconds with the NCG, and 0.27 seconds with the LLVM backend on my setup.
A minute improvement in running time, but a huge reduction of allocation can be obtained by replacing the foldl1' max with a manual recursion,
answer n = go 1 1 2
where
go ml mi i
| n < i = (ml,mi)
| l > ml = go l i (i+1)
| otherwise = go ml mi (i+1)
where
l = cSeq i
that makes it 0.35 resp. 0.25 seconds (and produces a tiny 52,936 bytes allocated in the heap).
Now if that is still too slow, you can worry about a good memoisation strategy. The best I know(1) is to use an unboxed array to store the chain lengths for the numbers not exceeding the limit,
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import System.Environment (getArgs)
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits
main :: IO ()
main = do
args <- getArgs
let bd = case args of
a:_ -> read a
_ -> 100000
print $ mxColl bd
mxColl :: Int -> (Int,Int)
mxColl bd = runST $ do
arr <- newArray (0,bd) 0
unsafeWrite arr 1 1
goColl arr bd 1 1 2
goColl :: STUArray s Int Int -> Int -> Int -> Int -> Int -> ST s (Int,Int)
goColl arr bd ms ml i
| bd < i = return (ms,ml)
| otherwise = do
nln <- collatzLength arr bd i
if ml < nln
then goColl arr bd i nln (i+1)
else goColl arr bd ms ml (i+1)
collatzLength :: STUArray s Int Int -> Int -> Int -> ST s Int
collatzLength arr bd n = go 1 n
where
go !l 1 = return l
go l m
| bd < m = go (l+1) $ case m .&. 1 of
0 -> m `shiftR` 1
_ -> 3*m+1
| otherwise = do
l' <- unsafeRead arr m
case l' of
0 -> do
l'' <- go 1 $ case m .&. 1 of
0 -> m `shiftR` 1
_ -> 3*m+1
unsafeWrite arr m (l''+1)
return (l + l'')
_ -> return (l+l'-1)
which does the job for a limit of 1000000 in 0.04 seconds when compiled with the NCG, 0.05 with the LLVM backend (apparently, that is not as good at optimising STUArray code as the NCG is).
If you don't have a 64-bit GHC, you can't simply use Int, since that would overflow then for some inputs.
But the overwhelming part of the computation is still performed in Int range, so you should use that where possible and only move to Integer where required.
switch :: Int
switch = (maxBound - 1) `quot` 3
back :: Integer
back = 2 * fromIntegral (maxBound :: Int)
cSeq :: Int -> Int
cSeq n = goInt 1 n
where
goInt acc 1 = acc
goInt acc m
| m .&. 1 == 0 = goInt (acc+1) (m `shiftR` 1)
| m > switch = goInteger (acc+1) (3*toInteger m + 1)
| otherwise = goInt (acc+1) (3*m+1)
goInteger acc m
| fromInteger m .&. (1 :: Int) == 1 = goInteger (acc+1) (3*m+1)
| m > back = goInteger (acc+1) (m `quot` 2) -- yup, quot is faster than shift for Integer here
| otherwise = goInt (acc + 1) (fromInteger $ m `quot` 2)
makes it harder to optimise the loop(s), so it is slower than the single loop using Int, but still decent. Here (where the Integer loop is never run), it takes 0.42 seconds with the NCG and 0.37 with the LLVM backend (which is pretty much the same as using quot in the pure Int version).
Using a similar trick for the memoised version has similar consequences, it's considerably slower than the pure Int version, but still blazingly fast compared to unmemoised versions.
(1) For this special (type of) problem, where you need to memoise the results for a contiguous range of arguments. For other problems, a Map or some other data structure will be the better choice.

It seems that the maximum function is the culprit as already pointed out, but you shouldn't have to worry about it if you compile your program with the -O2 flag.
The program is still quite slow, this is because the problem is supposed to teach you about memoization. One good way of doing this is haskell is by using Data.Memocombinators:
import Data.MemoCombinators
import Control.Arrow
import Data.List
import Data.Ord
import System.Environment
play m = maximumBy (comparing snd) . map (second threeNPuzzle) $ zip [1..] [1..m]
where
threeNPuzzle = arrayRange (1,m) memoized
memoized n
| n == 1 = 1
| odd n = 1 + threeNPuzzle (3*n + 1)
| even n = 1 + threeNPuzzle (n `div` 2)
main = getArgs >>= print . play . read . head
The above program runs in under a second when compiled with -O2 on my machine.
Note that in this case it is not a good idea to memoize all values found by threeNPuzzle, the program above memoizes the ones up until the limit (1000000 in the problem).

Slowdown when using parallel strategies in Haskell

I was working through the exercises of Andre Loh's deterministic parallel programming in haskell exercises. I was trying to convert the N-Queens sequential code into parallel by using strategies, but I noticed that the parallel code runs much slower than the sequential code and also errors out with insufficient stack space.
This is the code for the parallel N-Queens,
import Control.Monad
import System.Environment
import GHC.Conc
import Control.Parallel.Strategies
import Data.List
import Data.Function
type PartialSolution = [Int] -- per column, list the row the queen is in
type Solution = PartialSolution
type BoardSize = Int
chunk :: Int -> [a] -> [[a]]
chunk n [] = []
chunk n xs = case splitAt n xs of
(ys, zs) -> ys : chunk n zs
-- Generate all solutions for a given board size.
queens :: BoardSize -> [Solution]
--queens n = iterate (concatMap (addQueen n)) [[]] !! n
queens n = iterate (\l -> concat (map (addQueen n) l `using` parListChunk (n `div` numCapabilities) rdeepseq)) [[]] !! n
-- Given the size of the problem and a partial solution for the
-- first few columns, find all possible assignments for the next
-- column and extend the partial solution.
addQueen :: BoardSize -> PartialSolution -> [PartialSolution]
addQueen n s = [ x : s | x <- [1..n], safe x s 1 ]
-- Given a row number, a partial solution and an offset, check
-- that a queen placed at that row threatens no queen in the
-- partial solution.
safe :: Int -> PartialSolution -> Int -> Bool
safe x [] n = True
safe x (c:y) n = x /= c && x /= c + n && x /= c - n && safe x y (n + 1)
main = do
[n] <- getArgs
print $ length $ queens (read n)
The line (\l -> concat (map (addQueen n) l using parListChunk (n div numCapabilities) rdeepseq)) is what I changed from the original code. I have seen Simon Marlow's solution but I wanted to know the reason for the slowdown and error in my code.
Thanks in advance.

You are sparking way too much work. The parListChunk parameter of div n numCapabilities is probably, what, 7 on your system (2 cores and you're running with n ~ 14). The list is going to grow large very quickly so there is no point in sparking such small units of work (and I don't see why it makes sense tying it to the value of n).
If I add a factor of ten (making the sparking unit 70 in this case) then I get a clear performance win over single threading. Also, I don't have the stack issue you refer to - if it goes away with a change to your parListChunk value then I'd report that as a bug.
If I make the chunking every 800 then the times top off at 5.375s vs 7.9s. Over 800 and the performance starts to get worse again, ymmv.
EDIT:
[tommd#mavlo Test]$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.4
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -threaded -fforce-recomp ; time ./so 13 +RTS -N2
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m5.404s
[tommd#mavlo Test]$ ghc -O2 so.hs -rtsopts -fforce-recomp ; time ./so 13
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
73712
real 0m8.134s

How do you save a tree data structure to binary file in Haskell

I'm trying to save a simple (but quite big) Tree structure into a binary file using Haskell. The structure looks something like this:
-- For simplicity assume each Node has only 4 childs
data Tree = Node [Tree] | Leaf [Int]
And here is how I need the data look on disk:
Each node starts with four 32-bit offsets to it's children, then follow the childs.
I don't care much about the leafs, let's say it's just n consecutive 32-bit numbers.
For practival purposes I would need some node labels or some other additional data
but right now I don't care about that much neither.
It apears to me that Haskellers first choice when writing binary files is the Data.Binary.Put library. But with that I have a problem in the bullet #1. In particular, when I'm about to write a Node to a file, to write down the child offsets I need to know my current offset and the size of each child.
This is not something that Data.Binary.Put provides so I thought this must be a perfect application of Monad transformers. But even though it sounds cool and functional, so far I have not been successfull with this approach.
I asked two other questions that I thought would help me solve the problem here and here. I must say that each time I received very nice answers that helped me progress further but unfortunatelly I am still unable to solve the problem as a whole.
Here is what I've got so far, it still leaks too much memory to be practical.
I would love to have solution that uses such functional approach, but would be grateful for any other solution as well.

Here is implementation of two pass solution proposed by sclv.
import qualified Data.ByteString.Lazy as L
import Data.Binary.Put
import Data.Word
import Data.List (foldl')
data Tree = Node [Tree] | Leaf [Word32] deriving Show
makeTree 0 = Leaf $ replicate 100 0xdeadbeef
makeTree n = Node $ replicate 4 $ makeTree $ n-1
SizeTree mimics original Tree, it does not contain data but at each node it stores size of corresponding child in Tree.
We need to have SizeTree in memory, so it worth to make it more compact (e.g. replace Ints with uboxed words).
data SizeTree
= SNode {sz :: Int, chld :: [SizeTree]}
| SLeaf {sz :: Int}
deriving Show
With SizeTree in memory it is possible to serialize original Tree in streaming fashion.
putTree :: Tree -> SizeTree -> Put
putTree (Node xs) (SNode _ ys) = do
putWord8 $ fromIntegral $ length xs -- number of children
mapM_ (putWord32be . fromIntegral . sz) ys -- sizes of children
sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
putTree (Leaf xs) _ = do
putWord8 0 -- zero means 'leaf'
putWord32be $ fromIntegral $ length xs -- data length
mapM_ putWord32be xs -- leaf data
mkSizeTree :: Tree -> SizeTree
mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
where
ys = map mkSizeTree xs
sum' = foldl' (+) 0
It is important to prevent GHC from merging two passes into one (in which case it will hold tree in memory).
Here it is done by feeding not tree but tree generator to the function.
serialize mkTree size = runPut $ putTree (mkTree size) treeSize
where
treeSize = mkSizeTree $ mkTree size
main = L.writeFile "dump.bin" $ serialize makeTree 10

There are two basic approaches I would consider. If the entire serialized structure will easily fit into memory, you can serialize each node into a lazy bytestring and just use the lengths for each of them to calculate the offset from the current position.
serializeTree (Leaf nums) = runPut (mapM_ putInt32 nums)
serializeTree (Node subtrees) = mconcat $ header : childBs
where
childBs = map serializeTree subtrees
offsets = scanl (\acc bs -> acc+L.length bs) (fromIntegral $ 2*length subtrees) childBs
header = runPut (mapM_ putInt32 $ init offsets)
The other option is, after serializing a node, go back and re-write the offset fields with the appropriate data. This may be the only option if the tree is large, but I don't know of a serialization library that supports this. It would involve working in IO and seeking to the correct locations.

What I think you want is an explicit two pass solution. The first converts your tree into a size annotated tree. This pass forces the tree, but can be done, in fact, without any monadic machinery at all by tying the knot. The second pass is in the plain old Put monad, and given that the size annotations are already calculated, should be very straightforward.

Here is an implementation using Builder, which is part of the "binary" package. I haven't profiled it properly, but according to "top" it immediately allocates 108 Mbytes and then hangs on to that for the rest of the execution.
Note that I haven't tried reading the data back, so there may be lurking errors in my size and offset calculations.
-- Paste this into TreeBinary.hs, and compile with
-- ghc -O2 --make TreeBinary.hs -o TreeBinary
module Main where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B
import Data.List (init)
import Data.Monoid
import Data.Word
-- -------------------------------------------------------------------
-- Test data.
data Tree = Node [Tree] | Leaf [Word32] deriving Show
-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)
-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int. This is an example of lazy programming instead
-- of lazy evaluation.
makeTree :: Tree
makeTree = makeTree1 9
where makeTree1 0 = Leaf [0..100]
makeTree1 n = Node [ makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1 ]
-- --------------------------------------------------------------------
-- The actual serialisation code.
-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches),
baseLength + sum subLengths)
where
(branches, subLengths) = unzip $ map serialiseTree ts
baseLength = fromIntegral $ 1 + 4 * length ts
offsets = init $ scanl (+) baseLength subLengths
main = do
putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string