No speedup with naive merge sort parallelization in Haskell

No speedup with naive merge sort parallelization in Haskell - haskell

Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).
Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.
Problem statement
Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.
The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.
The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)
instance NFData a => NFData (Tree a) where
rnf (Leaf v) = deepseq v ()
rnf (Node x y) = deepseq (x, y) ()
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt (length xs `div` 2) xs
-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
xr <- strat $ runEval $ mergeSort' (l - 1) x
yr <- rseq $ runEval $ mergeSort' (l - 1) y
rdeepseq (merge xr yr)
where
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
strat | l > 0 = rpar
| otherwise = rseq
mergeSort = runEval . mergeSort' 10
By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.
Results
Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,
rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog
Results on a 24-core X5680 # 3.33GHz show little improvement
> ./ParallelMergeSort
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.
and on my own machine, a quad-core Phenom II,
> ./ParallelMergeSort
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.
Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!
Solutions I'm looking for
Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.
Further discussion
I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
Is there a better way (arrows, applicative) to express parallelism?

The answer is pretty easy: Because you have at no point introduced parallelism. Eval is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:
do xr <- rpar $ runEval $ mergeSort' x
yr <- rseq $ runEval $ mergeSort' y
rseq (merge xr yr)
This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.
Standard tips also kind-of apply:
The result should be evaluated deeply (e.g. using evalTraversable rseq). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.
Just sparking everything will most likely eat up any gains. It would be a good idea to introduce a parameter that stops sparking at lower recursion levels.
Edit: The following actually doesn't apply anymore after the question edit
But the worst part last: Your algorithm as you state it is very flawed. Your top-level seq only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).
So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...
Edit 2: Some more answers to the edits
The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.
On "Further Discussion":
The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.
The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.
Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.
With monad-par, your example might become something like:
do xr <- spawn $ mergeSort' x
yr <- spawn $ mergeSort' y
merge <$> get xr <*> get yr
So here the get actually forces you to specify the join points - and the library does the required deepseq automatically behind the scenes.

I had similar luck to what you report in EDIT 3 on a dual core system with these variants. I used a smaller list length because I'm on a smaller computer, compiled with ghc -O2 -rtsopts -threaded MergePar.hs, and ran with ./MergePar +RTS -H256M -N. This might offer a more structured way to compare performance. Note that the RTS option -qa sometimes helps the simple par variants.
import Control.Applicative
import Control.Parallel
import Control.Parallel.Strategies
import Criterion.Main
import GHC.Conc (numCapabilities)
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = Node (listToTree (take half xs)) (listToTree (drop half xs))
where half = length xs `div` 2
-- Merge two ordered lists
merge :: Ord a => [a] -> [a] -> [a]
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
-- Simple merge sort
mergeSort' :: Ord a => Tree a -> [a]
mergeSort' (Leaf v) = [v]
mergeSort' (Node x y) = merge (mergeSort' x) (mergeSort' y)
mergeSort :: Ord a => [a] -> [a]
mergeSort = mergeSort' . listToTree
-- Merge sort with 'par' annotations on every recursive call
mergeSortP' :: Ord a => Tree a -> [a]
mergeSortP' (Leaf v) = [v]
mergeSortP' (Node x y) = let xr = mergeSortP' x
yr = mergeSortP' y
in xr `par` yr `pseq` merge xr yr
mergeSortP :: Ord a => [a] -> [a]
mergeSortP = mergeSortP' . listToTree
-- Merge sort with 'rpar' annotations on every recursive call
mergeSortR' :: Ord a => Tree a -> [a]
mergeSortR' (Leaf v) = [v]
mergeSortR' (Node x y) =
runEval $ merge <$> rpar (mergeSortR' x) <*> rpar (mergeSortR' y)
mergeSortR :: Ord a => [a] -> [a]
mergeSortR = mergeSortR' . listToTree
-- Parallel merge sort that stops looking for parallelism at a certain
-- depth
smartMerge' :: Ord a => Int -> Tree a -> [a]
smartMerge' _ (Leaf v) = [v]
smartMerge' n t#(Node x y)
| n <= 1 = mergeSort' t
| otherwise = let xr = smartMerge' (n-1) x
yr = smartMerge' (n-2) y
in xr `par` yr `pseq` merge xr yr
smartMerge :: Ord a => [a] -> [a]
smartMerge = smartMerge' numCapabilities . listToTree
main = defaultMain $ [ bench "original" $ nf mergeSort lst
, bench "par" $ nf mergeSortP lst
, bench "rpar" $ nf mergeSortR lst
, bench "smart" $ nf smartMerge lst ]
where lst = [100000,99999..0] :: [Int]

Related

Haskell bottom-up DP space leak

Apologies if this is too specific, I am new here and not exactly sure what is reasonable. I have been bashing my head against this problem for hours with nothing to show for it. The following code is my implementation of a competitive programming problem.
module Main where
import Data.List (foldl', groupBy)
import Debug.Trace
type Case = (Int, [(Int, Int)])
type Soln = Int
main = interact handle
handle :: String -> String
handle = fmt . solve . parse
fmt :: Soln -> String
fmt s = (show s) ++ "\n"
parse :: String -> Case
parse s = (l, fs)
where
(l:_:fs') = (map read) $ words s
fs = pairs fs'
pairs :: [a] -> [(a, a)]
pairs [] = []
pairs (a:b:s) = (a, b):(pairs s)
solve :: Case -> Soln
solve c#(l, fs) = last $ foldl' run [0..l] f
where
f = concat $ map rep $ map combine $ groupBy samev fs
samev a b = (snd a) == (snd b)
combine a = (sum $ map fst $ a, snd $ head $ a)
rep (n, v) = replicate (min n (l `div` v)) v
run :: [Int] -> Int -> [Int]
run b v = (take v b) ++ (zipWith min b (drop v b))
-- run b v = (take v b) ++ (zipMin b (drop v b))
zipMin :: [Int] -> [Int] -> [Int]
zipMin [] _ = []
zipMin _ [] = []
zipMin (a:as) (b:bs) = (min a b):(zipMin as bs)
The intent is that this works like a bottom-up dynamic programming solution generating each row of the DP table from the previous using the fold in solve. In theory GHC should be able to optimize out all the old rows of the table. However, running this program on a moderately large input with approximately l = 2000 and length f = 5000, I get this:
> time ./E < E.in
0
1.98user 0.12system 0:02.10elapsed 99%CPU (0avgtext+0avgdata 878488maxresident)k
0inputs+0outputs (0major+219024minor)pagefaults 0swaps
That's using 878 MB of memory at peak! The table I am generating is only 10,000 Ints, and in theory I only need one row at a time! It seems obvious that this is some form of thunk leak or other space leak. Profiling reveals that run is consuming 99% of total runtime and memory. Digging further indicated that 98% of that was in the zipWith call. Interestingly, replacing the call to zipWith min with my custom zipMin function produces a significant improvement:
> time ./E < E.in
0
1.39user 0.08system 0:01.48elapsed 99%CPU (0avgtext+0avgdata 531400maxresident)k
0inputs+0outputs (0major+132239minor)pagefaults 0swaps
That's just 70% the run time, and 60% the memory! I tried all sorts to make this work. I know (++) is generally a bad idea, so I replaced the lists in run with Data.Sequence.Seq Int... and it got slower and used more memory. I am not particularly experienced with Haskell, but I am at my wit's end here. I am sure the answer to this problem exists somehwere on SO, but I am too new to Haskell to be able to find it, it seems.
Any help any of you can offer is very much appreciated. I would love an explanation of exactly what I have done wrong, how to diagnose it in future, and how to fix it.
Thanks in advance.
EDIT:
After following Steven's Excellent advice and replacing my lists with unboxed vectors the performance is... uh... signficantly improved:
> time ./E < E.in
0
0.01user 0.00system 0:00.02elapsed 80%CPU (0avgtext+0avgdata 5000maxresident)k
24inputs+0outputs (0major+512minor)pagefaults 0swaps

So, by using foldl' you have ensured that the accumulator will be in WHNF. Putting a list in WHNF only evaluates the first element of the list. The remainder of the list exists as a thunk, and will be passed around as a thunk to the subsequent calls of your fold. Since you are interested in multiple positions in the list at once (that is, you are dropping some parts of it in the zipWith) large portions of the lists are being kept from previous iterations.
The structure you need here is an unboxed vector. An unboxed vector will ensure that everything is maximally strict, and will run in far less memory.

Pseudo-quicksort time complexity

I know that quicksort has O(n log n) average time complexity. A pseudo-quicksort (which is only a quicksort when you look at it from far enough away, with a suitably high level of abstraction) that is often used to demonstrate the conciseness of functional languages is as follows (given in Haskell):
quicksort :: Ord a => [a] -> [a]
quicksort [] = []
quicksort (p:xs) = quicksort [y | y<-xs, y<p] ++ [p] ++ quicksort [y | y<-xs, y>=p]
Okay, so I know this thing has problems. The biggest problem with this is that it does not sort in place, which is normally a big advantage of quicksort. Even if that didn't matter, it would still take longer than a typical quicksort because it has to do two passes of the list when it partitions it, and it does costly append operations to splice it back together afterwards. Further, the choice of the first element as the pivot is not the best choice.
But even considering all of that, isn't the average time complexity of this quicksort the same as the standard quicksort? Namely, O(n log n)? Because the appends and the partition still have linear time complexity, even if they are inefficient.

This "quicksort" is actually deforested tree sort:
http://www.reddit.com/r/programming/comments/2h0j2/real_quicksort_in_haskell
data Tree a = Leaf | Node a (Tree a) (Tree a)
mkTree [] = Leaf
mkTree (x:xs) = Node x (mkTree (filter (<= x) xs)) (mkTree (filter (x <) xs))
Binary tree is unbalanced, so O(N^2) worst-case and O(N*Log N) average-case complexity for building search tree.
foldTree f g Leaf = g
foldTree f g (Node x l r) = f x (foldTree f g l) (foldTree f g r)
treeSort l = foldTree (\x lft rht -> lft++[x]++rht) [] (mkTree l)
Retrieval algorithm have O(N^2) worst-case and O(N*Log N) average-case complexity.
Well-balanced:
Prelude> let rnds = iterate step where step x = (75*x) `mod` 65537
Prelude> length . quicksort . take 4000 . rnds $ 1
4000
(0.08 secs, 10859016 bytes)
Prelude> length . quicksort . take 8000 . rnds $ 1
8000
(0.12 secs, 21183208 bytes)
Prelude> length . quicksort . take 16000 . rnds $ 1
16000
(0.25 secs, 42322744 bytes)
Not-so-well-balanced:
Prelude> length . quicksort . map (`mod` 10) $ [1..4000]
4000
(0.62 secs, 65024528 bytes)
Prelude> length . quicksort . map (`mod` 10) $ [1..8000]
8000
(2.45 secs, 241906856 bytes)
Prelude> length . quicksort . map (`mod` 10) $ [1..16000]
16000
(9.52 secs, 941667704 bytes)

I agree with your assumption that the average time complexity still is O(n log n). I'm not an expert and 100% sure, but these are my thoughts:
This is a pseudo code of the in-place quicksort: (call quicksort with l=1 and r=length of the array)
Quicksort(l,r)
--------------
IF r-l>=1 THEN
choose pivot element x of {x_l,x_l+1,...,x_r-1,x_r}
order the array-segment x_l,...x_r in such a way that
all elements < x are on the left side of x // line 6
all elements > x are on the right side of x // line 7
let m be the position of x in the 'sorted' array (as said in the two lines above)
Quicksort(l,m-1);
Quicksort(m+1,r)
FI
The average time complexity analysis then reasons by selecting the "<"-comparisons in line 6 and 7 as the dominant operation in this algorithm and finally comes to the conclusion that the average time complexity is O(n log n). As the cost of line "order the array-segment x_l,...x_r in such a way that..." are not considered (only the dominant operation is important in time complexity analysis if you want to find bounds), I think "because it has to do two passes of the list when it partitions it" is not a problem, also as your Haskell version would just take approximately twice as long in this step. The same holds true for the appendix-operation and I agree with on that this adds nothing to the asymptotic costs:
Because the appends and the partition still have linear time complexity, even if they are inefficient.
For the sake of convenience lets assume that this adds up "n" to our time complexity costs, so that we have "O(n log n+n)". As there exists a natural number o for that n log n > n for all natural numbers greater than o holds true, you can estimate n log n +n to the top by 2 n log n and to the bottom by n log n, therefore n log n+n = O(n log n).
Further, the choice of the first element as the pivot is not the best choice.
I think the choice of the pivot element is irrelevant here, because in the average case analysis you assume uniform distribution of the elements in the array. You can't know from which place in the array you should select it, and you therefore have to consider all these cases in which your pivot-element (independently from which place of the list you take it) is the i-st smallest element of your list, for i=1...r.

I can offer you a run time test on Ideone.com which seems to show more or less linearithmic run-times for both (++) based versions and the one using accumulator technique from the Landei's answer, as well as another one, using one-pass three-way partitioning. On ordered data this turns quadratic or worse for all of them.
-- random: 100k 200k 400k 800k
-- _O 0.35s-11MB 0.85s-29MB 1.80s-53MB 3.71s-87MB n^1.3 1.1 1.0
-- _P 0.36s-12MB 0.80s-20MB 1.66s-45MB 3.76s-67MB n^1.2 1.1 1.2
-- _A 0.31s-14MB 0.62s-20MB 1.58s-54MB 3.22s-95MB n^1.0 1.3 1.0
-- _3 0.20s- 9MB 0.41s-14MB 0.88s-24MB 1.92s-49MB n^1.0 1.1 1.1
-- ordered: 230 460 900 1800
-- _P 0.09s 0.33s 1.43s 6.89s n^1.9 2.1 2.3
-- _A 0.09s 0.33s 1.44s 6.90s n^1.9 2.1 2.3
-- _3 0.05s 0.15s 0.63s 3.14s n^1.6 2.1 2.3
quicksortO xs = go xs where
go [] = []
go (x:xs) = go [y | y<-xs, y<x] ++ [x] ++ go [y | y<-xs, y>=x]
quicksortP xs = go xs where
go [] = []
go (x:xs) = go [y | y<-xs, y<x] ++ (x : go [y | y<-xs, y>=x])
quicksortA xs = go xs [] where
go [] acc = acc
go (x:xs) acc = go [y | y<-xs, y<x] (x : go [y | y<-xs, y>=x] acc)
quicksort3 xs = go xs [] where
go (x:xs) zs = part x xs zs [] [] []
go [] zs = zs
part x [] zs a b c = go a ((x : b) ++ go c zs)
part x (y:ys) zs a b c =
case compare y x of
LT -> part x ys zs (y:a) b c
EQ -> part x ys zs a (y:b) c
GT -> part x ys zs a b (y:c)
The empirical run-time complexities are estimated here as O(n^a) where a = log( t2/t1 ) / log( n2/n1 ). The timings are very approximate as ideone aren't very reliable with occasional far outlyers, but for checking the time complexity it's enough.
Thus these data seem to indicate that one-pass partition is faster by 1.5x-2x than two-pass schemes, and that using (++) is in no way slowing things down - at all. I.e. the "append operations" are not "costly" at all. The quadratic behaviour or (++)/append seems to be an urban myth — in Haskell context of course (edit: ... i.e. in the context of guarded recursion/tail recursion modulo cons; cf. this answer) (update: as user:AndrewC explains, it really is quadratic with the left folding; linear when (++) is used with the right folding; more about this here and here).
later addition: To be stable, the three-way partitioning quicksort version should too build its parts in the top-down manner:
q3s xs = go xs [] where
go (x:xs) z = part x xs go (x:) (`go` z)
go [] z = z
part x [] a b c = a [] (b (c []))
part x (y:ys) a b c =
case compare y x of
LT -> part x ys (a . (y:)) b c
EQ -> part x ys a (b . (y:)) c
GT -> part x ys a b (c . (y:))
(performance not tested).

I don't know how much this improves the runtime complexity, but by using an accumulator you can avoid the expensive (++):
quicksort xs = go xs [] where
go [] acc = acc
go (x:xs) acc = go [y | y<-xs, y<x] (x : go [y | y<-xs, y>=x] acc)

Look here for a true O(n log n) quicksort that will work on both arrays and lists :
http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.23.4398&rep=rep1&type=pdf
It is quite easy to implement in Common Lisp, and it outperforms the sort implementation of many commercial lisps.

Yes, this version has the same asymptotic complexity as the classic version -- you replace the linear-time partition with: two passes (< and >=), and you have the additional linear-time ++ (which includes linear re-allocing/copying). So it's a hefty constant-factor worse than an in-place partition, but it's still linear. All the other aspects of the algorithm are the same, so the same analysis that gives O(n log n) average-case for "true" (i.e. in-place) quicksort still holds here.

Project Euler 23: insight on this stackoverflow-ing program needed

Hi haskell fellows. I'm currently working on the 23rd problem of Project Euler. Where I'm at atm is that my code seems right to me - not in the "good algorithm" meaning, but in the "should work" meaning - but produces a Stack memory overflow.
I do know that my algorithm isn't perfect (in particular I could certainly avoid computing such a big intermediate result at each recursion step in my worker function).
Though, being in the process of learning Haskell, I'd like to understand why this code fails so miserably, in order to avoid this kind of mistakes next time.
Any insight on why this program is wrong will be appreciated.
import qualified Data.List as Set ((\\))
main = print $ sum $ worker abundants [1..28123]
-- Limited list of abundant numbers
abundants :: [Int]
abundants = filter (\x -> (sum (divisors x)) - x > x) [1..28123]
-- Given a positive number, returns its divisors unordered.
divisors :: Int -> [Int]
divisors x | x > 0 = [1..squareRoot x] >>=
(\y -> if mod x y == 0
then let d = div x y in
if y == d
then [y]
else [y, d]
else [])
| otherwise = []
worker :: [Int] -> [Int] -> [Int]
worker (a:[]) prev = prev Set.\\ [a + a]
worker (a:as) prev = worker as (prev Set.\\ (map ((+) a) (a:as)))
-- http://www.haskell.org/haskellwiki/Generic_number_type#squareRoot
(^!) :: Num a => a -> Int -> a
(^!) x n = x^n
squareRoot :: Int -> Int
squareRoot 0 = 0
squareRoot 1 = 1
squareRoot n =
let twopows = iterate (^!2) 2
(lowerRoot, lowerN) =
last $ takeWhile ((n>=) . snd) $ zip (1:twopows) twopows
newtonStep x = div (x + div n x) 2
iters = iterate newtonStep (squareRoot (div n lowerN) * lowerRoot)
isRoot r = r^!2 <= n && n < (r+1)^!2
in head $ dropWhile (not . isRoot) iters
Edit: the exact error is Stack space overflow: current size 8388608 bytes.. Increasing the stack memory limit through +RTS -K... doesn't solve the problem.
Edit2: about the sqrt thing, I just copy pasted it from the link in comments. To avoid having to cast Integer to Doubles and face the rounding problems etc...

In the future, it's polite to attempt a bit of minimalization on your own. For example, with a bit of playing, I was able to discover that the following program also stack-overflows (with an 8M stack):
main = print (worker [1..1000] [1..1000])
...which really nails down just what function is screwing you over. Let's take a look at worker:
worker (a:[]) prev = prev Set.\\ [a + a]
worker (a:as) prev = worker as (prev Set.\\ (map ((+) a) (a:as)))
Even on my first read, this function was red-flagged in my mind, because it's tail-recursive. Tail recursion in Haskell is generally not such a great idea as it is in other languages; guarded recursion (where you produce at least one constructor before recursing, or recurse some small number of times before producing a constructor) is generally better for lazy evaluation. And in fact, here, what's happening is that each recursive call to worker is building a deeper- and deeper-ly nested thunk in the prev argument. When the time comes to finally return prev, we have to go very deeply into a long chain of Set.\\ calls to work out just what it was we finally have.
This problem is obfuscated slightly by the fact that the obvious strictness annotation doesn't help. Let's massage worker until it works. The first observation is that the first clause is completely subsumed by the second one. This is stylistic; it shouldn't affect the behavior (except on empty lists).
worker [] prev = prev
worker (a:as) prev = worker as (prev Set.\\ map (a+) (a:as))
Now, the obvious strictness annotation:
worker [] prev = prev
worker (a:as) prev = prev `seq` worker as (prev Set.\\ map (a+) (a:as))
I was surprised to discover that this still stack overflows! The sneaky thing is that seq on lists only evaluates far enough to learn whether the list matches either [] or _:_. The following does not stack overflow:
import Control.DeepSeq
worker [] prev = prev
worker (a:as) prev = prev `deepseq` worker as (prev Set.\\ map (a+) (a:as))
I didn't plug this final version back into the original code, but it at least works with the minimized main above. By the way, you might like the following implementation idea, which also stack overflows:
import Control.Monad
worker as bs = bs Set.\\ liftM2 (+) as as
but which can be fixed by using Data.Set instead of Data.List, and no strictness annotations:
import Control.Monad
import Data.Set as Set
worker as bs = toList (fromList bs Set.\\ fromList (liftM2 (+) as as))

As Daniel Wagner correctly said, the problem is that
worker (a:as) prev = worker as (prev Set.\\ (map ((+) a) (a:as)))
builds a badly nested thunk. You can avoid that and get somewhat better performance than with deepseq by exploiting the fact that both arguments to worker are sorted in this application. Thus you can get incremental output by noting that at any step everything in prev smaller than 2*a cannot be the sum of two abundant numbers, so
worker (a:as) prev = small ++ worker as (large Set.\\ map (+ a) (a:as))
where
(small,large) = span (< a+a) prev
does better. However, it's still bad because (\\) cannot use the sortedness of the two lists. If you replace it with
minus xxs#(x:xs) yys#(y:ys)
= case compare x y of
LT -> x : minus xs yys
EQ -> minus xs ys
GT -> minus xxs ys
minus xs _ = xs -- originally forgot the case for one empty list
(or use the data-ordlist package's version), calculating the set-difference is O(length) instead of O(length^2).

Ok, I loaded it up and gave it a shot. Daniel Wagner's advice is pretty good, probably better than mine. The problem is indeed with the worker function, but I was going to suggest using Data.MemoCombinators to memoize your function instead.
Also, your divisors algorithm is kind of silly. There's a much better way to do that. It's kind of mathy and would require a lot of TeX, so here's a link to a math.stackexchange page about how to do that. The one I was talking about, was the accepted answer, though someone else gives a recursive solution that I think would run faster. (It doesn't require prime factorization.)
https://math.stackexchange.com/questions/22721/is-there-a-formula-to-calculate-the-sum-of-all-proper-divisors-of-a-number

How do you save a tree data structure to binary file in Haskell

I'm trying to save a simple (but quite big) Tree structure into a binary file using Haskell. The structure looks something like this:
-- For simplicity assume each Node has only 4 childs
data Tree = Node [Tree] | Leaf [Int]
And here is how I need the data look on disk:
Each node starts with four 32-bit offsets to it's children, then follow the childs.
I don't care much about the leafs, let's say it's just n consecutive 32-bit numbers.
For practival purposes I would need some node labels or some other additional data
but right now I don't care about that much neither.
It apears to me that Haskellers first choice when writing binary files is the Data.Binary.Put library. But with that I have a problem in the bullet #1. In particular, when I'm about to write a Node to a file, to write down the child offsets I need to know my current offset and the size of each child.
This is not something that Data.Binary.Put provides so I thought this must be a perfect application of Monad transformers. But even though it sounds cool and functional, so far I have not been successfull with this approach.
I asked two other questions that I thought would help me solve the problem here and here. I must say that each time I received very nice answers that helped me progress further but unfortunatelly I am still unable to solve the problem as a whole.
Here is what I've got so far, it still leaks too much memory to be practical.
I would love to have solution that uses such functional approach, but would be grateful for any other solution as well.

Here is implementation of two pass solution proposed by sclv.
import qualified Data.ByteString.Lazy as L
import Data.Binary.Put
import Data.Word
import Data.List (foldl')
data Tree = Node [Tree] | Leaf [Word32] deriving Show
makeTree 0 = Leaf $ replicate 100 0xdeadbeef
makeTree n = Node $ replicate 4 $ makeTree $ n-1
SizeTree mimics original Tree, it does not contain data but at each node it stores size of corresponding child in Tree.
We need to have SizeTree in memory, so it worth to make it more compact (e.g. replace Ints with uboxed words).
data SizeTree
= SNode {sz :: Int, chld :: [SizeTree]}
| SLeaf {sz :: Int}
deriving Show
With SizeTree in memory it is possible to serialize original Tree in streaming fashion.
putTree :: Tree -> SizeTree -> Put
putTree (Node xs) (SNode _ ys) = do
putWord8 $ fromIntegral $ length xs -- number of children
mapM_ (putWord32be . fromIntegral . sz) ys -- sizes of children
sequence_ [putTree x y | (x,y) <- zip xs ys] -- children data
putTree (Leaf xs) _ = do
putWord8 0 -- zero means 'leaf'
putWord32be $ fromIntegral $ length xs -- data length
mapM_ putWord32be xs -- leaf data
mkSizeTree :: Tree -> SizeTree
mkSizeTree (Leaf xs) = SLeaf (1 + 4 + 4 * length xs)
mkSizeTree (Node xs) = SNode (1 + 4 * length xs + sum' (map sz ys)) ys
where
ys = map mkSizeTree xs
sum' = foldl' (+) 0
It is important to prevent GHC from merging two passes into one (in which case it will hold tree in memory).
Here it is done by feeding not tree but tree generator to the function.
serialize mkTree size = runPut $ putTree (mkTree size) treeSize
where
treeSize = mkSizeTree $ mkTree size
main = L.writeFile "dump.bin" $ serialize makeTree 10

There are two basic approaches I would consider. If the entire serialized structure will easily fit into memory, you can serialize each node into a lazy bytestring and just use the lengths for each of them to calculate the offset from the current position.
serializeTree (Leaf nums) = runPut (mapM_ putInt32 nums)
serializeTree (Node subtrees) = mconcat $ header : childBs
where
childBs = map serializeTree subtrees
offsets = scanl (\acc bs -> acc+L.length bs) (fromIntegral $ 2*length subtrees) childBs
header = runPut (mapM_ putInt32 $ init offsets)
The other option is, after serializing a node, go back and re-write the offset fields with the appropriate data. This may be the only option if the tree is large, but I don't know of a serialization library that supports this. It would involve working in IO and seeking to the correct locations.

What I think you want is an explicit two pass solution. The first converts your tree into a size annotated tree. This pass forces the tree, but can be done, in fact, without any monadic machinery at all by tying the knot. The second pass is in the plain old Put monad, and given that the size annotations are already calculated, should be very straightforward.

Here is an implementation using Builder, which is part of the "binary" package. I haven't profiled it properly, but according to "top" it immediately allocates 108 Mbytes and then hangs on to that for the rest of the execution.
Note that I haven't tried reading the data back, so there may be lurking errors in my size and offset calculations.
-- Paste this into TreeBinary.hs, and compile with
-- ghc -O2 --make TreeBinary.hs -o TreeBinary
module Main where
import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Builder as B
import Data.List (init)
import Data.Monoid
import Data.Word
-- -------------------------------------------------------------------
-- Test data.
data Tree = Node [Tree] | Leaf [Word32] deriving Show
-- Approximate size in memory (ignoring laziness) I think is:
-- 101 * 4^9 * sizeof(Int) + 1/3 * 4^9 * sizeof(Node)
-- This version uses [Word32] instead of [Int] to avoid having to write
-- a builder for Int. This is an example of lazy programming instead
-- of lazy evaluation.
makeTree :: Tree
makeTree = makeTree1 9
where makeTree1 0 = Leaf [0..100]
makeTree1 n = Node [ makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1
, makeTree1 $ n - 1 ]
-- --------------------------------------------------------------------
-- The actual serialisation code.
-- | Given a tree, return a builder for it and its estimated length in bytes.
serialiseTree :: Tree -> (B.Builder, Word32)
serialiseTree (Leaf ns) = (mconcat (B.singleton 2 : map B.putWord32be ns), fromIntegral $ 4 * length ns + 1)
serialiseTree (Node ts) = (mconcat (B.singleton 1 : map B.putWord32be offsets ++ branches),
baseLength + sum subLengths)
where
(branches, subLengths) = unzip $ map serialiseTree ts
baseLength = fromIntegral $ 1 + 4 * length ts
offsets = init $ scanl (+) baseLength subLengths
main = do
putStrLn $ "Length = " ++ show (snd $ serialiseTree makeTree)
BL.writeFile "test.bin" $ B.toLazyByteString $ fst $ serialiseTree makeTree

A way to measure performance

Given Exercise 14 from 99 Haskell Problems:
(*) Duplicate the elements of a list.
Eg.:
*Main> dupli''' [1..10]
[1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10]
I've implemented 4 solutions:
{-- my first attempt --}
dupli :: [a] -> [a]
dupli [] = []
dupli (x:xs) = replicate 2 x ++ dupli xs
{-- using concatMap and replicate --}
dupli' :: [a] -> [a]
dupli' xs = concatMap (replicate 2) xs
{-- usign foldl --}
dupli'' :: [a] -> [a]
dupli'' xs = foldl (\acc x -> acc ++ [x,x]) [] xs
{-- using foldl 2 --}
dupli''' :: [a] -> [a]
dupli''' xs = reverse $ foldl (\acc x -> x:x:acc) [] xs
Still, I don't know how to really measure performance .
So what's the recommended function (from the above list) in terms of performance .
Any suggestions ?

These all seem more complicated (and/or less efficient) than they need to be. Why not just this:
dupli [] = []
dupli (x:xs) = x:x:(dupli xs)
Your last example is close to a good fold-based implementation, but you should use foldr, which will obviate the need to reverse the result:
dupli = foldr (\x xs -> x:x:xs) []
As for measuring performance, the "empirical approach" is profiling. As Haskell programs grow in size, they can get fairly hard to reason about in terms of runtime and space complexity, and profiling is your best bet. Also, a crude but often effective empirical approach when gauging the relative complexity of two functions is to simply compare how long they each take on some sufficiently large input; e.g. time how long length $ dupli [1..1000000] takes and compare it to dupli'', etc.
But for a program this small it shouldn't be too hard to figure out the runtime complexity of the algorithm based on your knowledge of the data structure(s) in question--in this case, lists. Here's a tip: any time you use concatenation (x ++ y), the runtime complexity is O(length x). If concatenation is used inside of a recursive algorithm operating on all the elements of a list of size n, you will essentially have an O(n ^2) algorithm. Both examples I gave, and your last example, are O(n), because the only operation used inside the recursive definition is (:), which is O(1).

As recommended you can use the criterion package. A good description is http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/.
To summarize it here and adapt it to your question, here are the steps.
Install criterion with
cabal install criterion -fchart
And then add the following to your code
import Criterion.Main
l = [(1::Int)..1000]
main = defaultMain [ bench "1" $ nf dupli l
, bench "2" $ nf dupli' l
, bench "3" $ nf dupli'' l
, bench "4" $ nf dupli''' l
]
You need the nf in order to force the evaluation of the whole result list. Otherwise you'll get just the thunk for the computation.
After that compile and run
ghc -O --make dupli.hs
./dupli -t png -k png
and you get pretty graphs of the running times of the different functions.
It turns out that dupli''' is the fastest from your functions but the foldr version that pelotom listed beats everything.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string