I have written a function generating subsets of subset. It caused stack overflow when I use in the following way subsets [1..]. And it is "normal" behaviour when it comes to "normal" (no-lazy) languages. And now, I would like to improve my function to be lazy.
P.S. I don't understand laziness ( And I try to understand it) so perhaps my problem is strange for you- please explain. :)
P.S. 2 Feel free to say me something about my disability in Haskell ;)
subsets :: [a] -> [[a]]
subsets (x:xs) = (map (\ e -> x:e) (subsets xs)) ++ (subsets xs)
subsets [] = [[]]
There's two problems with that function. First, it recurses twice, which makes it exponentially more ineffiecient than necessary (if we disregard the exponential number of results...), because each subtree is recalculated every time for all overlapping subsets; this can be fixed by leting the recursive call be the same value:
subsets' :: [a] -> [[a]]
subsets' [] = [[]]
subsets' (x:xs) = let s = subsets' xs
in map (x:) s ++ s
This will already allow you to calculate length $ subsets' [1..25] in a few seconds, while length $ subsets [1..25] takes... well, I didn't wait ;)
The other issue is that with your version, when you give it an infinite list, it will recurse on the infinite tail of that list first. To generate all finite subsets in a meaningful way, we need to ensure two things: first, we must build up each set from smaller sets (to ensure termination), and second, we should ensure a fair order (ie., not generate the list [[1], [2], ...] first and never get to the rest). For this, we start from [[]] and recursively add the current element to everything we have already generated, and then remember the new list for the next step:
subsets'' :: [a] -> [[a]]
subsets'' l = [[]] ++ subs [[]] l
where subs previous (x:xs) = let next = map (x:) previous
in next ++ subs (previous ++ next) xs
subs _ [] = []
Which results in this order:
*Main> take 100 $ subsets'' [1..]
[[],[1],[2],[2,1],[3],[3,1],[3,2],[3,2,1],[4],[4,1],[4,2],[4,2,1],[4,3],[4,3,1],[4,3,2],[4,3,2,1],[5],[5,1],[5,2],[5,2,1],[5,3],[5,3,1],[5,3,2],[5,3,2,1],[5,4],[5,4,1],[5,4,2],[5,4,2,1],[5,4,3],[5,4,3,1],[5,4,3,2],[5,4,3,2,1],[6],[6,1],[6,2],[6,2,1],[6,3],[6,3,1],[6,3,2],[6,3,2,1],[6,4],[6,4,1],[6,4,2],[6,4,2,1],[6,4,3],[6,4,3,1],[6,4,3,2],[6,4,3,2,1],[6,5],[6,5,1],[6,5,2],[6,5,2,1],[6,5,3],[6,5,3,1],[6,5,3,2],[6,5,3,2,1],[6,5,4],[6,5,4,1],[6,5,4,2],[6,5,4,2,1],[6,5,4,3],[6,5,4,3,1],[6,5,4,3,2],[6,5,4,3,2,1],[7],[7,1],[7,2],[7,2,1],[7,3],[7,3,1],[7,3,2],[7,3,2,1],[7,4],[7,4,1],[7,4,2],[7,4,2,1],[7,4,3],[7,4,3,1],[7,4,3,2],[7,4,3,2,1],[7,5],[7,5,1],[7,5,2],[7,5,2,1],[7,5,3],[7,5,3,1],[7,5,3,2],[7,5,3,2,1],[7,5,4],[7,5,4,1],[7,5,4,2],[7,5,4,2,1],[7,5,4,3],[7,5,4,3,1],[7,5,4,3,2],[7,5,4,3,2,1],[7,6],[7,6,1],[7,6,2],[7,6,2,1]]
You can't generate all the subsets of an infinite set: they form an uncountable set. Cardinality makes it impossible.
At most, you can try to generate all the finite subsets. For that, you can't proceed by induction, from [] onwards, since you'll never reach []. You need to proceed inductively from the beginning of the list, instead of the end.
A right fold solution would be:
powerset :: Foldable t => t a -> [[a]]
powerset xs = []: foldr go (const []) xs [[]]
where go x f a = let b = (x:) <$> a in b ++ f (a ++ b)
then:
\> take 8 $ powerset [1..]
[[],[1],[2],[2,1],[3],[3,1],[3,2],[3,2,1]]
Since a code example is worth a thousand words I'll start with that:
testList = [1,2,2,3,4,5]
testSet = map sumMapper $ tails testList
where sumMapper [] = []
sumMapper (a:b) = sumMap a b
sumMap a b = map (+ a) b
This code takes a list and adds up all the elements to get the sum of all of them (I'd also be interested in efficiency of this). The output of testSet is:
[[3,3,4,5,6],[4,5,6,7],[5,6,7],[7,8],[9],[],[]]
I would like to find the union of these lists (to make it into a set) but I feel that:
whatIWant = foldl1 union testSet
will have bad performance (the real lists will be thousands of elements long).
Is this the correct solution or am I missing something obvious?
You might want to try
nub $ concat theListOfLists
In the version using union, the code to cut out duplicates will get run many times. Here it only is run once.
It will only execute the code to pull out the unique values once.
There is also a Data.Set library, you could alternatively use
import Data.Set
S.fromList $ concat theListOfLists
The important point is that the code (here and above) that pulls out duplicates only gets run on the full list once, rather than over and over again.
edit- Rein mentions below that nub is O(n^2), so you should avoid the first solution above in favor of something O(n log n), as Data.Set.fromList should be. As others have mentioned in the comments, you need something that enforces Ord a to get the proper complexity O(n log n), and Data.Set does, nub does not.
I will leave the two solutions (poor performance and good performance) because I think the resulting discussion was useful.
If you're using elements that are members of the Ord typeclass, as in your example, you can use Data.Set:
import qualified Data.Set as Set
whatYouWant = foldl' (Set.union . Set.fromList) Set.empty testSet
This has the advantage of taking space proportional to the size of the largest sublist rather than to the size of the entire concatenated list as does the Set.fromList . concat solution. The strict foldl' also prevents buildup of unevaluated thunks, preventing O(n) stack and heap space usage.
Generally speaking, an Ord constraint allows more efficient algorithms than an Eq constraint because it allows you to build a tree. This is also the reason that nub is O(n^2): the more efficient algorithm requires Ord rather than just Eq.
Since union is an associative operation (a+(b+c)==(a+b)+c), you can use tree-shaped folding for a logarithmic advantage in time complexity:
_U [] = []
_U (xs:t) = union xs (_U (pairs t))
pairs (xs:ys:t) = union xs ys : pairs t
pairs t = t
Of course Data.List.union itself is O(n2) in general, but if your testList is ordered non-decreasing, all the lists will be too, and you can use a linear ordUnion instead of the union, for a solution which is linearithmic overall and shouldn't leak space:
ordUnion :: (Ord a) => [a] -> [a] -> [a]
ordUnion a [] = a
ordUnion [] b = b
ordUnion (x:xs) (y:ys) = case compare x y of
LT -> x : ordUnion xs (y:ys)
EQ -> x : ordUnion xs ys
GT -> y : ordUnion (x:xs) ys
To prevent duplicates which might slip through, one more function is needed to process _U's output—a linear ordNub :: (Ord a) => [a] -> [a], with an obvious implementation.
Using the left-preferential (\(x:xs) ys -> x:ordUnion xs ys) could be even more productive overall (force smaller portions of the input at each given moment):
g testList = ordNub . _U $ [map (+ a) b | (a:b) <- tails testList]
where
_U [] = []
_U ((x:xs):t) = x : ordUnion xs (_U (pairs t))
pairs ((x:xs):ys:t) = (x : ordUnion xs ys) : pairs t
pairs t = t
see also:
data-ordlist package
even less forcing "implicit heap" by apfelmus
Tree-like folds
Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).
Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.
Problem statement
Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.
The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.
The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)
instance NFData a => NFData (Tree a) where
rnf (Leaf v) = deepseq v ()
rnf (Node x y) = deepseq (x, y) ()
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt (length xs `div` 2) xs
-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
xr <- strat $ runEval $ mergeSort' (l - 1) x
yr <- rseq $ runEval $ mergeSort' (l - 1) y
rdeepseq (merge xr yr)
where
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
strat | l > 0 = rpar
| otherwise = rseq
mergeSort = runEval . mergeSort' 10
By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.
Results
Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,
rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog
Results on a 24-core X5680 # 3.33GHz show little improvement
> ./ParallelMergeSort
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.
and on my own machine, a quad-core Phenom II,
> ./ParallelMergeSort
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.
Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!
Solutions I'm looking for
Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.
Further discussion
I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
Is there a better way (arrows, applicative) to express parallelism?
The answer is pretty easy: Because you have at no point introduced parallelism. Eval is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:
do xr <- rpar $ runEval $ mergeSort' x
yr <- rseq $ runEval $ mergeSort' y
rseq (merge xr yr)
This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.
Standard tips also kind-of apply:
The result should be evaluated deeply (e.g. using evalTraversable rseq). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.
Just sparking everything will most likely eat up any gains. It would be a good idea to introduce a parameter that stops sparking at lower recursion levels.
Edit: The following actually doesn't apply anymore after the question edit
But the worst part last: Your algorithm as you state it is very flawed. Your top-level seq only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).
So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...
Edit 2: Some more answers to the edits
The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.
On "Further Discussion":
The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.
The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.
Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.
With monad-par, your example might become something like:
do xr <- spawn $ mergeSort' x
yr <- spawn $ mergeSort' y
merge <$> get xr <*> get yr
So here the get actually forces you to specify the join points - and the library does the required deepseq automatically behind the scenes.
I had similar luck to what you report in EDIT 3 on a dual core system with these variants. I used a smaller list length because I'm on a smaller computer, compiled with ghc -O2 -rtsopts -threaded MergePar.hs, and ran with ./MergePar +RTS -H256M -N. This might offer a more structured way to compare performance. Note that the RTS option -qa sometimes helps the simple par variants.
import Control.Applicative
import Control.Parallel
import Control.Parallel.Strategies
import Criterion.Main
import GHC.Conc (numCapabilities)
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = Node (listToTree (take half xs)) (listToTree (drop half xs))
where half = length xs `div` 2
-- Merge two ordered lists
merge :: Ord a => [a] -> [a] -> [a]
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
-- Simple merge sort
mergeSort' :: Ord a => Tree a -> [a]
mergeSort' (Leaf v) = [v]
mergeSort' (Node x y) = merge (mergeSort' x) (mergeSort' y)
mergeSort :: Ord a => [a] -> [a]
mergeSort = mergeSort' . listToTree
-- Merge sort with 'par' annotations on every recursive call
mergeSortP' :: Ord a => Tree a -> [a]
mergeSortP' (Leaf v) = [v]
mergeSortP' (Node x y) = let xr = mergeSortP' x
yr = mergeSortP' y
in xr `par` yr `pseq` merge xr yr
mergeSortP :: Ord a => [a] -> [a]
mergeSortP = mergeSortP' . listToTree
-- Merge sort with 'rpar' annotations on every recursive call
mergeSortR' :: Ord a => Tree a -> [a]
mergeSortR' (Leaf v) = [v]
mergeSortR' (Node x y) =
runEval $ merge <$> rpar (mergeSortR' x) <*> rpar (mergeSortR' y)
mergeSortR :: Ord a => [a] -> [a]
mergeSortR = mergeSortR' . listToTree
-- Parallel merge sort that stops looking for parallelism at a certain
-- depth
smartMerge' :: Ord a => Int -> Tree a -> [a]
smartMerge' _ (Leaf v) = [v]
smartMerge' n t#(Node x y)
| n <= 1 = mergeSort' t
| otherwise = let xr = smartMerge' (n-1) x
yr = smartMerge' (n-2) y
in xr `par` yr `pseq` merge xr yr
smartMerge :: Ord a => [a] -> [a]
smartMerge = smartMerge' numCapabilities . listToTree
main = defaultMain $ [ bench "original" $ nf mergeSort lst
, bench "par" $ nf mergeSortP lst
, bench "rpar" $ nf mergeSortR lst
, bench "smart" $ nf smartMerge lst ]
where lst = [100000,99999..0] :: [Int]