Lazy tree with a space leak - haskell

I'm writing a program trying to implement a toy XML processor. Right now the program is supposed to read a stream of events (think SAX) describing the structure of a document and to build lazily the corresponding tree.
The events are defined by the following datatype:
data Event = Open String
| Close
A possible input would then be:
[Open "a", Open "b", Close, Open "c", Close, Close]
that would correspond to the tree:
a
/ \
b c
I would like to generate the tree in a lazy way, so that it does not need to be present in memory in full form at any time. My current implementation, however, seems to have a space leak causing all the nodes to be retained even when they are no longer needed. Here is the code:
data Event = Open String
| Close
data Tree a = Tree a (Trees a)
type Trees a = [Tree a]
data Node = Node String
trees [] = []
trees (Open x : es) =
let (children, rest) = splitStream es
in (Tree (Node x) (trees children)) : (trees rest)
splitStream es = scan 1 es
scan depth (s#(Open {}) : ss) =
let (b, a) = scan (depth+1) ss
in (s:b, a)
scan depth (s#Close : ss) =
case depth of
1 -> ([], ss)
x -> let (b, a) = scan (depth-1) ss
in (s:b, a)
getChildren = concatMap loop
where
loop (Tree _ cs) = cs
main = print .
length .
getChildren .
trees $
[Open "a"] ++ (concat . replicate 1000000 $ [Open "b", Close]) ++ [Close]
The function trees converts the list of events into a list of Tree Node. getChildren collects all the children nodes (labeled "b") of the root ("a"). These are then counted and the resulting number is printed.
The compiled program, built with GHC 7.0.4 (-O2), keeps increasing its memory usage up to the point when it prints the node count. I was expecting, on the other hand, an almost constant memory usage.
Looking at the "-hd" heap profile, it is clear that most of the memory is taken by the list constructor (:). It seems like one of the lists produced by scan or by trees is retained in full. I don't understand why, however, as length . getChildren should get rid of child nodes as soon as they are traversed.
Is there a way to fix such space leak?

I suspect that trees is the evil guy. As John L said this is probably an instance of the Wadler Space Leak in which the compiler is unable to apply the optimization that prevents the leak. The problem is that you use a lazy pattern matching (the let expression) to deconstruct the pair and perform pattern matching via the application of trees on one of the components of the tuple. I had a quite similar problem once http://comments.gmane.org/gmane.comp.lang.haskell.glasgow.user/19129. This thread also provides a more detailed explanation. To prevent the space leak you can simply use a case expression to deconstruct the tuple as follows.
trees [] = []
trees (Open x : es) =
case splitStream es of
(children, rest) -> Tree (Node x) (trees children) : trees rest
With this implementation the maximum residency drops from 38MB to 28KB.
But note that this new implementation of trees is more strict than the original one as it demands the application of splitStream. Therefore, in some cases this transformation might even cause a space leak. To regain a less strict implementation you might use a similar trick as the lines function in Data.List which causes a similar problem http://hackage.haskell.org/packages/archive/base/latest/doc/html/src/Data-List.html#lines. In this case trees would look as follows.
trees [] = []
trees (Open x : es) =
context (case splitStream es of
(children, rest) -> (trees children, trees rest))
where
context ~(children', rest') = Tree (Node x) children' : rest'
If we desugar the lazy pattern matching we get the following implementation. Here the compiler is able to detect the selector to the tuple component as we do not perform pattern matching on one of the components.
trees [] = []
trees (Open x : es) = Tree (Node x) children' : rest'
where
(children', rest') =
case splitStream es of
(children, rest) -> (trees children, trees rest)
Does anybody know whether this transformation always does the trick?

I strongly suspect this is an example of the "Wadler space leak" bug. Unfortunately I don't know how to solve it, but I did find a few things that mitigate the effects somewhat:
1) Change getChildren to
getChildren' = ($ []) . foldl (\ xsf (Tree _ cs) -> xsf . (cs ++)) id
This is a small, but noticeable, improvement.
2) In this example trees always outputs a single-element list. If this is always true for your data, explicitly dropping the rest of the list fixes the space leak:
main = print .
length .
getChildren .
(:[]) .
head .
trees

Related

Long working of program that count Ints

I want to write program that takes array of Ints and length and returns array that consist in position i all elements, that equals i, for example
[0,0,0,1,3,5,3,2,2,4,4,4] 6 -> [[0,0,0],[1],[2,2],[3,3],[4,4,4],[5]]
[0,0,4] 7 -> [[0,0],[],[],[],[4],[],[]]
[] 3 -> [[],[],[]]
[2,2] 3 -> [[],[],[2,2]]
So, that's my solution
import Data.List
import Data.Function
f :: [Int] -> Int -> [[Int]]
f ls len = g 0 ls' [] where
ls' = group . sort $ ls
g :: Int -> [[Int]] -> [[Int]] -> [[Int]]
g val [] accum
| len == val = accum
| otherwise = g (val+1) [] (accum ++ [[]])
g val (x:xs) accum
| len == val = accum
| val == head x = g (val+1) xs (accum ++ [x])
| otherwise = g (val+1) (x:xs) (accum ++ [[]])
But query f [] 1000000 works really long, why?
I see we're accumulating over some data structure. I think foldMap. I ask "Which Monoid"? It's some kind of lists of accumulations. Like this
newtype Bunch x = Bunch {bunch :: [x]}
instance Semigroup x => Monoid (Bunch x) where
mempty = Bunch []
mappend (Bunch xss) (Bunch yss) = Bunch (glom xss yss) where
glom [] yss = yss
glom xss [] = xss
glom (xs : xss) (ys : yss) = (xs <> ys) : glom xss yss
Our underlying elements have some associative operator <>, and we can thus apply that operator pointwise to a pair of lists, just like zipWith does, except that when we run out of one of the lists, we don't truncate, rather we just take the other. Note that Bunch is a name I'm introducing for purposes of this answer, but it's not that unusual a thing to want. I'm sure I've used it before and will again.
If we can translate
0 -> Bunch [[0]] -- single 0 in place 0
1 -> Bunch [[],[1]] -- single 1 in place 1
2 -> Bunch [[],[],[2]] -- single 2 in place 2
3 -> Bunch [[],[],[],[3]] -- single 3 in place 3
...
and foldMap across the input, then we'll get the right number of each in each place. There should be no need for an upper bound on the numbers in the input to get a sensible output, as long as you are willing to interpret [] as "the rest is silence". Otherwise, like Procrustes, you can pad or chop to the length you need.
Note, by the way, that when mappend's first argument comes from our translation, we do a bunch of ([]++) operations, a.k.a. ids, then a single ([i]++), a.k.a. (i:), so if foldMap is right-nested (which it is for lists), then we will always be doing cheap operations at the left end of our lists.
Now, as the question works with lists, we might want to introduce the Bunch structure only when it's useful. That's what Control.Newtype is for. We just need to tell it about Bunch.
instance Newtype (Bunch x) [x] where
pack = Bunch
unpack = bunch
And then it's
groupInts :: [Int] -> [[Int]]
groupInts = ala' Bunch foldMap (basis !!) where
basis = ala' Bunch foldMap id [iterate ([]:) [], [[[i]] | i <- [0..]]]
What? Well, without going to town on what ala' is in general, its impact here is as follows:
ala' Bunch foldMap f = bunch . foldMap (Bunch . f)
meaning that, although f is a function to lists, we accumulate as if f were a function to Bunches: the role of ala' is to insert the correct pack and unpack operations to make that just happen.
We need (basis !!) :: Int -> [[Int]] to be our translation. Hence basis :: [[[Int]]] is the list of images of our translation, computed on demand at most once each (i.e., the translation, memoized).
For this basis, observe that we need these two infinite lists
[ [] [ [[0]]
, [[]] , [[1]]
, [[],[]] , [[2]]
, [[],[],[]] , [[3]]
... ...
combined Bunchwise. As both lists have the same length (infinity), I could also have written
basis = zipWith (++) (iterate ([]:) []) [[[i]] | i <- [0..]]
but I thought it was worth observing that this also is an example of Bunch structure.
Of course, it's very nice when something like accumArray hands you exactly the sort of accumulation you need, neatly packaging a bunch of grungy behind-the-scenes mutation. But the general recipe for an accumulation is to think "What's the Monoid?" and "What do I do with each element?". That's what foldMap asks you.
The (++) operator copies the left-hand list. For this reason, adding to the beginning of a list is quite fast, but adding to the end of a list is very slow.
In summary, avoid adding things to the end of a list. Try to always add to the beginning instead. One simple way to do that is to build the list backwards, and then reverse it at the end. A more devious trick is to use "difference lists" (Google it). Another possibility is to use Data.Sequence rather than a list.
The first thing that should be noted is the most obvious way to implement this is use a data structure that allows random access, an array is an obviously choice. Note that you need to add the elements to the array multiple times and somehow "join them".
accumArray is perfect for this.
So we get:
f l i = elems $ accumArray (\l e -> e:l) [] (0,i-1) (map (\e -> (e,e)) l)
And we're good to go (see full code here).
This approach does involve converting the final array back into a list, but that step is very likely faster than say sorting the list, which often involves scanning the list at least a few times for a list of decent size.
Whenever you use ++ you have to recreate the entire list, since lists are immutable.
A simple solution would be to use :, but that builds a reversed list. However that can be fixed using reverse, which results in only building two lists (instead of 1 million in your case).
Your concept of glomming things onto an accumulator is a very useful one, and both MathematicalOrchid and Guvante show how you can use that concept reasonably efficiently. But in this case, there is a simpler approach that is likely also faster. You started with
group . sort $ ls
and this was a very good place to start! You get a list that's almost the one you want, except that you need to fill in some blanks. How can we figure those out? The simplest way, though probably not quite the most efficient, is to work with a list of all the numbers you want to count up to: [0 .. len-1].
So we start with
f ls len = g [0 .. len-1] (group . sort $ ls)
where
?
How do we define g? By pattern matching!
f ls len = g [0 .. len-1] (group . sort $ ls)
where
-- We may or may not have some lists left,
-- but we counted as high as we decided we
-- would
g [] _ = []
-- We have no lists left, so the rest of the
-- numbers are not represented
g ns [] = map (const []) ns
-- This shouldn't be possible, because group
-- doesn't make empty lists.
g _ ([]:_) = error "group isn't working!"
-- Finally, we have some work to do!
g (n:ns) xls#(xl#(x:_):xls')
| n == x = xl : g ns xls'
| otherwise = [] : g ns xls
That was nice, but making the list of numbers isn't free, so you might be wondering how you can optimize it. One method I invite you to try is using your original technique of keeping a separate counter, but following this same sort of structure.

Reconstruct a graph from BFS output in Haskell

I want to reconstruct the incidence structure of a graph in Haskell, which is given by the output of a breadth first traversal of it. Explicitly, the output consists of a root vertex and a list of neighborhoods (a neighborhood is a list of vertices marked as new or old (= already visited)), where each neighborhood corresponds to the least vertex which has not been assigned to a neighborhood, yet.
In any imperative language, I would solve the problem by using a queue:
Input: root vertex r, list of neighborhoods L
(1) Put r into the empty queue Q
(2) if Q is empty then STOP
(3) extract the first vertex v of Q
(4) extract the first neighborhood N of L
(5) append the unvisited vertices of N to Q
(6) remove the markings (new/old) of the nodes of N and assign v to N
(7) goto (2)
I tried to implement this naive algorithm in Haskell (by using a list or by using Data.Sequence as queue), but ghci always runs out of memory. This should not happen, because although the input consists of 300MB data, 16GB RAM should clearly suffice.
Therefore the naive implementation seems to cause a memory leak. How would you implement this algorithm in Haskell?
Edit:
Here are the (slightly simplified) data types, I use:
data Output = Out !Vertex ![[BFSNode]]
data Vertex = Vertex Integer SomeMoreComplexData
data BFSNode = New Vertex | Old Integer
data Graph = ![Vertex] ![(Integer,[Integer])]
The data type "Output" contains the already parsed BFS output consisting of the root vertex and the lists of neighborhoods. BFSNode corresponds to a node in the BFS tree which belongs to either a new vertex which is visited for the first time, or to an old vertex which already has been visited and which is therefore referred by its unique number. Note that the parsing process works fine and consumes very few memory.
My aim is to convert "Output" into the data type "Graph" which consists of the lists of vertices and of an incidence list.
Here is a simplified version of my implementation:
readTree :: [[BFSNode]] -> Seq Integer -> Graph
readTree [] _ = Graph [] []
readTree (nb:nbs) qs =
let (i :< qs') = viewl qs
newVs = fromList $! map nodeNr . filter isNew $ nb
(Graph vs adj) = readTree nbs $ qs' >< newVs
in Graph (map unNew (filter isNew nb) ++ vs) ((i,nub $ map nodeNr nb):adj)
"nbs" is the list of neighborhoods, "qs" is the queue. The function "nodeNr" extracts the unique identification number from a vertex, "isNew" tests whether a vertex is new, and "unNew" unpacks a new vertex from the data type "BFSNode".
Edit2:
I think I localized the problem now. Maybe it has nothing to do with my implementation of the conversion process. My failure was to use the build in function "read" to read the data type "Output" from a file. I realized now that Haskell has problems with reading big files. Even if it were just about reading a list of integers, e.g.
main = do
txt <- readFile "test"
writeFile "test2" . show $ (read txt :: [Integer]) }
the program will run out of memory if the file "test" is big enough. I understand now, that it is no good idea to parse data in this way, since "read" will load all data into the memory before showing any output, but I still do not understand why it fills 16GB of RAM although the file amounts not even 500MB. Do you have any idea what is wrong with "read"? Does Haskell show the same behavior on your machines?
Edit3:
Now I implemented a stream based parsing function "readOutput" which takes a String and returns the data type "Output". This function is lazy, so I immediately get an output when I call it. But when I compose it with my conversion function "readTree" (which is clearly tail-recursive) I get no output at all and the memory usage increases as usual. What am I doing wrong?
Edit4:
The problem in Edit3 came from some strictifications which I removed now.
This question does not specify a key ingredient - how is the graph going to be represented in Haskell? Functional programs require carefully thought out data structures to maximize sharing and run efficiently. Usually, this means they're recursively built from nothing (inductive). There's a paper on inductive graphs and functional graph algorithms‎ that gives one representation:
module Test where
data Graph a = Empty | Extension (Graph a) [Int] (Int, a)
deriving Show
That is, a graph is either Empty, or a (smaller) graph extended by one node. This is exactly how lists are built using Cons in functional languages, except that the additional node has to specify the smaller graph, the predecessor links ([Int]), and the new node number and data, (Int,a). Note that they also implemented this as an abstract type ''for efficiency reasons.''
A graph with one node can be generated by extending the empty graph.
singleton :: (Int,a) -> Graph a
singleton x = Extension Empty [] x
Using this structure, it's simple to define a recursive parse algorithm for your BFS tree.
data Mark a = Visited Int | New (Int,a) deriving Show
parse :: (Int,a) -> [[Mark a]] -> Graph a
parse x nbrs = extend Empty [x] nbrs
extend :: Graph a -> [(Int,a)] -> [[Mark a]] -> Graph a
extend g [] [] = g
extend g _ [] = Empty -- leftover nodes, really an error.
extend g [] _ = Empty -- leftover neighborhoods, really an error.
extend g (x : tl) (nbr : nbrs) =
extend (Extension g (seen nbr) x) (news tl nbr) nbrs
news :: [(Int,a)] -> [Mark a] -> [(Int,a)]
news l (New x : tl) = news (uniq l x) tl
news l (_ : tl) = news l tl
news l [] = l
uniq :: [(Int,a)] -> (Int,a) -> [(Int,a)]
uniq (x:tl) y = x : if (fst x == fst y) then tl else uniq tl y
uniq [] y = [y]
seen :: [Mark a] -> [Int]
seen (Visited i : tl) = i : seen tl
seen (_ : tl) = seen tl
seen [] = []
m0 = [New (1,())]
m1 = [Visited 0, New (2,()), New (3,())]
m2 = [Visited 1, New (3,())]
m3 = [Visited 1, Visited 2]
nbrs = [m0,m1,m2,m3]
Testing it out,
$ ghci
GHCi, version 7.6.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> :load Test
[1 of 1] Compiling Test ( Test.hs, interpreted )
Ok, modules loaded: Test.
*Test> parse (0,()) nbrs
Extension (Extension (Extension (Extension Empty [] (0,())) [0] (1,())) [1] (2,())) [1,2] (3,())
For efficiency, you could do the following:
The news and seen functions could be combined let (ns,sn) = newseen nbr ([],[]) and made tail-recursive (passing their partially constructed lists and returning immediately) for efficiency.
Your input could keep track of the node at the center of each neighbor list. This would avoid the list concatenation in the stack of neighbors. Alternatively, you could use a functional dequeue to hold that stack.
If you haven't seen it, I'd recommend Okasaki's book on purely functional data structures.

Haskell profilling subgraph mining algorithm

I try to solve problem of finding all connected subgraphs in Haskell. Algorithm used is described here. Quote from that paper:
As in every path algorithm, there are forward steps and back steps. A step forward is done if a given connected subgraph can be extended by addition of edge k, that is if edge k is not already part of the given subgraph, if k is adjacent to at least one edge of the given subgraph, and if addition of edge k is not forbidden by some restrictions given below.
A step back is done as soon as a given connected subgraph cannot be further elongated. In this case the edge added last is removed from the string, it is temporarily given the status "forbidden", and any other edges which were forbidden by backtracking from a previous longer string are simultaneously "allowed" again. In contrast, an edge which is forbidden by being removed from a string shorter than the present one remains forbidden, thus assuring that every connected subgraph is constructed once and only once.
To do this algorithm, I represented graphs as list of edges:
type Edge = (Int,Int)
type Graph = [Edge]
Firstly, I wrote function addEdge that check if is it possible to extend graph, return Nothing if it isn't possible or Edge to extend.
I have a "parent" graph and "extensible" graph, so I try to found one and only one edge that exists in "parent" graph, connected with "extensible" graph, not already included in "extensible" graph and so not included in forbidden set.
I wrote this function below:
addEdge :: Graph -> Graph -> [Edge] -> Maybe Edge
addEdge !parent !extensible !forb = listToMaybe $ intersectBy (\ (i,j) (k,l) -> (i == k || i == l || j == k || j == l)) (parent \\ (extensible `union` forb)) extensible
It's work! but, as I see from profiling whole program, addEdge is the most heavy function. I am sure, that my code isn't optimal. Leastways, intersectBy function that finds all possible solutions but i need only one. Is there any ways to make this code more rapid? Maybe, don't use standard lists but Set from Data.Set? It's first point of attention.
Main recursive function ext presented below:
ext :: Graph -> [Graph] -> Maybe Graph -> [(Edge,Int)] -> Int -> [Graph]
ext !main !list !grow !forb !maxLength | isEnd == True = (filter (\g -> (length g /= 1)) list) ++ (group main)
| ((addEdge main workGraph forbEdges) == Nothing) || (length workGraph) >= maxLength = ext main list (Just workGraph) forbProcess maxLength
| otherwise = ext main ((addedEdge:workGraph):list) Nothing forb maxLength where
workGraph = if grow == Nothing then (head list) else (bite (fromJust grow)) -- [Edge] graph now proceeded
workGraphLength = length workGraph
addedEdge = fromJust $ addEdge'
addEdge' = addEdge main workGraph forbEdges
bite xz = if (length xz == 1) then (fromJust (addEdge main xz forbEdges)):[] else tail xz
forbProcess = (head workGraph,workGraphLength):(filter ((<=workGraphLength).snd) forb)
forbEdges = map fst forb -- convert from (Edge,Level) to [Edge]
isEnd = (grow /= Nothing) && (length (fromJust grow) == 1) && ((addEdge main (fromJust grow) forbEdges) == Nothing)
I test my program on graph
c60 = [(1,4),(1,3),(1,2),(2,6),(2,5),(3,10),(3,7),(4,24),(4,21),(5,8),(5,7),(6,28),(6,25),
(7,9),(8,11),(8,12),(9,16),(9,13),(10,20),(10,17),(11,14),(11,13),(12,28),(12,30),(13,15),
(14,43),(14,30),(15,44),(15,18),(16,18),(16,17),(17,19),(18,47),(19,48),(19,22),(20,22),(20,21),
(21,23),(22,31),(23,32),(23,26),(24,26),(24,25),(25,27),(26,35),(27,36),(27,29),(28,29),(29,39),
(30,40),(31,32),(31,33),(32,34),(33,50),(33,55),(34,37),(34,55),(35,36),(35,37),(36,38),(37,57),
(38,41),(38,57),(39,40),(39,41),(40,42),(41,59),(42,45),(42,59),(43,44),(43,45),(44,46),(45,51),
(46,49),(46,51),(47,48),(47,49),(48,50),(49,53),(50,53),(51,52),(52,60),(52,54),(53,54),(54,56),(55,56),(56,58),(57,58),(58,60),(59,60)] :: Graph
For example, find all subgraphs with length from 1 to 7
length $ ext c60 [[(1,2)]] Nothing [] 7
>102332
Problem is too low speed of computation. As it pointed in original article, program have been written in FORTRAN 77 and launched on 150MHz workstation, perform test task minimum 30 times faster then my code on modern i5 processor.
I can't understand, why my program is so slow? Is there any ways to refactor this code? Or the best solution is porting it on C, and write bindings to C library over FFI?
I decided to take a shot at implementing the algorithm described in the paper using fgl. The complete code follows.
{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.Graph.Inductive
import Data.List
import Data.Tree
uniq = map head . group . sort . map (\(a, b) -> (min a b, max a b))
delEdgeLU (from, to) = delEdge (from, to) . delEdge (to, from)
insEdgeDU (from, to) = insEdge (from, to, ()) . insNodeU to . insNodeU from where
insNodeU n g = if gelem n g then g else insNode (n, ()) g
nextEdges subgraph remaining
| isEmpty subgraph = uniq (edges remaining)
| otherwise = uniq $ do
n <- nodes subgraph
n' <- suc remaining n
return (n, n')
search_ subgraph remaining
= Node subgraph
. snd . mapAccumL step remaining
$ nextEdges subgraph remaining
where
step r e = let r' = delEdgeLU e r in (r', search_ (insEdgeDU e subgraph) r')
search = search_ empty
mkUUGraph :: [(Int, Int)] -> Gr () ()
mkUUGraph es = mkUGraph ns (es ++ map swap es) where
ns = nub (map fst es ++ map snd es)
swap (a, b) = (b, a)
-- the one from the paper
sampleGraph = mkUUGraph cPaper
cPaper = [(1, 2), (1, 5), (1, 6), (2, 3), (3, 4), (4, 5)]
The functions you'll want to use at the top-level are mkUUGraph, which constructs a graph from a list of edges, and search, which constructs a tree whose nodes are connected subgraphs of its input. For example, to compute the statistics shown at the bottom of "Scheme 1" in the paper, you might do this:
*Main> map length . tail . levels . search . mkUUGraph $ [(1, 2), (1, 5), (1, 6), (2, 3), (3, 4), (4, 5)]
[6,7,8,9,6,1]
*Main> sum it
37
I had a little trouble comparing it to your implementation, because I don't understand what all the arguments to ext are supposed to do. In particular, I couldn't work out how to call ext on the adjacency graph in the paper in such a way that I got 37 results. Perhaps you have a bug.
In any case, I did my best to emulate what I think your code is trying to do: finding graphs with up to seven edges, and certainly containing the edge (1, 2) (despite the fact that your code outputs many graphs that do not contain (1, 2)). I added this code:
mainHim = print . length $ ext c60 [[(1,2)]] Nothing [] 7
mainMe = print . length . concat . take 7 . levels $ search_ (mkUUGraph [(1,2)]) (mkUUGraph c60)
My code finds 3301 such graphs; yours finds 35571. I didn't try very hard to figure out where that discrepancy came from. In ghci, mainHim takes 36.45s; mainMe takes 0.13s. When compiled with -O2, mainHim takes 4.65s; mainMe takes 0.05s. The numbers for mainMe can be cut in half again by using the PatriciaTree graph implementation rather than the default one, and probably cut still farther with profiling and some thought. Just in case the reason mainMe is so much faster is that it is finding so many fewer graphs, I tested a modified main as well:
main = print . length . concat . take 8 . levels $ (search (mkUUGraph c60) :: Tree (Gr () ()))
This prints 35853, so it is finding roughly the same number of graphs as your test command. It takes 0.72s in ghci and 0.38s when compiled with -O2.
Or the best solution is porting it on C, and write bindings to C library over FFI?
No, you don't have to write it in C. The code generated by GHC is not that much slower than C. This huge speed difference suggests that you're implementing a different algorithm. So instead of rewriting in a different language, you should rewrite the Haskell code.
I guess the problem with your code is that you ...
use lists instead of sets
use breadth-first instead of depth-first enumeration (not sure)
use operations on the whole set of edges instead of cleverly keeping track of which edges are in which set
encode the recursive structure of the algorithm by hand, instead of using recursive calls.
I have to admit that I don't fully understand your code. But I read the paper you linked to, and the algorithm described there seems to be a simple brute-force enumeration of all results. So I guess the Haskell implementation should use the list monad (or list comprehensions) to enumerate all subgraphs, filtering out non-connected subgraphs during the enumeration. If you've never written code with the list monad before, just enumerating all subgraphs might be a good starting point.

I need to create haskell function, which returns all possible binary trees, given a list of integers

As the title says, I need this:
getAllTrees :: [Int] -> [Tree Int]
getAllTrees xs = undefined
where tree is
data Tree x
= Null
| Leaf x
| Node (Tree x) x (Tree x)
I will appreciate any help, even the smallest clue :)
Thanks
I usually find it easiest to use the list monad for these kinds of problems. We can define getAllTrees by reasoning as follows:
The only tree of zero items is Null:
getAllTrees [] = return Null
There is also only one tree of one element, namely a Leaf:
getAllTrees [x] = return $ Leaf x
When we have more than one element, we can split the list in all possible ways to determine how we should branch, and then recursively generate the sub-trees from each list. Let's say we have a function splits :: [a] -> [([a], [a])] that returns all ways of splitting a list, for example:
> splits [1..3]
[([],[1,2,3]),([1],[2,3]),([1,2],[3]),([1,2,3],[])]
We can then define the final case of getAllTrees by using the list monad. This allows us to write code which sort of looks like like we're focusing on only one case, and the monad will give us all the combinations.
getAllTrees xs = do
(left, x : right) <- splits xs
Node <$> getAllTrees left <*> pure x <*> getAllTrees right
The first line splits the input list and takes the first item from the second part as the middle element. The case when the second part is empty doesn't match the pattern, so it gets discarded since that's how the list monad handles pattern match failures.
The second line uses applicative syntax to say that we want the result to be a list of nodes, made from all combinations of sub-trees from the left list, the fixed middle element x, and all sub-trees from the right list.
All that remains then is to implement splits. Looking at the example above, it's easy to see that we can just take the inits and tails of the list and zip them together:
splits xs = zip (inits xs) (tails xs)
Time for a quick sanity check in the interpreter:
> mapM_ print $ getAllTrees [1..3]
Node Null 1 (Node Null 2 (Leaf 3))
Node Null 1 (Node (Leaf 2) 3 Null)
Node (Leaf 1) 2 (Leaf 3)
Node (Node Null 1 (Leaf 2)) 3 Null
Node (Node (Leaf 1) 2 Null) 3 Null
> length $ getAllTrees [1..5]
42
Looks like we're done! Some key lessons:
Try to think about the small cases first, and build up from there.
The list monad is useful for code that needs to generate all combinations of things.
You don't have to do everything at once. Dealing with the list splitting separately made the code much simpler than it would have been otherwise.

Two-dimensional zipper

Inspired by the recent question about 2d grids in Haskell, I'm wondering if it would be possible to create a two-dimensional zipper to keep track of a position in a list of lists. A one-dimensional zipper on a list allows us to really efficiently move locally in a large list (the common example being a text editor). But lets say we have a second dimension like this:
grid =
[[ 1, 2, 3, 4, 5]
,[ 6, 7, 8, 9,10]
,[11,12,13,14,15]
,[16,17,18,19,20]
,[21,22,23,24,25]]
Can we create some kind of zipper data structure to efficiently move not only left and right but up and down in the grid here? If so, what if we replace the list of lists with an infinite list of infinite lists, can we still get efficient movement?
Not quite, no. One of the key aspects of how zippers work is that they represent a location in a structure by a path used to reach it, plus extra fragments created along the way, with the end result that you can backtrack along that path and rebuild the structure as you go. The nature of the paths available through the data structure thus constrains the zipper.
Because locations are identified by paths, each distinct path represents a different location, so any data structure with multiple paths to the same value can't be used with a zipper--for example, consider a cyclic list, or any other structure with looping paths.
Arbitrary movement in 2D space doesn't really fit the above requirements, so we can deduce that a 2D zipper would necessarily be somewhat limited. Perhaps you'd start from the origin, walk a path through the structure, and then backtrack along that path some distance in order to reach other points, for example. This also implies that for any point in the structure, there are other points that can only be reached via the origin.
What you can do is build some notion of 2D distance into the data structure, so that as you follow a path down through the structure, the points "below" you are close to each other; the idea is to minimize the amount of backtracking needed on average to move a short distance in 2D space. This ends up being roughly the same approach needed to search 2D space by distance--nearest neighbor searches, efficient geometric intersection, that sort of thing--and can be done with the same kind of data structure, namely space partitioning to create a higher-dimensional search tree. Implementing a zipper for a quadtree, a kd-tree, or similar structures is straightforward, just like any other tree.
Well you can use something simple like the following code. We represent a table by the top rows of the selected element, the bottom rows of the selected element, plus the elements to the left of the selected one, and the elements to the right of the selected one.
The top rows and the left elements are stored in a reverse order to enable efficient movement.
I'm not sure if this qualifies as a zipper though, because even though we hold a "location" in the data structure, it is not a "path".
-- Table sel left right top bottom
data Table a = Table a [a] [a] [[a]] [[a]] deriving Show
left :: Table a -> Table a
left tab#(Table _ [] _ _ _) = tab
left (Table sel (l:ls) rs ts bs) = Table l ls (sel:rs) ts bs
right :: Table a -> Table a
right tab#(Table _ _ [] _ _) = tab
right (Table sel ls (r:rs) ts bs) = Table r (sel:ls) rs ts bs
up :: Table a -> Table a
up tab#(Table _ _ _ [] _) = tab
up (Table sel ls rs (t:ts) bs) = Table sel' ls' rs' ts (b:bs)
where
(ls',(sel':rs')) = splitAt (length ls) t
b = ls ++ (sel:rs)
down :: Table a -> Table a
down tab#(Table _ _ _ _ []) = tab
down (Table sel ls rs ts (b:bs)) = Table sel' ls' rs' (t:ts) bs
where
(ls',(sel':rs')) = splitAt (length ls) b
t = ls ++ (sel:rs)
tableToList :: Table a -> [[a]]
tableToList (Table sel ls rs ts bs) = (reverse ts) ++ [ls ++ (sel:rs)] ++ bs
listToTable :: [[a]] -> Table a
listToTable [] = error "cannot make empty table"
listToTable ([]:_) = error "cannot make empty table"
listToTable ((t:tr):ts) = Table t [] tr [] ts
This even works for infinite lists -
selected :: Table a -> a
selected (Table sel _ _ _ _) = sel
a :: Table Int
a = listToTable $ replicate 10 [1..]
selected a #=> 1
selected $ down a #=> 1
selected $ right $ down a #=> 2
I was looking for something similar: a way to cheaply and easily navigate (which includes going “backwards”) a doubly-infinite list of lists. Here's my take at it.
If I read the others answers carefully, what I'm presenting here isn't really a zipper: while navigation is amortized O(1), the memory used by the zipper structure network is never released. On the other hand, it ought to tie the knot enough for “cells” to be shared no matter the path we take to get to them, which is the kind of topology we'd want on a 2D list of lists.
To compensate, the list of lists used to generate it ought to eventually go unreferenced and garbage-collected.
data FakeZip2D a = Z2 { z2Val :: a
, goUp :: !( Maybe (FakeZip2D a) )
, goDown :: Maybe (FakeZip2D a)
, goLeft :: !( Maybe (FakeZip2D a) )
, goRight :: Maybe (FakeZip2D a)
}
fromList2 :: [[a]] -> Maybe (FakeZip2D a)
fromList2 xss = head (head zss) where
extended = [ repeat Nothing ] ++
map (\z -> [Nothing] ++ z ++ repeat Nothing) zss ++
[ repeat Nothing ]
zss = zipWith4' row xss extended (drop 1 extended) (drop 2 extended)
row xs prev cur next = Just <$> zipWith5' Z2 xs (tail prev) (tail next)
cur (drop 2 cur)
-- totally inspired by https://stackoverflow.com/a/54096748/12274
zipWith4' f (a:as) (b:bs) ~(c:cs) ~(d:ds) =
f a b c d : zipWith4' f as bs cs ds
zipWith5' f (a:as) (b:bs) ~(c:cs) (d:ds) ~(e:es) =
f a b c d e : zipWith5' f as bs cs ds es
The data structure ought to be self-explanatory. Up and left can afford to be strict because we're building from singly-linked lists. AFAIK, there's no point in maing them lazy in Haskell, as they wouldn't let anything go out of scope anyway.
The lattice is built recursively, expanding the borders of the provided input with Nothing. The lazy-enough variants of zipWith I needed are inspired from answers to another series of questions of mine on the topic.
Here it is in action:
demo :: IO ()
demo = do
let multList2 = [[ i*j | j <- [0..] ] | i <- [0..] ]
multZ2 = fromList2 multList2
let rows = iterate (>>= goDown) multZ2
cols = map (iterate (>>= goRight)) rows
putStrLn "Multiplication table"
mapM_ (print . map z2Val) $ take 5 $ map (catMaybes . take 5) cols
putStrLn "List of squares"
let goDiag = goRight >=> goDown
print $ map z2Val $ take 25 $ catMaybes $ iterate (>>= goDiag) multZ2
putStrLn "Convoluted list of squares"
let goDiag' = goDown >=> goRight >=> goUp >=> goLeft >=> goDiag
print $ map z2Val $ take 25 $ catMaybes $ iterate (>>= goDiag') multZ2
The interface can likely be made even easier to use by dropping the Maybes. At your own risk, naturally.
This might be slightly off-topic as it's not a real zipper either, but it solved my problem; and since this is the question that came up when I first looked for a solution, I'm posting it here with the intent it helps someone else.

Resources