Haskell profilling subgraph mining algorithm

Haskell profilling subgraph mining algorithm - haskell

I try to solve problem of finding all connected subgraphs in Haskell. Algorithm used is described here. Quote from that paper:
As in every path algorithm, there are forward steps and back steps. A step forward is done if a given connected subgraph can be extended by addition of edge k, that is if edge k is not already part of the given subgraph, if k is adjacent to at least one edge of the given subgraph, and if addition of edge k is not forbidden by some restrictions given below.
A step back is done as soon as a given connected subgraph cannot be further elongated. In this case the edge added last is removed from the string, it is temporarily given the status "forbidden", and any other edges which were forbidden by backtracking from a previous longer string are simultaneously "allowed" again. In contrast, an edge which is forbidden by being removed from a string shorter than the present one remains forbidden, thus assuring that every connected subgraph is constructed once and only once.
To do this algorithm, I represented graphs as list of edges:
type Edge = (Int,Int)
type Graph = [Edge]
Firstly, I wrote function addEdge that check if is it possible to extend graph, return Nothing if it isn't possible or Edge to extend.
I have a "parent" graph and "extensible" graph, so I try to found one and only one edge that exists in "parent" graph, connected with "extensible" graph, not already included in "extensible" graph and so not included in forbidden set.
I wrote this function below:
addEdge :: Graph -> Graph -> [Edge] -> Maybe Edge
addEdge !parent !extensible !forb = listToMaybe $ intersectBy (\ (i,j) (k,l) -> (i == k || i == l || j == k || j == l)) (parent \\ (extensible `union` forb)) extensible
It's work! but, as I see from profiling whole program, addEdge is the most heavy function. I am sure, that my code isn't optimal. Leastways, intersectBy function that finds all possible solutions but i need only one. Is there any ways to make this code more rapid? Maybe, don't use standard lists but Set from Data.Set? It's first point of attention.
Main recursive function ext presented below:
ext :: Graph -> [Graph] -> Maybe Graph -> [(Edge,Int)] -> Int -> [Graph]
ext !main !list !grow !forb !maxLength | isEnd == True = (filter (\g -> (length g /= 1)) list) ++ (group main)
| ((addEdge main workGraph forbEdges) == Nothing) || (length workGraph) >= maxLength = ext main list (Just workGraph) forbProcess maxLength
| otherwise = ext main ((addedEdge:workGraph):list) Nothing forb maxLength where
workGraph = if grow == Nothing then (head list) else (bite (fromJust grow)) -- [Edge] graph now proceeded
workGraphLength = length workGraph
addedEdge = fromJust $ addEdge'
addEdge' = addEdge main workGraph forbEdges
bite xz = if (length xz == 1) then (fromJust (addEdge main xz forbEdges)):[] else tail xz
forbProcess = (head workGraph,workGraphLength):(filter ((<=workGraphLength).snd) forb)
forbEdges = map fst forb -- convert from (Edge,Level) to [Edge]
isEnd = (grow /= Nothing) && (length (fromJust grow) == 1) && ((addEdge main (fromJust grow) forbEdges) == Nothing)
I test my program on graph
c60 = [(1,4),(1,3),(1,2),(2,6),(2,5),(3,10),(3,7),(4,24),(4,21),(5,8),(5,7),(6,28),(6,25),
(7,9),(8,11),(8,12),(9,16),(9,13),(10,20),(10,17),(11,14),(11,13),(12,28),(12,30),(13,15),
(14,43),(14,30),(15,44),(15,18),(16,18),(16,17),(17,19),(18,47),(19,48),(19,22),(20,22),(20,21),
(21,23),(22,31),(23,32),(23,26),(24,26),(24,25),(25,27),(26,35),(27,36),(27,29),(28,29),(29,39),
(30,40),(31,32),(31,33),(32,34),(33,50),(33,55),(34,37),(34,55),(35,36),(35,37),(36,38),(37,57),
(38,41),(38,57),(39,40),(39,41),(40,42),(41,59),(42,45),(42,59),(43,44),(43,45),(44,46),(45,51),
(46,49),(46,51),(47,48),(47,49),(48,50),(49,53),(50,53),(51,52),(52,60),(52,54),(53,54),(54,56),(55,56),(56,58),(57,58),(58,60),(59,60)] :: Graph
For example, find all subgraphs with length from 1 to 7
length $ ext c60 [[(1,2)]] Nothing [] 7
>102332
Problem is too low speed of computation. As it pointed in original article, program have been written in FORTRAN 77 and launched on 150MHz workstation, perform test task minimum 30 times faster then my code on modern i5 processor.
I can't understand, why my program is so slow? Is there any ways to refactor this code? Or the best solution is porting it on C, and write bindings to C library over FFI?

I decided to take a shot at implementing the algorithm described in the paper using fgl. The complete code follows.
{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.Graph.Inductive
import Data.List
import Data.Tree
uniq = map head . group . sort . map (\(a, b) -> (min a b, max a b))
delEdgeLU (from, to) = delEdge (from, to) . delEdge (to, from)
insEdgeDU (from, to) = insEdge (from, to, ()) . insNodeU to . insNodeU from where
insNodeU n g = if gelem n g then g else insNode (n, ()) g
nextEdges subgraph remaining
| isEmpty subgraph = uniq (edges remaining)
| otherwise = uniq $ do
n <- nodes subgraph
n' <- suc remaining n
return (n, n')
search_ subgraph remaining
= Node subgraph
. snd . mapAccumL step remaining
$ nextEdges subgraph remaining
where
step r e = let r' = delEdgeLU e r in (r', search_ (insEdgeDU e subgraph) r')
search = search_ empty
mkUUGraph :: [(Int, Int)] -> Gr () ()
mkUUGraph es = mkUGraph ns (es ++ map swap es) where
ns = nub (map fst es ++ map snd es)
swap (a, b) = (b, a)
-- the one from the paper
sampleGraph = mkUUGraph cPaper
cPaper = [(1, 2), (1, 5), (1, 6), (2, 3), (3, 4), (4, 5)]
The functions you'll want to use at the top-level are mkUUGraph, which constructs a graph from a list of edges, and search, which constructs a tree whose nodes are connected subgraphs of its input. For example, to compute the statistics shown at the bottom of "Scheme 1" in the paper, you might do this:
*Main> map length . tail . levels . search . mkUUGraph $ [(1, 2), (1, 5), (1, 6), (2, 3), (3, 4), (4, 5)]
[6,7,8,9,6,1]
*Main> sum it
37
I had a little trouble comparing it to your implementation, because I don't understand what all the arguments to ext are supposed to do. In particular, I couldn't work out how to call ext on the adjacency graph in the paper in such a way that I got 37 results. Perhaps you have a bug.
In any case, I did my best to emulate what I think your code is trying to do: finding graphs with up to seven edges, and certainly containing the edge (1, 2) (despite the fact that your code outputs many graphs that do not contain (1, 2)). I added this code:
mainHim = print . length $ ext c60 [[(1,2)]] Nothing [] 7
mainMe = print . length . concat . take 7 . levels $ search_ (mkUUGraph [(1,2)]) (mkUUGraph c60)
My code finds 3301 such graphs; yours finds 35571. I didn't try very hard to figure out where that discrepancy came from. In ghci, mainHim takes 36.45s; mainMe takes 0.13s. When compiled with -O2, mainHim takes 4.65s; mainMe takes 0.05s. The numbers for mainMe can be cut in half again by using the PatriciaTree graph implementation rather than the default one, and probably cut still farther with profiling and some thought. Just in case the reason mainMe is so much faster is that it is finding so many fewer graphs, I tested a modified main as well:
main = print . length . concat . take 8 . levels $ (search (mkUUGraph c60) :: Tree (Gr () ()))
This prints 35853, so it is finding roughly the same number of graphs as your test command. It takes 0.72s in ghci and 0.38s when compiled with -O2.

Or the best solution is porting it on C, and write bindings to C library over FFI?
No, you don't have to write it in C. The code generated by GHC is not that much slower than C. This huge speed difference suggests that you're implementing a different algorithm. So instead of rewriting in a different language, you should rewrite the Haskell code.
I guess the problem with your code is that you ...
use lists instead of sets
use breadth-first instead of depth-first enumeration (not sure)
use operations on the whole set of edges instead of cleverly keeping track of which edges are in which set
encode the recursive structure of the algorithm by hand, instead of using recursive calls.
I have to admit that I don't fully understand your code. But I read the paper you linked to, and the algorithm described there seems to be a simple brute-force enumeration of all results. So I guess the Haskell implementation should use the list monad (or list comprehensions) to enumerate all subgraphs, filtering out non-connected subgraphs during the enumeration. If you've never written code with the list monad before, just enumerating all subgraphs might be a good starting point.

Related

Benchmarking Filter and Partition

I was testing the performance of the partition function for lists and got some strange results, I think.
We have that partition p xs == (filter p xs, filter (not . p) xs) but we chose the first implementation because it only performs a single traversal over the list. Yet, the results I got say that it maybe be better to use the implementation that uses two traversals.
Here is the minimal code that shows what I'm seeing
import Criterion.Main
import System.Random
import Data.List (partition)
mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)
randList :: RandomGen g => g -> Integer -> [Integer]
randList gen 0 = []
randList gen n = x:xs
where
(x, gen') = random gen
xs = randList gen' (n - 1)
main = do
gen <- getStdGen
let arg10000000 = randList gen 10000000
defaultMain [
bgroup "filters -- split list in half " [
bench "partition100" $ nf (partition (>= 50)) arg10000000
, bench "mypartition100" $ nf (mypartition (>= 50)) arg10000000
]
]
I ran the tests both with -O and without it and both times I get that the double traversals is better.
I am using ghc-7.10.3 with criterion-1.1.1.0
My questions are:
Is this expected?
Am I using Criterion correctly? I know that laziness can be tricky and (filter p xs, filter (not . p) xs) will only do two traversals if both elements of the tuple are used.
Does this has to do something with the way lists are handled in Haskell?
Thanks a lot!

There is no black or white answer to the question. To dissect the problem consider the following code:
import Control.DeepSeq
import Data.List (partition)
import System.Environment (getArgs)
mypartition :: (a -> Bool) -> [a] -> ([a],[a])
mypartition p l = (filter p l, filter (not . p) l)
main :: IO ()
main = do
let cnt = 10000000
xs = take cnt $ concat $ repeat [1 .. 100 :: Int]
args <- getArgs
putStrLn $ unwords $ "Args:" : args
case args of
[percent, fun]
-> let p = (read percent >=)
in case fun of
"partition" -> print $ rnf $ partition p xs
"mypartition" -> print $ rnf $ mypartition p xs
"partition-ds" -> deepseq xs $ print $ rnf $ partition p xs
"mypartition-ds" -> deepseq xs $ print $ rnf $ mypartition p xs
_ -> err
_ -> err
where
err = putStrLn "Sorry, I do not understand."
I do not use Criterion to have a better control about the order of evaluation. To get timings, I use the +RTS -s runtime option. The different test case are executed using different command line options. The first command line option defines for which percentage of the data the predicate holds. The second command line option chooses between different tests.
The tests distinguish two cases:
The data is generated lazily (2nd argument partition or mypartition).
The data is already fully evaluated in memory (2nd argument partition-ds or mypartition-ds).
The result of the partitioning is always evaluated from left to right, i.e. starting with the list that contains all the elements for which the predicate holds.
In case 1 partition has the advantage that elements of the first resulting list get discarded before all elements of the input list were even produced. Case 1 is especially good, if the predicate matches many elements, i.e. the first command line argument is large.
In case 2, partition cannot play out this advantage, since all elements are already in memory.
For mypartition, in any case all elements are held in memory after the first resulting list is evaluated, because they are needed again to compute the second resulting list. Therefore there is not much of a difference between the two cases.
It seems, the more memory is used, the harder garbage collection gets. Therefore partition is well suited, if the predicate matches many elements and the lazy variant is used.
Conversely, if the predicate does not match many elements or all elements are already in memory, mypartition performs better, since its recursion does not deal with pairs in contrast to partition.
The Stackoverflow question “Irrefutable pattern does not leak memory in recursion, but why?” might give some more insights about the handling of pairs in the recursion of partition.

Long working of program that count Ints

I want to write program that takes array of Ints and length and returns array that consist in position i all elements, that equals i, for example
[0,0,0,1,3,5,3,2,2,4,4,4] 6 -> [[0,0,0],[1],[2,2],[3,3],[4,4,4],[5]]
[0,0,4] 7 -> [[0,0],[],[],[],[4],[],[]]
[] 3 -> [[],[],[]]
[2,2] 3 -> [[],[],[2,2]]
So, that's my solution
import Data.List
import Data.Function
f :: [Int] -> Int -> [[Int]]
f ls len = g 0 ls' [] where
ls' = group . sort $ ls
g :: Int -> [[Int]] -> [[Int]] -> [[Int]]
g val [] accum
| len == val = accum
| otherwise = g (val+1) [] (accum ++ [[]])
g val (x:xs) accum
| len == val = accum
| val == head x = g (val+1) xs (accum ++ [x])
| otherwise = g (val+1) (x:xs) (accum ++ [[]])
But query f [] 1000000 works really long, why?

I see we're accumulating over some data structure. I think foldMap. I ask "Which Monoid"? It's some kind of lists of accumulations. Like this
newtype Bunch x = Bunch {bunch :: [x]}
instance Semigroup x => Monoid (Bunch x) where
mempty = Bunch []
mappend (Bunch xss) (Bunch yss) = Bunch (glom xss yss) where
glom [] yss = yss
glom xss [] = xss
glom (xs : xss) (ys : yss) = (xs <> ys) : glom xss yss
Our underlying elements have some associative operator <>, and we can thus apply that operator pointwise to a pair of lists, just like zipWith does, except that when we run out of one of the lists, we don't truncate, rather we just take the other. Note that Bunch is a name I'm introducing for purposes of this answer, but it's not that unusual a thing to want. I'm sure I've used it before and will again.
If we can translate
0 -> Bunch [[0]] -- single 0 in place 0
1 -> Bunch [[],[1]] -- single 1 in place 1
2 -> Bunch [[],[],[2]] -- single 2 in place 2
3 -> Bunch [[],[],[],[3]] -- single 3 in place 3
...
and foldMap across the input, then we'll get the right number of each in each place. There should be no need for an upper bound on the numbers in the input to get a sensible output, as long as you are willing to interpret [] as "the rest is silence". Otherwise, like Procrustes, you can pad or chop to the length you need.
Note, by the way, that when mappend's first argument comes from our translation, we do a bunch of ([]++) operations, a.k.a. ids, then a single ([i]++), a.k.a. (i:), so if foldMap is right-nested (which it is for lists), then we will always be doing cheap operations at the left end of our lists.
Now, as the question works with lists, we might want to introduce the Bunch structure only when it's useful. That's what Control.Newtype is for. We just need to tell it about Bunch.
instance Newtype (Bunch x) [x] where
pack = Bunch
unpack = bunch
And then it's
groupInts :: [Int] -> [[Int]]
groupInts = ala' Bunch foldMap (basis !!) where
basis = ala' Bunch foldMap id [iterate ([]:) [], [[[i]] | i <- [0..]]]
What? Well, without going to town on what ala' is in general, its impact here is as follows:
ala' Bunch foldMap f = bunch . foldMap (Bunch . f)
meaning that, although f is a function to lists, we accumulate as if f were a function to Bunches: the role of ala' is to insert the correct pack and unpack operations to make that just happen.
We need (basis !!) :: Int -> [[Int]] to be our translation. Hence basis :: [[[Int]]] is the list of images of our translation, computed on demand at most once each (i.e., the translation, memoized).
For this basis, observe that we need these two infinite lists
[ [] [ [[0]]
, [[]] , [[1]]
, [[],[]] , [[2]]
, [[],[],[]] , [[3]]
... ...
combined Bunchwise. As both lists have the same length (infinity), I could also have written
basis = zipWith (++) (iterate ([]:) []) [[[i]] | i <- [0..]]
but I thought it was worth observing that this also is an example of Bunch structure.
Of course, it's very nice when something like accumArray hands you exactly the sort of accumulation you need, neatly packaging a bunch of grungy behind-the-scenes mutation. But the general recipe for an accumulation is to think "What's the Monoid?" and "What do I do with each element?". That's what foldMap asks you.

The (++) operator copies the left-hand list. For this reason, adding to the beginning of a list is quite fast, but adding to the end of a list is very slow.
In summary, avoid adding things to the end of a list. Try to always add to the beginning instead. One simple way to do that is to build the list backwards, and then reverse it at the end. A more devious trick is to use "difference lists" (Google it). Another possibility is to use Data.Sequence rather than a list.

The first thing that should be noted is the most obvious way to implement this is use a data structure that allows random access, an array is an obviously choice. Note that you need to add the elements to the array multiple times and somehow "join them".
accumArray is perfect for this.
So we get:
f l i = elems $ accumArray (\l e -> e:l) [] (0,i-1) (map (\e -> (e,e)) l)
And we're good to go (see full code here).
This approach does involve converting the final array back into a list, but that step is very likely faster than say sorting the list, which often involves scanning the list at least a few times for a list of decent size.

Whenever you use ++ you have to recreate the entire list, since lists are immutable.
A simple solution would be to use :, but that builds a reversed list. However that can be fixed using reverse, which results in only building two lists (instead of 1 million in your case).

Your concept of glomming things onto an accumulator is a very useful one, and both MathematicalOrchid and Guvante show how you can use that concept reasonably efficiently. But in this case, there is a simpler approach that is likely also faster. You started with
group . sort $ ls
and this was a very good place to start! You get a list that's almost the one you want, except that you need to fill in some blanks. How can we figure those out? The simplest way, though probably not quite the most efficient, is to work with a list of all the numbers you want to count up to: [0 .. len-1].
So we start with
f ls len = g [0 .. len-1] (group . sort $ ls)
where
?
How do we define g? By pattern matching!
f ls len = g [0 .. len-1] (group . sort $ ls)
where
-- We may or may not have some lists left,
-- but we counted as high as we decided we
-- would
g [] _ = []
-- We have no lists left, so the rest of the
-- numbers are not represented
g ns [] = map (const []) ns
-- This shouldn't be possible, because group
-- doesn't make empty lists.
g _ ([]:_) = error "group isn't working!"
-- Finally, we have some work to do!
g (n:ns) xls#(xl#(x:_):xls')
| n == x = xl : g ns xls'
| otherwise = [] : g ns xls
That was nice, but making the list of numbers isn't free, so you might be wondering how you can optimize it. One method I invite you to try is using your original technique of keeping a separate counter, but following this same sort of structure.

Haskell - Calculating the shortest path using trees

i am trying to write a code in haskell, that goes from point A, to point F, on a board game, that is essentially a Matrix, following the shortest path.
This is the board:
AAAA
ACCB
ADEF
*
0 0 N
The robot enters on the letter A, on the bottom (where it is the * ), and must reach F, on the bottom of the board are the coordinates, x=0, y=0, and pointing towards North. F coordinate is (3,0)
The trick is, it can't jump more than one letter, it can go from A to B, B to C, etc. and it can walk through the letters of the type (A to A, B to B, etc)
It can only move forward and make turns (Left, right) so the path to let me go to F would be
Forward, Forward, Right, Forward ,Forward, Forward, Right, Jump, Right, Jump, Forward, Left, Jump, Left, Forward, Forward
Once it reaches F, it's done.
I want to try this approach, using a Tree
A
/ \
A D
/ \
/ \
A C
/ \ / \
/ \ D C
A
/ \
/ \
A
/
/
A
/ \
B A
/ \
C F
After that i would only need to validate the correct path and shortest right?
Problem is , i don't have that much experience using trees.
Would you indicate any other way to get the best path?
Thank you very much .

We're going to solve this problem by searching a tree in three parts. First we will build a Tree representing the paths through the problem, with branches for each state. We'd like to find the shortest path to get to a state with a certain criteria, so we will write a breadth first search for searching any Tree. This won't be fast enough for the example problem you provided, so we will improve on the breadth first search with a transposition table which keeps track of states we have already explored to avoid exploring them again.
Building a Tree
We'll assume that your playing board is represented in an Array from Data.Array
import Data.Array
type Board = Array (Int, Int) Char
board :: Board
board = listArray ((1,1),(3,4)) ("AAAA" ++ "ACCB" ++ "ADEF")
Data.Array doesn't provide a default easy way to make sure indexes that we look up values for with ! are actually in the bounds of the Array. For convenience, we'll provide a safe version that returns Just v if the value is in the Array or Nothing otherwise.
import Data.Maybe
(!?) :: Ix i => Array i a -> i -> Maybe a
a !? i = if inRange (bounds a) i then Just (a ! i) else Nothing
The State of the puzzle can be represented by the combination of a position of the robot and the direction that the robot is facing.
data State = State {position :: (Int, Int), direction :: (Int, Int)}
deriving (Eq, Ord, Show)
The direction is a unit vector that can be added to the position to get a new position. We can rotate the direction vector left or right and moveTowards it.
right :: Num a => (a, a) -> (a, a)
right (down, across) = (across, -down)
left :: Num a => (a, a) -> (a, a)
left (down, across) = (-across, down)
moveTowards :: (Num a, Num b) => (a, b) -> (a, b) -> (a, b)
moveTowards (x1, y1) (x2, y2) = (x1 + x2, y1 + y2)
To explore a board, we will need to be able to determine from a state what moves are legal. To do this it'd be useful to name the moves, so we'll make a data type to represent the possible moves.
import Prelude hiding (Right, Left)
data Move = Left | Right | Forward | Jump
deriving (Show)
To determine what moves are legal on a board we need to know which Board we are using and the State of the robot. This suggests the type moves :: Board -> State -> Move, but we re going to be computing the new state after each move just to decide if the move was legal, so we will also return the new state for convenience.
moves :: Board -> State -> [(Move, State)]
moves board (State pos dir) =
(if inRange (bounds board) pos then [(Right, State pos (right dir)), (Left, State pos (left dir))] else []) ++
(if next == Just here then [(Forward, State nextPos dir)] else []) ++
(if next == Just (succ here) then [(Jump, State nextPos dir)] else [])
where
here = fromMaybe 'A' (board !? pos)
nextPos = moveTowards dir pos
next = board !? nextPos
If we're on the board, we can turn Left and Right; the restriction that we be on the board guarantees all the States returned by moves have positions that are on the board. If the value held at the nextPos, next position matches what is Just here we can go Forward to it (if we're off the board, we assume what is here is 'A'). If next is Just the successor of what is here we can Jump to it. If next is off the board it is Nothing and can't match either Just here or Just (succ here).
Up until this point, we've just provided the description of the problem and haven't touched on answering the question with tree. We are going to use the rose tree Tree defined in Data.Tree.
data Tree a = Node {
rootLabel :: a, -- ^ label value
subForest :: Forest a -- ^ zero or more child trees
}
type Forest a = [Tree a]
Each node of a Tree a holds a single value a and a list of branches which are each a Tree a.
We are going to build a list of Trees in a straightforward manner from our moves function. We are going to make each result of moves the rootLabel of a Node and make the branches be the list of Trees we get when we explore the new state.
import Data.Tree
explore :: Board -> State -> [Tree (Move, State)]
explore board = map go . moves board
where
go (label, state) = Node (label, state) (explore board state)
At this point, our trees are infinite; nothing keeps the robot from endlessly spinning in place.. We can't draw one, but we could if we could limit the tree to just a few steps.
limit :: Int -> Tree a -> Tree a
limit n (Node a ts)
| n <= 0 = Node a []
| otherwise = Node a (map (limit (n-1)) ts)
We'll display just the first couple levels of the tree when we start off the bottom left corner facing towards the board in State (4, 1) (-1, 0).
(putStrLn .
drawForest .
map (fmap (\(m, s) -> show (m, board ! position s)) . limit 2) .
explore board $ State (4, 1) (-1, 0))
(Forward,'A')
|
+- (Right,'A')
| |
| +- (Right,'A')
| |
| `- (Left,'A')
|
+- (Left,'A')
| |
| +- (Right,'A')
| |
| `- (Left,'A')
|
`- (Forward,'A')
|
+- (Right,'A')
|
+- (Left,'A')
|
`- (Forward,'A')
Breadth First Search
Breadth first search explores all the possibilities at one level (across the "breadth" of what is being searched) before descending into the next level (into the "depth" of what is being searched). Breadth first search finds the shortest path to a goal. For our trees, this means exploring everything at one layer before exploring any of what's in the inner layers. We'll accomplish this by making a queue of nodes to explore adding the nodes we discover in the next layer to the end of the queue. The queue will always hold nodes from the current layer followed by nodes from the next layer. It will never hold any nodes from the layer past that because we won't discover those nodes until we have moved on to the next layer.
To implement that, we need an efficient queue, so we'll use a sequence from Data.Sequence/
import Data.Sequence (viewl, ViewL (..), (><))
import qualified Data.Sequence as Seq
We start with an empty queue Seq.empty of nodes to explore and an empty path [] into the Trees. We add the initial possibilities to the end of the queue with >< (concatenation of sequences) and go. We look at the start of the queue. If there's nothing left, EmptyL, we didn't find a path to the goal and return Nothing. If there is something there, and it matches the goal p, we return the path we have accumulate backwards. If the first thing in the queue doesn't match the goal we add it as the most recent part of the path and add all of its branches to the remainder of what was queued.
breadthFirstSearch :: (a -> Bool) -> [Tree a] -> Maybe [a]
breadthFirstSearch p = combine Seq.empty []
where
combine queue ancestors branches =
go (queue >< (Seq.fromList . map ((,) ancestors) $ branches))
go queue =
case viewl queue of
EmptyL -> Nothing
(ancestors, Node a bs) :< queued ->
if p a
then Just . reverse $ a:ancestors
else combine queued (a:ancestors) bs
This lets us write our first solve for Boards. It's convenient here that all of the positions returned from moves are on the board.
solve :: Char -> Board -> State -> Maybe [Move]
solve goal board = fmap (map fst) . breadthFirstSearch ((== goal) . (board !) . position . snd) . explore board
If we run this for our board it never finishes! Well, eventually it will, but my back of a napkin calculation suggests it will take about 40 million steps. The path to the end of the maze is 16 steps long and the robot is frequently presented with 3 options for what to do at each step.
> solve 'F' board (State (4, 1) (-1, 0))
We can solve much smaller puzzles like
AB
AC
*
Which we can represent the board for this puzzle with
smallBoard :: Board
smallBoard = listArray ((1,1),(2,2)) ("AB" ++ "AC")
We solve it looking for 'C' starting in row 3 column 1 looking towards lower numbered rows.
> solve 'C' smallBoard (State (3, 1) (-1, 0))
Just [Forward,Forward,Right,Jump,Right,Jump]
Transposition Table
Certainly this problem must be easier to solve than exploring 40 million possible paths. Most of those paths consist of spinning in place or randomly meandering back and forth. The degenerate paths all share one property, they keep visiting states they have already visited. In the breadthFirstSeach code, those paths keep adding the same nodes to the queue. We can get rid of all of this extra work just by remembering the nodes that we've already seen.
We'll remember the set of nodes we've already seen with a Set from Data.Set.
import qualified Data.Set as Set
To the signature of breadthFirstSearch we'll add a function from the label for a node to a representation for the branches of that node. The representation should be equal whenever all the branches out of the node are the same. In order to quickly compare the representations in O(log n) time with a Set we require that the representation have an Ord instance instead of just equality. The Ord instance allows Set to check for membership with binary search.
breadthFirstSearchUnseen:: Ord r => (a -> r) -> (a -> Bool) -> [Tree a] -> Maybe [a]
In addition to keeping track of the queue, breadthFirstSearchUnseen keeps track of the set of representations that have been seen, starting with Set.empty. Each time we add branches to the queue with combine we also add the representations to seen. We only add the unseen branches whose representations are not in the set of branches we've already seen.
breadthFirstSearchUnseen repr p = combine Set.empty Seq.empty []
where
combine seen queued ancestors unseen =
go
(seen `Set.union` (Set.fromList . map (repr . rootLabel) $ unseen))
(queued >< (Seq.fromList . map ((,) ancestors ) $ unseen))
go seen queue =
case viewl queue of
EmptyL -> Nothing
(ancestors, Node a bs) :< queued ->
if p a
then Just . reverse $ ancestors'
else combine seen queued ancestors' unseen
where
ancestors' = a:ancestors
unseen = filter (flip Set.notMember seen . repr . rootLabel) bs
Now we can improve our solve function to use breadthFirstSearchUnseen. All of the branches from a node are determined by the State - the Move label that got to that state is irrelevant - so we only use the snd part of the (Move, State) tuple as the representation for a node.
solve :: Char -> Board -> State -> Maybe [Move]
solve goal board = fmap (map fst) . breadthFirstSearchUnseen snd ((== goal) . (board !) . position . snd) . explore board
We can now solve the original puzzle very quickly.
> solve 'F' board (State (4, 1) (-1, 0))
Just [Forward,Forward,Forward,Right,Forward,Forward,Forward,Right,Jump,Right,Jump,Forward,Left,Jump,Left,Jump,Jump]

Reconstruct a graph from BFS output in Haskell

I want to reconstruct the incidence structure of a graph in Haskell, which is given by the output of a breadth first traversal of it. Explicitly, the output consists of a root vertex and a list of neighborhoods (a neighborhood is a list of vertices marked as new or old (= already visited)), where each neighborhood corresponds to the least vertex which has not been assigned to a neighborhood, yet.
In any imperative language, I would solve the problem by using a queue:
Input: root vertex r, list of neighborhoods L
(1) Put r into the empty queue Q
(2) if Q is empty then STOP
(3) extract the first vertex v of Q
(4) extract the first neighborhood N of L
(5) append the unvisited vertices of N to Q
(6) remove the markings (new/old) of the nodes of N and assign v to N
(7) goto (2)
I tried to implement this naive algorithm in Haskell (by using a list or by using Data.Sequence as queue), but ghci always runs out of memory. This should not happen, because although the input consists of 300MB data, 16GB RAM should clearly suffice.
Therefore the naive implementation seems to cause a memory leak. How would you implement this algorithm in Haskell?
Edit:
Here are the (slightly simplified) data types, I use:
data Output = Out !Vertex ![[BFSNode]]
data Vertex = Vertex Integer SomeMoreComplexData
data BFSNode = New Vertex | Old Integer
data Graph = ![Vertex] ![(Integer,[Integer])]
The data type "Output" contains the already parsed BFS output consisting of the root vertex and the lists of neighborhoods. BFSNode corresponds to a node in the BFS tree which belongs to either a new vertex which is visited for the first time, or to an old vertex which already has been visited and which is therefore referred by its unique number. Note that the parsing process works fine and consumes very few memory.
My aim is to convert "Output" into the data type "Graph" which consists of the lists of vertices and of an incidence list.
Here is a simplified version of my implementation:
readTree :: [[BFSNode]] -> Seq Integer -> Graph
readTree [] _ = Graph [] []
readTree (nb:nbs) qs =
let (i :< qs') = viewl qs
newVs = fromList $! map nodeNr . filter isNew $ nb
(Graph vs adj) = readTree nbs $ qs' >< newVs
in Graph (map unNew (filter isNew nb) ++ vs) ((i,nub $ map nodeNr nb):adj)
"nbs" is the list of neighborhoods, "qs" is the queue. The function "nodeNr" extracts the unique identification number from a vertex, "isNew" tests whether a vertex is new, and "unNew" unpacks a new vertex from the data type "BFSNode".
Edit2:
I think I localized the problem now. Maybe it has nothing to do with my implementation of the conversion process. My failure was to use the build in function "read" to read the data type "Output" from a file. I realized now that Haskell has problems with reading big files. Even if it were just about reading a list of integers, e.g.
main = do
txt <- readFile "test"
writeFile "test2" . show $ (read txt :: [Integer]) }
the program will run out of memory if the file "test" is big enough. I understand now, that it is no good idea to parse data in this way, since "read" will load all data into the memory before showing any output, but I still do not understand why it fills 16GB of RAM although the file amounts not even 500MB. Do you have any idea what is wrong with "read"? Does Haskell show the same behavior on your machines?
Edit3:
Now I implemented a stream based parsing function "readOutput" which takes a String and returns the data type "Output". This function is lazy, so I immediately get an output when I call it. But when I compose it with my conversion function "readTree" (which is clearly tail-recursive) I get no output at all and the memory usage increases as usual. What am I doing wrong?
Edit4:
The problem in Edit3 came from some strictifications which I removed now.

This question does not specify a key ingredient - how is the graph going to be represented in Haskell? Functional programs require carefully thought out data structures to maximize sharing and run efficiently. Usually, this means they're recursively built from nothing (inductive). There's a paper on inductive graphs and functional graph algorithms‎ that gives one representation:
module Test where
data Graph a = Empty | Extension (Graph a) [Int] (Int, a)
deriving Show
That is, a graph is either Empty, or a (smaller) graph extended by one node. This is exactly how lists are built using Cons in functional languages, except that the additional node has to specify the smaller graph, the predecessor links ([Int]), and the new node number and data, (Int,a). Note that they also implemented this as an abstract type ''for efficiency reasons.''
A graph with one node can be generated by extending the empty graph.
singleton :: (Int,a) -> Graph a
singleton x = Extension Empty [] x
Using this structure, it's simple to define a recursive parse algorithm for your BFS tree.
data Mark a = Visited Int | New (Int,a) deriving Show
parse :: (Int,a) -> [[Mark a]] -> Graph a
parse x nbrs = extend Empty [x] nbrs
extend :: Graph a -> [(Int,a)] -> [[Mark a]] -> Graph a
extend g [] [] = g
extend g _ [] = Empty -- leftover nodes, really an error.
extend g [] _ = Empty -- leftover neighborhoods, really an error.
extend g (x : tl) (nbr : nbrs) =
extend (Extension g (seen nbr) x) (news tl nbr) nbrs
news :: [(Int,a)] -> [Mark a] -> [(Int,a)]
news l (New x : tl) = news (uniq l x) tl
news l (_ : tl) = news l tl
news l [] = l
uniq :: [(Int,a)] -> (Int,a) -> [(Int,a)]
uniq (x:tl) y = x : if (fst x == fst y) then tl else uniq tl y
uniq [] y = [y]
seen :: [Mark a] -> [Int]
seen (Visited i : tl) = i : seen tl
seen (_ : tl) = seen tl
seen [] = []
m0 = [New (1,())]
m1 = [Visited 0, New (2,()), New (3,())]
m2 = [Visited 1, New (3,())]
m3 = [Visited 1, Visited 2]
nbrs = [m0,m1,m2,m3]
Testing it out,
$ ghci
GHCi, version 7.6.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> :load Test
[1 of 1] Compiling Test ( Test.hs, interpreted )
Ok, modules loaded: Test.
*Test> parse (0,()) nbrs
Extension (Extension (Extension (Extension Empty [] (0,())) [0] (1,())) [1] (2,())) [1,2] (3,())
For efficiency, you could do the following:
The news and seen functions could be combined let (ns,sn) = newseen nbr ([],[]) and made tail-recursive (passing their partially constructed lists and returning immediately) for efficiency.
Your input could keep track of the node at the center of each neighbor list. This would avoid the list concatenation in the stack of neighbors. Alternatively, you could use a functional dequeue to hold that stack.
If you haven't seen it, I'd recommend Okasaki's book on purely functional data structures.

Merging an unbound number of ordered infinite sequences

I want to generate all natural numbers together with their decomposition in prime factors, up to a certain threshold.
I came up with the following function:
vGenerate :: [a] -- generator set for monoid B* (Kleene star of B)
-> (a, (a -> a -> a)) -- (identity element, generating function)
-> (a -> Bool) -- filter
-> [a] -- B* filtered
vGenerate [] (g0,_) _ = [g0]
vGenerate (e:es) (g0,g) c =
let coEs = vGenerate es (g0,g) c
coE = takeWhile (c) $ iterate (g e) g0
in concatMap (\m -> takeWhile (c) $ map (g m) coE) coEs
gen then generates all natural numbers together with their prime factors:
gen threshold =
let b = map (\x -> (x,[x])) $ takeWhile (<= threshold) primes
condition = (<= threshold) . fst
g0 = (1,[])
g = \(n,nl)(m,ml) -> ((n*m), nl ++ ml)
in vGenerate b (g0,g) condition
primes = [2,3,5,7,11,.. ] -- pseudo code
I have the following questions:
It is not always known in advance how many numbers we will need. Can we modify vGenerate such that it starts with a lazy infinite list of primes, and generates all the factorizations in increasing order? The challenge is that we have an infinite list of primes, for each prime an infinite list of powers of that prime number, and then have to take all possible combinations. The lists are naturally ordered by increasing first element, so they could be generated lazily.
I documented vGenerate in terms of monoid, with the intention to keep it as abstract as possible, but perhaps this just obfuscates the code? I want to generalize it later (more as an exercise than for real usage), e.g. for generating raster points within certain constraints, which can also be put in the monoid context, so I thought it was a good start to get rid of all references to the problem space (in casu: primes). But I feel that the filtering function does not fit well in the abstraction: the generation must happen in an order that is monotonous for the metric tested by c, because recursion is terminated as soon as c is not satisfied. Any advice?

Have a look at mergeAll :: Ord a => [[a]] -> [a] from the data-ordlist package. It merges an unbound number of infinite sequences as long as the sequences are ordered, and the heads of the sequences are ordered. I've used it for similar problems before, for example to generate all numbers of the form 2^i*3^j.
> let numbers = mergeAll [[2^i*3^j | j <- [0..]] | i <- [0..]]
> take 20 numbers
[1,2,3,4,6,8,9,12,16,18,24,27,32,36,48,54,64,72,81,96]
You should be able to extend this to generate all numbers with their factorizations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell profilling subgraph mining algorithm - haskell

Related

Benchmarking Filter and Partition

Long working of program that count Ints

Haskell - Calculating the shortest path using trees

Reconstruct a graph from BFS output in Haskell

Merging an unbound number of ordered infinite sequences

Categories

Resources