Given a Huffman tree and a stream of bits, return a pair containing (1) the
-- string of symbols encoded by the bits (according to the Huffman tree), and
-- (2) a Bool indicating whether the output stream contains every bit from the
-- input (that is, return False if there were any bits left over).
Here is the code, it only returns the first symbol in the tree. What's the problem?
data BTree a = Leaf a | Fork (BTree a) (BTree a) deriving (Show, Eq)
traT :: BTree a -> BTree a -> [Bool] -> [a] -> ([a], Bool)
traT (Leaf v) c bs res= (res++[v], True)
traT (Fork left right) c (b:bs) res
| b = traT right c bs res
| otherwise = traT left c bs res
traT _ c [] res = (res, True)
traT _ c bs res = traT c c (bs) res
traT _ c bs res = (res, False)
decode :: BTree a -> [Bool] -> ([a], Bool)
decode (Fork x y) bs = traT (Fork x y) (Fork x y) bs []
decode (Leaf x) bs = traT(Leaf x) (Leaf x) bs []
Well, you seem to be on the right track.
it only returns the first symbol in the tree.
Your main problem is with these 2 lines:
traT (Leaf v) c bs res= (res++[v], True)
...
traT _ c bs res = traT c c (bs) res
The first one masks the second one for all leaf nodes. And the second one is your only forward recursive call that could operate at leaf nodes, hence your only hope to process any further bits.
A couple of remarks:
the res++[v] expression forces the code to rescan the whole symbol list at each new symbol.
The second line would call itself endlessly (but it is masked by the first one).
Another (smaller) problem is that returning just one flag for the presence of "extra" bits at the end of the bit stream loses information, as we would like to know what the extra bits are. It is a bit risky to do this in your core recursive function. Of course, it is perfectly OK to do it in the final outer decode function.
This is why in the code sample below, I have used an extra symBits argument to keep the bits that have been processed but not yet attributed to a symbol. I keep them in reverse order, because Haskell prefers to prepend items to a list, rather than to put them at the end, rescanning the whole list to do so. Hence the call to reverse in the final stage of processing. It is a cheap reverse call, as it is limited in length to the depth of our Huffman tree.
So here is some suggested reworked code, where I have tried to distinguish the 4 cases: leaf node or fork node AND at end of bit stream or not. I also took the liberty to rename your c argument as htop.
data BTree a = Leaf a | Fork (BTree a) (BTree a) deriving (Show, Eq)
type Bit = Bool
-- hnode htop symBits bs
travHT :: BTree a -> BTree a -> [Bit] -> [Bit] -> ([a], [Bit])
-- situations where at least one input bit remains:
travHT (Leaf v) htop symBits (b:rbs) = -- CHANGE: forward recursive call
-- symbol completed, jump from leaf node to top of htree:
let fwdRes = travHT htop htop [] (b:rbs)
nextSyms = fst fwdRes
lastSymBits = snd fwdRes
in (v : nextSyms, lastSymBits)
travHT (Fork left right) htop symBits (b:rbs)
| b = travHT right htop (b:symBits) rbs
| otherwise = travHT left htop (b:symBits) rbs
-- situations where we have reached the end of the bit stream:
travHT (Leaf v) htop symBits [] = ([v],[])
-- no more bits and not at a leaf --> incomplete last symbol:
travHT (Fork left right) htop symBits [] = ([], reverse symBits)
-- homework-mandated interface:
decode :: BTree a -> [Bit] -> ([a], Bool)
decode htree bs =
let pair = travHT htree htree [] bs
(symbols, restOfBits) = pair
weUsedAllBits = null restOfBits
in (symbols, weUsedAllBits)
Testing code with token main program:
xyz_code :: BTree Char
xyz_code = Fork (Leaf 'x') (Fork (Leaf 'y') (Leaf 'z'))
-- Bit streams for test purposes:
------ Y Z X X X Y/Z??
bl0 = [True,False, True,True , False, False, False]
bl1 = [True,False, True,True , False, False, False, True]
main = do
let bitList = bl0
let htree = xyz_code
let result = decode htree bitList
putStrLn $ "result = " ++ show result
Program output:
result = ("yzxxx",True)
Hope it helps. I will also ask the powers that be to add the [huffman-code] tag to your question. Tags are a nice way to help people find the questions of interest to them. And we do have a tag for Huffman codes.
Related
So I'm trying to make a little program that can take in data captured during an experiment, and for the most part I think I've figured out how to recursively take in data until the user signals there is no more, however upon termination of data taking haskell throws Exception: <<loop>> and I can't really figure out why. Here's the code:
readData :: (Num a, Read a) => [Point a] -> IO [Point a]
readData l = do putStr "Enter Point (x,y,<e>) or (d)one: "
entered <- getLine
if (entered == "d" || entered == "done")
then return l
else do let l = addPoint l entered
nl <- readData l
return nl
addPoint :: (Num a, Read a) => [Point a] -> String -> [Point a]
addPoint l s = l ++ [Point (dataList !! 0) (dataList !! 1) (dataList !! 2)]
where dataList = (map read $ checkInputData . splitOn "," $ s) :: (Read a) => [a]
checkInputData :: [String] -> [String]
checkInputData xs
| length xs < 2 = ["0","0","0"]
| length xs < 3 = (xs ++ ["0"])
| length xs == 3 = xs
| length xs > 3 = ["0","0","0"]
As far as I can tell, the exception is indication that there is an infinite loop somewhere, but I can't figure out why this is occurring. As far as I can tell when "done" is entered the current level should simply return l, the list it's given, which should then cascade up the previous iterations of the function.
Thanks for any help. (And yes, checkInputData will have proper error handling once I figure out how to do that.)
<<loop>> basically means GHC has detected an infinite loop caused by a value which depends immediately on itself (cf. this question, or this one for further technical details if you are curious). In this case, that is triggered by:
else do let l = addPoint l entered
This definition, which shadows the l you passed as an argument, defines l in terms of itself. You meant to write something like...
else do let l' = addPoint l entered
... which defines a new value, l', in terms of the original l.
As Carl points out, turning on -Wall (e.g. by passing it to GHC at the command line, or with :set -Wall in GHCi) would make GHC warn you about the shadowing:
<interactive>:171:33: warning: [-Wname-shadowing]
This binding for ‘l’ shadows the existing binding
bound at <interactive>:167:10
Also, as hightlighted by dfeuer, the whole do-block in the else branch can be replaced by:
readData (addPoint l entered)
As an unrelated suggestion, in this case it is a good idea to replace your uses of length and (!!) with pattern matching. For instance, checkInputData can be written as:
checkInputData :: [String] -> [String]
checkInputData xs = case xs of
[_,_] -> xs ++ ["0"]
[_,_,_] -> xs
_ -> ["0","0","0"]
addPoint, in its turn, might become:
addPoint :: (Num a, Read a) => [Point a] -> String -> [Point a]
addPoint l s = l ++ [Point x y z]
where [x,y,z] = (map read $ checkInputData . splitOn "," $ s) :: (Read a) => [a]
That becomes even neater if you change checkInputData so that it returns a (String, String, String) triple, which would better express the invariant that you are reading exactly three values.
I am working with the following data type:
data SuffixTree = Leaf Int | Node [(String, SuffixTree)]
deriving (Eq, Show)
Each subtree has a corresponding label (string).
The idea is to build the corresponding suffix tree by adding each suffix and its index into an accumulating tree (at the beginning it is Node []).
This is already defined
buildTree s
= foldl (flip insert) (Node []) (zip (suffixes s) [0..length s-1])
where suffixes is correctly defined.
I've been trying to implement the insert function for a while but can't seem to succeed.
This is what I have now (the names and style are not the best since this is still work in progress):
insert :: (String, Int) -> SuffixTree -> SuffixTree
insert pair tree#(Node content)
= insert' pair tree content
where
insert' :: (String, Int) -> SuffixTree -> [(String, SuffixTree)] -> SuffixTree
insert' (s, n) (Node []) subtrees
= Node ((s, Leaf n) : subtrees)
insert' (s, n) (Node content#((a, tree) : pairs)) subtrees
| null p = insert' (s, n) (Node pairs) subtrees
| p == a = insert' (r, n) tree subtrees
| p /= a = Node ((p, newNode) : (subtrees \\ [(a, tree)]))
where
(p, r, r') = partition s a
newNode = Node [(r, (Leaf n)), (r', tree)]
The partition function takes two strings and returns a tuple consisting of:
The common prefix (if it exists)
The first string without the prefix
The second string without the prefix
I think I understand the rules needed to build the tree.
We start by comparing the label of the first subtree to the string we want to insert (say, str). If they don't have a prefix in common, we try to insert in the next subtree.
If the label is a prefix of str, we continue to look into that subtree, but instead of using str we try to insert str without the prefix.
If str is a prefix of label, then we replace the existing subtree with a new Node, having a Leaf and the old subtree. We also adjust the labels.
If we don't have a match between str and any label then we add a new Leaf to the list of subtrees.
However, the biggest problem that I have is that I need to return a new tree containing the changes, so I have to keep track of everything else in the tree (not sure how to do this or if I'm thinking correctly about this).
The code appears to be working correctly on this string: "banana":
Node [("a",Node [("",Leaf 5),("na",Node [("",Leaf 3),("na",Leaf 1)])]),
("na",Node [("",Leaf 4),("na",Leaf 2)]),("banana",Leaf 0)]
However, on this string "mississippi" I get an Exception: Non-exhaustive patterns in function insert'.
Any help or ideas are greatly appreciated!
You are using a quadratic algorithm; whereas optimally, suffix tree can be constructed in linear time. That said, sticking with the same algorithm, a possibly better approach would be to first build the (uncompressed) suffix trie (not tree) and then compress the resulting trie.
The advantage would be that a suffix trie can be represented using Data.Map:
data SuffixTrie
= Leaf' Int
| Node' (Map (Maybe Char) SuffixTrie)
which makes manipulations both more efficient and easier than list of pairs. Doing so, you may also completely bypass common prefix calculations, as it comes out by itself:
import Data.List (tails)
import Data.Maybe (maybeToList)
import Control.Arrow (first, second)
import Data.Map.Strict (Map, empty, insert, insertWith, assocs)
data SuffixTree
= Leaf Int
| Node [(String, SuffixTree)]
deriving Show
data SuffixTrie
= Leaf' Int
| Node' (Map (Maybe Char) SuffixTrie)
buildTrie :: String -> SuffixTrie
buildTrie s = foldl go (flip const) (init $ tails s) (length s) $ Node' empty
where
go run xs i (Node' ns) = run (i - 1) $ Node' tr
where tr = foldr loop (insert Nothing $ Leaf' (i - 1)) xs ns
loop x run = insertWith (+:) (Just x) . Node' $ run empty
where _ +: Node' ns = Node' $ run ns
buildTree :: String -> SuffixTree
buildTree = loop . buildTrie
where
loop (Leaf' i) = Leaf i
loop (Node' m) = Node $ con . second loop <$> assocs m
con (Just x, Node [(xs, tr)]) = (x:xs, tr) -- compress single-child nodes
con n = maybeToList `first` n
then:
\> buildTree "banana"
Node [("a",Node [("",Leaf 5),
("na",Node [("",Leaf 3),
("na",Leaf 1)])]),
("banana",Leaf 0),
("na",Node [("",Leaf 4),
("na",Leaf 2)])]
similarly:
\> buildTree "mississippi"
Node [("i",Node [("",Leaf 10),
("ppi",Leaf 7),
("ssi",Node [("ppi",Leaf 4),
("ssippi",Leaf 1)])]),
("mississippi",Leaf 0),
("p",Node [("i",Leaf 9),
("pi",Leaf 8)]),
("s",Node [("i",Node [("ppi",Leaf 6),
("ssippi",Leaf 3)]),
("si",Node [("ppi",Leaf 5),
("ssippi",Leaf 2)])])]
Here's how the problem is occurring.
Let's say you're processing buildTree "nanny". After you've inserted the suffixes "nanny", "anny", and "nny", your tree looks like t1 given by:
let t1 = Node t1_content
t1_content = [("n",t2),("anny",Leaf 1)]
t2 = Node [("ny",Leaf 2),("anny",Leaf 0)]
Next, you try to insert the prefix "ny":
insert ("ny", 3) t1
= insert' ("ny", 3) t1 t1_content
-- matches guard p == a with p="n", r="y", r'=""
= insert' ("y", 3) t2 t1_content
What you intend to do next is insert ("y", 3) into t2 to yield:
Node [("y", Leaf 3), ("ny",Leaf 2),("anny",Leaf 0)])
Instead, what happens is:
insert' ("y", 3) t2 t1_content
-- have s="y", a="ny", so p="", r="y", r'="ny"
-- which matches guard: null p
= insert' ("y", 3) (Node [("anny", Leaf 0)]) t1_content
-- have s="y", a="anny", so p="", r="y", r'="anny"
-- which matches guard: null p
= insert' ("y", 3) (Node []) t1_content
= Node [("y", Leaf 3), ("n",t2), ("anny",Leaf 1)]
and suffix "y" has been added to t1 instead of t2.
When you next try to insert suffix "y", the guard p==a case tries to insert ("y",3) into Leaf 3 and you get a pattern error.
The reason it works on banana is that you only ever insert a new node at the top level of the tree, so "adding to t2" and "adding to t1" are the same thing.
I suspect you'll need to substantially rethink the structure of your recursion to get this working.
Looks like this code does the job, although there may still be improvements to make. I hope that it's general enough to work on any string. I also tried to avoid using ++, but it's still better than nothing.
getContent (Node listOfPairs)
= listOfPairs
insert :: (String, Int) -> SuffixTree -> SuffixTree
insert (s, n) (Node [])
= Node [(s, Leaf n)]
insert (s, n) (Node (pair#(a, tree) : pairs))
| p == a = Node ((a, insert (r, n) tree) : pairs)
| null p = Node (pair : (getContent (insert (r, n) (Node pairs))))
| p /= a = Node ([(p, Node [(r, Leaf n), (r', tree)])] ++ pairs)
where
(p, r, r') = partition s a
I'm solving the Brigde and torch problem
in Haskell.
I wrote a function that given a state of the puzzle, as in which people have yet to cross and those who have crossed, gives back a list of all possible moves from one side to the other (moving two people forwards and one person backwards).
module DarkBridgeDT where
data Crossing = Trip [Float] [Float] Float deriving (Show)
data RoundTrip = BigTrip Crossing Crossing deriving (Show)
trip :: [Float] -> [Float] -> Float -> Crossing
trip x y z = Trip x y z
roundtrip :: Crossing -> Crossing -> RoundTrip
roundtrip x y = BigTrip x y
next :: Crossing -> [RoundTrip]
next (Trip [] _ _) = []
next (Trip (a:b:[]) s _ )
|a>b = [BigTrip (Trip [] (a:b:s) a) (Trip [] [] 0)]
|otherwise = [BigTrip (Trip [] (b:a:s) b) (Trip [] [] 0)]
next (Trip d s _) = [BigTrip (Trip [x,z] (i:j:s) j) b | i <- d, j <- d, i < j, x <- d, z <- d, x < z, z /= i, z /= j, x /= z, x /= i, x /= j, b <- (back [x,z] (i:j:s))]
where
back [] s = []
back d s = [Trip (i:d) (filter (/= i) s) i | i <- s]
Now I need a function that given a state as the one above and a maximum amount of time gives back all possible solutions to the puzzle in less than that given time.
All I have for that is this:
cross :: Crossing -> Float -> [[RoundTrip]]
cross (Trip [] _ _) _ = []
cross (Trip _ _ acu) max
| acu > max = []
cross (Trip a b acu) max = map (cross (map (crec) (next (Trip a b acu)) acu)) max
where
crec (BigTrip (Trip _ _ t1) (Trip a b t2)) acu = (Trip a b (t1+t2+acu))
Of course that doesn't compile, the 5th line is the one that's driving me insane. Any input?
Edit:
The cross function is meant to apply the next function to every result of the last nextfunction called.
If the first result of next was something like: [A,B,C,D] then it would call next on A B C and D to see if any or all of those get to a solution in less than max (A B C and D would be Crossings inside which contain the floats that are the time that ads up and is compared to max).
My data structure is
Crossing: Contains the first side of the bridge (the people in it represented by the time they take to cross the bridge) the other side of the bridge (the same as the other) and a time that represents the greatest time that last crossed the bridge (either the greatest of the two in the first crossing or the only one in the second) or the amount of time acumulated crossing the bridge (in the cross function).
RoundTrip: Represents two crossings, the first and the second, the one getting to safety and the one coming back to danger.
cross (Trip [1,2,5,10] [] 0) 16 should give an empty list for there is no solution that takes less than 17 minutes (or whatever time unit).
cross (Trip [1,2,5,10] [] 0) 17 should give the normal solution to the puzzle as a list of roundtrips.
I hope that makes it clearer.
Edit2:
I finally got it. I read Carsten's solution before I completed mine and we laid it out practically the same. He used fancier syntax and more complex structures but it's really similar:
module DarkBridgeST where
data Torch = Danger | Safety deriving (Eq,Show)
data State = State
[Float] -- people in danger
[Float] -- people safe
Torch -- torch position
Float -- remaining time
deriving (Show)
type Crossing = [Float]
classic :: State
classic = State [1,2,5,10] [] Danger 17
next :: State -> [Crossing] -- List all possible moves
next (State [] _ _ _) = [] -- Finished
next (State _ [] Safety _) = [] -- No one can come back
next (State danger _ Danger rem) = [[a,b] | a <- danger, b <- danger, a /= b, a < b, max a b <= rem]
next (State _ safe Safety rem) = [[a] | a <- safe, a <= rem]
cross :: State -> Crossing -> State -- Crosses the bridge depending on where the torch is
cross (State danger safe Danger rem) cross = State (taking cross danger) (safe ++ cross) Safety (rem - (maximum cross))
cross (State danger safe Safety rem) cross = State (danger ++ cross) (taking cross safe) Danger (rem - (maximum cross))
taking :: [Float] -> [Float] -> [Float]
taking [] d = d
taking (x:xs) d = taking xs (filter (/=x) d)
solve :: State -> [[Crossing]]
solve (State [] _ _ _) = [[]]
solve sf = do
c <- next sf
let sn = cross sf c
r <- solve sn
return (c:r)
All in all thanks everyone. I'm new to Haskell programming and this helped me understand a lot of things. I hope this post can also help someone starting haskell like me one day :)
I'm not going to leave much of your code intact here.
The first problems are with the data structures. Crossing doesn't actually represent anything to do with crossing the bridge, but the state before or after a bridge crossing. And you can't use RoundTrip because the number of bridge crossings is always odd.
I'm renaming the data structure I'm actually keeping, but I'm not keeping it unmodified.
data Bank = Danger | Safety deriving (Eq,Show)
data PuzzleState = PuzzleState
[Float] -- people still in danger
[Float] -- people on the safe bank
Bank -- current location of the torch
Float -- remaining time
type Crossing = ([Float],Bank)
Modifying/writing these functions is left as an exercise for the reader
next :: PuzzleState -> [Crossing] -- Create a list of possible crossings
applyCrossing :: PuzzleState -> Crossing -> PuzzleState -- Create the next state
Then something like this function can put it all together (assuming next returns an empty list if the remaining time is too low):
cross (PuzzleState [] _ _ _) = [[]]
cross s1 = do
c <- next s1
let s2 = applyCrossing s1 c
r <- cross s2
return $ c : r
Just for the fun, an approach using a lazy tree:
import Data.List
import Data.Tree
type Pawn = (Char, Int)
data Direction = F | B
data Turn = Turn {
_start :: [Pawn],
_end :: [Pawn],
_dir :: Direction,
_total :: Int
}
type Solution = ([String], Int)
-- generate a tree
mkTree :: [Pawn] -> Tree Turn
mkTree p = Node{ rootLabel = s, subForest = branches s }
where s = Turn p [] F 0
-- generates a node for a Turn
mkNode :: Turn -> Tree Turn
mkNode t = Node{ rootLabel = t, subForest = branches t }
-- next possible moves
branches :: Turn -> [Tree Turn]
-- complete
branches (Turn [] e d t) = []
-- moving forward
branches (Turn s e F t) = map (mkNode.turn) (next s)
where
turn n = Turn (s\\n) (e++n) B (t+time n)
time = maximum . map snd
next xs = [x| x <- mapM (const xs) [1..2], head x < head (tail x)]
-- moving backward
branches (Turn s e B t) = map (mkNode.turn) e
where
turn n = Turn (n:s) (delete n e) F (t+time n)
time (_,b) = b
solve :: Int -> Tree Turn -> [Solution]
solve limit tree = solve' [] [] limit tree
where
solve' :: [Solution] -> [String] -> Int -> Tree Turn -> [Solution]
solve' sols cur limit (Node (Turn s e d t) f)
| and [t <= limit, s == []] = sols ++ [(cur++[step],t)]
| t <= limit = concat $ map (solve' sols (cur++[step]) limit) f
| otherwise = []
where step = "[" ++ (v s) ++ "|" ++ (v e) ++ "]"
v = map fst
Then you you can get a list of solutions:
solve 16 $ mkTree [('a',2), ('b',4), ('c',8)]
=> [(["[abc|]","[c|ab]","[ac|b]","[|bac]"],14),(["[abc|]","[c|ab]","[bc|a]","[|abc]"],16),(["[abc|]","[b|ac]","[ab|c]","[|cab]"],14),(["[abc|]","[a|bc]","[ba|c]","[|cab]"],16)]
Or also generate a tree of solutions:
draw :: Int -> Tree Turn -> Tree String
draw limit (Node (Turn s e d t) f)
| t > limit = Node "Time Out" []
| s == [] = Node ("Complete: " ++ step) []
| otherwise = Node step (map (draw limit) f)
where step = "[" ++ (v s) ++ "|" ++ (v e) ++ "]" ++ " - " ++ (show t)
v = map fst
Then:
putStrLn $ drawTree $ draw 16 $ mkTree [('a',2), ('b',4), ('c',8)]
Will result in:
[abc|] - 0
|
+- [c|ab] - 4
| |
| +- [ac|b] - 6
| | |
| | `- Complete: [|bac] - 14
| |
| `- [bc|a] - 8
| |
| `- Complete: [|abc] - 16
|
+- [b|ac] - 8
| |
| +- [ab|c] - 10
| | |
| | `- Complete: [|cab] - 14
| |
| `- [cb|a] - 16
| |
| `- Time Out
|
`- [a|bc] - 8
|
+- [ba|c] - 12
| |
| `- Complete: [|cab] - 16
|
`- [ca|b] - 16
|
`- Time Out
I am trying to implement Kosaraju's graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a summary data structure that has the node and lists of its incoming and outgoing edges. The code below achieves that, but takes over a minute, whereas I can see from posts on the MOOC forum that people using other languages are completing in <<10s. (getLines is taking 10s compared to under 1s in benchmarks I read about.)
I'm new to Haskell and have implemented an accumulation method using foldl' (the ' was a breakthrough in making it terminate at all), but it feels rather imperative in style, and I'm hoping that that's the reason why it is running slow. Moreover, I'm currently planning to use a similar pattern to conduct the depth-first-search, and I fear it will all just become too slow.
I have found this presentation and blog that talk about these sort of issues but at too expert a level.
import System.IO
import Control.Monad
import Data.Map.Strict as Map
import Data.List as L
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored (Edges, Edges) deriving (Show)
type Graph1 = Map NodeName Node
getLines :: FilePath -> IO [[Int]]
getLines = liftM (fmap (fmap read . words) . lines) . readFile
getLines' :: FilePath -> IO [(Int,Int)]
getLines' = liftM (fmap (tuplify2 . fmap read . words) . lines) . readFile
tuplify2 :: [a] -> (a,a)
tuplify2 [x,y] = (x,y)
main = do
list <- getLines "testdata.txt" -- [String]
--list <- getLines "SCC.txt" -- [String]
let
list' = createGraph list
return list'
createGraph :: [[Int]] -> Graph1
createGraph xs = L.foldl' build Map.empty xs
where
build :: Graph1-> [Int] -> Graph1
build = \acc (x:y:_) ->
let tmpAcc = case Map.lookup x acc of
Nothing -> Map.insert x (Node False ([y],[])) acc
Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False ((y:fwd), bck))) x acc
in case Map.lookup y tmpAcc of
Nothing -> Map.insert y (Node False ([],[x])) tmpAcc
Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False (fwd, (x:bck)))) y tmpAcc
Using maps:
Use IntMap or HashMap when possible. Both are significantly faster for Int keys than Map. HashMap is usually faster than IntMap but uses more RAM and has a less rich library.
Don't do unnecessary lookups. The containers package has a large number of specialized functions. With alter the number of lookups can be halved compared to the createGraph implementation in the question.
Example for createGraph:
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node
createGraph :: [(Int, Int)] -> Graph1
createGraph xs = foldl' build IM.empty xs
where
addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
addFwd y _ = Just (Node False [y] [])
addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
addBwd x _ = Just (Node False [] [x])
build :: Graph1 -> (Int, Int) -> Graph1
build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc
Using vectors:
Consider the efficient construction functions (the accumulators, unfolds, generate, iterate, constructN, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors.
In the more general case, use the laziness of boxed vectors to enable self-reference when constructing a vector.
Use unboxed vectors when possible.
Use unsafe functions when you're absolutely sure about the bounds.
Only use mutable vectors when there aren't pure alternatives. In that case, prefer the ST monad to IO. Also, avoid creating many mutable heap objects (i. e. prefer mutable vectors to immutable vectors of mutable references).
Example for createGraph:
import qualified Data.Vector as V
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = V.Vector Node
createGraph :: Int -> [(Int, Int)] -> Graph1
createGraph maxIndex edges = graph'' where
graph = V.replicate maxIndex (Node False [] [])
graph' = V.accum (\(Node e f b) x -> Node e (x:f) b) graph edges
graph'' = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)
Note that if there are gaps in the range of the node indices, then it'd be wise to either
Contiguously relabel the indices before doing anything else.
Introduce an empty constructor to Node to signify a missing index.
Faster I/O:
Use the IO functions from Data.Text or Data.ByteString. In both cases there are also efficient functions for breaking input into lines or words.
Example:
import qualified Data.ByteString.Char8 as BS
import System.IO
getLines :: FilePath -> IO [(Int, Int)]
getLines path = do
lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
return [(a, b) | [a, b] <- pairs]
Benchmarking:
Always do it, unlike me in this answer. Use criterion.
Based pretty much on András' suggestions, I've reduced a 113 second task down to 24 (measured by stopwatch as I can't quite get Criterion to do anything yet) (and then down to 10 by compiling -O2)!!! I've attended some courses this last year that talked about the challenge of optimising for large datasets but this was the first time I faced a question that actually involved one, and it was as non-trivial as my instructors' suggested. This is what I have now:
import System.IO
import Control.Monad
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
import qualified Data.ByteString.Char8 as BS
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node
-- DFS uses a stack to store next points to explore, a list can do this
type Stack = [(NodeName, NodeName)]
getBytes :: FilePath -> IO [(Int, Int)]
getBytes path = do
lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
let
pairs = (map . map) (maybe (error "Can't read integers") fst . BS.readInt) lines
return [(a,b) | [a,b] <- pairs]
main = do
--list <- getLines' "testdata.txt" -- [String]
list <- getBytes "SCC.txt" -- [String]
let list' = createGraph' list
putStrLn $ show $ list' IM.! 66
-- return list'
bmark = defaultMain [
bgroup "1" [
bench "Sim test" $ whnf bmark' "SCC.txt"
]
]
bmark' :: FilePath -> IO ()
bmark' path = do
list <- getLines path
let
list' = createGraph list
putStrLn $ show $ list' IM.! 2
createGraph' :: [(Int, Int)] -> Graph1
createGraph' xs = foldl' build IM.empty xs
where
addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
addFwd y _ = Just (Node False [y] [])
addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
addBwd x _ = Just (Node False [] [x])
build :: Graph1 -> (Int, Int) -> Graph1
build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc
And now on with the rest of the exercise....
This is not really an answer, I would rather comment András Kovács post, if I add those 50 points...
I have implemented the loading of the graph in both IntMap and MVector, in a attempt to benchmark mutability vs. immutability.
Both program use Attoparsec for the parsing. There is surely more economic way to do it, but Attoparsec is relatively fast compared to its high abstraction level (the parser can stand in one line). The guideline is to avoid String and read. read is partial and slow, [Char] is slow and not memory efficient, unless properly fused.
As András Kovács noted, IntMap is better than Map for Int keys. My code provides another example of alter usage. If the node identifier mapping is dense, you may also want to use Vector and Array. They allow O(1) indexing by the identifier.
The mutable version handle on demand the exponential growth of the MVector. This avoid to precise an upper bound on node identifiers, but introduce more complexity (the reference on the vector may change).
I benchmarked with a file of 5M edges with identifiers in the range [0..2^16]. The MVector version is ~2x faster than the IntMap code (12s vs 25s on my computer).
The code is here [Gist].
I will edit when more profiling is done on my side.
module Main where
import Data.List
import Data.Function
type Raw = (String, String)
icards = [("the", "le"),("savage", "violent"),("work", "travail"),
("wild", "sauvage"),("chance", "occasion"),("than a", "qu'un")]
data Entry = Entry {wrd, def :: String, len :: Int, phr :: Bool}
deriving Show
-- French-to-English, search-tree section
entries' :: [Entry]
entries' = map (\(x, y) -> Entry y x (length y) (' ' `elem` y)) icards
data Tree a = Empty | Tree a (Tree a) (Tree a)
tree :: Tree Entry
tree = build entries'
build :: [Entry] -> Tree Entry
build [] = Empty
build (e:es) = ins e (build es)
ins :: Entry -> Tree Entry -> Tree Entry
...
find :: Tree Entry -> Word -> String
...
translate' :: String -> String
translate' = unwords . (map (find tree)) . words
so i'm trying to design function ins and find but i am not sure where to start.any ideas?
I have no idea by which criteria the tree should be sorted, so I use just wrd. Then it would look like:
ins :: Entry -> Tree Entry -> Tree Entry
ins entry Empty = Tree entry Empty Empty
ins entry#(Entry w _ _ _) (Tree current#(Entry w1 _ _ _) left right)
| w == w1 = error "duplicate entry"
| w < w1 = Tree current (ins entry left) right
| otherwise = Tree current left (ins entry right)
How to get there?
As always when using recursion, you need a base case. Here it is very simple: If the tree is empty, just replace it by a node containing your data. There are no children for the new node, so we use Empty.
The case if you have a full node looks more difficult, but this is just due to pattern matching, the idea is very simple: If the entry is "smaller" you need to replace the left child with a version that contains the entry, if it is "bigger" you need to replace the right child.
If both node and entry have the same "size" you have three options: keep the old node, replace it by the new one (keeping the children) or throw an error (which seems the cleanest solution, so I did it here).
A simple generalization of Landei's answer:
ins :: Ord a => a -> Tree a -> Tree a
ins x Empty = Tree x Empty Empty
ins x (Tree x' l r) = case compare x x' of
EQ -> undefined
LT -> Tree x' (ins x l) r
GT -> Tree x' l (ins x r)
For this to work on Tree Entry, you will need to define an instance of Ord for Entry.