Building a suffix tree by inserting each suffix in Haskell - string

I am working with the following data type:
data SuffixTree = Leaf Int | Node [(String, SuffixTree)]
deriving (Eq, Show)
Each subtree has a corresponding label (string).
The idea is to build the corresponding suffix tree by adding each suffix and its index into an accumulating tree (at the beginning it is Node []).
This is already defined
buildTree s
= foldl (flip insert) (Node []) (zip (suffixes s) [0..length s-1])
where suffixes is correctly defined.
I've been trying to implement the insert function for a while but can't seem to succeed.
This is what I have now (the names and style are not the best since this is still work in progress):
insert :: (String, Int) -> SuffixTree -> SuffixTree
insert pair tree#(Node content)
= insert' pair tree content
where
insert' :: (String, Int) -> SuffixTree -> [(String, SuffixTree)] -> SuffixTree
insert' (s, n) (Node []) subtrees
= Node ((s, Leaf n) : subtrees)
insert' (s, n) (Node content#((a, tree) : pairs)) subtrees
| null p = insert' (s, n) (Node pairs) subtrees
| p == a = insert' (r, n) tree subtrees
| p /= a = Node ((p, newNode) : (subtrees \\ [(a, tree)]))
where
(p, r, r') = partition s a
newNode = Node [(r, (Leaf n)), (r', tree)]
The partition function takes two strings and returns a tuple consisting of:
The common prefix (if it exists)
The first string without the prefix
The second string without the prefix
I think I understand the rules needed to build the tree.
We start by comparing the label of the first subtree to the string we want to insert (say, str). If they don't have a prefix in common, we try to insert in the next subtree.
If the label is a prefix of str, we continue to look into that subtree, but instead of using str we try to insert str without the prefix.
If str is a prefix of label, then we replace the existing subtree with a new Node, having a Leaf and the old subtree. We also adjust the labels.
If we don't have a match between str and any label then we add a new Leaf to the list of subtrees.
However, the biggest problem that I have is that I need to return a new tree containing the changes, so I have to keep track of everything else in the tree (not sure how to do this or if I'm thinking correctly about this).
The code appears to be working correctly on this string: "banana":
Node [("a",Node [("",Leaf 5),("na",Node [("",Leaf 3),("na",Leaf 1)])]),
("na",Node [("",Leaf 4),("na",Leaf 2)]),("banana",Leaf 0)]
However, on this string "mississippi" I get an Exception: Non-exhaustive patterns in function insert'.
Any help or ideas are greatly appreciated!

You are using a quadratic algorithm; whereas optimally, suffix tree can be constructed in linear time. That said, sticking with the same algorithm, a possibly better approach would be to first build the (uncompressed) suffix trie (not tree) and then compress the resulting trie.
The advantage would be that a suffix trie can be represented using Data.Map:
data SuffixTrie
= Leaf' Int
| Node' (Map (Maybe Char) SuffixTrie)
which makes manipulations both more efficient and easier than list of pairs. Doing so, you may also completely bypass common prefix calculations, as it comes out by itself:
import Data.List (tails)
import Data.Maybe (maybeToList)
import Control.Arrow (first, second)
import Data.Map.Strict (Map, empty, insert, insertWith, assocs)
data SuffixTree
= Leaf Int
| Node [(String, SuffixTree)]
deriving Show
data SuffixTrie
= Leaf' Int
| Node' (Map (Maybe Char) SuffixTrie)
buildTrie :: String -> SuffixTrie
buildTrie s = foldl go (flip const) (init $ tails s) (length s) $ Node' empty
where
go run xs i (Node' ns) = run (i - 1) $ Node' tr
where tr = foldr loop (insert Nothing $ Leaf' (i - 1)) xs ns
loop x run = insertWith (+:) (Just x) . Node' $ run empty
where _ +: Node' ns = Node' $ run ns
buildTree :: String -> SuffixTree
buildTree = loop . buildTrie
where
loop (Leaf' i) = Leaf i
loop (Node' m) = Node $ con . second loop <$> assocs m
con (Just x, Node [(xs, tr)]) = (x:xs, tr) -- compress single-child nodes
con n = maybeToList `first` n
then:
\> buildTree "banana"
Node [("a",Node [("",Leaf 5),
("na",Node [("",Leaf 3),
("na",Leaf 1)])]),
("banana",Leaf 0),
("na",Node [("",Leaf 4),
("na",Leaf 2)])]
similarly:
\> buildTree "mississippi"
Node [("i",Node [("",Leaf 10),
("ppi",Leaf 7),
("ssi",Node [("ppi",Leaf 4),
("ssippi",Leaf 1)])]),
("mississippi",Leaf 0),
("p",Node [("i",Leaf 9),
("pi",Leaf 8)]),
("s",Node [("i",Node [("ppi",Leaf 6),
("ssippi",Leaf 3)]),
("si",Node [("ppi",Leaf 5),
("ssippi",Leaf 2)])])]

Here's how the problem is occurring.
Let's say you're processing buildTree "nanny". After you've inserted the suffixes "nanny", "anny", and "nny", your tree looks like t1 given by:
let t1 = Node t1_content
t1_content = [("n",t2),("anny",Leaf 1)]
t2 = Node [("ny",Leaf 2),("anny",Leaf 0)]
Next, you try to insert the prefix "ny":
insert ("ny", 3) t1
= insert' ("ny", 3) t1 t1_content
-- matches guard p == a with p="n", r="y", r'=""
= insert' ("y", 3) t2 t1_content
What you intend to do next is insert ("y", 3) into t2 to yield:
Node [("y", Leaf 3), ("ny",Leaf 2),("anny",Leaf 0)])
Instead, what happens is:
insert' ("y", 3) t2 t1_content
-- have s="y", a="ny", so p="", r="y", r'="ny"
-- which matches guard: null p
= insert' ("y", 3) (Node [("anny", Leaf 0)]) t1_content
-- have s="y", a="anny", so p="", r="y", r'="anny"
-- which matches guard: null p
= insert' ("y", 3) (Node []) t1_content
= Node [("y", Leaf 3), ("n",t2), ("anny",Leaf 1)]
and suffix "y" has been added to t1 instead of t2.
When you next try to insert suffix "y", the guard p==a case tries to insert ("y",3) into Leaf 3 and you get a pattern error.
The reason it works on banana is that you only ever insert a new node at the top level of the tree, so "adding to t2" and "adding to t1" are the same thing.
I suspect you'll need to substantially rethink the structure of your recursion to get this working.

Looks like this code does the job, although there may still be improvements to make. I hope that it's general enough to work on any string. I also tried to avoid using ++, but it's still better than nothing.
getContent (Node listOfPairs)
= listOfPairs
insert :: (String, Int) -> SuffixTree -> SuffixTree
insert (s, n) (Node [])
= Node [(s, Leaf n)]
insert (s, n) (Node (pair#(a, tree) : pairs))
| p == a = Node ((a, insert (r, n) tree) : pairs)
| null p = Node (pair : (getContent (insert (r, n) (Node pairs))))
| p /= a = Node ([(p, Node [(r, Leaf n), (r', tree)])] ++ pairs)
where
(p, r, r') = partition s a

Related

Why the does this shrink tree looks the way it does when using filter

I'm trying to understand what is the effect that filter has in the shrink tree of a generator when using hedgehog integrated shrinking.
Consider the following function:
{-# LANGUAGE OverloadedStrings #-}
import Hedgehog
import qualified Hedgehog.Gen as Gen
aFilteredchar:: Gen Char
aFilteredchar =
Gen.filter (`elem` ("x" :: String)) (Gen.element "yx")
When a print the shrink tree:
>>> Gen.printTree aFilteredchar
I'd get shrink trees that look as follow:
'x'
└╼'x'
└╼'x'
└╼'x'
...
└╼<discard>
this is, a very deep tree containing only x's, and a discard at the end.
Why does the shrink function keeps on returning x's, instead of an empty list, which signals that there are no further shrinks possible?
Gen is essentially a composition of a probability monad and a tree monad, and the behavior you observe mostly arises from the tree monad and the definition of Gen.filter.
Basically, Gen.filter p g is a simple monadic loop, try 0 where:
-- simplified body of filter
try k =
if k > 100 then
discard -- empty tree
else do
x <- g
if p x then
pure x -- singleton tree
else
try (k + 1) -- keep looping
So to understand the tree you got, you must understand the tree monad under the do notation here.
The tree monad
The Tree type in hedgehog that is internally used by Gen looks roughly like this (if you are looking at the linked implementation in hedgehog, set m ~ Maybe):
data Tree a = Empty | Node a [Tree a] -- node label and children
There are many other Tree-like types that are monads, and the monadic bind (>>=) generally takes the form of a tree substitution.
Say you have a tree t = Node x [t1, t2, ...] :: Tree a, and a continuation/substitution k :: a -> Tree b, which replaces every node/variable x :: a with the tree k x :: Tree b. We can describe t >>= k in two steps, fmap then join, as follows. First the fmap applies the substitution on every node label. So we obtain a tree where every node is labeled by another tree. For concreteness, say k x = Node y [u1, u2, ...]:
fmap k t
=
Node
(k x) -- node label
[fmap k t1, fmap k t2, ...] -- node children
=
Node
(Node y [u1, u2, ...]) -- node label
[fmap k t1, fmap k t2, ...] -- node children
Then the join step flattens the nested tree structure, concatenating the children from inside the label with those outside:
t >>= k
=
join (fmap k t)
=
Node
y
([join (fmap k t1), join (fmap k t2), ...] ++ [u1, u2, ...])
To complete the Monad instance, note that we have pure x = Node x [].
The try loop
Now that we have some intuition for the tree monad we can turn to your particular generator. We want to evaluate try k above, where p = (== 'x') and g = elements "yx". I'm waving my hands here, but you should imagine that g evaluates randomly to either the tree Node 'y' [] (generate 'y' with no shrinkings), aka. pure 'y', or Node 'x' [Node 'y' []] (generate 'x' and shrink to 'y'; indeed, "elements shrinks to the left"), and that every occurence of g is independent from others, so we get a different result when we retry.
Let's examine each case separately. What happens if g = pure 'y'? Assume k <= 100 so we're in the else branch of the toplevel if, already simplified away below:
-- simplified body of filter
try k = do
c <- pure 'y' -- g = pure 'y'
if c == 'x' then -- p c = (c == 'x')
pure c
else
try (k + 1)
-- since (do c <- pure 'y' ; s c) = s 'y' (monad law) and ('y' == 'x') = False
try k = try (k + 1)
So all the times where g evaluates to pure 'y' end up simplified away as the recursive term try (k + 1), and we are left with the cases where g evaluates to the other tree Node 'x' [Node 'y' []]:
try k = do
c <- Node 'x' [Node 'y' []] -- g
if c == 'x' then
pure c
else
try (k + 1)
As illustrated in the previous section, the monadic bind is equivalent to the following, and we finish with some equational reasoning.
try k = join (Node (s 'x') [Node (s 'y') []])
where
s c = if c == 'x' then pure c else try (k + 1)
try k = join (Node (pure 'x') [Node (try (k + 1)) []])
try k = join (Node (pure 'x') [pure (try (k + 1))] -- simplifying join
try k = Node 'x' [join (pure (try (k + 1)))] -- join . pure = id
try k = Node 'x' [try (k + 1)]
In summary, starting from try 0, with half probability try k = try (k + 1), and with the other half try k = Node 'x' [try (k + 1)], finally we stop at try 100. This explains the tree you observe.
try 0 = Node 'x' [Node 'x' [ ... ]] -- about 50 nodes
(I believe this also provides at least a partial answer to your other question, since this shows how shrinking a Gen.filter often amounts to rerunning the generator from scratch.)
While Li-yao Xia's detailed answer correctly describes how this happens, it doesn't address the why; why does it re-run the generator after each shrink? The answer is that it shouldn't; this is a bug. See bug report Improve Filter on GitHub.

Finding out number of even paths from root in a tree

I'm trying to get a function that counts all paths from the root to a leaf that has an even number of nodes ( counting the root and the leaf)
My tree looks like this:
data Tree = Leaf Int | Node Int Tree Tree
all i got so far is a function that counts ALL nodes in a tree, which is easy enough:
countNodes (Leaf _) = 1
countNodes (Node _ x y) = 1+ countNodes x + countNodes y
Now i saw a bunch of questions that deal with trees but i felt like no answer helped me much, so I'm just gonna ask myself. How do i make a part of a function stop when a leaf is reached? I know this has to do with my problem to think with recursions.
What I tried to do was to to make list of all paths from the root, but i always end up with a function that gets all elements in the tree and puts them together somehow.
I'm missing something simple, please help. (or link me an answer that does exactly what i want)
I think the easiest way would be to make a data type that can describe a path through a tree:
data Path = L Path | R Path | End deriving (Eq, Show)
This type is basically a list but with two prepend constructors to tell you either go Left or go Right. This conveniently lets you look up items by path, or you can write a function that gives you a list of all paths in the tree.
-- Note that this can fail: lookupNode (Leaf 1) (L End) == Nothing
lookupNode :: Tree -> Path -> Maybe Tree
allPaths :: Tree -> [Path]
If you can write the allPaths function, then you can write the function you want on top of it. To start, just begin by listing the base cases:
allPaths (Leaf _) = [End]
allPaths (Node _ left right) = _
To fill in the hole _, think about what it means to list all the paths starting at a Node and recursing down left. You would need to have a L at the beginning of all of those paths, so you can put the following in there
allPaths (Node _ left right) = (map L $ allPaths left)
Similarly, you would need to handle the right tree:
allPaths (Node _ left right) =
(map L $ allPaths left) ++
(map R $ allPaths right)
So now:
> let tree =
Node 1
(Node 2 -- L _
(Leaf 3) -- L (L End)
(Node 4 -- L (R _)
(Leaf 5) -- L (R (L End))
(Leaf 6) -- L (R (R End))
)
)
(Leaf 7) -- R End
> allPaths tree
[L (L End),L (R (L End)), L (R (R End)),R End]
Now, to find the Leafs with an even number of nodes above them, first write a function that calculates a path length:
pathLength :: Path -> Int
pathLength End = 0
pathLength (L rest) = 1 + pathlength rest
pathLength (R rest) = 1 + pathLength rest
evenNodeCountPaths :: Tree -> [Path]
evenNodeCountPaths tree = filter (even . pathLength) $ allPaths tree
Note: It is possible to do this with
data Dir = L | R | End
type Path = [Dir]
But that can lead to invalid paths like [End, End, L, R, End], which just doesn't make any sense. I chose to go for the list-like data Path for this reason. You have to write your own pathLength function, but this formulation makes it impossible to have invalid paths.
Probably it's easier to compute both the number of even and the number of odd paths.
evenAndOdd (Leaf _) = (0, 1)
evenAndOdd (Node _ l r) = let
(el, ol) = evenAndOdd l
(er, or) = evenAndOdd r
in (ol+or, el+er)
If you really must, you can then define a function in terms of this to count just the even paths.
evenOnly = fst . evenAndOdd

Why is the return value of head implicitly converted to Maybe?

I've just started to learn Haskell and for practice I've decided to create a function that takes a tree, whose elements have a position and size, and returns the element located at a certain position. My code looks like this:
import Data.List (dropWhile)
data Node = Node (Int,Int) (Int,Int) [Node] deriving(Eq, Show)
findNode :: (Int,Int) -> Node -> Maybe Node
findNode loc node#(Node (x, y) (width, height) []) = if loc >= (x, y) && loc <= (x+width, y+height) then Just node else Nothing
findNode loc node#(Node (x, y) (width, height) children) =
if loc >= (x, y) && loc <= (x+width, y+height)
then if not $ null nodes then
head nodes
else
Just node
else Nothing
where nodes = dropWhile (==Nothing) $ map (findNode loc) children
This code compiles and runs perfectly as far as I can tell, but I'm curious about one thing, Why is head nodes acceptable instead of Just $ head nodes?.
That's because your nodes has a type of [Maybe Node]. You can infact verify that by annotating it explicitly and compiling it again:
findNode loc node#(Node (x, y) (width, height) children) =
if loc >= (x, y) && loc <= (x+width, y+height)
then
if not $ null nodes
then head nodes
else Just node
else Nothing
where
nodes = dropWhile (==Nothing) $ map (findNode loc) children :: [Maybe Node]
So how does, nodes have [Maybe Node] type ?
The type of map (findNode loc) children is [Maybe Node]. (General type of map is map :: (a -> b) -> [a] -> [b] and in your case the type of findNode loc is Node -> Maybe Node and hence the resultant type is [Maybe Node] ). And the type of dropWhile is (a -> Bool) -> [a] -> [a]. In your case a already is Maybe Node and hence your nodes has a type of [Maybe Node].

Non-tree data structures in Haskell

Making tree like data structures is relatively easy in Haskell. However, what if I want a structure like the following:
A (root)
/ \
B C
/ \ / \
D E F
So if I traverse down the structure through B to update E, the returned new updated structure also has E updated if I traverse through C.
Could someone give me some hints about how to achieve this? You can assume there are no loops.
I would flatten the data structure to an array, and operate on this instead:
import Data.Array
type Tree = Array Int -- Bounds should start at (1) and go to sum [1..n]
data TreeTraverse = TLeft TreeTraverse | TRight TreeTraverse | TStop
Given some traverse directions (left, right, stop), it's easy to see that if we go left, we simply add the current level to our position, and if we go right, we also add the current position plus one:
getPosition :: TreeTraverse -> Int
getPosition = getPosition' 1 1
where
getPosition' level pos (TLeft ts) = getPosition' (level+1) (pos+level) ts
getPosition' level pos (TRight ts) = getPosition' (level+1) (pos+level + 1) ts
getPosition' _ pos (TStop) = pos
In your case, you want to traverse either ABE or ACE:
traverseABE = TLeft $ TRight TStop
traverseACE = TRight $ TLeft TStop
Since we already now how to get the position of your element, and Data.Array provides some functions to set/get specific elements, we can use the following functions to get/set tree values:
getElem :: TreeTraverse -> Tree a -> a
getElem tt t = t ! getPosition tt
setElem :: TreeTraverse -> Tree a -> a -> Tree a
setElem tt t x = t // [(getPosition tt, x)]
To complete the code, lets use your example:
example = "ABCDEF"
exampleTree :: Tree Char
exampleTree = listArray (1, length example) example
And put everything to action:
main :: IO ()
main = do
putStrLn $ "Traversing from A -> B -> E: " ++ [getElem traverseABE exampleTree]
putStrLn $ "Traversing from A -> C -> E: " ++ [getElem traverseACE exampleTree]
putStrLn $ "exampleTree: " ++ show exampleTree ++ "\n"
putStrLn $ "Setting element from A -> B -> E to 'X', "
let newTree = setElem traverseABE exampleTree 'X'
putStrLn $ "but show via A -> C -> E: " ++ [getElem traverseACE newTree]
putStrLn $ "newTree: " ++ show newTree ++ "\n"
Note that this is most-likely not the best way to do this, but the first thing that I had in mind.
Once you've established identity, it can be done.
But first you must establish identity.
In many languages, values can be distinct from each other, but equal. In Python, for example:
>>> a = [1]
>>> b = [1]
>>> a == b
True
>>> a is b
False
You want to update E in one branch of the tree, and also update all other elements for which that element is E. But Haskell is referentially transparent: it has no notion of things being the same object; only equality, and even that is not applicable for every object.
One way you could do this is equality. Say this was your tree:
__A__
/ \
B C
/ \ / \
1 2 2 3
Then we could go through the tree and update all the 2s to, say, four. But this isn't exactly what you want in some cases.
In Haskell, if you want to update one thing in multiple places, you'll have to be explicit about what is and isn't the same thing. Another way you could deal with this is to tag each different value with a unique integer, and use that integer to determine identity:
____________A___________
/ \
B C
/ \ / \
(id=1)"foo" (id=2)"bar" (id=2)"bar" (id=3)"baz"
Then we could update all values with an identity of 2. Accidental collisions cannot be a problem, as there can be no collisions except those that are intentional.
This is essentially what STRef and IORef do, except they hoist the actual value into the monad's state and hide the identities from you. The only downside of using these is you'll need to make much of your code monadic, but you're probably not going to get away from that easily whatever you do. (Modifying values rather than replacing them is an inherently effectful thing to do.)
The structure you gave was not specified in much detail so it's impossible to tailor an example to your use case, but here's a simple example using the ST monad and a Tree:
import Control.Monad
import Control.Monad.ST
import Data.Tree
import Data.Traversable (traverse)
import Data.STRef
createInitialTree :: ST s (Tree (STRef s String))
createInitialTree = do
[a, b, c, d, e, f] <- mapM newSTRef ["A", "B", "C", "D", "E", "F"]
return $ Node a [ Node b [Node d [], Node e []]
, Node c [Node e [], Node f []]
]
dereferenceTree :: Tree (STRef s a) -> ST s (Tree a)
dereferenceTree = traverse readSTRef
test :: ST s (Tree String, Tree String)
test = do
tree <- createInitialTree
before <- dereferenceTree tree
let leftE = subForest (subForest tree !! 0) !! 1
writeSTRef (rootLabel leftE) "new" -- look ma, single update!
after <- dereferenceTree tree
return (before, after)
main = do
let (before, after) = runST test
putStrLn $ drawTree before
putStrLn $ drawTree after
Observe that although we only explicitly modified the value of the left E value, it changed on the right side, too, as desired.
I should note that these are not the only ways. There are probably many other solutions to this same problem, but they all require you to define identity sensibly. Only once that has been done can one begin the next step.

haskell binary search tree

module Main where
import Data.List
import Data.Function
type Raw = (String, String)
icards = [("the", "le"),("savage", "violent"),("work", "travail"),
("wild", "sauvage"),("chance", "occasion"),("than a", "qu'un")]
data Entry = Entry {wrd, def :: String, len :: Int, phr :: Bool}
deriving Show
-- French-to-English, search-tree section
entries' :: [Entry]
entries' = map (\(x, y) -> Entry y x (length y) (' ' `elem` y)) icards
data Tree a = Empty | Tree a (Tree a) (Tree a)
tree :: Tree Entry
tree = build entries'
build :: [Entry] -> Tree Entry
build [] = Empty
build (e:es) = ins e (build es)
ins :: Entry -> Tree Entry -> Tree Entry
...
find :: Tree Entry -> Word -> String
...
translate' :: String -> String
translate' = unwords . (map (find tree)) . words
so i'm trying to design function ins and find but i am not sure where to start.any ideas?
I have no idea by which criteria the tree should be sorted, so I use just wrd. Then it would look like:
ins :: Entry -> Tree Entry -> Tree Entry
ins entry Empty = Tree entry Empty Empty
ins entry#(Entry w _ _ _) (Tree current#(Entry w1 _ _ _) left right)
| w == w1 = error "duplicate entry"
| w < w1 = Tree current (ins entry left) right
| otherwise = Tree current left (ins entry right)
How to get there?
As always when using recursion, you need a base case. Here it is very simple: If the tree is empty, just replace it by a node containing your data. There are no children for the new node, so we use Empty.
The case if you have a full node looks more difficult, but this is just due to pattern matching, the idea is very simple: If the entry is "smaller" you need to replace the left child with a version that contains the entry, if it is "bigger" you need to replace the right child.
If both node and entry have the same "size" you have three options: keep the old node, replace it by the new one (keeping the children) or throw an error (which seems the cleanest solution, so I did it here).
A simple generalization of Landei's answer:
ins :: Ord a => a -> Tree a -> Tree a
ins x Empty = Tree x Empty Empty
ins x (Tree x' l r) = case compare x x' of
EQ -> undefined
LT -> Tree x' (ins x l) r
GT -> Tree x' l (ins x r)
For this to work on Tree Entry, you will need to define an instance of Ord for Entry.

Resources