Can you implement Binary Search Tree in Haskell with O(log n) insertion? - haskell

If I understand correctly, modifying (insertion or deletion) a Binary Search Tree in Haskell requires copying the whole tree, so practically making it being O(n). Is there a way to implement it in O(log n) or maybe compiler would optimize O(n) insertion down to O(log n) "under the hood"?

If I understand correctly, modifying (insertion or deletion) a Binary Search Tree in Haskell requires copying the whole tree, so practically making it being O(n).
You do not need to copy the entire tree. Indeed, let us work with a simple unbalanced binary search tree, like:
data Tree a = Node (Tree a) a (Tree a) | Empty deriving (Eq, Show)
then we can insert a value with:
insertIn :: Ord a => a -> Tree a -> Tree a
insertIn x = go
where go Empty = Node Empty x Empty
go n#(Node l v r)
| x < v = Node (go l) v r
| x > v = Node l v (go r)
| otherwise = n
Here we reuse r in case we construct a Node (go l) v r, and we reuse l in case we construct a Node l v (go r). For each node we visit, we create a new node where one of the two subtrees is used in the new node. This means that the new tree will point to the same subtree objects as the original tree.
In this example, the amount of new nodes thus scales with O(d) with d the depth of the tree. If the tree is fairly balanced, than it will insert in O(log n).
Of course you can improve the algorithm and define an AVL tree or red-black tree by storing more information in the node regarding balancing, in that case you thus can guarantee O(log n) insertion time.
The fact that all data is immutable here helps to reuse parts of the tree: we know that l and r can not change, so the two trees will share a large amount of nodes and thus reduce the amount of memory necessary if you want to use both the original and the new tree.
If there is no reference to the old tree necessary, the garbage collector will eventually collect the "old" nodes that have been replaced by the new tree.

Related

How can I implement an optimal, purely functional, double-ended priority queue?

Okasaki shows how to write purely functional priority queues with O(1) insert and O(log n) minView (some versions also offer O(log n) or even O(1) merge. Can any of these ideas be extended to double-ended priority queues? Khoong and Leong (in a paper I don't have access to) offer an ephemeral implementation based on binomial heaps, but from what I can see of their paper, that approach doesn't seem easy to make persistent, as it uses parent and sibling pointers.
As leftaroundabout points out, this can be done with a 2–3 finger tree. In particular, one annotated with the semigroup
data MinMax k = MinMax
{ smallest :: !k
, largest :: !k }
instance Ord k => Semigroup (MinMax k) where
MinMax min1 max1 <> MinMax min2 max2 = MinMax (min min1 min2) (max max1 max2)
Such an annotated finger tree can be made a double-ended priority queue in basically the same way that the fingertree package defines priority queues (but adjusted slightly to avoid needing a Monoid). minView and maxView can be improved using the same implementation technique as Data.Sequence.deleteAt.
Why use a Semigroup and not add a neutral element to make it a Monoid? This way, we can unpack MinMax annotations into tree nodes and avoid an extra indirection at every step, along with extra allocation.
Performance bounds
insert: Amortized O(1) (note: this bound will hold up even in the face of persistence, thanks to careful use of laziness). Worst-case O(log n). Note that the fingertree package only claims O(log n) for insertion; this is a documentation bug which I have reported and which will be corrected in the next version.
minView/maxView: Worst-case O(1) to see the minimum/maximum; worst-case O(log n) to remove it.
meld: Worst-case O(log (min (m, n)), where m and n are the sizes of the queues.
Hinze-Paterson style 2–3 finger trees are actually a bit more than necessary. A one-fingered version will do the trick, with fewer digit sizes.
{-# options_ghc -funbox-strict-fields #-}
data Node k a
= Node2 !(MinMax k) !a !a
| Node3 !(MinMax k) !a !a !a
data Tree k a
= Empty
-- the child of a Zero node may
-- not be empty
| Zero !(MinMax k) (Tree k (Node k a))
| One !(MinMax k) !a (Tree k (Node k a))
I've been working on fleshing this out for the last few days. Fortunately, it's mostly quite straightforward. Unfortunately, it requires an awful lot of code. The fundamental challenge is that deletion in 2–3 trees is fairly involved. The version for finger trees adds another layer of complexity. And then the whole thing has to be written twice to deal with both minView and maxView.

haskell create an unbalanced tree

My Tree definition is
data Tree = Leaf Integer | Node Tree Tree
This is a binary tree, with only values at the leaves.
I am given following definition for balanced trees:
We say that a tree is balanced if the number of leaves in the left and right subtree of every node differs by at most one, with leaves themselves being trivially balanced.
I try to create a balanced tree as follows:
t :: Tree
t = Node (Node (Node (Leaf 1) (Leaf 2)) (Node(Leaf 3)(Leaf 4))) (Node (Node (Leaf 5) (Leaf 6)) (Node (Leaf 7) (Leaf 8)) )
Can you please let me know if t above is a balanced tree with values only at the leaves?
Another question, how do I create another tree with values only at the leaves and it is unbalanced as per above definition.
Thanks
Can you please let me know if t above is a balanced tree with values only at the leaves?
I can, but I won't. However, I hope I can guide you through the process of writing a function that will determine whether a given tree is balanced.
The following is certainly not the most efficient way to do it (see the bottom for a hint about that), but it is a very modular way. It's also a good example of the "computation by transformation" approach that functional programming (and especially lazy functional programming) encourages. It seems pretty clear to me that the first question to ask is "how many leaves descend from each node?" There's no way for us to write down the answers directly in the tree, but we can make a new tree that has the answers:
data CountedTree = CLeaf Integer | CNode Integer Tree Tree
Each node of a CountedTree has an integer field indicating how many leaves descend from it.
You should be able to write a function that reads off the total number of leaves from a CountedTree, whether it's a Leaf or a Node:
getSize :: CountedTree -> Integer
The next step is to determine whether a CountedTree is balanced. Here's a skeleton:
countedBalanced :: CountedTree -> Bool
countedBalanced CLeaf = ?
countedBalanced (CNode _ left right)
= ?? && ?? && getSize left == getSize right
I've left the first step for last: convert a Tree into a CountedTree:
countTree :: Tree -> CountedTree
And finally you can wrap it all up:
balanced :: Tree -> Bool
balanced t = ?? (?? t)
Now it turns out that you don't actually have to copy and annotate the tree to figure out whether or not it's balanced. You can do it much more directly. This is a much more efficient approach, but a somewhat less modular one. I'll give you the relevant types, and you can fill in the function.
-- The balance status of a tree. Either it's
-- unbalanced, or it's balanced and we store
-- its total number of leaves.
data Balance = Unbalanced | Balanced Integer
getBalance :: Tree -> Balance

Remove root element from a heap tree

How do I remove the smallest element of an heap tree?
This element is at the root of the tree. If I remove that, I'm left with two independent subtrees.
data Heap a = Empty
| Node a (Heap a) (Heap a)
The type of the function is:
removeMin :: Heap a -> (a, Heap a)
It should return the tree and the minimum removed.
Should I make an auxiliary function to build a new tree, or is there a faster way to do this?
Your type, as written, raises some questions:
Q: What's the output from removeMin Empty?
A: You can't produce an a from nothing, so the result should be wrapped in Maybe.
Q: If I've put (+), (-) and (*) in a Heap (Int -> Int -> Int), which one should be returned by removeMin?
A: Not all data types have an ordering (notably, functions lack one), so it makes sense to require that the data type have an Ord instance.
So the updated type becomes:
removeMin :: Ord a => Heap a -> Maybe (a, Heap a)
Now consider it case by case:
Empty has no min element:
removeMin Empty = Nothing
If one branch is empty, the remaining heap is the other branch
removeMin (Node a Empty r) = Just (a, r)
removeMin (Node a l Empty) = Just (a, l)
Convince yourself that this works for Node a Empty Empty.
If neither branch is empty, then the new smallest min element must be the root of one of the branches.
The branches in the resulting Heap are just the branch of the larger element, and the branch of the smaller element, with its minimum removed.
Fortunately, we already have a helper to remove the minimum from a Heap!
removeMin (Node a l#(Node la _ _) r#(Node ra _ _)) = Just (a, Node mina maxN minN')
where (minN, maxN) = if la <= ra then (l,r) else (r,l)
Just (mina, minN') = removeMin minN
Now, while this produces a valid heap, it's not necessarily the best algorithm because it's not guaranteed to produce a balanced heap. A poorly balanced heap is no better than a linked list, giving you O(n) insertion and deletion times where a balanced heap can give you O(log n).
You should build an appropriate function to build new tree, but don't worry- it will not perform poorly. GHC can optimize such use cases and this operation could be just as fast as you want (including large, even infinite (recursive) data structures).
I understand you are able to create such auxiliary function by yourself? It is straightforward - anyway, in case of troubles I can write it later.
Think of it this way: After removing the top node, you're left with two heaps. So you need to implement (recursive) merging of two heaps, something like
merge :: (Ord a) => Heap a -> Heap a -> Heap a
You could also implement a monoid instance for Heap
instance (Ord a) => Monoid (Heap a) where
mempty = Empty
mappend = -- the merging function

Is there a way to avoid copying the whole search path of a binary tree on insert?

I've just started working my way through Okasaki's Purely Functional Data Structures, but have been doing things in Haskell rather than Standard ML. However, I've come across an early exercise (2.5) that's left me a bit stumped on how to do things in Haskell:
Inserting an existing element into a binary search tree copies the entire search path
even though the copied nodes are indistinguishable from the originals. Rewrite insert using exceptions to avoid this copying. Establish only one handler per insertion rather than one handler per iteration.
Now, my understanding is that ML, being an impure language, gets by with a conventional approach to exception handling not so different to, say, Java's, so you can accomplish it something like this:
type Tree = E | T of Tree * int * Tree
exception ElementPresent
fun insert (x, t) =
let fun go E = T (E, x, E)
fun go T(l, y, r) =
if x < y then T(go (l), x, r)
else if y < x then T(l, x, go (r))
else raise ElementPresent
in go t
end
handle ElementPresent => t
I don't have an ML implementation, so this may not be quite right in terms of the syntax.
My issue is that I have no idea how this can be done in Haskell, outside of doing everything in the IO monad, which seems like cheating and even if it's not cheating, would seriously limit the usefulness of a function which really doesn't do any mutation. I could use the Maybe monad:
data Tree a = Empty | Fork (Tree a) a (Tree a)
deriving (Show)
insert :: (Ord a) => a -> Tree a -> Tree a
insert x t = maybe t id (go t)
where go Empty = return (Fork Empty x Empty)
go (Fork l y r)
| x < y = do l' <- go l; return (Fork l' y r)
| x > y = do r' <- go r; return (Fork l y r')
| otherwise = Nothing
This means everything winds up wrapped in Just on the way back up when the element isn't found, which requires more heap allocation, and sort of defeats the purpose. Is this allocation just the price of purity?
EDIT to add: A lot of why I'm wondering about the suitability of the Maybe solution is that the optimization described only seems to save you all the constructor calls you would need in the case where the element already exists, which means heap allocations proportional to the length of the search path. The Maybe also avoids those constructor calls when the element already exists, but then you get a number of Just constructor calls equal to the length of the search path. I understand that a sufficiently smart compiler could elide all the Just allocations, but I don't know if, say, the current version of GHC is really that smart.
In terms of cost, the ML version is actually very similar to your Haskell version.
Every recursive call in the ML version results in a stack frame. The same is true in the
Haskell version. This is going to be proportional in size to the path that you traverse in
the tree. Also, both versions will of course allocate new nodes for the entire path if an insertion is actually performed.
In your Haskell version, every recursive call might also eventually result in the
allocation of a Just node. This will go on the minor heap, which is just a block of
memory with a bump pointer. For all practical purposes, GHC's minor heap is roughly equivalent in
cost to the stack. Since these are short-lived allocations, they won't normally end up
being moved to the major heap at all.
GHC generally cannot elide path copying in cases like that. However, there is a way to do it manually, without incurring any of the indirection/allocation costs of Maybe. Here it is:
{-# LANGUAGE MagicHash #-}
import GHC.Prim (reallyUnsafePtrEquality#)
data Tree a = Empty | Fork (Tree a) a (Tree a)
deriving (Show)
insert :: (Ord a) => a -> Tree a -> Tree a
insert x Empty = Fork Empty x Empty
insert x node#(Fork l y r)
| x < y = let l' = insert x l in
case reallyUnsafePtrEquality# l l' of
1# -> node
_ -> Fork l' y r
| x > y = let r' = insert x r in
case reallyUnsafePtrEquality# r r' of
1# -> node
_ -> Fork l y r'
| otherwise = node
The pointer equality function does exactly what's in the name. Here it is safe because even if the equality returns a false negative we only do a bit of extra copying, and nothing worse happens.
It's not the most idiomatic or prettiest Haskell, but the performance benefits can be significant. In fact, this trick is used very frequently in unordered-containers.
As fizruk indicates, the Maybe approach is not significantly different from what you'd get in Standard ML. Yes, the whole path is copied, but the new copy is discarded if it turns out not to be needed. The Just constructor itself may not even be allocated on the heap—it can't escape from insert, let alone the module, and you don't do anything weird with it, so the compiler is free to analyze it to death.
Edit
There are efficiency problems, now that I think of it. Your use of Maybe conceals the fact that you're actually making two passes—one down to find the insertion point and one up to build the tree. The solution to this is to drop Maybe Tree in favor of (Tree,Bool) and use strictness annotations, or to switch to continuation-passing style. Also, if you choose to stay with the three-way logic, you may want to use the three-way comparison function. Alternatively, you can go all the way to the bottom each time and check later if you hit a duplicate.
If you have a predicate that checks whether the key is already in the tree, you can look before you leap:
insert x t = if contains t x then t else insert' x t
This traverses the tree twice, of course. Whether that's as bad as it sounds should be determined empirically: it might just load the relevant part of the tree into the cache.

Get the parent of a node in Data.Tree (haskell)

I need a tree implementation where I can access the parent node of any node in the tree. Looking at Data.Tree I see the tree definition:
data Tree a = Node {
rootLabel :: a, -- ^ label value
subForest :: Forest a -- ^ zero or more child trees
}
So if I have a tree node Tree a I can access its label and its children. But it is also possible to access its parent node? Do have to choose a different implementation for my needs? Which package would you recommend?
If I'm not mistaken, what you're asking for is basically what gets implemented in LYAH's section on Zippers.
I won't attempt to explain it better than Miran did, but the basic idea is to keep track of which tree you're coming from, as well as which branch you're moving down, while you traverse the tree. You're not keeping track of the node's parent directly in the data structure, but all of the information will be available when you traverse the tree.
Data.Tree deliberately does not have nodes reference their own parents. This way, changes to parent nodes (or nodes in other branches) do not need to re-create the entire tree, and nodes in memory may be shared with multiple trees. The term for structures like this is "persistent".
You may implement a node which knows its own parent; the resulting structure will not be persistent. The clear alternative is to always know where the root node of your tree is; that's good practice in any language.
One library that does allow Data.Tree to know its parents is rosezipper, documentation can be found here.
Try this:
data Tree a = Leaf | Node a (Tree a) (Tree a) (Tree a) deriving (Eq, Show)
You can save the parent tree into the third(or any other) tree.
Example:
singleton::(Ord a)=>a->Tree a
singleton x = Node x Leaf Leaf Leaf
insert::(Ord a)=>a->Tree a->Tree a
insert x Leaf = singleton x
insert x t#(Node a l r p) = insertIt x t p
where insertIt x Leaf (Node a l r p) = Node x Leaf Leaf (Node a Leaf Leaf Leaf)--1*
insertIt x t#(Node a l r p) parent
| x == a = t
| x < a = Node a (insertIt x l t) r p
| x > a = Node a l (insertIt x l t) p
1* in this row you can save the whole parent:
where insertIt x Leaf parent = Node x Leaf Leaf parent

Resources