What Self Balancing Tree is simplest in Functional Programming? - haskell

I'm designing a self balancing tree in Haskell. As an exercise and because it is nice to have in your back hand.
Previously in C and Python I preferred Treaps and Splay Trees due to their simple balancing rules. I always disliked R/B Trees, since they seemed like more work than they were worth.
Now, due to the functional nature of Haskell, things seem to have changed. I can write a R/B insert function in 10 lines of code. Treaps on the other hand requires wrapping to store the random number generator, and Splay Trees are a pain to do top-down.
So I'm asking if you have experience with other types of trees?
Which ones are better at utilizing the pattern matching and top-down nature of functional languages?

Ok, I guess there wasn't a lot of references or research for answering this question. Instead I've taken the time to try your different ideas and trees. I didn't find anything a lot better than RB trees, but perhaps that's just search bias.
The RB tree can be (insertion) balanced with four simple rules, as shown by Chris Okasaki:
balance T (T R (T R a x b) y c) z d = T R (T B a x b) y (T B c z d)
balance T (T R a x (T R b y c)) z d = T R (T B a x b) y (T B c z d)
balance T a x (T R b y (T R c z d)) = T R (T B a x b) y (T B c z d)
balance T a x (T R (T R b y c) z d) = T R (T B a x b) y (T B c z d)
balance T a x b = T B a x b
AVL trees can be balanced in a similar pattern matching way. However the rules don't compress as well:
balance T (T (T a x b dx) y c (-1)) z d (-2) = T (T a x b dx) y (T c z d 0) 0
balance T a x (T b y (T c z d dz) 1 ) 2 = T (T a x b 0) y (T c z d dz) 0
balance T (T a x (T b y c 1 ) 1 ) z d (-2) = T (T a x b -1) y (T c z d 0) 0
balance T (T a x (T b y c (-1)) 1 ) z d (-2) = T (T a x b 0) y (T c z d 1) 0
balance T (T a x (T b y c _ ) 1 ) z d (-2) = T (T a x b 0) y (T c z d 0) 0
balance T a x (T (T b y c 1 ) z d (-1)) 2 = T (T a x b -1) y (T c z d 0) 0
balance T a x (T (T b y c (-1)) z d (-1)) 2 = T (T a x b 0) y (T c z d 1) 0
balance T a x (T (T b y c _ ) z d (-1)) 2 = T (T a x b 0) y (T c z d 0) 0
balance t = t
As AVL trees seams to generally be considered inferior to RB trees, they are probably not worth the extra hassle.
AA trees could theoretically be balanced nice and easily by:
balance T n (T n a x b) y c = T n a x (T n b y c) -- skew
balance T n a x (T n b y (T n c z d)) = T (n+1) (T n a x b) y (T n c z d) --split
balance T n a x b = T n a x b
But unfortunately Haskell don't like the overloading of n. It is possible that a less standard implementation of AA trees, not using ranks, but something more similar to R and B, would work well.
Splay trees are difficult because you need to focus on a single node, rather than the static structure of the tree. It can be done by merging insert and splay.
Treaps are also uneasy to do in a functional environment, as you don't have a global random generator, but need to keep instances in every node. This can be tackled by leaving the task of generating priorities to the client, but even then, you can't do priority comparison using pattern matching.

As you say Red Black trees aren't that hard to use. Have you given finger trees a look? You might be interested in augmenting your base data structure with something like a zipper. Another tree you might find interesting is the AA tree it is a simplification of Red Black Trees.

It's the one that's already implemented.
There are fine implementations in Haskell of balanced trees such as Data.Map and Data.Set. Don't they fulfill your needs? Don't reimplement, reuse.

The OCaml standard library uses an AVL tree for its map functor. It seems as though it's easier to implement than an RB-tree if you include a remove operation.

Related

Stack overflow when constructing/evaluating a red black tree in Haskell

I have the following Red Black tree:
data Tree a
= E
| S a
| C !Color !(Tree a) !(Tree a)
data Color = R | B
In case of this tree, all the data are stored in the leaves (the S constructor). I have written an insert function like the standard Okasaki red black trees[1] (modifying the parts where the values are stored in the internal nodes)
In this cases I populate the tree with 10 million elements:
l = go 10000000 E
where
go 0 t = insert 0 t
go n t = insert t $ go (n - 1) t
When I try to evaluate the left most element (leaf) of the tree like this:
left :: Tree a -> Maybe a
left E = Nothing
left (S x) = Just x
left (C _ _ l _) = left l
I encounter the following:
left l
*** Exception: stack overflow
Is this owing to the way that I am constructing the tree (non tail recursive) or is there some missing space leak that I cannot see.
Please note the function works fine for a million elements. Additionally I attempted a tail recursive way of the tree construction:
l = go 10000000 E
where
go 0 t = insert 0 t
go n t = go (n - 1) (insert n t)
but encountered the same stack overflow exception.
[1] https://www.cs.tufts.edu/~nr/cs257/archive/chris-okasaki/redblack99.pdf
EDIT
The insert and balance function for completeness:
insert :: Ord a => a -> Tree a -> Tree a
insert x xs = makeBlack $ ins xs
where
ins E = S x
ins (S a) = C R (S x) (S a)
ins (C c l r) = balance c (ins l) r -- always traverse left and trust the balancing
makeBlack (C _ l r) = C B l r
makeBlack a = a
balance :: Color -> Tree a -> Tree a -> Tree a
balance B (C R (C R a b) c) d = C R (C B a b) (C B c d)
balance B (C R a (C R b c)) d = C R (C B a b) (C B c d)
balance B a (C R (C R b c) d) = C R (C B a b) (C B c d)
balance B a (C R b (C R c d)) = C R (C B a b) (C B c d)
balance color a b = C color a b
There was mistyping from my end while typing in the insert code, it is insert n $ go (n - 1) t and not insert t $ go (n - 1) t. However when actually encountering the stack overflow the code was correct and the overflow happened in ghci.
The first example of insertion code has a bug: it tries to insert the tree itself as an element.
The second version
l = go 10000000 L.empty where
go 0 t = L.cons 0 t
go n t = go (n - 1) (L.cons n t)
Is indeed tail recursive, but it still has a problem: it doesn't at any step "force" the tree while it is being constructed. Due to Haskell's laziness, go will return a thunk that hides 10000000 pending applications of L.cons.
When the runtime tries to "pop" that thunk, it will put each n variable in the stack while the thunk below is being "popped" in its turn, causing the stack overflow. "Function calls don't add stack frames in Haskell; instead, stack frames come from nesting thunks."
The solution is to force each intermediate tree to WHNF, so that thunks don't accumulate. This should be enough (using the BangPatterns extension):
l :: Tree Int
l = go 10000000 L.empty
where
go 0 !t = L.cons 0 t
go n !t = go (n - 1) (L.cons n t)
This basically means: "before recursing to add another element, make sure the accumulator is in WHNF". The n need not be forced because it is scrutinized in the pattern-match.

Summing up all the nodes a tree with a generic type. (Haskell)

I have been trying to write a code which takes all the integers in a tree and return a sum of them. I'm trying to do this with type a, which is from a data time:
data Tree a = Nil | Value a (Tree a) (Tree a)
deriving Show
and we want to use:
tree = Value 2 (Value 2 (Value 2 Nil Nil) Nil) (Value 2 Nil Nil)
and my code is as follow:
countTree :: (a -> a -> a) -> a -> Tree a -> a
countTree p k (Nil) = h
countTree p k (Value x y z) = x (+) (countTree p k y) (+) (countTree p k z)
and I want to run my code as countTree (+) 0 tree and the results should return 8.
The problem is that when I run my code it tells me that x has four arguments but it's type a has zero which I honestly don't understand why. I've modifying sections of my code, but no success once so ever, I could really use some assistance.
x (+) (countTree p k y) (+) (countTree p k z)
is attempting to treat x as a function, and pass to it as arguments all of
(+) (countTree p k y) (+) (countTree p k z)
If you want to have "x + recur left + recur right", you'd want something like:
x + (countTree p k y) + (countTree p k z)
I'm pretty sure however you actually want to use p, not + hard coded. Using prefix notation, you'd have to rearrange it a bit to something like :
(p (p x (countTree p k y)) (countTree p k z))
Or, you could use backticks to inline the calls to p as #bipll suggested:
x `p` (countTree p k y) `p` (countTree p k z)
A side note, but I'm also pretty sure you want h to be k.

Difficulties in understanding algebraic data type

I am not quite sure what this ZInt is actually describing.
data Nat = Zero | S Nat
data ZInt = Z Nat Nat deriving Show
addZ :: ZInt -> ZInt -> ZInt
addZ (Z a b) (Z c d) = Z (add a c) (add b d)
with
add :: Nat -> Nat -> Nat
add a Zero = a
add a (S b) = S (add a b)
mult :: Nat -> Nat -> Nat
mult _ Zero = Zero
mult a (S b) = add a (mult a b)
At first glance i thought maybe it's a presentation of complex numbers, adding imaginary and real components (in function addZ) without displaying form of
a+b*i
But what is happening in this functions?
subZ :: ZInt -> ZInt -> ZInt
subZ (Z a b) (Z c d) = Z (add a d) (add b c)
multZ :: ZInt -> ZInt -> ZInt
multZ (Z a b) (Z c d) = Z (add (mult a d) (mult c b)) (add (mult a c) (mult b d))
So I do understand data Nat = Zero | S Nat and also the add and mult functions, but not addZ, subZ and multZ.
It's just integer numbers. Nat represents a natural number. ZInt represents an integer number. In Z a b if a >= b then integer is a - b else -(b - a).
For example:
ZInt representation | Traditional representation
Z Zero Zero | 0
Z (S Zero) Zero | 1
Z Zero (S Zero) | -1
Z (S Zero) (S Zero) | 0
...
As we can see, to negate an integer you just swap the Nat values in its representation:
negate :: ZInt -> ZInt
negate (Z n m) = Z m n
And we can define subZ like this:
a `subZ` b = a `addZ` negate b
This representation is not canonical, Z (S Zero) (S Zero) is the same integer as Z Zero Zero. So, we can define canonical form like this:
canonical :: ZInt -> ZInt
canonical (Z (S n) (S m)) = canonical (Z n m)
canonical x = x
What reason is to define integer numbers by this way?
First of all, it mathematically clear. If someone defined the set of natural number named N the we can easy define the set of integers named Z as Z = N * N where (*) is product of two sets.
In Haskell, I can see only one reason for that. By this way we can define integer numbers on type level.
First, ZInt is representing each integer as an ordered pair of natural numbers. #freestyle covers how this representation works well; I will just expand on how the arithmetic operators take advantage of this encoding.
addZ, subZ and multZ are simply manipulating the pair of natural numbers that represent each integer.
addZ (Z a b) (Z c d) = Z (add a c) (add b d)
(a - b) + (c - d) == a - b + c - d
== a + c - b - d
== (a + c) - (b + d)
subZ (Z a b) (Z c d) = Z (add a d) (add b c)
(a - b) - (c - d) == a - b - c + d
== a + d - b - c
== (a + d) - (b + c)
multZ (Z a b) (Z c d) = Z (add (mult a d) (mult c b)) (add (mult a c) (mult b d))
(a - b) * (c - d) == ac - ad - bc + bd
== ac + bd - ad - bc
== (ac + bd) - (ad + bc)
Note that the given definition of multZ can get the sign wrong; it should be
multZ (Z a b) (Z c d) = Z (add (mult a c) (mult b d)) (add (mult a d) (mult b c))
(For clarity, it should also use mult b c instead of mult c b, even though multiplication of natural numbers is commutative.)

F# version of haskell pattern match

How do I do this haskell in F# cleanly?
add 1 2 x = 3 + x
add 1 x y = 1 + x + y
add z x y = z + x + y
You can't overload the function itself, but you can use pattern matching directly:
let add z x y = // curried multiple parameters
match z, x, y with // convert to three-tuple to match on
| 1, 2, x -> 3 + x
| 1, x, y -> 1 + x + y
| z, x, y -> z + x + y
Usage is as expected: add 1 2 3
If you're willing to use tuples as arguments (ie forgo currying and partial application), you can even write it more shorthand:
let add = // expect three-tuple as first (and only) parameter
function // use that one value directly to match on
| 1, 2, x -> 3 + x
| 1, x, y -> 1 + x + y
| z, x, y -> z + x + y
Usage now is: add (1, 2, 3)
Recall in Haskell that the general form of functions as a list of declarations with patterns:
f pat1 ... = e1
f pat2 ... = e2
f pat3 ... = e3
is just sugar for the case analysis:
f x1 .. xn = case (x1, .. xn) of
(pat1, ..., patn) -> e1
(pat2, ..., patn) -> e2
(pat3, ..., patn) -> e3
so the same translation can be made to other languages with pattern matching but without declaration-level patterns.
This is purely syntactic. Languages like Haskell, Standard ML and Mathematica allow you to write out different match cases as if they were different functions:
factorial 0 = 1
factorial 1 = 1
factorial n = n * factorial(n-1)
whereas languages like OCaml and F# require you to have a single function definition and use match or equivalent in its body:
let factorial = function
| 0 -> 1
| 1 -> 1
| n -> n * factorial(n-1)
Note that you don't have to copy the function name over and over again using this syntax and you can factor match cases more easily:
let factorial = function
| 0 | 1 -> 1
| n -> n * factorial(n-1)
As yamen wrote, do currying with let f a b = match a, b with ... in F#.
In the classic red-black tree implementation, I find the duplication of the function names and right-hand sides in Standard ML and Haskell quite ugly:
balance :: RB a -> a -> RB a -> RB a
balance (T R a x b) y (T R c z d) = T R (T B a x b) y (T B c z d)
balance (T R (T R a x b) y c) z d = T R (T B a x b) y (T B c z d)
balance (T R a x (T R b y c)) z d = T R (T B a x b) y (T B c z d)
balance a x (T R b y (T R c z d)) = T R (T B a x b) y (T B c z d)
balance a x (T R (T R b y c) z d) = T R (T B a x b) y (T B c z d)
balance a x b = T B a x b
compared to the equivalent OCaml or F#:
let balance = function
| B, z, (T(R, y, T(R, x, a, b), c) | T(R, x, a, T(R, y, b, c))), d
| B, x, a, (T(R, z, T(R, y, b, c), d) | T(R, y, b, T(R, z, c, d))) ->
T(R, y, T(B, x, a, b), T(B, z, c, d))
| a, b, c, d -> T(a, b, c, d)

How can I check if a BST is valid?

How can I check if a BST is a valid one, given its definition and using a generalized version of fold for BST?
data(Ord a, Show a, Read a) => BST a = Void | Node {
val :: a,
left, right :: BST a
} deriving (Eq, Ord, Read, Show)
fold :: (Read a, Show a, Ord a) => (a -> b -> b -> b) -> b -> BST a -> b
fold _ z Void = z
fold f z (Node x l r) = f x (fold f z l) (fold f z r)
The idea is to check that a node value is greater then all values in left-subtree and smaller than all values in its right-subtree. This must be True for all nodes in the tree. A function bstList simply output the list of (ordered) values in the BST.
Of course something like this won't work:
--isBST :: (Read a, Show a, Ord a) => BST a -> Bool
isBST t = fold (\x l r -> all (<x) (bstList l) && all (>x) (bstList r)) (True) t
because, for example, applying the fold function to the node 19 ends up all (<19) (bstList True) && all (>19) (bstList True).
Your problem seems to be that you lose information because your function only returns a boolean when it examines the left and right subtrees. So change it to also return the minimum and maximum values of the subtrees. (This is probably more efficient as well, since you don't need to used bslist to check all elements anymore)
And make a wrapper function to ignore these "auxiliary" values after you are done, of course.
(Please don't put typeclass constraints on the data type.)
A BST is valid iff an in-order traversal is monotonically increasing.
flatten tree = fold (\a l r -> l . (a:) . r) id tree []
ordered list#(_:rest) = and $ zipWith (<) list rest
ordered _ = True
isBST = ordered . flatten
A nice way of encoding this is to lean on the traversal provided by Data.Foldable.
{-# LANGUAGE DeriveFunctor, DeriveFoldable #-}
import Data.Foldable
import Data.Monoid
We can derive an instance of it automatically using an extension, but we need to reorder the fields of the Node constructor to provide us an in-order traversal.
While we're at it, we should eliminate the constraints on the data type itself. They actually provide no benefit, and has been removed from the language as of Haskell 2011. (When you want to use such constraints you should put them on instances of classes, not on the data type.)
data BST a
= Void
| Node
{ left :: BST a
, val :: a
, right :: BST a
} deriving (Eq, Ord, Read, Show, Foldable)
First we define what it means for a list to be strictly sorted.
sorted :: Ord a => [a] -> Bool
sorted [] = True
sorted [x] = True
sorted (x:xs) = x < head xs && sorted xs
-- head is safe because of the preceeding match.
Then we can use the toList method provided by Data.Foldable and the above helper.
isBST :: Ord a => BST a -> Bool
isBST = sorted . toList
We can also implement this more directly, like you asked. Since we removed the spurious constraints on the data type, we can simplify the definition of your fold.
cata :: (b -> a -> b -> b) -> b -> BST a -> b
cata _ z Void = z
cata f z (Node l x r) = f (cata f z l) x (cata f z r)
Now we need a data type to model the result of our catamorphism, which is that we either have no nodes (Z), or a range of strictly increasing nodes (T) or have failed (X)
data T a = Z | T a a | X deriving Eq
And we can then implement isBST directly
isBST' :: Ord a => BST a -> Bool
isBST' b = cata phi Z b /= X where
phi X _ _ = X
phi _ _ X = X
phi Z a Z = T a a
phi Z a (T b c) = if a < b then T a c else X
phi (T a b) c Z = if b < c then T a c else X
phi (T a b) c (T d e) = if b < c && c < d then T a e else X
This is a bit tedious, so perhaps it would be better to decompose the way we compose the interim states a bit:
cons :: Ord a => a -> T a -> T a
cons _ X = X
cons a Z = T a a
cons a (T b c) = if a < b then T a c else X
instance Ord a => Monoid (T a) where
mempty = Z
Z `mappend` a = a
a `mappend` Z = a
X `mappend` _ = X
_ `mappend` X = X
T a b `mappend` T c d = if b < c then T a d else X
isBST'' :: Ord a => BST a -> Bool
isBST'' b = cata phi Z b /= X where
phi l a r = l `mappend` cons a r
Personally, I'd probably just use the Foldable instance.
If you don't insist on using a fold you can do it like this:
ord Void = True
ord (Node v l r) = every (< v) l && every (> v) r && ord l && ord r where
every p Void = True
every p (Node v l r) = p v && every p l && every p r

Resources