Trees with values on the leaves only - haskell

A few years ago, during a C# course I learned to write a binary tree that looked more or less like this:
data Tree a = Branch a (Tree a) (Tree a) | Leaf
I saw the benefit of it, it had its values on the branches, which allowed for quick and easy lookup and insertion of values, because it would encounter a value on the root of each branch all the way down until it hit a leaf, that held no value.
Ever since I started learning Haskell, however; I've seen numerous examples of trees that are defined like this:
data Tree a = Branch (Tree a) (Tree a) | Leaf a
That definition puzzles me. I can't see the usefulness of having data on the elements that don't branch, because it would end up leading to a tree that looks like this:
Which to me, seems like a poorly designed alternative to a List. It also makes me question the lookup time of it, since it can't asses which branch to go down to find the value it's looking for; but rather needs to go through every node to find what it's looking for.
So, can anyone shed some light on why the second version (value on leaves) is so much more prevalent in Haskell than the first version?

I think this depends on what you're trying to model and how you're trying to model it.
A tree where the internal nodes store values and the leaves are just leaves is essentially a standard binary tree (tree each leaf as NULL and you basically have an imperative-style binary tree). If the values are stored in sorted order, you now have a binary search tree. There are many specific advantages to storing data this way, most of which transfer directly over from imperative settings.
Trees where the leaves store the data and the internal nodes are just for structure do have their advantages. For example, red/black trees support two powerful operations called split and join that have advantages in some circumstances. split takes as input a key, then destructively modifies the tree to produce two trees, one of which contains all keys less than the specified input key and one containing the remaining keys. join is, in a sense, the opposite: it takes in two trees where one tree's values are all less than the other tree's values, then fuses them together into a single tree. These operations are particularly difficult to implement on most red/black trees, but are much simpler if all the data is stored in the leaves only rather than in the internal nodes. This paper detailing an imperative implementation of red/black trees mentions that some older implementations of red/black trees used this approach for this very reason.
As another potential advantage of storing keys in the leaves, suppose that you want to implement the concatenate operation, which joins two lists together. If you don't have data in the leaves, this is as simple as
concat first second = Branch first second
This works because no data is stored in those nodes. If the data is stored in the leaves, you need to somehow move a key from one of the leaves up to the new concatenation node, which takes more time and is trickier to work with.
Finally, in some cases, you might want to store the data in the leaves because the leaves are fundamentally different from internal nodes. Consider a parse tree, for example, where the leaves store specific terminals from the parse and the internal nodes store all the nonterminals in the production. In this case, there really are two different types of nodes, so it doesn't make sense to store arbitrary data in the internal nodes.
Hope this helps!

You described a tree with data at the leaves as "a poorly designed alternative to a List."
I agree that this could be used as an alternative to a list, but it's not necessarily poorly designed! Consider the data type
data Tree t = Leaf t | Branch (Tree t) (Tree t)
You can define cons and snoc (append to end of list) operations -
cons :: t -> Tree t -> Tree t
cons t (Leaf s) = Branch (Leaf t) (Leaf s)
cons t (Branch l r) = Branch (cons t l) r
snoc :: Tree t -> t -> Tree t
snoc (Leaf s) t = Branch (Leaf s) (Leaf t)
snoc (Branch l r) t = Branch l (snoc r t)
These run (for roughly balanced lists) in O(log n) time where n is the length of the list. This contrasts with the standard linked list, which has O(1) cons and O(n) snoc operations. You can also define a constant-time append (as in templatetypedef's answer)
append :: Tree t -> Tree t -> Tree t
append l r = Branch l r
which is O(1) for two lists of any size, whereas the standard list is O(n) where n is the length of the left argument.
In practice you would want to define slightly smarter versions of these functions which attempt to keep the tree balanced. To do this it is often useful to have some additional information at the branches, which could be done by having multiple kinds of branch (as in a red-black tree which has "red" and "black" nodes) or explicitly include additional data at the branches, as in
data Tree b a = Leaf a | Branch b (Tree b a) (Tree b a)
For example, you can support an O(1) size operation by storing the total number of elements in both subtrees in the nodes. All of your operations on the tree become slightly more complicated since you need to correctly persist the information about subtree sizes -- in effect the work of computing the size of the tree is amortized over all the operations that construct the tree (and cleverly persisted, so that minimal work is done whenever you need to reconstruct a size later).

More is better worse more. I'll explain just a couple basic considerations to show why your intuition fails. The general idea, though, is that different data structures need different things.
Empty leaf nodes can actually be a space (and therefore time) problem in some contexts. If a node is represented by a bit of information and two pointers to its children, you'll end up with two null pointers per node whose children are both leaves. That's two machine words per leaf node, which can add up to quite a bit of space. Some structures avoid this by ensuring that each leaf holds at least one piece of information to justify its existence. In some cases (such as ropes), each leaf may have a fairly large and dense payload.
Making internal nodes bigger (by storing information in them) makes it more expensive to modify the tree. Changing a leaf in a balanced tree typically forces you to allocate replacements for O(log n) internal nodes. If each of those is larger, you've just allocated more space and spent extra time to copy more words. The extra size of the internal nodes also means that you can fit less of the tree structure into the CPU cache.

Related

How to define Type safe constrained rose trees

I am trying to define a data structure with these characteristics:
It is a rose tree
The nodes in the tree are of variable sort
The only difference between the sorts of node is a constraint on the number of children they may take
The complete set of constraints is: None; OneOnly; TwoOnly; AtLeastOne; AtLeastTwo
I want the relevant constraint to be type checkable and checked.
(Eg when building, or editing the tree, trying to add a second child to IamJustOne :: OneOnly is an error)
I am having difficulty getting started defining this structure (especially points 3-5).
There is information on the web on the steps needed to define a rose tree.
There is information in Data.Tree.Rose sufficient to create a rose tree with variable nodes. (Though I am still not clear on the distinction in that module between Knuth trees, and Knuth forests.)
There are research level papers on heterogeneous containers well above my comprehension grade
My initial approach was to attempt to create subtypes of MyRose (not working code) as:
data MyRose sub = MyRose {label :: String, subtype :: sub, children :: [MyRose sub]}
type AtLeastOne a = snoc a [a]
type AtLeastTwo a = snoc a ( snoc a [a] )
...
instance MyRose AtLeastOne where children = AtLeastOne MyRose -- instances to provide defaults
...
instance None STree where children = Nothing
I have tried various approaches using data, newtype, class, type, and am now investigating type family and data family. None of my approaches have been productive.
Could you suggest pointers to defining this data structure. Babies first steps would be perfectly useful - it is difficult to underestimate my level of knowledge on this topic.
Before you go the crazy advanced route, I recommend making sure that the simple route isn't Good Enough. The simple route looks like this:
data Tree = Node { label :: String, children :: Children }
data Children
= Zero
| One Tree
| Two Tree Tree
| Positive Tree [Tree]
| Many Tree Tree [Tree]
Here's your criteria:
Is a rose tree -- uh, I guess?
Nodes in the tree are of variable sort -- check, the five Children constructors indicate the sort, and each Node may make a different choice of constructor
The only difference between sorts is a constraint on the number of children they may take -- check
The complete set of constraints -- check
Relevant constraint is type checkable and checked -- check, e.g. the application One child1 child2 does not typecheck
Even if you could define it, a tree of this sort seems very difficult to use. The type of the tree will have to reflect its entire structure, and a client will have to carry that type around everywhere, since all operations on the tree will need to know this type in order to do anything. They won't be able to just have a Rose String or something, they will need to know the exact shape.
Let's imagine you've succeeded in your goal. Then, you may have some example tree t:
t :: OnlyTwo (AtLeastOne None)
indicating a top level with 2 nodes, each of whom has at least one child, each of which is empty. What on Earth should be the type of insert t "hello"? Of deleteMin t? You can't really know which levels of the tree may need to collapse if you delete a single node, or where you may need to grow a level if you insert one.
Maybe you have answers to these questions, and some obscure use case where this is the best solution. But since you ask for baby's first solution: I think if I were you, I would step back and ask why I really want this. What do you hope to achieve with this level of type detail? What do you want client code to look like when it consumes or builds such a tree? Answers to these questions would make for a much clearer problem.

Avoid revisiting node in an invariant directed graph

This is perhaps related to functional data structures, but I found no tags about this topic.
Say I have a syntax tree type Tree, which is organised as a DAG by simply sharing common sub expressions. For example,
data Tree = Val Int | Plus Tree Tree
example :: Tree
example = let x = Val 42 in Plus x x
Then, on this syntax tree type, I have a pure function simplify :: Tree -> Tree, which, when given the root node of a Tree, simplifies the whole tree by first simplifying the children of that root node, and then handle the operation of the root node itself.
Since simplify is a pure function, and some nodes are shared, we expect not to call simplify multiple times on those shared nodes.
Here comes the problem. The whole data structure is invariant, and the sharing is transparent to the programmer, so it seems impossible to determine whether or not two nodes are in fact the same nodes.
The same problem happens when handling the so-called “tying-the-knot” structures. By tying the knot, we produce a finite data representation for an otherwise infinite data structure, e.g. let xs = 1 : xs in xs. Here xs itself is finite, but calling map succ on it does not necessarily produce a finite representation.
These problems can be concluded as such: when the data is organised in an invariant directed graph, how do we avoid revisiting the same node, doing duplicated work, or even resulting in non-termination when the graph happened to be cyclic?
Some ideas that I have thought of:
Extend the Tree type to Tree a, making every nodes hold an extra a. When generating the graph, associate each node with a unique a value. The memory address should have worked here, despite that the garbage collector may move any heap object at any time.
For the syntax tree example, we may store a STRef (Maybe Tree) in every node for the simplified version, but this might not be extensible, and injects some implementation detail of a specific operation to the whole data structure itself.
This is a problem with a lot of research behind it. In general, you cannot observe sharing in a pure language like Haskell, due to referential transparency. But in practice, you can safely observe sharing so long as you restrict yourself to doing the observing in the IO monad. Andy Gill (one of the legends from the old Glasgow school of FP!) has written a wonderful paper about this about 10 years ago:
http://ku-fpg.github.io/files/Gill-09-TypeSafeReification.pdf
Very well worth reading, and the bibliography will give you pointers to prior art in this area and many suggested solutions, from "poor-man's morally-safe" approaches to fully monadic knot-tying techniques. In my mind, Andy's solution and the corresponding reify package in Hackage:
https://hackage.haskell.org/package/data-reify
are the most practical solutions to this problem. And I can tell from experience that they work really well in practice.

Eq testing for large DAG structures in Haskell

I'm new to Haskell ( a couple of months ). I have a Haskell program that assembles a large expression DAG (not a tree, a DAG), potentially deep and with multiple merging paths (ie, the number of different paths from root to leaves is huge). I need a fast way to test these dags for equality.The default Eq derivation will just recurse, exploring the same nodes multiple times. Currently this causes my program to take 60 seconds for relatively small expressions, and not even finish for larger ones. The profiler indicates it is busy checking equality most of the time. I would like to implement a custom Eq that does not have this problem. I don't have a way to solve this problem that does not involve a lot of rewriting. So I want to hear your thoughts.
My first attempt was to 'instrument' tree nodes with a hash that I compute incrementally, using Data.Hashable.hash, as I build the tree. This approach gave me an easy way to test two things aren't equal without looking deep into the structure. But often in this DAG, because of the paths in the DAG merging, the structures are indeed equal. So the hashes are equal, and I revert to full blown equality testing.
If I had a way to do physical equality, then a lot of my problems here would go away: if they are physically equal, then that's it. Otherwise if the hash is different then that's it. Only go deeper if they are physically not the same, but their hash agrees.
I could also imitate git, and compute a SHA1 per node to decide if they are equal period (no need to recurse). I know for a fact that this would help, because If I let equality be decided fully in terms of hash equality, then the program runs in tens milliseconds for the largest expressions. This approach also has the nice advantage that if for some reason there are two equal dags are not physically equal but are content-equal, I would be able to detect it fast in that case as well. (With Ids, Id still have to do a traversal at that point). So I like the semantics more.
This approach, however involves a lot more work than just calling the Data.Hashable.hash function, because I have to derive it for every variant of the dag node type. And moreover, I have multiple dag representations, with slightly different node definitions, so I would need to basically do this hashing trick thing twice or more if I decide to add more representations.
What would you do?
Part of the problem here is that Haskell has no concept of object identity, so when you say you have a DAG where you refer to the same node twice, as far as Haskell is concerned its just two values in different places on a tree. This is fundamentally different from the OO concept where an object is indexed by its location in memory, so the distinction between "same object" and "different objects with equal fields" is meaningful.
To solve your problem you need to detect when you are visiting the same object that you saw earlier, and in order to do that you need to have a concept of "same object" that is independent of the value. There are two basic ways to attack this:
Store all your objects in a vector (i.e. an array), and use the vector index as an object identity. Replace values with indices throughout your data structure.
Give each object a unique "identity" field so you can tell if you have seen this one before when traversing the DAG.
The former is how the Data.Graph module in the containers package does it. One advantage is that, if you have a single mapping from DAG to vector, then DAG equality becomes just vector equality.
Any efficient way to test for equality will be intertwined with the way you build up the DAG values.
Here is an idea which keeps track of all nodes ever created in a Map.
As new nodes are added to the Map they are assigned a unique id.
Creating nodes now becomes monadic as you have thread this Map
(and the next available id) throughout your computation.
In this example the nodes are implemented as Rose trees, and the
order of the children is not significant - hence the call to
sort in deriving the key into the map.
import Control.Monad.State
import Data.List
import qualified Data.Map as M
data Node = Node { _eqIdent:: Int -- equality identifier
, _value :: String -- value associated with the node
, _children :: [Node] -- children
}
deriving (Show)
type BuildState = (Int, M.Map (String,[Int]) Node)
buildNode :: String -> [Node] -> State BuildState Node
buildNode value nodes = do
(nextid, nodeMap) <- get
let key = (value, sort (map _eqIdent nodes)) -- the identity of the node
case M.lookup key nodeMap of
Nothing -> do let n = Node nextid value nodes
nodeMap' = M.insert key n nodeMap
put (nextid+1, nodeMap')
return n
Just node -> return node
nodeEquality :: Node -> Node -> Bool
nodeEquality a b = _eqIdent a == _eqIdent b
One caveat -- this approach requires that you know all the children of a node when you build it.

Gather data about existing tree-like data

Let's say we have existing tree-like data and we would like to add information about depth of each node. How can we easily achieve that?
Data Tree = Node Tree Tree | Leaf
For each node we would like to know in constant complexity how deep it is. We have the data from external module, so we have information as it is shown above. Real-life example would be external HTML parser which just provides the XML tree and we would like to gather data e.g. how many hyperlinks every node contains.
Functional languages are created for traversing trees and gathering data, there should be an easy solution.
Obvious solution would be creating parallel structure. Can we do better?
The standard trick, which I learned from Chris Okasaki's wonderful Purely Functional Data Structures is to cache the results of expensive operations at each node. (Perhaps this trick was known before Okasaki's thesis; I don't know.) You can provide smart constructors to manage this information for you so that constructing the tree need not be painful. For example, when the expensive operation is depth, you might write:
module SizedTree (SizedTree, sizedTree, node, leaf, depth) where
data SizedTree = Node !Int SizedTree SizedTree | Leaf
node l r = Node (max (depth l) (depth r) + 1) l r
leaf = Leaf
depth (Node d _ _) = d
depth Leaf = 0
-- since we don't expose the constructors, we should
-- provide a replacement for pattern matching
sizedTree f v (Node _ l r) = f l r
sizedTree f v Leaf = v
Constructing SizedTrees costs O(1) extra work at each node (hence it is O(n) work to convert an n-node Tree to a SizedTree), but the payoff is that checking the depth of a SizedTree -- or of any subtree -- is an O(1) operation.
You do need some another data where you can store these Ints. Define Tree as
data Tree a = Node Tree a Tree | Leaf a
and then write a function
annDepth :: Tree a -> Tree (Int, a)
Your original Tree is Tree () and with pattern synonyms you can recover nice constructors.
If you want to preserve the original tree for some reason, you can define a view:
{-# LANGUAGE GADTs, DataKinds #-}
data Shape = SNode Shape Shape | SLeaf
data Tree a sh where
Leaf :: a -> Tree a SLeaf
Node :: Tree a lsh -> a -> Tree a rsh -> Tree a (SNode lsh rsh)
With this you have a guarantee that an annotated tree has the same shape as the unannotated. But this doesn't work good without proper dependent types.
Also, have a look at the question Boilerplate-free annotation of ASTs in Haskell?
The standard solution is what #DanielWagner suggested, just extend the data structure. This can be somewhat inconvenient, but can be solved: Smart constructors for creating instances and using records for pattern matching.
Perhaps Data types a la carte could help, although I haven't used this approach myself. There is a library compdata based on that.
A completely different approach would be to efficiently memoize the values you need. I was trying to solve a similar problem and one of the solutions is provided by the library stable-memo. Note that this isn't a purely functional approach, as the library is internally based on object identity, but the interface is pure and works perfectly for the purpose.

What are the names used in computer science for some of the following tree data types?

Sometimes I get myself using different types of trees in Haskell and I don't know what they are called or where to get more information on algorithms using them or class instances for them, or even some pre-existing code or library on hackage.
Examples:
Binary trees where the labels are on the leaves or the branches:
data BinTree1 a = Leaf |
Branch {label :: a, leftChild :: BinTree1 a, rightChild :: BinTree1 a}
data BinTree2 a = Leaf {label :: a} |
Branch {leftChild :: BinTree2 a, rightChild :: BinTree2 a}
Similarly trees with the labels for each children node or a general label for all their children:
data Tree1 a = Branch {label :: a, children :: [Tree1 a]}
data Tree2 a = Branch {labelledChildren :: [(a, Tree2 a)]}
Sometimes I start using Tree2 and somehow on the course of developing it gets refactored into Tree1, which seems simpler to deal with, but I never gave a lot of thought about it. Is there some kind of duality here?
Also, if you can post some other different kinds of trees that you think are useful, please do.
In summary: everything you can tell me about those trees will be useful! :)
Thanks.
EDIT:
Clarification: this is not homework. It's just that I usually end up using those data types and creating instances (Functor, Monad, etc...) and maybe if I new their names I would find libraries with stuff implemented and more theoretical information on them.
Usually when a library on Hackage have Tree in the name, it implements BinTree2 or some version of a non-binary tree with labels only on the leaves, so it seems to me that maybe Tree2 and BinTree2 have some other name or identifier.
Also I feel that there may be some kind of duality or isomorphism, or a way of turning code that uses Tree1 into code that uses Tree2 with some transformation. Is there? May be it's just an impression.
The names I've heard:
BinTree1 is a binary tree
BinTree2 don't know a name but you can use such a tree to represent a prefix-free code like huffman coding for example
Tree1 is a Rose tree
Tree2 is isomoprhic to [Tree1] (a forest of Tree1) or another way to view it is a Tree1 without a label for the root.
A binary tree that only has labels in the leaves (BinTree2) is usually used for hash maps, because the tree structure itself doesn't offer any information other than the binary position of the leaves.
So, if you have 4 values with the following hash codes:
...000001 A
...000010 B
...000011 C
...000010 D
... you might store them in a binary tree (an implicit patricia trie) like so:
+ <- Bit #1 (least significant bit) of hash code
/ \ 0 = left, 1 = right
/ \
[B, D] + <- Bit #2
/ \
/ \
[A] [C]
We see that since the hash codes of B and D "start" with 0, they are stored in the left root child. They have exactly the same hash codes, so no more forks are necessary. The hash codes of A and C both "start" with 1, so another fork is necessary. A has bit 2 as 0, so it goes to the left, and C with 1 goes to the right.
This hash table implementation is kind of bad, because hashes might have to be recomputed when certain elements are inserted, but no matter.
BinTree1 is just an ordinary binary tree, and is used for fast order-based sets. Nothing more to say about it, really.
The only difference between Tree1 and Tree2 is that Tree2 can't have root node labels. This means that if used as a prefix tree, it cannot contain the empty string. It has very limited use, and I haven't seen anything like it in practice. Tree1, however, obviously has an use as a non-binary prefix tree, as I said.

Resources