Is it possible to represent Set as Tree in Haskell? - haskell

I'm reading Purely Functional Data Structres and trying to solve exercises they give in haskell.
I've defined Tree in a standard way data Tree a = Empty | Node a (Tree a) (Tree a)
I'd like to define Set as Tree where nodes are instance of Ord. Is there a way to express this in Haskell? Something like type Set = Tree Ord or I deemed to reimplement tree every time I want express some data structure as tree?

Do you intend to use the tree as a binary search tree (BST)? For that to work, you need all operations to preserve the strong BST property: for each Node constructor in the BST, all Nodes in the left subtree contain elements less than the one in the current Node, all Nodes in the right subtree contain elements greater.
You absolutely need to preserve that property. Once it's no longer true, all further BST operations lose correctness guarantees. This conclusion has consequences. You can't expose the constructors of the type. You can't expose operations to work on it that don't preserve the BST property.
So every operation that operates on a set needs to have access to the Ord instance for the type constrained in the node. (Well, other than a couple special cases like checking if a set is empty or creating a singleton set, which never have to deal with the order of children.)
At that point, what exactly can you share about the tree type with other uses? You can't share operations. You can't use the constructors. That leaves, well.. nothing useful. You could share the name of the type with no ways to do much of anything with it.
So, no.. You can't share a binary search tree data type with other uses that just seek an arbitrary binary tree. Attempting to do so just results in a broken BST.

This is exactly what Data.Set module from containers library does:
http://hackage.haskell.org/package/containers-0.5.10.2/docs/src/Data-Set-Internal.html#Set
You can try to read it's implementation. type Set = Tree Ord won't compile if you're not using some exotic language extensions. But you're probably don't want to use them and more interested in functions like:
insert :: Ord a => a -> Tree a -> Tree a
find :: Ord a => a -> Tree a -> Maybe a
delete :: Ord a => a -> Tree a -> Tree a
This is how type classes meant to be used.

Related

How to define Type safe constrained rose trees

I am trying to define a data structure with these characteristics:
It is a rose tree
The nodes in the tree are of variable sort
The only difference between the sorts of node is a constraint on the number of children they may take
The complete set of constraints is: None; OneOnly; TwoOnly; AtLeastOne; AtLeastTwo
I want the relevant constraint to be type checkable and checked.
(Eg when building, or editing the tree, trying to add a second child to IamJustOne :: OneOnly is an error)
I am having difficulty getting started defining this structure (especially points 3-5).
There is information on the web on the steps needed to define a rose tree.
There is information in Data.Tree.Rose sufficient to create a rose tree with variable nodes. (Though I am still not clear on the distinction in that module between Knuth trees, and Knuth forests.)
There are research level papers on heterogeneous containers well above my comprehension grade
My initial approach was to attempt to create subtypes of MyRose (not working code) as:
data MyRose sub = MyRose {label :: String, subtype :: sub, children :: [MyRose sub]}
type AtLeastOne a = snoc a [a]
type AtLeastTwo a = snoc a ( snoc a [a] )
...
instance MyRose AtLeastOne where children = AtLeastOne MyRose -- instances to provide defaults
...
instance None STree where children = Nothing
I have tried various approaches using data, newtype, class, type, and am now investigating type family and data family. None of my approaches have been productive.
Could you suggest pointers to defining this data structure. Babies first steps would be perfectly useful - it is difficult to underestimate my level of knowledge on this topic.
Before you go the crazy advanced route, I recommend making sure that the simple route isn't Good Enough. The simple route looks like this:
data Tree = Node { label :: String, children :: Children }
data Children
= Zero
| One Tree
| Two Tree Tree
| Positive Tree [Tree]
| Many Tree Tree [Tree]
Here's your criteria:
Is a rose tree -- uh, I guess?
Nodes in the tree are of variable sort -- check, the five Children constructors indicate the sort, and each Node may make a different choice of constructor
The only difference between sorts is a constraint on the number of children they may take -- check
The complete set of constraints -- check
Relevant constraint is type checkable and checked -- check, e.g. the application One child1 child2 does not typecheck
Even if you could define it, a tree of this sort seems very difficult to use. The type of the tree will have to reflect its entire structure, and a client will have to carry that type around everywhere, since all operations on the tree will need to know this type in order to do anything. They won't be able to just have a Rose String or something, they will need to know the exact shape.
Let's imagine you've succeeded in your goal. Then, you may have some example tree t:
t :: OnlyTwo (AtLeastOne None)
indicating a top level with 2 nodes, each of whom has at least one child, each of which is empty. What on Earth should be the type of insert t "hello"? Of deleteMin t? You can't really know which levels of the tree may need to collapse if you delete a single node, or where you may need to grow a level if you insert one.
Maybe you have answers to these questions, and some obscure use case where this is the best solution. But since you ask for baby's first solution: I think if I were you, I would step back and ask why I really want this. What do you hope to achieve with this level of type detail? What do you want client code to look like when it consumes or builds such a tree? Answers to these questions would make for a much clearer problem.

Traversals as multilenses: What would be an example?

I'm currently working on understanding the lens library in detail by following the explanation in https://en.wikibooks.org/wiki/Haskell/Lenses_and_functional_references#The_scenic_route_to_lenses, which starts with a Traversal and then introduces the Lens. The wiki makes the distinction that a Traversal can have multiple targets, while a Lens has a single target.
This arguments is also made in the documentation for Traversal, "[Traversals] have also been known as multilenses".
What would be example code to illustrate this distinction? E.g. how can I use a Traversal to get or set multiple values in a way I cannot do with a Lens?
Suppose you have a type
data Pair a = Pair a a
This offers two natural lenses and two additional natural traversals.
-- Work with the first or second component
_1, _2 :: Lens' (Pair a) a
-- Work with both components, either left to
-- right or right to left
forwards, backwards :: Traversal' (Pair a) a
There are fewer things you can do with a traversal than a lens, but you (often) get more traversals than lenses.

Gather data about existing tree-like data

Let's say we have existing tree-like data and we would like to add information about depth of each node. How can we easily achieve that?
Data Tree = Node Tree Tree | Leaf
For each node we would like to know in constant complexity how deep it is. We have the data from external module, so we have information as it is shown above. Real-life example would be external HTML parser which just provides the XML tree and we would like to gather data e.g. how many hyperlinks every node contains.
Functional languages are created for traversing trees and gathering data, there should be an easy solution.
Obvious solution would be creating parallel structure. Can we do better?
The standard trick, which I learned from Chris Okasaki's wonderful Purely Functional Data Structures is to cache the results of expensive operations at each node. (Perhaps this trick was known before Okasaki's thesis; I don't know.) You can provide smart constructors to manage this information for you so that constructing the tree need not be painful. For example, when the expensive operation is depth, you might write:
module SizedTree (SizedTree, sizedTree, node, leaf, depth) where
data SizedTree = Node !Int SizedTree SizedTree | Leaf
node l r = Node (max (depth l) (depth r) + 1) l r
leaf = Leaf
depth (Node d _ _) = d
depth Leaf = 0
-- since we don't expose the constructors, we should
-- provide a replacement for pattern matching
sizedTree f v (Node _ l r) = f l r
sizedTree f v Leaf = v
Constructing SizedTrees costs O(1) extra work at each node (hence it is O(n) work to convert an n-node Tree to a SizedTree), but the payoff is that checking the depth of a SizedTree -- or of any subtree -- is an O(1) operation.
You do need some another data where you can store these Ints. Define Tree as
data Tree a = Node Tree a Tree | Leaf a
and then write a function
annDepth :: Tree a -> Tree (Int, a)
Your original Tree is Tree () and with pattern synonyms you can recover nice constructors.
If you want to preserve the original tree for some reason, you can define a view:
{-# LANGUAGE GADTs, DataKinds #-}
data Shape = SNode Shape Shape | SLeaf
data Tree a sh where
Leaf :: a -> Tree a SLeaf
Node :: Tree a lsh -> a -> Tree a rsh -> Tree a (SNode lsh rsh)
With this you have a guarantee that an annotated tree has the same shape as the unannotated. But this doesn't work good without proper dependent types.
Also, have a look at the question Boilerplate-free annotation of ASTs in Haskell?
The standard solution is what #DanielWagner suggested, just extend the data structure. This can be somewhat inconvenient, but can be solved: Smart constructors for creating instances and using records for pattern matching.
Perhaps Data types a la carte could help, although I haven't used this approach myself. There is a library compdata based on that.
A completely different approach would be to efficiently memoize the values you need. I was trying to solve a similar problem and one of the solutions is provided by the library stable-memo. Note that this isn't a purely functional approach, as the library is internally based on object identity, but the interface is pure and works perfectly for the purpose.

When to expose constructors of a data type when designing data structures?

When designing data structures in functional languages there are 2 options:
Expose their constructors and pattern match on them.
Hide their constructors and use higher-level functions to examine the data structures.
In what cases, what is appropriate?
Pattern matching can make code much more readable or simpler. On the other hand, if we need to change something in the definition of a data type then all places where we pattern-match on them (or construct them) need to be updated.
I've been asking this question myself for some time. Often it happens to me that I start with a simple data structure (or even a type alias) and it seems that constructors + pattern matching will be the easiest approach and produce a clean and readable code. But later things get more complicated, I have to change the data type definition and refactor a big part of the code.
The essential factor for me is the answer to the following question:
Is the structure of my datatype relevant to the outside world?
For example, the internal structure of the list datatype is very much relevant to the outside world - it has an inductive structure that is certainly very useful to expose to consumers, because they construct functions that proceed by induction on the structure of the list. If the list is finite, then these functions are guaranteed to terminate. Also, defining functions in this way makes it easy to provide properties about them, again by induction.
By contrast, it is best for the Set datatype to be kept abstract. Internally, it is implemented as a tree in the containers package. However, it might as well have been implemented using arrays, or (more usefully in a functional setting) with a tree with a slightly different structure and respecting different invariants (balanced or unbalanced, branching factor, etc). The need to enforce any invariants above and over those that the constructors already enforce through their types, by the way, precludes letting the datatype be concrete.
The essential difference between the list example and the set example is that the Set datatype is only relevant for the operations that are possible on Set's. Whereas lists are relevant because the standard library already provides many functions to act on them, but in addition their structure is relevant.
As a sidenote, one might object that actually the inductive structure of lists, which is so fundamental to write functions whose termination and behaviour is easy to reason about, is captured abstractly by two functions that consume lists: foldr and foldl. Given these two basic list operators, most functions do not need to inspect the structure of a list at all, and so it could be argued that lists too coud be kept abstract. This argument generalizes to many other similar structures, such as all Traversable structures, all Foldable structures, etc. However, it is nigh impossible to capture all possible recursion patterns on lists, and in fact many functions aren't recursive at all. Given only foldr and foldl, one would, writing head for example would still be possible, though quite tedious:
head xs = fromJust $ foldl (\b x -> maybe (Just x) Just b) Nothing xs
We're much better off just giving away the internal structure of the list.
One final point is that sometimes the actual representation of a datatype isn't relevant to the outside world, because say it is some kind of optimised and might not be the canonical representation, or there isn't a single "canonical" representation. In these cases, you'll want to keep your datatype abstract, but offer "views" of your datatype, which do provide concrete representations that can be pattern matched on.
One example would be if wanted to define a Complex datatype of complex numbers, where both cartesian forms and polar forms can be considered canonical. In this case, you would keep Complex abstract, but export two views, ie functions polar and cartesian that return a pair of a length and an angle or a coordinate in the cartesian plane, respectively.
Well, the rule is pretty simple: If it's easy to construct wrong values by using the actual constructors, then don't allow them to be used directly, but instead provide smart constructors. This is the path followed by some data structures like Map and Set, which are easy to get wrong.
Then there are the types for which it's impossible or hard to construct inconsistent/wrong values either because the type doesn't allow that at all or because you would need to introduce bottoms. The length-indexed list type (commonly called Vec) and most monads are examples of that.
Ultimately this is your own decision. Put yourself into the user's perspective and make the tradeoff between convenience and safety. If there is no tradeoff, then always expose the constructors. Otherwise your library users will hate you for the unnecessary opacity.
If the data type serves a simple purpose (like Maybe a) and no (explicit or implicit) assumptions about the data type can be violated by directly constructing a value via the data constructors, I would expose the constructors.
On the other hand, if the data type is more complex (like a balanced tree) and/or it's internal representation is likely to change, I usually hide the constructors.
When using a package, there's an unwritten rule that the interface exposed by a non-internal module should be "safe" to use on the given data type. Considering the balanced tree example, exposing the data constructors allows one to (accidentally) construct an unbalanced tree, and so the assumed runtime guarantees for searching the tree etc might be violated.
If the type is used to represent values with a canonical definition and representation (many mathematical objects fall into this category), and it's not possible to construct "invalid" values using the type, then you should expose the constructors.
For example, if you're representing something like two dimensional points with your own type (including a newtype), you might as well expose the constructor. The reality is that a change to this datatype is not going to be a change in how 2d points are represented, it's going to be a change in your need to use 2d points (maybe you're generalising to 3d space, maybe you're adding a concept of layers, or whatever), and is almost certain to need attention in the parts of the code using values of this type no matter what you do.[1]
A complex type representing something specific to your application or field is quite likely to undergo changes to the representation while continuing to support similar operations. Therefore you only want other modules depending on the operations, not on the internal structure. So you shouldn't expose the constructors.
Other types represent things with canonical definitions but not canonical representations. Everyone knows the properties expected of maps and sets, but there are lots of different ways of representing values that support those properties. So you again only want other modules depending on the operations they support, not on the particular representations.
Some types, whether or not they are if simple with canonical representations, allow the construction of values in the program which don't represent a valid value in the abstract concept the type is supposed to represent. A simple example would be a type representing a self-balancing binary search tree; client code with access to the constructors could easily construct invalid trees. Exposing the constructors either means you need to assume that such values passed in from outside may be invalid and therefore you need to make something sensible happen even for bizarre values, or means that it's the responsibility of the programmers working with your interface to ensure they don't violate any assumptions. It's usually better to just keep such types from being constructed directly outside your module.
Basically it comes down to the concept your type is supposed to represent. If your concept maps in a very simple and obvious[2] way directly to values in some data type which isn't "more inclusive" than the concept due to the compiler being unable to check needed invariants, then the concept is pretty much "the same" as the data type, and exposing its structure is fine. If not, then you probably need to keep the structure hidden.
[1] A likely change though would be to change which numeric type you're using for the coordinate values, so you probably do have to think about how to minimise the impact of such changes. That's pretty orthogonal to whether or not you expose the constructors though.
[2] "Obvious" here meaning that if you asked 10 people independently to come up with a data type representing the concept they would all come back with the same thing, modulo changing the names.
I would propose a different, noticeably more restrictive rule than most people. The central criterion would be:
Do you guarantee that this type will never, ever change? If so, exposing the constructors might be a good idea. Good luck with that, though!
But the types for which you can make that guarantee tend to be very simple, generic "foundation" types like Maybe, Either or [], which one could arguably write once and then never revisit again.
Though even those can be questioned, because they do get revisited from time to time; there's people who have used Church-encoded versions of Maybe and List in various contexts for performance reasons, e.g.:
{-# LANGUAGE RankNTypes #-}
newtype Maybe' a = Maybe' { elimMaybe' :: forall r. r -> (a -> r) -> r }
nothing = Maybe' $ \z k -> z
just x = Maybe' $ \z k -> k x
newtype List' a = List' { elimList' :: forall r. (a -> r -> r) -> r -> r }
nil = List' $ \k z -> z
cons x xs = List' $ \k z -> k x (elimList' k z xs)
These two examples highlight something important: you can replace the Maybe' type's implementation shown above with any other implementation as long as it supports the following three functions:
nothing :: Maybe' a
just :: a -> Maybe' a
elimMaybe' :: Maybe' a -> r -> (a -> r) -> r
...and the following laws:
elimMaybe' nothing z x == z
elimMaybe' (just x) z f == f x
And this technique can be applied to any algebraic data type. Which to me says that pattern matching against concrete constructors is just insufficiently abstract; it doesn't really gain you anything that you can't get out of the abstract constructors + destructor pattern, and it loses implementation flexibility.

Data structure to represent automata

I'm currently trying to come up with a data structure that fits the needs of two automata learning algorithms I'd like to implement in Haskell: RPNI and EDSM.
Intuitively, something close to what zippers are to trees would be perfect: those algorithms are state merging algorithms that maintain some sort of focus (the Blue Fringe) on states and therefore would benefit of some kind of zippers to reach interesting points quickly. But I'm kinda lost because a DFA (Determinist Finite Automaton) is more a graph-like structure than a tree-like structure: transitions can make you go back in the structure, which is not likely to make zippers ok.
So my question is: how would you go about representing a DFA (or at least its transitions) so that you could manipulate it in a fast fashion?
Let me begin with the usual opaque representation of automata in Haskell:
newtype Auto a b = Auto (a -> (b, Auto a b))
This represents a function that takes some input and produces some output along with a new version of itself. For convenience it's a Category as well as an Arrow. It's also a family of applicative functors. Unfortunately this type is opaque. There is no way to analyze the internals of this automaton. However, if you replace the opaque function by a transparent expression type you should get automata that you can analyze and manipulate:
data Expr :: * -> * -> * where
-- Stateless
Id :: Expr a a
-- Combinators
Connect :: Expr a b -> Expr b c -> Expr a c
-- Stateful
Counter :: (Enum b) => b -> Expr a b
This gives you access to the structure of the computation. It is also a Category, but not an arrow. Once it becomes an arrow you have opaque functions somewhere.
Can you just use a graph to get started? I think the fgl package is part of the Haskell Platform.
Otherwise you can try defining your own structure with 'deriving (Data)' and use the "Scrap Your Zipper" library to get the Zipper.
If you don't need any fancy graph algorithms you can represent your DFA as a Map State State. This gives you fast access and manipulation. You also get focus by keeping track of the current state.
Take a look at the regex-tdfa package: http://hackage.haskell.org/package/regex-tdfa
The source is pretty complex, but it's an implementations of regexes with tagged DFAs tuned for performance, so it should illustrate some good practices for representing DFAs efficiently.

Resources