I am programming several search functions, for which I use the Node datatype:
data Node a =
Node { [...]
, getPath :: [a] -- ^ Previous states this node has visited
}
That field, getPath, is what I use to check if I have previously visited that state in that node: when expanding a new node, I check that by doing:
visited = `elem` path
It works, but it becomes incredibly costly when there are a lot of nodes expanded and the paths become too long. Is there a better way to keep track of the states I have visited? Maybe a different data structure (instead of a list) that performs better for this use?
The answer depends on what constraints your elements have (or can be given).
If your elements have an Ord constraint then you can use Data.Set from containers to get O(log n) membership tests.
If your elements have a Hashable constraint then you can use a HashSet from unordered-containers to get O(log n) membership tests with better constant factors* and a large enough base that they are effectively constant time for most use cases.
If your elements only have an Eq constraint then you can't do better than a list.
* Depending on the performance of the Hashable instance. When in doubt, benchmark.
Related
so far I've only found vector and sequences, but neither of those could replace an element of a list in O(1). Such data structure would of course violate the immutable character of Haskells structures, but maybe there still exist some dirty implementation?
Every feedback is welcom.
As you suggest yourself – I'm also pretty sure a safe, purely-functional update in O(1) is not possible. What is possible is in O(log n) with a tree-like implementation; for instance, instead of [a] you could use Data.Map.Map Int a with a contiguous region of indices. Also, it is possible to do a batch update of k ≤ n elements in a list or vector, in only O(n) instead of the O(k·n) it would take to manually insert them one-by-one. Check out //.
If none of that is fast enough for you, then yes, you will need to go into the dark realm of mutability. Fortunately, Haskell offers a good safety armour and flashlight for such journeys: the ST monad. The way it works is, you wrap the entire region where you need to do mutable updates in runST. Inside that region, you use MVectors, which support O(1) mutable element updates, much like you could in an imperative language. But thanks to a type-system trick, runST ensures that all these side-effects stay confined to within the local scope.
This is perhaps related to functional data structures, but I found no tags about this topic.
Say I have a syntax tree type Tree, which is organised as a DAG by simply sharing common sub expressions. For example,
data Tree = Val Int | Plus Tree Tree
example :: Tree
example = let x = Val 42 in Plus x x
Then, on this syntax tree type, I have a pure function simplify :: Tree -> Tree, which, when given the root node of a Tree, simplifies the whole tree by first simplifying the children of that root node, and then handle the operation of the root node itself.
Since simplify is a pure function, and some nodes are shared, we expect not to call simplify multiple times on those shared nodes.
Here comes the problem. The whole data structure is invariant, and the sharing is transparent to the programmer, so it seems impossible to determine whether or not two nodes are in fact the same nodes.
The same problem happens when handling the so-called “tying-the-knot” structures. By tying the knot, we produce a finite data representation for an otherwise infinite data structure, e.g. let xs = 1 : xs in xs. Here xs itself is finite, but calling map succ on it does not necessarily produce a finite representation.
These problems can be concluded as such: when the data is organised in an invariant directed graph, how do we avoid revisiting the same node, doing duplicated work, or even resulting in non-termination when the graph happened to be cyclic?
Some ideas that I have thought of:
Extend the Tree type to Tree a, making every nodes hold an extra a. When generating the graph, associate each node with a unique a value. The memory address should have worked here, despite that the garbage collector may move any heap object at any time.
For the syntax tree example, we may store a STRef (Maybe Tree) in every node for the simplified version, but this might not be extensible, and injects some implementation detail of a specific operation to the whole data structure itself.
This is a problem with a lot of research behind it. In general, you cannot observe sharing in a pure language like Haskell, due to referential transparency. But in practice, you can safely observe sharing so long as you restrict yourself to doing the observing in the IO monad. Andy Gill (one of the legends from the old Glasgow school of FP!) has written a wonderful paper about this about 10 years ago:
http://ku-fpg.github.io/files/Gill-09-TypeSafeReification.pdf
Very well worth reading, and the bibliography will give you pointers to prior art in this area and many suggested solutions, from "poor-man's morally-safe" approaches to fully monadic knot-tying techniques. In my mind, Andy's solution and the corresponding reify package in Hackage:
https://hackage.haskell.org/package/data-reify
are the most practical solutions to this problem. And I can tell from experience that they work really well in practice.
I'm new to Haskell ( a couple of months ). I have a Haskell program that assembles a large expression DAG (not a tree, a DAG), potentially deep and with multiple merging paths (ie, the number of different paths from root to leaves is huge). I need a fast way to test these dags for equality.The default Eq derivation will just recurse, exploring the same nodes multiple times. Currently this causes my program to take 60 seconds for relatively small expressions, and not even finish for larger ones. The profiler indicates it is busy checking equality most of the time. I would like to implement a custom Eq that does not have this problem. I don't have a way to solve this problem that does not involve a lot of rewriting. So I want to hear your thoughts.
My first attempt was to 'instrument' tree nodes with a hash that I compute incrementally, using Data.Hashable.hash, as I build the tree. This approach gave me an easy way to test two things aren't equal without looking deep into the structure. But often in this DAG, because of the paths in the DAG merging, the structures are indeed equal. So the hashes are equal, and I revert to full blown equality testing.
If I had a way to do physical equality, then a lot of my problems here would go away: if they are physically equal, then that's it. Otherwise if the hash is different then that's it. Only go deeper if they are physically not the same, but their hash agrees.
I could also imitate git, and compute a SHA1 per node to decide if they are equal period (no need to recurse). I know for a fact that this would help, because If I let equality be decided fully in terms of hash equality, then the program runs in tens milliseconds for the largest expressions. This approach also has the nice advantage that if for some reason there are two equal dags are not physically equal but are content-equal, I would be able to detect it fast in that case as well. (With Ids, Id still have to do a traversal at that point). So I like the semantics more.
This approach, however involves a lot more work than just calling the Data.Hashable.hash function, because I have to derive it for every variant of the dag node type. And moreover, I have multiple dag representations, with slightly different node definitions, so I would need to basically do this hashing trick thing twice or more if I decide to add more representations.
What would you do?
Part of the problem here is that Haskell has no concept of object identity, so when you say you have a DAG where you refer to the same node twice, as far as Haskell is concerned its just two values in different places on a tree. This is fundamentally different from the OO concept where an object is indexed by its location in memory, so the distinction between "same object" and "different objects with equal fields" is meaningful.
To solve your problem you need to detect when you are visiting the same object that you saw earlier, and in order to do that you need to have a concept of "same object" that is independent of the value. There are two basic ways to attack this:
Store all your objects in a vector (i.e. an array), and use the vector index as an object identity. Replace values with indices throughout your data structure.
Give each object a unique "identity" field so you can tell if you have seen this one before when traversing the DAG.
The former is how the Data.Graph module in the containers package does it. One advantage is that, if you have a single mapping from DAG to vector, then DAG equality becomes just vector equality.
Any efficient way to test for equality will be intertwined with the way you build up the DAG values.
Here is an idea which keeps track of all nodes ever created in a Map.
As new nodes are added to the Map they are assigned a unique id.
Creating nodes now becomes monadic as you have thread this Map
(and the next available id) throughout your computation.
In this example the nodes are implemented as Rose trees, and the
order of the children is not significant - hence the call to
sort in deriving the key into the map.
import Control.Monad.State
import Data.List
import qualified Data.Map as M
data Node = Node { _eqIdent:: Int -- equality identifier
, _value :: String -- value associated with the node
, _children :: [Node] -- children
}
deriving (Show)
type BuildState = (Int, M.Map (String,[Int]) Node)
buildNode :: String -> [Node] -> State BuildState Node
buildNode value nodes = do
(nextid, nodeMap) <- get
let key = (value, sort (map _eqIdent nodes)) -- the identity of the node
case M.lookup key nodeMap of
Nothing -> do let n = Node nextid value nodes
nodeMap' = M.insert key n nodeMap
put (nextid+1, nodeMap')
return n
Just node -> return node
nodeEquality :: Node -> Node -> Bool
nodeEquality a b = _eqIdent a == _eqIdent b
One caveat -- this approach requires that you know all the children of a node when you build it.
A few years ago, during a C# course I learned to write a binary tree that looked more or less like this:
data Tree a = Branch a (Tree a) (Tree a) | Leaf
I saw the benefit of it, it had its values on the branches, which allowed for quick and easy lookup and insertion of values, because it would encounter a value on the root of each branch all the way down until it hit a leaf, that held no value.
Ever since I started learning Haskell, however; I've seen numerous examples of trees that are defined like this:
data Tree a = Branch (Tree a) (Tree a) | Leaf a
That definition puzzles me. I can't see the usefulness of having data on the elements that don't branch, because it would end up leading to a tree that looks like this:
Which to me, seems like a poorly designed alternative to a List. It also makes me question the lookup time of it, since it can't asses which branch to go down to find the value it's looking for; but rather needs to go through every node to find what it's looking for.
So, can anyone shed some light on why the second version (value on leaves) is so much more prevalent in Haskell than the first version?
I think this depends on what you're trying to model and how you're trying to model it.
A tree where the internal nodes store values and the leaves are just leaves is essentially a standard binary tree (tree each leaf as NULL and you basically have an imperative-style binary tree). If the values are stored in sorted order, you now have a binary search tree. There are many specific advantages to storing data this way, most of which transfer directly over from imperative settings.
Trees where the leaves store the data and the internal nodes are just for structure do have their advantages. For example, red/black trees support two powerful operations called split and join that have advantages in some circumstances. split takes as input a key, then destructively modifies the tree to produce two trees, one of which contains all keys less than the specified input key and one containing the remaining keys. join is, in a sense, the opposite: it takes in two trees where one tree's values are all less than the other tree's values, then fuses them together into a single tree. These operations are particularly difficult to implement on most red/black trees, but are much simpler if all the data is stored in the leaves only rather than in the internal nodes. This paper detailing an imperative implementation of red/black trees mentions that some older implementations of red/black trees used this approach for this very reason.
As another potential advantage of storing keys in the leaves, suppose that you want to implement the concatenate operation, which joins two lists together. If you don't have data in the leaves, this is as simple as
concat first second = Branch first second
This works because no data is stored in those nodes. If the data is stored in the leaves, you need to somehow move a key from one of the leaves up to the new concatenation node, which takes more time and is trickier to work with.
Finally, in some cases, you might want to store the data in the leaves because the leaves are fundamentally different from internal nodes. Consider a parse tree, for example, where the leaves store specific terminals from the parse and the internal nodes store all the nonterminals in the production. In this case, there really are two different types of nodes, so it doesn't make sense to store arbitrary data in the internal nodes.
Hope this helps!
You described a tree with data at the leaves as "a poorly designed alternative to a List."
I agree that this could be used as an alternative to a list, but it's not necessarily poorly designed! Consider the data type
data Tree t = Leaf t | Branch (Tree t) (Tree t)
You can define cons and snoc (append to end of list) operations -
cons :: t -> Tree t -> Tree t
cons t (Leaf s) = Branch (Leaf t) (Leaf s)
cons t (Branch l r) = Branch (cons t l) r
snoc :: Tree t -> t -> Tree t
snoc (Leaf s) t = Branch (Leaf s) (Leaf t)
snoc (Branch l r) t = Branch l (snoc r t)
These run (for roughly balanced lists) in O(log n) time where n is the length of the list. This contrasts with the standard linked list, which has O(1) cons and O(n) snoc operations. You can also define a constant-time append (as in templatetypedef's answer)
append :: Tree t -> Tree t -> Tree t
append l r = Branch l r
which is O(1) for two lists of any size, whereas the standard list is O(n) where n is the length of the left argument.
In practice you would want to define slightly smarter versions of these functions which attempt to keep the tree balanced. To do this it is often useful to have some additional information at the branches, which could be done by having multiple kinds of branch (as in a red-black tree which has "red" and "black" nodes) or explicitly include additional data at the branches, as in
data Tree b a = Leaf a | Branch b (Tree b a) (Tree b a)
For example, you can support an O(1) size operation by storing the total number of elements in both subtrees in the nodes. All of your operations on the tree become slightly more complicated since you need to correctly persist the information about subtree sizes -- in effect the work of computing the size of the tree is amortized over all the operations that construct the tree (and cleverly persisted, so that minimal work is done whenever you need to reconstruct a size later).
More is better worse more. I'll explain just a couple basic considerations to show why your intuition fails. The general idea, though, is that different data structures need different things.
Empty leaf nodes can actually be a space (and therefore time) problem in some contexts. If a node is represented by a bit of information and two pointers to its children, you'll end up with two null pointers per node whose children are both leaves. That's two machine words per leaf node, which can add up to quite a bit of space. Some structures avoid this by ensuring that each leaf holds at least one piece of information to justify its existence. In some cases (such as ropes), each leaf may have a fairly large and dense payload.
Making internal nodes bigger (by storing information in them) makes it more expensive to modify the tree. Changing a leaf in a balanced tree typically forces you to allocate replacements for O(log n) internal nodes. If each of those is larger, you've just allocated more space and spent extra time to copy more words. The extra size of the internal nodes also means that you can fit less of the tree structure into the CPU cache.
Thinking of maps as representations of finite functions, a map of two or more variables can be given either in curried or uncurried form; that is, the types Map (a,b) c and Map a (Map b c) are isomorphic, or something close to it.
What practical considerations are there — efficiency, etc — for choosing between the two representations?
The Ord instance of tuples uses lexicographic order, so Map (a, b) c is going to sort by a first anyway, so the overall order will be the same. Regarding practical considerations:
Because Data.Map is a binary search tree splitting at a key is comparable to a lookup, so getting a submap for a given a in the uncurried form won't be significantly more expensive than in the curried form.
The curried form may produce a less balanced tree overall, for the obvious reason of having multiple trees instead of just one.
The curried form will have a bit of extra overhead to store the nested maps.
The nested maps of the curried form representing "partial applications" can be shared if some a values produce the same result.
Similarly, "partial application" of the curried form gives you the existing inner map, while the uncurried form must construct a new map.
So the uncurried form is clearly better in general, but the curried form may be better if you expect to do "partial application" often and would benefit from sharing of Map b c values.
Note that some care will be necessary to ensure you actually benefit from that potential sharing; you'll need to explicitly define any shared inner maps and reuse the single value when constructing the full map.
Edit: Tikhon Jelvis points out in the comments that the memory overhead of the tuple constructors--which I did not think to account for--is not at all negligible. There is certainly some overhead to the curried form, but that overhead is proportional to how many distinct a values there are. The tuple constructor overhead in the uncurried form, on the other hand, is proportional to the total number of keys.
So if, on average, for any given value of a there are three or more distinct keys using it you'll probably save memory using the curried version. The concerns about unbalanced trees still apply, of course. The more I think about it, the more I suspect the curried form is unequivocally better except perhaps if your keys are very sparse and unevenly distributed.
Note that because arity of definitions does matter to GHC, the same care is required when defining functions if you want subexpressions to be shared; this is one reason you sometimes see functions defined in a style like this:
foo x = go
where z = expensiveComputation x
go y = doStuff y z
Tuples are lazy in both elements, so the tuple version introduces a little extra laziness. Whether this is good or bad strongly depends on your usage. (In particular, comparisons may force the tuple elements, but only if there are lots of duplicate a values.)
Beyond that, I think it's going to depend on how many duplicates you have. If a is almost always different whenever b is, you're going to have a lot of small trees, so the tuple version might be better. On the other hand, if the opposite is true, the non-tuple version may save you a little time (not constantly recomparing a once you've found the appropriate subtree and you're looking for b).
I'm reminded of tries, and how they store common prefixes once. The non-tuple version seems to be a bit like that. A trie can be more efficient than a BST if there's lots of common prefixes, and less efficient if there aren't.
But the bottom line: benchmark it!! ;-)
Apart from the efficiency aspects, there's also a pragmatic side to this question: what do you want to do with this structure?
Do you, for instance, want to be able to store an empty map for a given value of type a? If so, then the uncurried version might be more practical!
Here's a simple example: let's say we want to store String-valued properties of persons - say the value of some fields on that person's stackoverflow profile page.
type Person = String
type Property = String
uncurriedMap :: Map Person (Map Property String)
uncurriedMap = fromList [
("yatima2975", fromList [("location","Utrecht"),("age","37")]),
("PLL", fromList []) ]
curriedMap :: Map (Person,Property) String
curriedMap = fromList [
(("yatima2975","location"), "Utrecht"),
(("yatima2975","age"), "37") ]
With the curried version, there is no nice way to record the fact that user "PLL" is known to the system, but hasn't filled in any information. A person/property pair ("PLL",undefined) is going to cause runtime crashes, since Map is strict in the keys.
You could change the type of curriedMap to Map (Person,Property) (Maybe String) and store Nothings in there, and that might very well be the best solution in this case; but where there's a unknown/varying number of properties (e.g. depending on the kind of Person) that will also run into difficulties.
So, I guess it also depends on whether you need a query function like this:
data QueryResult = PersonUnknown | PropertyUnknownForPerson | Value String
query :: Person -> Property -> Map (Person, Property) String -> QueryResult
This is hard to write (if not impossible) in the curried version, but easy in the uncurried version.