Eq testing for large DAG structures in Haskell

Eq testing for large DAG structures in Haskell - haskell

I'm new to Haskell ( a couple of months ). I have a Haskell program that assembles a large expression DAG (not a tree, a DAG), potentially deep and with multiple merging paths (ie, the number of different paths from root to leaves is huge). I need a fast way to test these dags for equality.The default Eq derivation will just recurse, exploring the same nodes multiple times. Currently this causes my program to take 60 seconds for relatively small expressions, and not even finish for larger ones. The profiler indicates it is busy checking equality most of the time. I would like to implement a custom Eq that does not have this problem. I don't have a way to solve this problem that does not involve a lot of rewriting. So I want to hear your thoughts.
My first attempt was to 'instrument' tree nodes with a hash that I compute incrementally, using Data.Hashable.hash, as I build the tree. This approach gave me an easy way to test two things aren't equal without looking deep into the structure. But often in this DAG, because of the paths in the DAG merging, the structures are indeed equal. So the hashes are equal, and I revert to full blown equality testing.
If I had a way to do physical equality, then a lot of my problems here would go away: if they are physically equal, then that's it. Otherwise if the hash is different then that's it. Only go deeper if they are physically not the same, but their hash agrees.
I could also imitate git, and compute a SHA1 per node to decide if they are equal period (no need to recurse). I know for a fact that this would help, because If I let equality be decided fully in terms of hash equality, then the program runs in tens milliseconds for the largest expressions. This approach also has the nice advantage that if for some reason there are two equal dags are not physically equal but are content-equal, I would be able to detect it fast in that case as well. (With Ids, Id still have to do a traversal at that point). So I like the semantics more.
This approach, however involves a lot more work than just calling the Data.Hashable.hash function, because I have to derive it for every variant of the dag node type. And moreover, I have multiple dag representations, with slightly different node definitions, so I would need to basically do this hashing trick thing twice or more if I decide to add more representations.
What would you do?

Part of the problem here is that Haskell has no concept of object identity, so when you say you have a DAG where you refer to the same node twice, as far as Haskell is concerned its just two values in different places on a tree. This is fundamentally different from the OO concept where an object is indexed by its location in memory, so the distinction between "same object" and "different objects with equal fields" is meaningful.
To solve your problem you need to detect when you are visiting the same object that you saw earlier, and in order to do that you need to have a concept of "same object" that is independent of the value. There are two basic ways to attack this:
Store all your objects in a vector (i.e. an array), and use the vector index as an object identity. Replace values with indices throughout your data structure.
Give each object a unique "identity" field so you can tell if you have seen this one before when traversing the DAG.
The former is how the Data.Graph module in the containers package does it. One advantage is that, if you have a single mapping from DAG to vector, then DAG equality becomes just vector equality.

Any efficient way to test for equality will be intertwined with the way you build up the DAG values.
Here is an idea which keeps track of all nodes ever created in a Map.
As new nodes are added to the Map they are assigned a unique id.
Creating nodes now becomes monadic as you have thread this Map
(and the next available id) throughout your computation.
In this example the nodes are implemented as Rose trees, and the
order of the children is not significant - hence the call to
sort in deriving the key into the map.
import Control.Monad.State
import Data.List
import qualified Data.Map as M
data Node = Node { _eqIdent:: Int -- equality identifier
, _value :: String -- value associated with the node
, _children :: [Node] -- children
}
deriving (Show)
type BuildState = (Int, M.Map (String,[Int]) Node)
buildNode :: String -> [Node] -> State BuildState Node
buildNode value nodes = do
(nextid, nodeMap) <- get
let key = (value, sort (map _eqIdent nodes)) -- the identity of the node
case M.lookup key nodeMap of
Nothing -> do let n = Node nextid value nodes
nodeMap' = M.insert key n nodeMap
put (nextid+1, nodeMap')
return n
Just node -> return node
nodeEquality :: Node -> Node -> Bool
nodeEquality a b = _eqIdent a == _eqIdent b
One caveat -- this approach requires that you know all the children of a node when you build it.

Related

Avoid revisiting node in an invariant directed graph

This is perhaps related to functional data structures, but I found no tags about this topic.
Say I have a syntax tree type Tree, which is organised as a DAG by simply sharing common sub expressions. For example,
data Tree = Val Int | Plus Tree Tree
example :: Tree
example = let x = Val 42 in Plus x x
Then, on this syntax tree type, I have a pure function simplify :: Tree -> Tree, which, when given the root node of a Tree, simplifies the whole tree by first simplifying the children of that root node, and then handle the operation of the root node itself.
Since simplify is a pure function, and some nodes are shared, we expect not to call simplify multiple times on those shared nodes.
Here comes the problem. The whole data structure is invariant, and the sharing is transparent to the programmer, so it seems impossible to determine whether or not two nodes are in fact the same nodes.
The same problem happens when handling the so-called “tying-the-knot” structures. By tying the knot, we produce a finite data representation for an otherwise infinite data structure, e.g. let xs = 1 : xs in xs. Here xs itself is finite, but calling map succ on it does not necessarily produce a finite representation.
These problems can be concluded as such: when the data is organised in an invariant directed graph, how do we avoid revisiting the same node, doing duplicated work, or even resulting in non-termination when the graph happened to be cyclic?
Some ideas that I have thought of:
Extend the Tree type to Tree a, making every nodes hold an extra a. When generating the graph, associate each node with a unique a value. The memory address should have worked here, despite that the garbage collector may move any heap object at any time.
For the syntax tree example, we may store a STRef (Maybe Tree) in every node for the simplified version, but this might not be extensible, and injects some implementation detail of a specific operation to the whole data structure itself.

This is a problem with a lot of research behind it. In general, you cannot observe sharing in a pure language like Haskell, due to referential transparency. But in practice, you can safely observe sharing so long as you restrict yourself to doing the observing in the IO monad. Andy Gill (one of the legends from the old Glasgow school of FP!) has written a wonderful paper about this about 10 years ago:
http://ku-fpg.github.io/files/Gill-09-TypeSafeReification.pdf
Very well worth reading, and the bibliography will give you pointers to prior art in this area and many suggested solutions, from "poor-man's morally-safe" approaches to fully monadic knot-tying techniques. In my mind, Andy's solution and the corresponding reify package in Hackage:
https://hackage.haskell.org/package/data-reify
are the most practical solutions to this problem. And I can tell from experience that they work really well in practice.

Keep record of previously visited states when searching

I am programming several search functions, for which I use the Node datatype:
data Node a =
Node { [...]
, getPath :: [a] -- ^ Previous states this node has visited
}
That field, getPath, is what I use to check if I have previously visited that state in that node: when expanding a new node, I check that by doing:
visited = `elem` path
It works, but it becomes incredibly costly when there are a lot of nodes expanded and the paths become too long. Is there a better way to keep track of the states I have visited? Maybe a different data structure (instead of a list) that performs better for this use?

The answer depends on what constraints your elements have (or can be given).
If your elements have an Ord constraint then you can use Data.Set from containers to get O(log n) membership tests.
If your elements have a Hashable constraint then you can use a HashSet from unordered-containers to get O(log n) membership tests with better constant factors* and a large enough base that they are effectively constant time for most use cases.
If your elements only have an Eq constraint then you can't do better than a list.
* Depending on the performance of the Hashable instance. When in doubt, benchmark.

Haskell: `Map (a,b) c` versus `Map a (Map b c)`?

Thinking of maps as representations of finite functions, a map of two or more variables can be given either in curried or uncurried form; that is, the types Map (a,b) c and Map a (Map b c) are isomorphic, or something close to it.
What practical considerations are there — efficiency, etc — for choosing between the two representations?

The Ord instance of tuples uses lexicographic order, so Map (a, b) c is going to sort by a first anyway, so the overall order will be the same. Regarding practical considerations:
Because Data.Map is a binary search tree splitting at a key is comparable to a lookup, so getting a submap for a given a in the uncurried form won't be significantly more expensive than in the curried form.
The curried form may produce a less balanced tree overall, for the obvious reason of having multiple trees instead of just one.
The curried form will have a bit of extra overhead to store the nested maps.
The nested maps of the curried form representing "partial applications" can be shared if some a values produce the same result.
Similarly, "partial application" of the curried form gives you the existing inner map, while the uncurried form must construct a new map.
So the uncurried form is clearly better in general, but the curried form may be better if you expect to do "partial application" often and would benefit from sharing of Map b c values.
Note that some care will be necessary to ensure you actually benefit from that potential sharing; you'll need to explicitly define any shared inner maps and reuse the single value when constructing the full map.
Edit: Tikhon Jelvis points out in the comments that the memory overhead of the tuple constructors--which I did not think to account for--is not at all negligible. There is certainly some overhead to the curried form, but that overhead is proportional to how many distinct a values there are. The tuple constructor overhead in the uncurried form, on the other hand, is proportional to the total number of keys.
So if, on average, for any given value of a there are three or more distinct keys using it you'll probably save memory using the curried version. The concerns about unbalanced trees still apply, of course. The more I think about it, the more I suspect the curried form is unequivocally better except perhaps if your keys are very sparse and unevenly distributed.
Note that because arity of definitions does matter to GHC, the same care is required when defining functions if you want subexpressions to be shared; this is one reason you sometimes see functions defined in a style like this:
foo x = go
where z = expensiveComputation x
go y = doStuff y z

Tuples are lazy in both elements, so the tuple version introduces a little extra laziness. Whether this is good or bad strongly depends on your usage. (In particular, comparisons may force the tuple elements, but only if there are lots of duplicate a values.)
Beyond that, I think it's going to depend on how many duplicates you have. If a is almost always different whenever b is, you're going to have a lot of small trees, so the tuple version might be better. On the other hand, if the opposite is true, the non-tuple version may save you a little time (not constantly recomparing a once you've found the appropriate subtree and you're looking for b).
I'm reminded of tries, and how they store common prefixes once. The non-tuple version seems to be a bit like that. A trie can be more efficient than a BST if there's lots of common prefixes, and less efficient if there aren't.
But the bottom line: benchmark it!! ;-)

Apart from the efficiency aspects, there's also a pragmatic side to this question: what do you want to do with this structure?
Do you, for instance, want to be able to store an empty map for a given value of type a? If so, then the uncurried version might be more practical!
Here's a simple example: let's say we want to store String-valued properties of persons - say the value of some fields on that person's stackoverflow profile page.
type Person = String
type Property = String
uncurriedMap :: Map Person (Map Property String)
uncurriedMap = fromList [
("yatima2975", fromList [("location","Utrecht"),("age","37")]),
("PLL", fromList []) ]
curriedMap :: Map (Person,Property) String
curriedMap = fromList [
(("yatima2975","location"), "Utrecht"),
(("yatima2975","age"), "37") ]
With the curried version, there is no nice way to record the fact that user "PLL" is known to the system, but hasn't filled in any information. A person/property pair ("PLL",undefined) is going to cause runtime crashes, since Map is strict in the keys.
You could change the type of curriedMap to Map (Person,Property) (Maybe String) and store Nothings in there, and that might very well be the best solution in this case; but where there's a unknown/varying number of properties (e.g. depending on the kind of Person) that will also run into difficulties.
So, I guess it also depends on whether you need a query function like this:
data QueryResult = PersonUnknown | PropertyUnknownForPerson | Value String
query :: Person -> Property -> Map (Person, Property) String -> QueryResult
This is hard to write (if not impossible) in the curried version, but easy in the uncurried version.

When to expose constructors of a data type when designing data structures?

When designing data structures in functional languages there are 2 options:
Expose their constructors and pattern match on them.
Hide their constructors and use higher-level functions to examine the data structures.
In what cases, what is appropriate?
Pattern matching can make code much more readable or simpler. On the other hand, if we need to change something in the definition of a data type then all places where we pattern-match on them (or construct them) need to be updated.
I've been asking this question myself for some time. Often it happens to me that I start with a simple data structure (or even a type alias) and it seems that constructors + pattern matching will be the easiest approach and produce a clean and readable code. But later things get more complicated, I have to change the data type definition and refactor a big part of the code.

The essential factor for me is the answer to the following question:
Is the structure of my datatype relevant to the outside world?
For example, the internal structure of the list datatype is very much relevant to the outside world - it has an inductive structure that is certainly very useful to expose to consumers, because they construct functions that proceed by induction on the structure of the list. If the list is finite, then these functions are guaranteed to terminate. Also, defining functions in this way makes it easy to provide properties about them, again by induction.
By contrast, it is best for the Set datatype to be kept abstract. Internally, it is implemented as a tree in the containers package. However, it might as well have been implemented using arrays, or (more usefully in a functional setting) with a tree with a slightly different structure and respecting different invariants (balanced or unbalanced, branching factor, etc). The need to enforce any invariants above and over those that the constructors already enforce through their types, by the way, precludes letting the datatype be concrete.
The essential difference between the list example and the set example is that the Set datatype is only relevant for the operations that are possible on Set's. Whereas lists are relevant because the standard library already provides many functions to act on them, but in addition their structure is relevant.
As a sidenote, one might object that actually the inductive structure of lists, which is so fundamental to write functions whose termination and behaviour is easy to reason about, is captured abstractly by two functions that consume lists: foldr and foldl. Given these two basic list operators, most functions do not need to inspect the structure of a list at all, and so it could be argued that lists too coud be kept abstract. This argument generalizes to many other similar structures, such as all Traversable structures, all Foldable structures, etc. However, it is nigh impossible to capture all possible recursion patterns on lists, and in fact many functions aren't recursive at all. Given only foldr and foldl, one would, writing head for example would still be possible, though quite tedious:
head xs = fromJust $ foldl (\b x -> maybe (Just x) Just b) Nothing xs
We're much better off just giving away the internal structure of the list.
One final point is that sometimes the actual representation of a datatype isn't relevant to the outside world, because say it is some kind of optimised and might not be the canonical representation, or there isn't a single "canonical" representation. In these cases, you'll want to keep your datatype abstract, but offer "views" of your datatype, which do provide concrete representations that can be pattern matched on.
One example would be if wanted to define a Complex datatype of complex numbers, where both cartesian forms and polar forms can be considered canonical. In this case, you would keep Complex abstract, but export two views, ie functions polar and cartesian that return a pair of a length and an angle or a coordinate in the cartesian plane, respectively.

Well, the rule is pretty simple: If it's easy to construct wrong values by using the actual constructors, then don't allow them to be used directly, but instead provide smart constructors. This is the path followed by some data structures like Map and Set, which are easy to get wrong.
Then there are the types for which it's impossible or hard to construct inconsistent/wrong values either because the type doesn't allow that at all or because you would need to introduce bottoms. The length-indexed list type (commonly called Vec) and most monads are examples of that.
Ultimately this is your own decision. Put yourself into the user's perspective and make the tradeoff between convenience and safety. If there is no tradeoff, then always expose the constructors. Otherwise your library users will hate you for the unnecessary opacity.

If the data type serves a simple purpose (like Maybe a) and no (explicit or implicit) assumptions about the data type can be violated by directly constructing a value via the data constructors, I would expose the constructors.
On the other hand, if the data type is more complex (like a balanced tree) and/or it's internal representation is likely to change, I usually hide the constructors.
When using a package, there's an unwritten rule that the interface exposed by a non-internal module should be "safe" to use on the given data type. Considering the balanced tree example, exposing the data constructors allows one to (accidentally) construct an unbalanced tree, and so the assumed runtime guarantees for searching the tree etc might be violated.

If the type is used to represent values with a canonical definition and representation (many mathematical objects fall into this category), and it's not possible to construct "invalid" values using the type, then you should expose the constructors.
For example, if you're representing something like two dimensional points with your own type (including a newtype), you might as well expose the constructor. The reality is that a change to this datatype is not going to be a change in how 2d points are represented, it's going to be a change in your need to use 2d points (maybe you're generalising to 3d space, maybe you're adding a concept of layers, or whatever), and is almost certain to need attention in the parts of the code using values of this type no matter what you do.[1]
A complex type representing something specific to your application or field is quite likely to undergo changes to the representation while continuing to support similar operations. Therefore you only want other modules depending on the operations, not on the internal structure. So you shouldn't expose the constructors.
Other types represent things with canonical definitions but not canonical representations. Everyone knows the properties expected of maps and sets, but there are lots of different ways of representing values that support those properties. So you again only want other modules depending on the operations they support, not on the particular representations.
Some types, whether or not they are if simple with canonical representations, allow the construction of values in the program which don't represent a valid value in the abstract concept the type is supposed to represent. A simple example would be a type representing a self-balancing binary search tree; client code with access to the constructors could easily construct invalid trees. Exposing the constructors either means you need to assume that such values passed in from outside may be invalid and therefore you need to make something sensible happen even for bizarre values, or means that it's the responsibility of the programmers working with your interface to ensure they don't violate any assumptions. It's usually better to just keep such types from being constructed directly outside your module.
Basically it comes down to the concept your type is supposed to represent. If your concept maps in a very simple and obvious[2] way directly to values in some data type which isn't "more inclusive" than the concept due to the compiler being unable to check needed invariants, then the concept is pretty much "the same" as the data type, and exposing its structure is fine. If not, then you probably need to keep the structure hidden.
[1] A likely change though would be to change which numeric type you're using for the coordinate values, so you probably do have to think about how to minimise the impact of such changes. That's pretty orthogonal to whether or not you expose the constructors though.
[2] "Obvious" here meaning that if you asked 10 people independently to come up with a data type representing the concept they would all come back with the same thing, modulo changing the names.

I would propose a different, noticeably more restrictive rule than most people. The central criterion would be:
Do you guarantee that this type will never, ever change? If so, exposing the constructors might be a good idea. Good luck with that, though!
But the types for which you can make that guarantee tend to be very simple, generic "foundation" types like Maybe, Either or [], which one could arguably write once and then never revisit again.
Though even those can be questioned, because they do get revisited from time to time; there's people who have used Church-encoded versions of Maybe and List in various contexts for performance reasons, e.g.:
{-# LANGUAGE RankNTypes #-}
newtype Maybe' a = Maybe' { elimMaybe' :: forall r. r -> (a -> r) -> r }
nothing = Maybe' $ \z k -> z
just x = Maybe' $ \z k -> k x
newtype List' a = List' { elimList' :: forall r. (a -> r -> r) -> r -> r }
nil = List' $ \k z -> z
cons x xs = List' $ \k z -> k x (elimList' k z xs)
These two examples highlight something important: you can replace the Maybe' type's implementation shown above with any other implementation as long as it supports the following three functions:
nothing :: Maybe' a
just :: a -> Maybe' a
elimMaybe' :: Maybe' a -> r -> (a -> r) -> r
...and the following laws:
elimMaybe' nothing z x == z
elimMaybe' (just x) z f == f x
And this technique can be applied to any algebraic data type. Which to me says that pattern matching against concrete constructors is just insufficiently abstract; it doesn't really gain you anything that you can't get out of the abstract constructors + destructor pattern, and it loses implementation flexibility.

Are new vectors created even if the old ones aren't used anymore?

This question is about the Data.Vector package.
Given the fact that I'll never use the old value of a certain cell once the cell is updated. Will the update operation always create a new vector, reflecting the update, or will it be done as an in-place update ?
Note: I know about Data.Vector.Mutable

No, but something even better can happen.
Data.Vector is built using "stream fusion". This means that if the sequence of operations that you are performing to build up and then tear down the vector can be fused away, then the Vector itself will never even be constructed and your code will turn into an optimized loop.
Fusion works by turning code that would build vectors into code that builds up and tears down streams and then puts the streams into a form that the compiler can see to perform optimizations.
So code that looks like
foo :: Int
foo = sum as
where as, bs, cs, ds, es :: Vector Int
as = map (*100) bs
bs = take 10 cs
cs = zipWith (+) (generate 1000 id) ds
ds = cons 1 $ cons 3 $ map (+2) es
es = replicate 24000 0
despite appearing to build up and tear down quite a few very large vectors can fuse all the way down to an inner loop that only calculates and adds 10 numbers.
Doing what you proposed is tricky, because it requires that you know that no references to a term exist anywhere else, which imposes a cost on any attempt to copy a reference into an environment. Moreover, it interacts rather poorly with laziness. You need to attach little affine addenda to the thunk you conspicuously didn't evaluate yet. But to do this in a multithreaded environment is scarily race prone and hard to get right.

Well, how exactly should the compiler see that "the old vector is not used anywhere"? Say we have a function that changes a vector:
changeIt :: Vector Int -> Int -> Vector Int
changeIt vec n = vec // [(0,n)]
Just from this definition, the compiler cannot assume that vec represents the only reference to the vector in question. We would have to annotate the function so it can only be used in this way - which Haskell doesn't support (but Clean does, as far as I know).
So what can we do in Haskell? Let us say we have another silly function:
changeItTwice vec n = changeIt (changeIt vec n) (n+1)
Now GHC could inline changeIt, and indeed "see" that no reference to the intermediate structure escapes. But typically, you would use this information to not produce that intermediate data structure, instead directly generating the end result!
This is a pretty common optimization (for lists, there is fusion, for example) - and I think it plays pretty much exactly the role you have in mind: Limiting the number of times a data structure needs to be copied. Whether or not this approach is more flexible than in-place-updates is up for debate, but you can definitely recover a lot of performance without having to break abstraction by annotating uniqueness properties.
(However, I think that Vector currently does not, in fact, perform this specific optimization. Might need a few more optimizer rules...)

IMHO this is certainly impossible as the GHC garbage collector may go havoc if you randomly change an object (even if it is not used anymore). That is because the object may be moved into an older generation and mutation could introduce pointers to a younger generation. If now the younger generation gets garbage collected, the object may move and thus the pointer may become invalid.
AFAIK, all mutable objects in Haskell are located on a special heap that is treated differently by the GC, so that such problems can't occur.

Not necessarily. Data.Vector uses stream fusion, so depending on your use the vector may not be created at all and the program may compile to an efficient constant space loop.
This mostly applies to operations that transform the entire vector rather than just updating a single cell, though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string