I'm looking for a way of storing graphs as strings. The strings are to be used as keys in a map, so that two topologically identical graphs will map to the same value in the map. Does anybody know of such an algorithm?
The nodes of the tree are labeled with duplicate labels being allowed.
The program is in java and an implementation in that would be neat, but any pointers to possible algorithms are appreciated.
if you have an algorithm that maps general graphs to strings, and so that two graphs map to the same string if and only if they are topologically equivalent, then you have an algorithm for GRAPH AUTOMORPHISM. Graph automorphism has no known polynomial-time algorithms. So you can't have (easily :) a polynomial-time algorithm that calculates the strings as you postulate them, because otherwise you'd have constructed a previously unknown and very efficient algorithm to graph automorphism.
This doesn't mean that it wouldn't be possible to solve the problem for your class of graphs; it just means that for the class of all graphs it's kind of difficult.
You may find the following question relevant...
Using finite automata as keys to a container
Basically, an automaton can be minimised using well-known algorithms from automata-theory textbooks. Hopcrofts is an example. There is precisely one minimal automaton that is equivalent to any given automaton. However, that minimal automaton may be represented in different ways. Constructing a safe canonical form is basically a matter of renumbering nodes and ordering the adjacency table using information that is significant in defining the automaton, and not by information that is specific to the representation.
The basic principle should extend to general graphs. Whether you can minimise your graphs depends on their semantics, but the basic idea of renumbering the nodes and sorting the adjacency list still applies.
Other answers here assume things about your graphs - for example that the nodes have unique labels that can be ordered and which are significant for the semantics of your graphs, that can be used to identify the nodes in an adjacency matrix or list. This simply won't work if you're interested in morphims of unlabelled graphs, for instance. Different ways of numbering the nodes (and thus ordering the adjacency list) will result in different canonical forms for equivalent graphs that just happen to be represented differently.
As for the renumbering etc, an approach is to borrow and adapt principles from automata minimisation algorithms. Basically...
Create a vector of blocks (sets of nodes). Initially, populate this with one block per class of nodes (ie per distinct node annotation). The modification here is that we order these by annotation details (not by representation-specific node IDs).
For each class (annotation) of edges in order, evaluate each block. If each node in the block can follow the current edge-type to reach the same set of next blocks, leave it untouched. Otherwise, split it as necessary to get maximal blocks that achieve this objective. Keep these split blocks clustered together in the vector (preserve the existing ordering, just refine it a bit), and order the split blocks based on a suitable ordering of the next-block sets. For example, use bitvectors as long as the current vector of blocks, with a set bit for each block reachable by following the current edge type. To order the bitvectors, treat them as big integers.
EDIT - I forgot to mention - in the second bullet, as soon as you split a block, you restart with the first block in the vector and first edge annotation. Obviously, a naive implementation will be slow, so take the principle and use it to adapt Hopcrofts minimisation algorithm.
If you end up with blocks that have multiple nodes in them, those nodes are equivalent. Whether that means they can be merged or not depends on your semantics, but the relative ordering of nodes within each such block clearly doesn't matter.
If dealing with graphs that can be minimised (e.g. automaton digraphs) I suspect it's best to minimise first, though I still haven't got around to implementing this myself.
The key thing is, of course, ensuring that your renumbering is sensitive only to the significant details of the graph - its structure and annotations - and not the things that are only there so that you can construct a representation such as node IDs/addresses etc.
Once you have the blocks ordered, deriving a canonical form should be easy.
gSpan introduced the 'minimum DFS code' which encodes graphs such that if two graphs have the same code, they must be isomorphic. gSpan has implementations in C++ and Java.
A common way to do this is using Adjacency lists
Beside an Adjacency list, there are adjacency matrices. Which one you choose should depend on which you use to implement your Graph class (adjacency lists are usually the better choice, but they both have strengths and weaknesses). If you have a totally different implementation of Graph, consider using one of these, as it makes many graph algorithms very easy to implement.
One other option is, if possible, overriding hashCode() and equals() on the Graph class and use the actual graph object as the key rather than converting to a string.
E: overriding the hashCode() and equals() is the route I would take if some vertices are not uniquely labeled. As noted in the comments, this can be expensive, but I think it would depend on the implementation of the Graph class.
If equals() is too expensive, then you should use an adjacency list or matrix, but don't just use the node names. You have to carefully specify exactly what it is that identifies individual graphs and vertices (and therefore what would make them equal), and then make your string representation of the adjacency list use those properties instead of the node names. I'd suggest you write this specification of your graph equals operation down.
Related
In Haskell or some other functional programming language, how would you implement a heuristic search?
Take as an example search space, the nine-puzzle, that is a 3x3 grid with 8 tiles and 1 hole, and you move tiles into the hole until you have correctly assembled a picture. The heuristic is the "Manhattan heuristic", which evaluates a board position adding up the distance each tile is from its target position, taking as the distance the number of squares horizontally plus the number of squares vertically each tile needs to be moved to get to the correct location.
I have been reading John Hughes paper on pretty printing as I know that pretty printer back-tracks to find better solutions. I am trying to understand how to generalise a heuristic search along these lines.
===
Note that my ultimate aim here is not to write a solver for the 9-puzzle, but to learn some general techniques for writing efficient heuristic searches in FP languages. I am also interested to learn if there is code that can be generalised and re-used across a wider class of such problems, rather than solving any specific problem.
For example, a search space can be characterised by a function that maps a State to a List of States together with some 'operation' that describes how one state is transitioned into another. There could also be a goal function, mapping a State to Bool, indicating when a goal State has been reached. And of course, the heuristic function mapping a State to a Number reflecting how well it is estimated to score. Other descriptions of the search are possible.
I don't think it's necessarily very specific to FP or Haskell (unless you utilize lists as "multiple possibility" monads, as in Learn You A Haskell For Great Good).
One way to do it would be by writing a recursive function taking the following:
the current state (that is the board configuration)
possibly some path metadata, e.g., the number of steps from the initial configuration (which is just the recursion depth), or a memoization-map of all the states already considered
possibly some decision, metadata, e.g., a pesudo-random number generator
Within each recursive call, the function would take the state, and check if it is the required result. If not it would
if it uses a memoization map, check if a choice was already considered
If it uses a recursive-step count, check whether to pursue the choices further
If it decides to recursively call itself on the possible choices emanating from this state (e.g., if there are different tiles which can be pushed into the hole), it could do so in the order based on the heuristic (or possibly pseudo-randomly based on the order based on the heuristic)
The function would return whether it succeeded, and, if they are used, updated versions of the memoization map and/or pseudo-random number generator.
I am looking for a Haskell data structure that stores an ordered list of elements and that is time-efficient at swapping pairs of elements at arbitrary locations within the list. It's not [a], obviously. It's not Vector because swapping creates new vectors. Which data structure is efficient at this?
The most efficient implementations of persistent data structures, which exhibit O(1) updates (as well as appending, prepending, counting and slicing), are based on the Array Mapped Trie algorithm. The Vector data-structures of Clojure and Scala are based on it, for instance. The only Haskell implementation of that data-structure that I know of is presented by the "persistent-vector" package.
This algorithm is very young, it was only first presented in the year 2000, which might be the reason why not so many people have ever heard about it. But the thing turned out to be such a universal solution that it got adapted for Hash-tables soon after. The adapted version of this algorithm is called Hash Array Mapped Trie. It is as well used in Clojure and Scala to implement the Set and Map data-structures. It is also more ubiquitous in Haskell with packages like "unordered-containers" and "stm-containers" revolving around it.
To learn more about the algorithm I recommend the following links:
http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation.html
http://lampwww.epfl.ch/papers/idealhashtrees.pdf
Data.Sequence from the containers package would likely be a not-terrible data structure to start with for this use case.
Haskell is a (nearly) pure functional language, so any data structure you update will need to make a new copy of the structure, and re-using the data elements is close to the best you can do. Also, the new list would be lazily evaluated and typically only the spine would need to be created until you need the data. If the number of updates is small compared to the number of elements, you could make a difference list that checks a sparse set of updates first, and only then looks in the original vector.
I need to cluster a large number of strings using ELKI based on the Edit Distance / Levenshtein Distance. Since the data set is too large, I'd like to avoid file based precomputed distance matrices. How can I
(a) load string data in ELKI from a file (only "Labels")?
(b) implement a distance function accessing the labels (extend AbstractDBIDDistanceFunction, but how to get the labels?)
Some code snippets or example input files would be helpful.
It's actually pretty straightforward:
A) write a Parser that is adequate for your input file format (why try to reuse a parser written for numerical vectors with labels?), probably subclassing AbstractStreamingParser, producing a relation of the desired data type (probably you can just use String. If you want to be a bit more general TokenSequence may be a more appropriate concept for these distances. Strings are just the simplest case.
B) implement a DistanceFunction based on this vector type instead of DBIDs, i.e. a PrimitiveDistanceFunction<String>. Again, subclassing AbstractPrimitiveDistanceFunction may be the easiest thing to do.
For performance reasons, you may also want to look into indexing algorithms to retrieve e.g. the k most similar strings efficiently. I'm not sure which index structures exist for string edit distance and levenshtein distance.
A colleague has a student that apparently has some working token edit distances, but I have not seen or reviewed the code yet. As he is processing log files, he will probably be using a token based approach instead of characters.
I'm looking for a functional data structure that represents finite bijections between two types, that is space-efficient and time-efficient.
For instance, I'd be happy if, considering a bijection f of size n:
extending f with a new pair of elements has complexity O(ln n)
querying f(x) or f^-1(x) has complexity O(ln n)
the internal representation for f is more space efficient than having 2 finite maps (representing f and its inverse)
I am aware of efficient representation of permutations, like this paper, but it does not seem to solve my problem.
Please have a look at my answer for a relatively similar question. The provided code can handle general NxM relations, but also be specialized to just bijections (just as you would for a binary search tree).
Pasting the answer here for completeness:
The simplest way is to use a pair of unidirectional maps. It has some cost, but you won't get much better (you could get a bit better using dedicated binary trees, but you have a huge complexity cost to pay if you have to implement it yourself). In essence, lookups will be just as fast, but addition and deletion will be twice as slow. Which isn't so bad for a logarithmic operation. Another advantage of this technique is that you can use specialized maps types for the key or value type if you have one available. You won't get as much flexibility with a specific generalist data structure.
A different solution is to use a quadtree (instead of considering a NxN relation as a pair of 1xN and Nx1 relations, you see it as a set of elements in the cartesian product (Key*Value) of your types, that is, a spatial plane), but it's not clear to me that the time and memory costs are better than with two maps. I suppose it needs to be tested.
Although it doesn't satisfy your third requirement, bimaps seem like the way to go. (They just make two finite maps, one in each direction, convenient to use.)
In C++ and other languages, add-on libraries implement a multi-index container, e.g. Boost.Multiindex. That is, a collection that stores one type of value but maintains multiple different indices over those values. These indices provide for different access methods and sorting behaviors, e.g. map, multimap, set, multiset, array, etc. Run-time complexity of the multi-index container is generally the sum of the individual indices' complexities.
Is there an equivalent for Haskell or do people compose their own? Specifically, what is the most idiomatic way to implement a collection of type T with both a set-type of index (T is an instance of Ord) as well as a map-type of index (assume that a key value of type K could be provided for each T, either explicitly or via a function T -> K)?
I just uploaded IxSet to hackage this morning,
http://hackage.haskell.org/package/ixset
ixset provides sets which have multiple indexes.
ixset has been around for a long time as happstack-ixset. This version removes the dependencies on anything happstack specific, and is the new official version of IxSet.
Another option would be kdtree:
darcs get http://darcs.monoid.at/kdtree
kdtree aims to improve on IxSet by offering greater type-safety and better time and space usage. The current version seems to do well on all three of those aspects -- but it is not yet ready for prime time. Additional contributors would be highly welcomed.
In the trivial case where every element has a unique key that's always available, you can just use a Map and extract the key to look up an element. In the slightly less trivial case where each value merely has a key available, a simple solution it would be something like Map K (Set T). Looking up an element directly would then involve first extracting the key, indexing the Map to find the set of elements that share that key, then looking up the one you want.
For the most part, if something can be done straightforwardly in the above fashion (simple transformation and nesting), it probably makes sense to do it that way. However, none of this generalizes well to, e.g., multiple independent keys or keys that may not be available, for obvious reasons.
Beyond that, I'm not aware of any widely-used standard implementations. Some examples do exist, for example IxSet from happstack seems to roughly fit the bill. I suspect one-size-kinda-fits-most solutions here are liable to have a poor benefit/complexity ratio, so people tend to just roll their own to suit specific needs.
Intuitively, this seems like a problem that might work better not as a single implementation, but rather a collection of primitives that could be composed more flexibly than Data.Map allows, to create ad-hoc specialized structures. But that's not really helpful for short-term needs.
For this specific question, you can use a Bimap. In general, though, I'm not aware of any common class for multimaps or multiply-indexed containers.
I believe that the simplest way to do this is simply with Data.Map. Although it is designed to use single indices, when you insert the same element multiple times, most compilers (certainly GHC) will make the values place to the same place. A separate implementation of a multimap wouldn't be that efficient, as you want to find elements based on their index, so you cannot naively associate each element with multiple indices - say [([key], value)] - as this would be very inefficient.
However, I have not looked at the Boost implementations of Multimaps to see, definitively, if there is an optimized way of doing so.
Have I got the problem straight? Both T and K have an order. There is a function key :: T -> K but it is not order-preserving. It is desired to manage a collection of Ts, indexed (for rapid access) both by the T order and the K order. More generally, one might want a collection of T elements indexed by a bunch of orders key1 :: T -> K1, .. keyn :: T -> Kn, and it so happens that here key1 = id. Is that the picture?
I think I agree with gereeter's suggestion that the basis for a solution is just to maintiain in sync a bunch of (Map K1 T, .. Map Kn T). Inserting a key-value pair in a map duplicates neither the key nor the value, allocating only the extra heap required to make a new entry in the right place in the index. Inserting the same value, suitably keyed, in multiple indices should not break sharing (even if one of the keys is the value). It is worth wrapping the structure in an API which ensures that any subsequent modifications to the value are computed once and shared, rather than recomputed for each entry in an index.
Bottom line: it should be possible to maintain multiple maps, ensuring that the values are shared, even though the key-orders are separate.