How to define a tree-like DAG in Haskell - haskell

How do you define a directed acyclic graph (DAG) (of strings) (with one root) best in Haskell?
I especially need to apply the following two functions on this data structure as fast as possible:
Find all (direct and indirect) ancestors of one element (including the parents of the parents etc.).
Find all (direct) children of one element.
I thought of [(String,[String])] where each pair is one element of the graph consisting of its name (String) and a list of strings ([String]) containing the names of (direct) parents of this element. The problem with this implementation is that it's hard to do the second task.
You could also use [(String,[String])] again while the list of strings ([String]) contain the names of the (direct) children. But here again, it's hard to do the first task.
What can I do? What alternatives are there? Which is the most efficient way?
EDIT: One more remark: I'd also like it to be defined easily. I have to define the instance of this data type myself "by hand", so i'd like to avoid unnecessary repetitions.

Have you looked at the tree implemention in Martin Erwig's Functional Graph Library? Each node is represented as a context containing both its children and its parents. See the graph type class for how to access this. It might not be as easy as you requested, but it is already there, well-tested and easy-to-use. I have used it for more than a decade in a large project.

Related

Interview prep - How does one access nodes in binary search trees in the real world?

I'm preparing for the data structures portion of an interview. I see that there are already questions about applications of binary trees in the real world - What are the applications of binary trees?
My question is different - How does one actually access nodes in data structures based on trees?
I understand BFS, DFS and such for explicit node traversal, I get a sense this is not how the real world works. Do people write their own traversal / search algorithms, or do they rely on iterators and similar access methods provided by the database?
What is the name of a high level abstraction of accessing structured tree data, if any? I'm thinking of iterators, but am not sure if this is the right term.
I am looking for a way to intelligently converse on the subject.
What is the name of a high level abstraction of accessing structured
tree data, if any?
Iterator is something more suitable for linear data-structure like linked list or dynamic array. As tree is a hierarchical data-structure, parent(), child(), children(), ancestor(), descendent() are more suitable. For instance, if you want to implement a file system, you will have RootDirectory and to get all the sub-directories you might call like getAllSubDirectories() which are actually children of root directory. So the naming is actually domain specific.
class Directory {
private Integer someAttribute;
....
....
private Directory parentDirectory;
List<Directory> subDirectories;
}
You can abstract as iterator and can call iterator.hasNext() or iterator.next() on a hierarchical data-structure as well. So for example, for a binary search tree, you can abstract away an iterator which can iterate the tree in preorder/inorder/postorder traversal in both direction. See in-order successor.

complex datatype in c#

i need some help for my problem related to complex datatype in C#. I have following type of data and i want to save it in variable, but it will be performance efficient as i have to use it for search and there will be lot of data in it. Data sample is as follow:
ParentNode1
ChildNode1
ChildNode2
ChildNode3
ParentNode2
ParentNode3
ParentNode4
ChildNode1
ChildNode2
Node1
Node2
Node3
Nth level Node1
ChildNode3
ParentNode5
Above data is just a sample to show hierarchy of data. I'm not sure nested List, Dictionary, ienumerable or link list which will be best related to performance. Thanks
If you know that a search will take place at a single level, then you might want a list of lists: one list for each level. If your hierarchy has N levels, then you have N lists. Each one contains nodes that are:
ListNode
Data // string
ParentIndex // index of parent in the previous list
So to search level 4, you go to the list for that level and do your contains or regex test on each node in that level. If it matches, then the ParentIndex value will get you the parent, and its ParentIndex will get you the grandparent, etc.
This way, you don't have to worry about navigating the hierarchy except when you find a match, and you don't have to write nested or recursive algorithms to traverse the tree.
You could maintain your hierarchy, as well, with each top-level node containing a list of child nodes, and build this secondary list only for searching.

Haskell data structure that is efficient for swapping elements?

I am looking for a Haskell data structure that stores an ordered list of elements and that is time-efficient at swapping pairs of elements at arbitrary locations within the list. It's not [a], obviously. It's not Vector because swapping creates new vectors. Which data structure is efficient at this?
The most efficient implementations of persistent data structures, which exhibit O(1) updates (as well as appending, prepending, counting and slicing), are based on the Array Mapped Trie algorithm. The Vector data-structures of Clojure and Scala are based on it, for instance. The only Haskell implementation of that data-structure that I know of is presented by the "persistent-vector" package.
This algorithm is very young, it was only first presented in the year 2000, which might be the reason why not so many people have ever heard about it. But the thing turned out to be such a universal solution that it got adapted for Hash-tables soon after. The adapted version of this algorithm is called Hash Array Mapped Trie. It is as well used in Clojure and Scala to implement the Set and Map data-structures. It is also more ubiquitous in Haskell with packages like "unordered-containers" and "stm-containers" revolving around it.
To learn more about the algorithm I recommend the following links:
http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation.html
http://lampwww.epfl.ch/papers/idealhashtrees.pdf
Data.Sequence from the containers package would likely be a not-terrible data structure to start with for this use case.
Haskell is a (nearly) pure functional language, so any data structure you update will need to make a new copy of the structure, and re-using the data elements is close to the best you can do. Also, the new list would be lazily evaluated and typically only the spine would need to be created until you need the data. If the number of updates is small compared to the number of elements, you could make a difference list that checks a sparse set of updates first, and only then looks in the original vector.

UML metamodel: derived, derived union and subsetting

If you have ever worked with the metamodel of UML, you propably know the concepts of unions and subsets - As far as I understand it:
Attributes and associations of an element/class marked as "derived union" cannot be used directly. In more specific sub-classes, you can possibly find subsets of them that can be used, as long as they are not marked as derived unions themselves.
"derived" (without union) attributes and associations have also subsets in more specific classes, but unlike above you can use them directly without having to look for subsets in more specific classes
My questions:
Does this make sense or am I on the wrong track here?
What is the meaning of the "/" (slash) you can find in front of some
attributes/associations, that they have subsets in child-classes?
E.g. /general : Classifier[*]
An union property is a property that consists of multiple other properties. You can only understand the union, when you combine all subsets. A list is almost by definition an union.
Almost, because it might be uninitialized.
A derived union is a property requiring a specific collection of subsets. I would not talk about accessing them directly, but about how direct you can understand them. You need all information before you can make an interpretation.
The difference between the two that a derived union requires a specific subset and an union might have a subset and might have different subsets in different contexts. A very simple example being the fields on a form. All required fields show the definition of a derived union. All other fields are part of the complete union.
Derived unions can contain derived unions in their subsets. It directs the creation of classes and their instances, it does not make them impossible.
All derived features require other features to be known. Temperature can be read directly, but to know if someone has fever requires more knowledge, like the time of the day, the place of collecting information etc..
The slash implies that it is being derived.

Converting graph to canonical string

I'm looking for a way of storing graphs as strings. The strings are to be used as keys in a map, so that two topologically identical graphs will map to the same value in the map. Does anybody know of such an algorithm?
The nodes of the tree are labeled with duplicate labels being allowed.
The program is in java and an implementation in that would be neat, but any pointers to possible algorithms are appreciated.
if you have an algorithm that maps general graphs to strings, and so that two graphs map to the same string if and only if they are topologically equivalent, then you have an algorithm for GRAPH AUTOMORPHISM. Graph automorphism has no known polynomial-time algorithms. So you can't have (easily :) a polynomial-time algorithm that calculates the strings as you postulate them, because otherwise you'd have constructed a previously unknown and very efficient algorithm to graph automorphism.
This doesn't mean that it wouldn't be possible to solve the problem for your class of graphs; it just means that for the class of all graphs it's kind of difficult.
You may find the following question relevant...
Using finite automata as keys to a container
Basically, an automaton can be minimised using well-known algorithms from automata-theory textbooks. Hopcrofts is an example. There is precisely one minimal automaton that is equivalent to any given automaton. However, that minimal automaton may be represented in different ways. Constructing a safe canonical form is basically a matter of renumbering nodes and ordering the adjacency table using information that is significant in defining the automaton, and not by information that is specific to the representation.
The basic principle should extend to general graphs. Whether you can minimise your graphs depends on their semantics, but the basic idea of renumbering the nodes and sorting the adjacency list still applies.
Other answers here assume things about your graphs - for example that the nodes have unique labels that can be ordered and which are significant for the semantics of your graphs, that can be used to identify the nodes in an adjacency matrix or list. This simply won't work if you're interested in morphims of unlabelled graphs, for instance. Different ways of numbering the nodes (and thus ordering the adjacency list) will result in different canonical forms for equivalent graphs that just happen to be represented differently.
As for the renumbering etc, an approach is to borrow and adapt principles from automata minimisation algorithms. Basically...
Create a vector of blocks (sets of nodes). Initially, populate this with one block per class of nodes (ie per distinct node annotation). The modification here is that we order these by annotation details (not by representation-specific node IDs).
For each class (annotation) of edges in order, evaluate each block. If each node in the block can follow the current edge-type to reach the same set of next blocks, leave it untouched. Otherwise, split it as necessary to get maximal blocks that achieve this objective. Keep these split blocks clustered together in the vector (preserve the existing ordering, just refine it a bit), and order the split blocks based on a suitable ordering of the next-block sets. For example, use bitvectors as long as the current vector of blocks, with a set bit for each block reachable by following the current edge type. To order the bitvectors, treat them as big integers.
EDIT - I forgot to mention - in the second bullet, as soon as you split a block, you restart with the first block in the vector and first edge annotation. Obviously, a naive implementation will be slow, so take the principle and use it to adapt Hopcrofts minimisation algorithm.
If you end up with blocks that have multiple nodes in them, those nodes are equivalent. Whether that means they can be merged or not depends on your semantics, but the relative ordering of nodes within each such block clearly doesn't matter.
If dealing with graphs that can be minimised (e.g. automaton digraphs) I suspect it's best to minimise first, though I still haven't got around to implementing this myself.
The key thing is, of course, ensuring that your renumbering is sensitive only to the significant details of the graph - its structure and annotations - and not the things that are only there so that you can construct a representation such as node IDs/addresses etc.
Once you have the blocks ordered, deriving a canonical form should be easy.
gSpan introduced the 'minimum DFS code' which encodes graphs such that if two graphs have the same code, they must be isomorphic. gSpan has implementations in C++ and Java.
A common way to do this is using Adjacency lists
Beside an Adjacency list, there are adjacency matrices. Which one you choose should depend on which you use to implement your Graph class (adjacency lists are usually the better choice, but they both have strengths and weaknesses). If you have a totally different implementation of Graph, consider using one of these, as it makes many graph algorithms very easy to implement.
One other option is, if possible, overriding hashCode() and equals() on the Graph class and use the actual graph object as the key rather than converting to a string.
E: overriding the hashCode() and equals() is the route I would take if some vertices are not uniquely labeled. As noted in the comments, this can be expensive, but I think it would depend on the implementation of the Graph class.
If equals() is too expensive, then you should use an adjacency list or matrix, but don't just use the node names. You have to carefully specify exactly what it is that identifies individual graphs and vertices (and therefore what would make them equal), and then make your string representation of the adjacency list use those properties instead of the node names. I'd suggest you write this specification of your graph equals operation down.

Resources