I want to implement an AST in Haskell. I need a parent reference so it seems impossible to use a functional data structure. I've seen the following in an article. We define a node as:
type Tree = Node -> Node
Node allows us to get attribute by key of type Key a.
Is there anything to read about such a pattern? Could you give me some further links?
If you want a pure data structure with cyclic self-references, then as delnan says in the comments the usual term for that is "tying the knot". Searching for that term should give you more information.
Do note that data structures built by tying the knot are difficult (or impossible) to "update" in the usual manner--with a non-cyclic structure you can keep pieces of the original when building a new structure based on it, but changing any piece of a cycle requires you to rebuild the entire cycle as well. Depending on what you're doing, this may or may not be a problem, of course.
Related
From the bird's view, my question is: Is there a universal mechanism for as-is data serialization in Haskell?
Introduction
The origin of the problem does not root in Haskell indeed. Once, I tried to serialize a python dictionary where a hash function of objects was quite heavy. I found that in python, the default dictionary serialization does not save the internal structure of the dictionary but just dumps a list of key-value pairs. As a result, the de-serialization process is time-consuming, and there is no way to struggle with it. I was certain that there is a way in Haskell because, at my glance, there should be no problem transferring a pure Haskell type to a byte-stream automatically using BFS or DFS. Surprisingly, but it does not. This problem was discussed here (citation below)
Currently, there is no way to make HashMap serializable without modifying the HashMap library itself. It is not possible to make Data.HashMap an instance of Generic (for use with cereal) using stand-alone deriving as described by #mergeconflict's answer, because Data.HashMap does not export all its constructors (this is a requirement for GHC). So, the only solution left to serialize the HashMap seems to be to use the toList/fromList interface.
Current Problem
I have quite the same problem with Data.Trie bytestring-trie package. Building a trie for my data is heavily time-consuming and I need a mechanism to serialize and de-serialize this tire. However, it looks like the previous case, I see no way how to make Data.Trie an instance of Generic (or, am I wrong)?
So the questions are:
Is there some kind of a universal mechanism to project a pure Haskell type to a byte string? If no, is it a fundamental restriction or just a lack of implementations?
If no, what is the most painless way to modify the bytestring-trie package to make it the instance of Generic and serialize with Data.Store
There is a way using compact regions, but there is a big restriction:
Our binary representation contains direct pointers to the info tables of objects in the region. This means that the info tables of the receiving process must be laid out in exactly the same way as from the original process; in practice, this means using static linking, using the exact same binary and turning off ASLR. This API does NOT do any safety checking and will probably segfault if you get it wrong. DO NOT run this on untrusted input.
This also gives insight into universal serialization is not possible currently. Data structures contain very specific pointers which can differ if you're using different binaries. Reading in the raw bytes into another binary will result in invalid pointers.
There is some discussion in this GitHub issue about weakening this requirement.
I think the proper way is to open an issue or pull request upstream to export the data constructors in the internal module. That is what happened with HashMap which is now fully accessible in its internal module.
Update: it seems there is already a similar open issue about this.
I'm a newbie Haskell programmer with imperative background.
I'm writing a program that parses an abstract syntax tree (or rather a graph) that has cycles. (This is actually GCC's Generic AST). I'm designing data types for representing this graph, but I'm facing difficulties:
Firstly, there are lots of nasty cycles in GCC AST. For example type declarations refer to the actual type descriptor which refers back to the type's declaration (and this declaration may sometimes be different from the original, sometimes it is just a reference to an identifier node). All tree nodes reference a so-called context node (a form of parent reference). However, this context node (for some node X) often is not the same as the node that originally referred to X. For example a declaration for some builtin function may be found from a C++ namespace node, but the function declaration's context points to a translation unit declaration. Perhpas this makes using zippers impossible? Also I can't just ignore these context nodes, because they are useful for finding out the scope of declarations.
So, when designing the data structure I should take into account the fact, that sometimes I don't know what kind of node I will be dealing with at run time. However, when I'm certain that a node will have a known type, I would like to be able to reflect this on type level and take advantage of static type checking (for example function's result type is always a type, not an integer constant, but it might be integer or pointer type, etc.).
Secondly, GCC tree is a hierarchical data type in object-oriented sense. All nodes have common information, such as what kind of node they are. All declarations have a name and many flags, and variable declarations have type information in addition. Many nodes have so much data that it would be incovenient to access this information through pattern matching only. So I most likely will be using accessor functions (and type classes to provide an uniform interface regardless of the node type).
Thirdly, I would like my graph to be purely functional, but I don't know how to build it. My input is text, which has a section for every node. Nodes are identified by unique ids, and there are lots of forward- and self-references.
So, with my background, I sense that I'm trying to force my Haskell interface to an imperative form. So I'm asking you for some concrete advice to guide me in designing my data types, their interface and how to build the graph.
So far I have decided that I will not be tying the knot. It would prevent me from doing transformations on the tree (which IMHO deserves some transformations). It also would make it hard to write this tree back to disc, if I want to do that some day.
EDIT:
Here is a sample of my input. It is in YAML, but the format is not yet set to stone (I'm generating this data my self from within GCC).
http://sange.fi/~aura/test.yaml The example contains the global namespace with one function declaration for (int main (int argc, char *argv[])).
Thanks in advance!
I was thinking about writing a browser in haskell. A central data structure will be a mutable tree representing the document. Apart from using a tree composing entirely of iorefs, is there a better solution?
I am hoping to avoid something like this: data DomNode = DomNode TagName NodeProperties (IORef DomNode) [IORef DomNode] (tag, properties, parent, children)
The problem is that javascript can hold onto references of nodes in the tree, and it can mutate (add children, modify properties) any node it has a reference to, as well as traverse to it's parent.
Edit:
I realized you would need to use mutable state somehow - because you can hold onto a reference to a node that is deleted from the tree, or moved in the tree. If you referred to the element via something based on the structure of the tree, this reference will be invalid.
It is natural for javascript to operate with mutable references, so you'll have to introduce them sooner or later (not necessarily IORefs, maybe some kind of lookup table living in state monad).
If most of operations on DOM will be performed from javascript, then it is better to select data structure natural for it.
Don't try to use pure data structure only for purity itself. Otherwise you'll finish with hand-made emulation of RAM :) Your task looks imperative for me, so why not to use all the imperative features haskell provides?
I don't think zippers, as Niklas B. suggested, can help you a lot. Usually they have only one "cursor", where you can mutate. (AFAIK in theory any number of cursors are possible, but in practice it is mostly unusable)
The usual Haskell approach is to use an immutable tree, and have operations such as addChild and so forth return a new, modified tree, rather than touching the existing tree.
Without knowing what you're actually trying to do with this tree, I would suggest that this is probably the simplest and easiest approach.
Problem specification:
I am currently searching for a elegant and/but efficient solution to a problem that i guess is quite common. Consider the following situation:
I defined a fileformat based on a BTree that is defined (in a simplified way) like this:
data FileTree = FileNode [Key] [FileOffset]
| FileLeaf [Key] [Data]
Reading and writing this from a file to a lazy data structure is implemented and works just fine. This will result in a instance of:
data MemTree = MemNode [Key] [MemTree]
| MemLeaf [Key] [Data]
Now my goal is to have a generic function updateFile :: FilePath -> (MemTree -> MemTree) -> IO () that will read in the FileTree and convert it into a MemTree, apply the MemTree -> MemTree function and write back the changes to the tree structure. The problem is that the FileOffsets have to be conserved somehow.
I have two approaches to this problem. Both of them lack in elegance and/or efficiency:
Approach 1: Extend MemTree to contain the offsets
This approach extends the MemTree to contain the offsets:
data MemTree = MemNode [Key] [(MemTree, Maybe FileOffset)]
| MemNode [Key] [Data]
The read function would then read in the FileTree and stores the FileOffset alongside the MemTree reference. Writing will checks if a reference already has an associated offset and if it does it just uses it.
Pros: easy to implement, no overhead to find the offset
Cons: exposes internal to the user who is responsible to set the offset to Nothing
Approach 2: Store offsets in a secondary structure
Another way to attack this problem is to read in the FileTree and create a StableName.Map that holds onto the FileOffsets. That way (and if i understand the semantics of StableName correctly) it should be possible to take the final MemTree and lookup the StableName of each node in the the StableName.Map. If there is an entry the node is clean and doesn't have to be written again.
Pros: doesn't expose the internals to the user
Cons: involves overhead for lookups in the map
Conclusion
These are the two approaches i can think of. The first one should be more efficient, the second one is more pleasant to the eye. I'd like your comments on my ideas, maybe someone even has a better approach in mind?
[Edit] Reasonal
There are two reasons i am searching for a solution like this:
On the one hand you should try to handle errors before they arise by using the type system. The aforementioned user is of course the designer of the next layer in the system (ie me). By working on the pure tree representation some kinds of bugs won't be able to happen. All changes to the tree in the file should be in one place. That should make reasoning easier.
On the other hand i could just implement something like insert :: FilePath -> Key -> Value -> IO () and be done with it. But then i'll lose a very nice trait that comes free when i keep a (kind of a) log by updating the tree in place. Transactions (ie merging of several inserts) are just a matter of working on the same tree in memory and writing just the differences back to the file.
I think that the package Data.Generic.Diff may do exactly what you wanted. It references somebody's thesis for the idea of how it works.
I am very new at Haskell so I won't be showing code, but hopefully my explanation may help for a solution.
First, why not just expose only the MemTree to the user, since that is what they will update, and the FileTree can be kept completely hidden. That way, later, if you want to change this to be going to a database, for example, the user doesn't see any difference.
So, since the FileTree is hidden, why not just read it in when you are going to update, then you have the offsets, so do the update, and close the file again.
One problem with keeping the offsets is that it prevents another program from making any changes to the file, and in your case that may be fine, but I think as a general rule it is a bad design.
The main change, that I see, is that the MemTree shouldn't be lazy, since the file won't be staying open.
I know that memoization seems to be a perennial topic here on the haskell tag on stack overflow, but I think this question has not been asked before.
I'm aware of several different 'off the shelf' memoization libraries for Haskell:
The memo-combinators and memotrie packages, which make use of a beautiful trick involving lazy infinite data structures to achieve memoization in a purely functional way. (As I understand it, the former is slightly more flexible, while the latter is easier to use in simple cases: see this SO answer for discussion.)
The uglymemo package, which uses unsafePerformIO internally but still presents a referentially transparent interface. The use of unsafePerformIO internally results in better performance than the previous two packages. (Off the shelf, its implementation uses comparison-based search data structures, rather than perhaps-slightly-more-efficient hash functions; but I think that if you find and replace Cmp for Hashable and Data.Map for Data.HashMap and add the appropraite imports, you get a hash based version.)
However, I'm not aware of any library that looks answers up based on object identity rather than object value. This can be important, because sometimes the kinds of object which are being used as keys to your memo table (that is, as input to the function being memoized) can be large---so large that fully examining the object to determine whether you've seen it before is itself a slow operation. Slow, and also unnecessary, if you will be applying the memoized function again and again to an object which is stored at a given 'location in memory' 1. (This might happen, for example, if we're memoizing a function which is being called recursively over some large data structure with a lot of structural sharing.) If we've already computed our memoized function on that exact object before, we can already know the answer, even without looking at the object itself!
Implementing such a memoization library involves several subtle issues and doing it properly requires several special pieces of support from the language. Luckily, GHC provides all the special features that we need, and there is a paper by Peyton-Jones, Marlow and Elliott which basically worries about most of these issues for you, explaining how to build a solid implementation. They don't provide all details, but they get close.
The one detail which I can see which one probably ought to worry about, but which they don't worry about, is thread safety---their code is apparently not threadsafe at all.
My question is: does anyone know of a packaged library which does the kind of memoization discussed in the Peyton-Jones, Marlow and Elliott paper, filling in all the details (and preferably filling in proper thread-safety as well)?
Failing that, I guess I will have to code it up myself: does anyone have any ideas of other subtleties (beyond thread safety and the ones discussed in the paper) which the implementer of such a library would do well to bear in mind?
UPDATE
Following #luqui's suggestion below, here's a little more data on the exact problem I face. Let's suppose there's a type:
data Node = Node [Node] [Annotation]
This type can be used to represent a simple kind of rooted DAG in memory, where Nodes are DAG Nodes, the root is just a distinguished Node, and each node is annotated with some Annotations whose internal structure, I think, need not concern us (but if it matters, just ask and I'll be more specific.) If used in this way, note that there may well be significant structural sharing between Nodes in memory---there may be exponentially more paths which lead from the root to a node than there are nodes themselves. I am given a data structure of this form, from an external library with which I must interface; I cannot change the data type.
I have a function
myTransform : Node -> Node
the details of which need not concern us (or at least I think so; but again I can be more specific if needed). It maps nodes to nodes, examining the annotations of the node it is given, and the annotations its immediate children, to come up with a new Node with the same children but possibly different annotations. I wish to write a function
recursiveTransform : Node -> Node
whose output 'looks the same' as the data structure as you would get by doing:
recursiveTransform Node originalChildren annotations =
myTransform Node recursivelyTransformedChildren annotations
where
recursivelyTransformedChildren = map recursiveTransform originalChildren
except that it uses structural sharing in the obvious way so that it doesn't return an exponential data structure, but rather one on the order of the same size as its input.
I appreciate that this would all be easier if say, the Nodes were numbered before I got them, or I could otherwise change the definition of a Node. I can't (easily) do either of these things.
I am also interested in the general question of the existence of a library implementing the functionality I mention quite independently of the particular concrete problem I face right now: I feel like I've had to work around this kind of issue on a few occasions, and it would be nice to slay the dragon once and for all. The fact that SPJ et al felt that it was worth adding not one but three features to GHC to support the existence of libraries of this form suggests that the feature is genuinely useful and can't be worked around in all cases. (BUT I'd still also be very interested in hearing about workarounds which will help in this particular case too: the long term problem is not as urgent as the problem I face right now :-) )
1 Technically, I don't quite mean location in memory, since the garbage collector sometimes moves objects around a bit---what I really mean is 'object identity'. But we can think of this as being roughly the same as our intuitive idea of location in memory.
If you only want to memoize based on object identity, and not equality, you can just use the existing laziness mechanisms built into the language.
For example, if you have a data structure like this
data Foo = Foo { ... }
expensive :: Foo -> Bar
then you can just add the value to be memoized as an extra field and let the laziness take care of the rest for you.
data Foo = Foo { ..., memo :: Bar }
To make it easier to use, add a smart constructor to tie the knot.
makeFoo ... = let foo = Foo { ..., memo = expensive foo } in foo
Though this is somewhat less elegant than using a library, and requires modification of the data type to really be useful, it's a very simple technique and all thread-safety issues are already taken care of for you.
It seems that stable-memo would be just what you needed (although I'm not sure if it can handle multiple threads):
Whereas most memo combinators memoize based on equality, stable-memo does it based on whether the exact same argument has been passed to the function before (that is, is the same argument in memory).
stable-memo only evaluates keys to WHNF.
This can be more suitable for recursive functions over graphs with cycles.
stable-memo doesn't retain the keys it has seen so far, which allows them to be garbage collected if they will no longer be used. Finalizers are put in place to remove the corresponding entries from the memo table if this happens.
Data.StableMemo.Weak provides an alternative set of combinators that also avoid retaining the results of the function, only reusing results if they have not yet been garbage collected.
There is no type class constraint on the function's argument.
stable-memo will not work for arguments which happen to have the same value but are not the same heap object. This rules out many candidates for memoization, such as the most common example, the naive Fibonacci implementation whose domain is machine Ints; it can still be made to work for some domains, though, such as the lazy naturals.
Ekmett just uploaded a library that handles this and more (produced at HacPhi): http://hackage.haskell.org/package/intern. He assures me that it is thread safe.
Edit: Actually, strictly speaking I realize this does something rather different. But I think you can use it for your purposes. It's really more of a stringtable-atom type interning library that works over arbitrary data structures (including recursive ones). It uses WeakPtrs internally to maintain the table. However, it uses Ints to index the values to avoid structural equality checks, which means packing them into the data type, when what you want are apparently actually StableNames. So I realize this answers a related question, but requires modifying your data type, which you want to avoid...