Can Haskell functions be serialized? - haskell

The best way to do it would be to get the representation of the function (if it can be recovered somehow). Binary serialization is preferred for efficiency reasons.
I think there is a way to do it in Clean, because it would be impossible to implement iTask, which relies on that tasks (and so functions) can be saved and continued when the server is running again.
This must be important for distributed haskell computations.
I'm not looking for parsing haskell code at runtime as described here: Serialization of functions in Haskell.
I also need to serialize not just deserialize.

Unfortunately, it's not possible with the current ghc runtime system.
Serialization of functions, and other arbitrary data, requires some low level runtime support that the ghc implementors have been reluctant to add.
Serializing functions requires that you can serialize anything, since arbitrary data (evaluated and unevaluated) can be part of a function (e.g., a partial application).

No. However, the CloudHaskell project is driving home the need for explicit closure serialization support in GHC. The closest thing CloudHaskell has to explicit closures is the distributed-static package. Another attempt is the HdpH closure representation. However, both use Template Haskell in the way Thomas describes below.
The limitation is a lack of static support in GHC, for which there is a currently unactioned GHC ticket. (Any takers?). There has been a discussion on the CloudHaskell mailing list about what static support should actually look like, but nothing has yet progressed as far as I know.
The closest anyone has come to a design and implementation is Jost Berthold, who has implemented function serialisation in Eden. See his IFL 2010 paper "Orthogonal Serialisation for Haskell". The serialisation support is baked in to the Eden runtime system. (Now available as separate library: packman. Not sure whether it can be used with GHC or needs a patched GHC as in the Eden fork...) Something similar would be needed for GHC. This is the serialisation support Eden, in the version forked from GHC 7.4:
data Serialized a = Serialized { packetSize :: Int , packetData :: ByteArray# }
serialize :: a -> IO (Serialized a)
deserialize :: Serialized a -> IO a
So: one can serialize functions and data structures. There is a Binary instance for Serialized a, allowing you to checkpoint a long-running computation to file! (See Secion 4.1).
Support for such a simple serialization API in the GHC base libraries would surely be the Holy Grail for distributed Haskell programming. It would likely simplify the composability between the distributed Haskell flavours (CloudHaskell, MetaPar, HdpH, Eden and so on...)

Check out Cloud Haskell. It has a concept called Closure which is used to send code to be executed on remote nodes in a type safe manner.

Eden probably comes closest and probably deserves a seperate answer: (De-)Serialization of unevaluated thunks is possible, see https://github.com/jberthold/packman.
Deserialization is however limited to the same program (where program is a "compilation result"). Since functions are serialized as code pointers, previously unknown functions cannot be deserialized.
Possible usage:
storing unevaluated work for later
distributing work (but no sharing of new code)

A pretty simple and practical, but maybe not as elegant solution would be to (preferably have GHC automatically) compile each function into a separate module of machine-independent bytecode, serialize that bytecode whenever serialization of that function is required, and use the dynamic-loader or plugins packages, to dynamically load them, so even previously unknown functions can be used.
Since a module notes all its dependencies, those could then be (de)serialized and loaded too. In practice, serializing index numbers and attaching an indexed list of the bytecode blobs would probably be the most efficient.
I think as long as you compile the modules yourself, this is already possible right now.
As I said, it would not be very pretty though. Not to mention the generally huge security risk of de-serializing code from insecure sources to run in an unsecured environment. :-)
(No problem if it is trustworthy, of course.)
I’m not going to code it up right here, right now though. ;-)

Related

Serialization in Haskell

From the bird's view, my question is: Is there a universal mechanism for as-is data serialization in Haskell?
Introduction
The origin of the problem does not root in Haskell indeed. Once, I tried to serialize a python dictionary where a hash function of objects was quite heavy. I found that in python, the default dictionary serialization does not save the internal structure of the dictionary but just dumps a list of key-value pairs. As a result, the de-serialization process is time-consuming, and there is no way to struggle with it. I was certain that there is a way in Haskell because, at my glance, there should be no problem transferring a pure Haskell type to a byte-stream automatically using BFS or DFS. Surprisingly, but it does not. This problem was discussed here (citation below)
Currently, there is no way to make HashMap serializable without modifying the HashMap library itself. It is not possible to make Data.HashMap an instance of Generic (for use with cereal) using stand-alone deriving as described by #mergeconflict's answer, because Data.HashMap does not export all its constructors (this is a requirement for GHC). So, the only solution left to serialize the HashMap seems to be to use the toList/fromList interface.
Current Problem
I have quite the same problem with Data.Trie bytestring-trie package. Building a trie for my data is heavily time-consuming and I need a mechanism to serialize and de-serialize this tire. However, it looks like the previous case, I see no way how to make Data.Trie an instance of Generic (or, am I wrong)?
So the questions are:
Is there some kind of a universal mechanism to project a pure Haskell type to a byte string? If no, is it a fundamental restriction or just a lack of implementations?
If no, what is the most painless way to modify the bytestring-trie package to make it the instance of Generic and serialize with Data.Store
There is a way using compact regions, but there is a big restriction:
Our binary representation contains direct pointers to the info tables of objects in the region. This means that the info tables of the receiving process must be laid out in exactly the same way as from the original process; in practice, this means using static linking, using the exact same binary and turning off ASLR. This API does NOT do any safety checking and will probably segfault if you get it wrong. DO NOT run this on untrusted input.
This also gives insight into universal serialization is not possible currently. Data structures contain very specific pointers which can differ if you're using different binaries. Reading in the raw bytes into another binary will result in invalid pointers.
There is some discussion in this GitHub issue about weakening this requirement.
I think the proper way is to open an issue or pull request upstream to export the data constructors in the internal module. That is what happened with HashMap which is now fully accessible in its internal module.
Update: it seems there is already a similar open issue about this.

Why can't I replace libraries distributed with GHC? What would happen if I did?

I see in this answer and this one that "everything will break horribly" and Stack won't let me replace base, but it will let me replace bytestring. What's the problem with this? Is there a way to do this safely without recompiling GHC? I'm debugging a problem with the base libraries and it'd be very convenient.
N.B. when I say I want to replace base I mean with a modified version of base from the same GHC version. I'm debugging the library, not testing a program against different GHC releases.
Most libraries are collections of Haskell modules containing Haskell code. The meaning of those libraries is determined by the code in the modules.
The base package, though, is a bit different. Many of the functions and data types it offers are not implemented in standard Haskell; their meaning is not given by the code contained in the package, but by the compiler itself. If you look at the source of the base package (and the other boot libraries), you will see many operations whose complete definition is simply undefined. Special code in the compiler's runtime system implements these operations and exposes them.
For example, if the compiler didn't offer seq as a primitive operation, there would be no way to implement seq after-the-fact: no Haskell term that you can write down will have the same type and semantics as seq unless it uses seq (or one of the Haskell extensions defined in terms of seq). Likewise many of the pointer operations, ST operations, concurrency primitives, and so forth are implemented in the compiler themselves.
Not only are these operations typically unimplementable, they also are typically very strongly tied to the compiler's internal data structures, which change from one release to the next. So even if you managed to convince GHC to use the base package from a different (version of the) compiler, the most likely outcome would simply be corrupted internal data structures with unpredictable (and potentially disastrous) results -- race conditions, trashing memory, space leaks, segfaults, that kind of thing.
If you need several versions of base, just install several versions of GHC. It's been carefully architected so that multiple versions can peacefully coexist on a single machine. (And in particular installing multiple versions definitely does not require recompiling GHC or even compiling GHC a first time, which seems to be your main concern.)

What's so bad about Template Haskell?

It seems that Template Haskell is often viewed by the Haskell community as an unfortunate convenience. It's hard to put into words exactly what I have observed in this regard, but consider these few examples
Template Haskell listed under "The Ugly (but necessary)" in response to the question Which Haskell (GHC) extensions should users use/avoid?
Template Haskell considered a temporary/inferior solution in Unboxed Vectors of newtype'd values thread (libraries mailing list)
Yesod is often criticized for relying too much on Template Haskell (see the blog post in response to this sentiment)
I've seen various blog posts where people do pretty neat stuff with Template Haskell, enabling prettier syntax that simply wouldn't be possible in regular Haskell, as well as tremendous boilerplate reduction. So why is it that Template Haskell is looked down upon in this way? What makes it undesirable? Under what circumstances should Template Haskell be avoided, and why?
One reason for avoiding Template Haskell is that it as a whole isn't type-safe, at all, thus going against much of "the spirit of Haskell." Here are some examples of this:
You have no control over what kind of Haskell AST a piece of TH code will generate, beyond where it will appear; you can have a value of type Exp, but you don't know if it is an expression that represents a [Char] or a (a -> (forall b . b -> c)) or whatever. TH would be more reliable if one could express that a function may only generate expressions of a certain type, or only function declarations, or only data-constructor-matching patterns, etc.
You can generate expressions that don't compile. You generated an expression that references a free variable foo that doesn't exist? Tough luck, you'll only see that when actually using your code generator, and only under the circumstances that trigger the generation of that particular code. It is very difficult to unit test, too.
TH is also outright dangerous:
Code that runs at compile-time can do arbitrary IO, including launching missiles or stealing your credit card. You don't want to have to look through every cabal package you ever download in search for TH exploits.
TH can access "module-private" functions and definitions, completely breaking encapsulation in some cases.
Then there are some problems that make TH functions less fun to use as a library developer:
TH code isn't always composable. Let's say someone makes a generator for lenses, and more often than not, that generator will be structured in such a way that it can only be called directly by the "end-user," and not by other TH code, by for example taking a list of type constructors to generate lenses for as the parameter. It is tricky to generate that list in code, while the user only has to write generateLenses [''Foo, ''Bar].
Developers don't even know that TH code can be composed. Did you know that you can write forM_ [''Foo, ''Bar] generateLens? Q is just a monad, so you can use all of the usual functions on it. Some people don't know this, and because of that, they create multiple overloaded versions of essentially the same functions with the same functionality, and these functions lead to a certain bloat effect. Also, most people write their generators in the Q monad even when they don't have to, which is like writing bla :: IO Int; bla = return 3; you are giving a function more "environment" than it needs, and clients of the function are required to provide that environment as an effect of that.
Finally, there are some things that make TH functions less fun to use as an end-user:
Opacity. When a TH function has type Q Dec, it can generate absolutely anything at the top-level of a module, and you have absolutely no control over what will be generated.
Monolithism. You can't control how much a TH function generates unless the developer allows it; if you find a function that generates a database interface and a JSON serialization interface, you can't say "No, I only want the database interface, thanks; I'll roll my own JSON interface"
Run time. TH code takes a relatively long time to run. The code is interpreted anew every time a file is compiled, and often, a ton of packages are required by the running TH code, that have to be loaded. This slows down compile time considerably.
This is solely my own opinion.
It's ugly to use. $(fooBar ''Asdf) just does not look nice. Superficial, sure, but it contributes.
It's even uglier to write. Quoting works sometimes, but a lot of the time you have to do manual AST grafting and plumbing. The API is big and unwieldy, there's always a lot of cases you don't care about but still need to dispatch, and the cases you do care about tend to be present in multiple similar but not identical forms (data vs. newtype, record-style vs. normal constructors, and so on). It's boring and repetitive to write and complicated enough to not be mechanical. The reform proposal addresses some of this (making quotes more widely applicable).
The stage restriction is hell. Not being able to splice functions defined in the same module is the smaller part of it: the other consequence is that if you have a top-level splice, everything after it in the module will be out of scope to anything before it. Other languages with this property (C, C++) make it workable by allowing you to forward declare things, but Haskell doesn't. If you need cyclic references between spliced declarations or their dependencies and dependents, you're usually just screwed.
It's undisciplined. What I mean by this is that most of the time when you express an abstraction, there is some kind of principle or concept behind that abstraction. For many abstractions, the principle behind them can be expressed in their types. For type classes, you can often formulate laws which instances should obey and clients can assume. If you use GHC's new generics feature to abstract the form of an instance declaration over any datatype (within bounds), you get to say "for sum types, it works like this, for product types, it works like that". Template Haskell, on the other hand, is just macros. It's not abstraction at the level of ideas, but abstraction at the level of ASTs, which is better, but only modestly, than abstraction at the level of plain text.*
It ties you to GHC. In theory another compiler could implement it, but in practice I doubt this will ever happen. (This is in contrast to various type system extensions which, though they might only be implemented by GHC at the moment, I could easily imagine being adopted by other compilers down the road and eventually standardized.)
The API isn't stable. When new language features are added to GHC and the template-haskell package is updated to support them, this often involves backwards-incompatible changes to the TH datatypes. If you want your TH code to be compatible with more than just one version of GHC you need to be very careful and possibly use CPP.
There's a general principle that you should use the right tool for the job and the smallest one that will suffice, and in that analogy Template Haskell is something like this. If there's a way to do it that's not Template Haskell, it's generally preferable.
The advantage of Template Haskell is that you can do things with it that you couldn't do any other way, and it's a big one. Most of the time the things TH is used for could otherwise only be done if they were implemented directly as compiler features. TH is extremely beneficial to have both because it lets you do these things, and because it lets you prototype potential compiler extensions in a much more lightweight and reusable way (see the various lens packages, for example).
To summarize why I think there are negative feelings towards Template Haskell: It solves a lot of problems, but for any given problem that it solves, it feels like there should be a better, more elegant, disciplined solution better suited to solving that problem, one which doesn't solve the problem by automatically generating the boilerplate, but by removing the need to have the boilerplate.
* Though I often feel that CPP has a better power-to-weight ratio for those problems that it can solve.
EDIT 23-04-14: What I was frequently trying to get at in the above, and have only recently gotten at exactly, is that there's an important distinction between abstraction and deduplication. Proper abstraction often results in deduplication as a side effect, and duplication is often a telltale sign of inadequate abstraction, but that's not why it's valuable. Proper abstraction is what makes code correct, comprehensible, and maintainable. Deduplication only makes it shorter. Template Haskell, like macros in general, is a tool for deduplication.
I'd like to address a few of the points dflemstr brings up.
I don't find the fact that you can't typecheck TH to be that worrying. Why? Because even if there is an error, it will still be compile time. I'm not sure if this strengthens my argument, but this is similar in spirit to the errors that you receive when using templates in C++. I think these errors are more understandable than C++'s errors though, as you'll get a pretty printed version of the generated code.
If a TH expression / quasi-quoter does something that's so advanced that tricky corners can hide, then perhaps it's ill-advised?
I break this rule quite a bit with quasi-quoters I've been working on lately (using haskell-src-exts / meta) - https://github.com/mgsloan/quasi-extras/tree/master/examples . I know this introduces some bugs such as not being able to splice in the generalized list comprehensions. However, I think that there's a good chance that some of the ideas in http://hackage.haskell.org/trac/ghc/blog/Template%20Haskell%20Proposal will end up in the compiler. Until then, the libraries for parsing Haskell to TH trees are a nearly perfect approximation.
Regarding compilation speed / dependencies, we can use the "zeroth" package to inline the generated code. This is at least nice for the users of a given library, but we can't do much better for the case of editing the library. Can TH dependencies bloat generated binaries? I thought it left out everything that's not referenced by the compiled code.
The staging restriction / splitting of compilation steps of the Haskell module does suck.
RE Opacity: This is the same for any library function you call. You have no control over what Data.List.groupBy will do. You just have a reasonable "guarantee" / convention that the version numbers tell you something about the compatibility. It is somewhat of a different matter of change when.
This is where using zeroth pays off - you're already versioning the generated files - so you'll always know when the form of the generated code has changed. Looking at the diffs might be a bit gnarly, though, for large amounts of generated code, so that's one place where a better developer interface would be handy.
RE Monolithism: You can certainly post-process the results of a TH expression, using your own compile-time code. It wouldn't be very much code to filter on top-level declaration type / name. Heck, you could imagine writing a function that does this generically. For modifying / de-monolithisizing quasiquoters, you can pattern match on "QuasiQuoter" and extract out the transformations used, or make a new one in terms of the old.
This answer is in response to the issues brought up by illissius, point by point:
It's ugly to use. $(fooBar ''Asdf) just does not look nice. Superficial, sure, but it contributes.
I agree. I feel like $( ) was chosen to look like it was part of the language - using the familiar symbol pallet of Haskell. However, that's exactly what you /don't/ want in the symbols used for your macro splicing. They definitely blend in too much, and this cosmetic aspect is quite important. I like the look of {{ }} for splices, because they are quite visually distinct.
It's even uglier to write. Quoting works sometimes, but a lot of the time you have to do manual AST grafting and plumbing. The [API][1] is big and unwieldy, there's always a lot of cases you don't care about but still need to dispatch, and the cases you do care about tend to be present in multiple similar but not identical forms (data vs. newtype, record-style vs. normal constructors, and so on). It's boring and repetitive to write and complicated enough to not be mechanical. The [reform proposal][2] addresses some of this (making quotes more widely applicable).
I also agree with this, however, as some of the comments in "New Directions for TH" observe, the lack of good out-of-the-box AST quoting is not a critical flaw. In this WIP package, I seek to address these problems in library form: https://github.com/mgsloan/quasi-extras . So far I allow splicing in a few more places than usual and can pattern match on ASTs.
The stage restriction is hell. Not being able to splice functions defined in the same module is the smaller part of it: the other consequence is that if you have a top-level splice, everything after it in the module will be out of scope to anything before it. Other languages with this property (C, C++) make it workable by allowing you to forward declare things, but Haskell doesn't. If you need cyclic references between spliced declarations or their dependencies and dependents, you're usually just screwed.
I've run into the issue of cyclic TH definitions being impossible before... It's quite annoying. There is a solution, but it's ugly - wrap the things involved in the cyclic dependency in a TH expression that combines all of the generated declarations. One of these declarations generators could just be a quasi-quoter that accepts Haskell code.
It's unprincipled. What I mean by this is that most of the time when you express an abstraction, there is some kind of principle or concept behind that abstraction. For many abstractions, the principle behind them can be expressed in their types. When you define a type class, you can often formulate laws which instances should obey and clients can assume. If you use GHC's [new generics feature][3] to abstract the form of an instance declaration over any datatype (within bounds), you get to say "for sum types, it works like this, for product types, it works like that". But Template Haskell is just dumb macros. It's not abstraction at the level of ideas, but abstraction at the level of ASTs, which is better, but only modestly, than abstraction at the level of plain text.
It's only unprincipled if you do unprincipled things with it. The only difference is that with the compiler implemented mechanisms for abstraction, you have more confidence that the abstraction isn't leaky. Perhaps democratizing language design does sound a bit scary! Creators of TH libraries need to document well and clearly define the meaning and results of the tools they provide. A good example of principled TH is the derive package: http://hackage.haskell.org/package/derive - it uses a DSL such that the example of many of the derivations /specifies/ the actual derivation.
It ties you to GHC. In theory another compiler could implement it, but in practice I doubt this will ever happen. (This is in contrast to various type system extensions which, though they might only be implemented by GHC at the moment, I could easily imagine being adopted by other compilers down the road and eventually standardized.)
That's a pretty good point - the TH API is pretty big and clunky. Re-implementing it seems like it could be tough. However, there are only really only a few ways to slice the problem of representing Haskell ASTs. I imagine that copying the TH ADTs, and writing a converter to the internal AST representation would get you a good deal of the way there. This would be equivalent to the (not insignificant) effort of creating haskell-src-meta. It could also be simply re-implemented by pretty printing the TH AST and using the compiler's internal parser.
While I could be wrong, I don't see TH as being that complicated of a compiler extension, from an implementation perspective. This is actually one of the benefits of "keeping it simple" and not having the fundamental layer be some theoretically appealing, statically verifiable templating system.
The API isn't stable. When new language features are added to GHC and the template-haskell package is updated to support them, this often involves backwards-incompatible changes to the TH datatypes. If you want your TH code to be compatible with more than just one version of GHC you need to be very careful and possibly use CPP.
This is also a good point, but somewhat dramaticized. While there have been API additions lately, they haven't been extensively breakage inducing. Also, I think that with the superior AST quoting I mentioned earlier, the API that actually needs to be used can be very substantially reduced. If no construction / matching needs distinct functions, and are instead expressed as literals, then most of the API disappears. Moreover, the code you write would port more easily to AST representations for languages similar to Haskell.
In summary, I think that TH is a powerful, semi-neglected tool. Less hate could lead to a more lively eco-system of libraries, encouraging the implementation of more language feature prototypes. It's been observed that TH is an overpowered tool, that can let you /do/ almost anything. Anarchy! Well, it's my opinion that this power can allow you to overcome most of its limitations, and construct systems capable of quite principled meta-programming approaches. It's worth the usage of ugly hacks to simulate the "proper" implementation, as this way the design of the "proper" implementation will gradually become clear.
In my personal ideal version of nirvana, much of the language would actually move out of the compiler, into libraries of these variety. The fact that the features are implemented as libraries does not heavily influence their ability to faithfully abstract.
What's the typical Haskell answer to boilerplate code? Abstraction. What're our favorite abstractions? Functions and typeclasses!
Typeclasses let us define a set of methods, that can then be used in all manner of functions generic on that class. However, other than this, the only way classes help avoid boilerplate is by offering "default definitions". Now here is an example of an unprincipled feature!
Minimal binding sets are not declarable / compiler checkable. This could lead to inadvertent definitions that yield bottom due to mutual recursion.
Despite the great convenience and power this would yield, you cannot specify superclass defaults, due to orphan instances http://lukepalmer.wordpress.com/2009/01/25/a-world-without-orphans/ These would let us fix the numeric hierarchy gracefully!
Going after TH-like capabilities for method defaults led to http://www.haskell.org/haskellwiki/GHC.Generics . While this is cool stuff, my only experience debugging code using these generics was nigh-impossible, due to the size of the type induced for and ADT as complicated as an AST. https://github.com/mgsloan/th-extra/commit/d7784d95d396eb3abdb409a24360beb03731c88c
In other words, this went after the features provided by TH, but it had to lift an entire domain of the language, the construction language, into a type system representation. While I can see it working well for your common problem, for complex ones, it seems prone to yielding a pile of symbols far more terrifying than TH hackery.
TH gives you value-level compile-time computation of the output code, whereas generics forces you to lift the pattern matching / recursion part of the code into the type system. While this does restrict the user in a few fairly useful ways, I don't think the complexity is worth it.
I think that the rejection of TH and lisp-like metaprogramming led to the preference towards things like method-defaults instead of more flexible, macro-expansion like declarations of instances. The discipline of avoiding things that could lead to unforseen results is wise, however, we should not ignore that Haskell's capable type system allows for more reliable metaprogramming than in many other environments (by checking the generated code).
One rather pragmatic problem with Template Haskell is that it only works when GHC's bytecode interpreter is available, which is not the case on all architectures. So if your program uses Template Haskell or relies on libraries that use it, it will not run on machines with an ARM, MIPS, S390 or PowerPC CPU.
This is relevant in practice: git-annex is a tool written in Haskell that makes sense to run on machines worrying about storage, such machines often have non-i386-CPUs. Personally, I run git-annex on a NSLU 2 (32 MB of RAM, 266MHz CPU; did you know Haskell works fine on such hardware?) If it would use Template Haskell, this is not possible.
(The situation about GHC on ARM is improving these days a lot and I think 7.4.2 even works, but the point still stands).
Why is TH bad? For me, it comes down to this:
If you need to produce so much repetitive code that you find yourself trying to use TH to auto-generate it, you're doing it wrong!
Think about it. Half the appeal of Haskell is that its high-level design allows you to avoid huge amounts of useless boilerplate code that you have to write in other languages. If you need compile-time code generation, you're basically saying that either your language or your application design has failed you. And we programmers don't like to fail.
Sometimes, of course, it's necessary. But sometimes you can avoid needing TH by just being a bit more clever with your designs.
(The other thing is that TH is quite low-level. There's no grand high-level design; a lot of GHC's internal implementation details are exposed. And that makes the API prone to change...)

Is there an object-identity-based, thread-safe memoization library somewhere?

I know that memoization seems to be a perennial topic here on the haskell tag on stack overflow, but I think this question has not been asked before.
I'm aware of several different 'off the shelf' memoization libraries for Haskell:
The memo-combinators and memotrie packages, which make use of a beautiful trick involving lazy infinite data structures to achieve memoization in a purely functional way. (As I understand it, the former is slightly more flexible, while the latter is easier to use in simple cases: see this SO answer for discussion.)
The uglymemo package, which uses unsafePerformIO internally but still presents a referentially transparent interface. The use of unsafePerformIO internally results in better performance than the previous two packages. (Off the shelf, its implementation uses comparison-based search data structures, rather than perhaps-slightly-more-efficient hash functions; but I think that if you find and replace Cmp for Hashable and Data.Map for Data.HashMap and add the appropraite imports, you get a hash based version.)
However, I'm not aware of any library that looks answers up based on object identity rather than object value. This can be important, because sometimes the kinds of object which are being used as keys to your memo table (that is, as input to the function being memoized) can be large---so large that fully examining the object to determine whether you've seen it before is itself a slow operation. Slow, and also unnecessary, if you will be applying the memoized function again and again to an object which is stored at a given 'location in memory' 1. (This might happen, for example, if we're memoizing a function which is being called recursively over some large data structure with a lot of structural sharing.) If we've already computed our memoized function on that exact object before, we can already know the answer, even without looking at the object itself!
Implementing such a memoization library involves several subtle issues and doing it properly requires several special pieces of support from the language. Luckily, GHC provides all the special features that we need, and there is a paper by Peyton-Jones, Marlow and Elliott which basically worries about most of these issues for you, explaining how to build a solid implementation. They don't provide all details, but they get close.
The one detail which I can see which one probably ought to worry about, but which they don't worry about, is thread safety---their code is apparently not threadsafe at all.
My question is: does anyone know of a packaged library which does the kind of memoization discussed in the Peyton-Jones, Marlow and Elliott paper, filling in all the details (and preferably filling in proper thread-safety as well)?
Failing that, I guess I will have to code it up myself: does anyone have any ideas of other subtleties (beyond thread safety and the ones discussed in the paper) which the implementer of such a library would do well to bear in mind?
UPDATE
Following #luqui's suggestion below, here's a little more data on the exact problem I face. Let's suppose there's a type:
data Node = Node [Node] [Annotation]
This type can be used to represent a simple kind of rooted DAG in memory, where Nodes are DAG Nodes, the root is just a distinguished Node, and each node is annotated with some Annotations whose internal structure, I think, need not concern us (but if it matters, just ask and I'll be more specific.) If used in this way, note that there may well be significant structural sharing between Nodes in memory---there may be exponentially more paths which lead from the root to a node than there are nodes themselves. I am given a data structure of this form, from an external library with which I must interface; I cannot change the data type.
I have a function
myTransform : Node -> Node
the details of which need not concern us (or at least I think so; but again I can be more specific if needed). It maps nodes to nodes, examining the annotations of the node it is given, and the annotations its immediate children, to come up with a new Node with the same children but possibly different annotations. I wish to write a function
recursiveTransform : Node -> Node
whose output 'looks the same' as the data structure as you would get by doing:
recursiveTransform Node originalChildren annotations =
myTransform Node recursivelyTransformedChildren annotations
where
recursivelyTransformedChildren = map recursiveTransform originalChildren
except that it uses structural sharing in the obvious way so that it doesn't return an exponential data structure, but rather one on the order of the same size as its input.
I appreciate that this would all be easier if say, the Nodes were numbered before I got them, or I could otherwise change the definition of a Node. I can't (easily) do either of these things.
I am also interested in the general question of the existence of a library implementing the functionality I mention quite independently of the particular concrete problem I face right now: I feel like I've had to work around this kind of issue on a few occasions, and it would be nice to slay the dragon once and for all. The fact that SPJ et al felt that it was worth adding not one but three features to GHC to support the existence of libraries of this form suggests that the feature is genuinely useful and can't be worked around in all cases. (BUT I'd still also be very interested in hearing about workarounds which will help in this particular case too: the long term problem is not as urgent as the problem I face right now :-) )
1 Technically, I don't quite mean location in memory, since the garbage collector sometimes moves objects around a bit---what I really mean is 'object identity'. But we can think of this as being roughly the same as our intuitive idea of location in memory.
If you only want to memoize based on object identity, and not equality, you can just use the existing laziness mechanisms built into the language.
For example, if you have a data structure like this
data Foo = Foo { ... }
expensive :: Foo -> Bar
then you can just add the value to be memoized as an extra field and let the laziness take care of the rest for you.
data Foo = Foo { ..., memo :: Bar }
To make it easier to use, add a smart constructor to tie the knot.
makeFoo ... = let foo = Foo { ..., memo = expensive foo } in foo
Though this is somewhat less elegant than using a library, and requires modification of the data type to really be useful, it's a very simple technique and all thread-safety issues are already taken care of for you.
It seems that stable-memo would be just what you needed (although I'm not sure if it can handle multiple threads):
Whereas most memo combinators memoize based on equality, stable-memo does it based on whether the exact same argument has been passed to the function before (that is, is the same argument in memory).
stable-memo only evaluates keys to WHNF.
This can be more suitable for recursive functions over graphs with cycles.
stable-memo doesn't retain the keys it has seen so far, which allows them to be garbage collected if they will no longer be used. Finalizers are put in place to remove the corresponding entries from the memo table if this happens.
Data.StableMemo.Weak provides an alternative set of combinators that also avoid retaining the results of the function, only reusing results if they have not yet been garbage collected.
There is no type class constraint on the function's argument.
stable-memo will not work for arguments which happen to have the same value but are not the same heap object. This rules out many candidates for memoization, such as the most common example, the naive Fibonacci implementation whose domain is machine Ints; it can still be made to work for some domains, though, such as the lazy naturals.
Ekmett just uploaded a library that handles this and more (produced at HacPhi): http://hackage.haskell.org/package/intern. He assures me that it is thread safe.
Edit: Actually, strictly speaking I realize this does something rather different. But I think you can use it for your purposes. It's really more of a stringtable-atom type interning library that works over arbitrary data structures (including recursive ones). It uses WeakPtrs internally to maintain the table. However, it uses Ints to index the values to avoid structural equality checks, which means packing them into the data type, when what you want are apparently actually StableNames. So I realize this answers a related question, but requires modifying your data type, which you want to avoid...

How do I do automatic data serialization of data objects?

One of the huge benefits in languages that have some sort of reflection/introspecition is that objects can be automatically constructed from a variety of sources.
For example, in Java I can use the same objects for persisting to a db (with Hibernate), serializing to XML (with JAXB), and serializing to JSON (json-lib). You can do the same in Ruby and Python also usually following some simple rules for properties or annotations for Java.
Thus I don't need lots "Domain Transfer Objects". I can concentrate on the domain I am working in.
It seems in very strict FP like Haskell and Ocaml this is not possible.
Particularly Haskell. The only thing I have seen is doing some sort of preprocessing or meta-programming (ocaml). Is it just accepted that you have to do all the transformations from the bottom upwards?
In other words you have to do lots of boring work to turn a data type in haskell into a JSON/XML/DB Row object and back again into a data object.
I can't speak to OCaml, but I'd say that the main difficulty in Haskell is that deserialization requires knowing the type in advance--there's no universal way to mechanically deserialize from a format, figure out what the resulting value is, and go from there, as is possible in languages with unsound or dynamic type systems.
Setting aside the type issue, there are various approaches to serializing data in Haskell:
The built-in type classes Read/Show (de)serialize algebraic data types and most built-in types as strings. Well-behaved instances should generally be such that read . show is equivalent to id, and that the result of show can be parsed as Haskell source code constructing the serialized value.
Various serialization packages can be found on Hackage; typically these require that the type to be serialized be an instance of some type class, with the package providing instances for most built-in types. Sometimes they merely require an automatically derivable instance of the type-reifying, reflective metaprogramming Data class (the charming fully qualified name for which is Data.Data.Data), or provide Template Haskell code to auto-generate instances.
For truly unusual serialization formats--or to create your own package like the previously mentioned ones--one can reach for the biggest hammer available, sort of a "big brother" to Read and Show: parsing and pretty-printing. Numerous packages are available for both, and while it may sound intimidating at first, parsing and pretty-printing are in fact amazingly painless in Haskell.
A glance at Hackage indicates that serialization packages already exist for various formats, including binary data, JSON, YAML, and XML, though I've not used any of them so I can't personally attest to how well they work. Here's a non-exhaustive list to get you started:
binary: Performance-oriented serialization to lazy ByteStrings
cereal: Similar to binary, but a slightly different interface and uses strict ByteStrings
genericserialize: Serialization via built-in metaprogramming, output format is extensible, includes R5RS sexp output.
json: Lightweight serialization of JSON data
RJson: Serialization to JSON via built-in metaprogramming
hexpat-pickle: Combinators for serialization to XML, using the "hexpat" package
regular-xmlpickler: Serialization to XML of recursive data structures using the "regular" package
The only other problem is that, inevitably, not all types will be serializable--if nothing else, I suspect you're going to have a hard time serializing polymorphic types, existential types, and functions.
For what it's worth, I think the pre-processor solution found in OCaml (as exemplified by sexplib, binprot and json-wheel among others) is pretty great (and I think people do very similar things with Template Haskell). It's far more efficient than reflection, and can also be tuned to individual types in a natural way. If you don't like the auto-generated serializer for a given type foo, you can always just write your own, and it fits beautifully into the auto-generated serializers for types that include foo as a component.
The only downside is that you need to learn camlp4 to write one of these for yourself. But using them is quite easy, once you get your build-system set up to use the preprocessor. It's as simple as adding with sexp to the end of a type definition:
type t = { foo: int; bar: float }
with sexp
and now you have your serializer.
You wanted
to do lot of boring work to turn a data type in haskell into JSON/XML/DB Row object and back again into a data object.
There are many ways to serialize and unserialize data types in Haskell. You can use for example,
Data.Binary
Text.JSON
as well as other common formants (protocol buffers, thrift, xml)
Each package often/usually comes with a macro or deriving mechanism to allow you to e.g. derive JSON. For Data.Binary for example, see this previous answer: Erlang's term_to_binary in Haskell?
The general answer is: we have many great packages for serialization in Haskell, and we tend to use the existing class 'deriving' infrastructure (with either generics or template Haskell macros to do the actual deriving).
My understanding is that the simplest way to serialize and deserialize in Haskell is to derive from Read and Show. This is simple and isn't fullfilling your requirements.
However there are HXT and Text.JSON which seem to provide what you need.
The usual approach is to employ Data.Binary. This provides the basic serialisation capability. Binary instances for data types are easy to write and can easily be built out of smaller units.
If you want to generate the instances automatically then you can use Template Haskell. I don't know of any package to do this, but I wouldn't be surprised if one already exists.

Resources