How do I do automatic data serialization of data objects?

How do I do automatic data serialization of data objects? - haskell

One of the huge benefits in languages that have some sort of reflection/introspecition is that objects can be automatically constructed from a variety of sources.
For example, in Java I can use the same objects for persisting to a db (with Hibernate), serializing to XML (with JAXB), and serializing to JSON (json-lib). You can do the same in Ruby and Python also usually following some simple rules for properties or annotations for Java.
Thus I don't need lots "Domain Transfer Objects". I can concentrate on the domain I am working in.
It seems in very strict FP like Haskell and Ocaml this is not possible.
Particularly Haskell. The only thing I have seen is doing some sort of preprocessing or meta-programming (ocaml). Is it just accepted that you have to do all the transformations from the bottom upwards?
In other words you have to do lots of boring work to turn a data type in haskell into a JSON/XML/DB Row object and back again into a data object.

I can't speak to OCaml, but I'd say that the main difficulty in Haskell is that deserialization requires knowing the type in advance--there's no universal way to mechanically deserialize from a format, figure out what the resulting value is, and go from there, as is possible in languages with unsound or dynamic type systems.
Setting aside the type issue, there are various approaches to serializing data in Haskell:
The built-in type classes Read/Show (de)serialize algebraic data types and most built-in types as strings. Well-behaved instances should generally be such that read . show is equivalent to id, and that the result of show can be parsed as Haskell source code constructing the serialized value.
Various serialization packages can be found on Hackage; typically these require that the type to be serialized be an instance of some type class, with the package providing instances for most built-in types. Sometimes they merely require an automatically derivable instance of the type-reifying, reflective metaprogramming Data class (the charming fully qualified name for which is Data.Data.Data), or provide Template Haskell code to auto-generate instances.
For truly unusual serialization formats--or to create your own package like the previously mentioned ones--one can reach for the biggest hammer available, sort of a "big brother" to Read and Show: parsing and pretty-printing. Numerous packages are available for both, and while it may sound intimidating at first, parsing and pretty-printing are in fact amazingly painless in Haskell.
A glance at Hackage indicates that serialization packages already exist for various formats, including binary data, JSON, YAML, and XML, though I've not used any of them so I can't personally attest to how well they work. Here's a non-exhaustive list to get you started:
binary: Performance-oriented serialization to lazy ByteStrings
cereal: Similar to binary, but a slightly different interface and uses strict ByteStrings
genericserialize: Serialization via built-in metaprogramming, output format is extensible, includes R5RS sexp output.
json: Lightweight serialization of JSON data
RJson: Serialization to JSON via built-in metaprogramming
hexpat-pickle: Combinators for serialization to XML, using the "hexpat" package
regular-xmlpickler: Serialization to XML of recursive data structures using the "regular" package
The only other problem is that, inevitably, not all types will be serializable--if nothing else, I suspect you're going to have a hard time serializing polymorphic types, existential types, and functions.

For what it's worth, I think the pre-processor solution found in OCaml (as exemplified by sexplib, binprot and json-wheel among others) is pretty great (and I think people do very similar things with Template Haskell). It's far more efficient than reflection, and can also be tuned to individual types in a natural way. If you don't like the auto-generated serializer for a given type foo, you can always just write your own, and it fits beautifully into the auto-generated serializers for types that include foo as a component.
The only downside is that you need to learn camlp4 to write one of these for yourself. But using them is quite easy, once you get your build-system set up to use the preprocessor. It's as simple as adding with sexp to the end of a type definition:
type t = { foo: int; bar: float }
with sexp
and now you have your serializer.

You wanted
to do lot of boring work to turn a data type in haskell into JSON/XML/DB Row object and back again into a data object.
There are many ways to serialize and unserialize data types in Haskell. You can use for example,
Data.Binary
Text.JSON
as well as other common formants (protocol buffers, thrift, xml)
Each package often/usually comes with a macro or deriving mechanism to allow you to e.g. derive JSON. For Data.Binary for example, see this previous answer: Erlang's term_to_binary in Haskell?
The general answer is: we have many great packages for serialization in Haskell, and we tend to use the existing class 'deriving' infrastructure (with either generics or template Haskell macros to do the actual deriving).

My understanding is that the simplest way to serialize and deserialize in Haskell is to derive from Read and Show. This is simple and isn't fullfilling your requirements.
However there are HXT and Text.JSON which seem to provide what you need.

The usual approach is to employ Data.Binary. This provides the basic serialisation capability. Binary instances for data types are easy to write and can easily be built out of smaller units.
If you want to generate the instances automatically then you can use Template Haskell. I don't know of any package to do this, but I wouldn't be surprised if one already exists.

Related

Serialization in Haskell

From the bird's view, my question is: Is there a universal mechanism for as-is data serialization in Haskell?
Introduction
The origin of the problem does not root in Haskell indeed. Once, I tried to serialize a python dictionary where a hash function of objects was quite heavy. I found that in python, the default dictionary serialization does not save the internal structure of the dictionary but just dumps a list of key-value pairs. As a result, the de-serialization process is time-consuming, and there is no way to struggle with it. I was certain that there is a way in Haskell because, at my glance, there should be no problem transferring a pure Haskell type to a byte-stream automatically using BFS or DFS. Surprisingly, but it does not. This problem was discussed here (citation below)
Currently, there is no way to make HashMap serializable without modifying the HashMap library itself. It is not possible to make Data.HashMap an instance of Generic (for use with cereal) using stand-alone deriving as described by #mergeconflict's answer, because Data.HashMap does not export all its constructors (this is a requirement for GHC). So, the only solution left to serialize the HashMap seems to be to use the toList/fromList interface.
Current Problem
I have quite the same problem with Data.Trie bytestring-trie package. Building a trie for my data is heavily time-consuming and I need a mechanism to serialize and de-serialize this tire. However, it looks like the previous case, I see no way how to make Data.Trie an instance of Generic (or, am I wrong)?
So the questions are:
Is there some kind of a universal mechanism to project a pure Haskell type to a byte string? If no, is it a fundamental restriction or just a lack of implementations?
If no, what is the most painless way to modify the bytestring-trie package to make it the instance of Generic and serialize with Data.Store

There is a way using compact regions, but there is a big restriction:
Our binary representation contains direct pointers to the info tables of objects in the region. This means that the info tables of the receiving process must be laid out in exactly the same way as from the original process; in practice, this means using static linking, using the exact same binary and turning off ASLR. This API does NOT do any safety checking and will probably segfault if you get it wrong. DO NOT run this on untrusted input.
This also gives insight into universal serialization is not possible currently. Data structures contain very specific pointers which can differ if you're using different binaries. Reading in the raw bytes into another binary will result in invalid pointers.
There is some discussion in this GitHub issue about weakening this requirement.
I think the proper way is to open an issue or pull request upstream to export the data constructors in the internal module. That is what happened with HashMap which is now fully accessible in its internal module.
Update: it seems there is already a similar open issue about this.

What are module signatures in Haskell?

I have recently found Haskell's feature called "module signatures". As I have discovered they are put in .hsig files and begin with signature keyword instead of module.
The example syntax of such a file may look like
signature Str where
data Str
empty :: Str
append :: Str -> Str -> Str
However, I cannot imagine how and why one would use them. Could you explain me which problems do they solve and how to properly make use of them?
They strongly remind me the module system that one can see in OCaml (link), which also has modules signatures and separate implementations, but I can't decide how close are these two concepts. Is it somehow related?

They are related to the OCaml module system, but with some important differences:
Signatures are defined within the language (in .hsig files) but unlike in OCaml they are not instantiated within the language. Instead, the package manager controls instantiation (currently, only Cabal provides that). Modules never know if they are importing an abstract signature or an actual module.
Implementation modules know nothing about signatures and do not not reference them directly. Any existing module can implement a signature if the definitions happen to be compatible.
Instantiation is triggered by a coincidence of module name and signature name in the dependencies of some compilation unit (executable, library, test suite...) When the names coincide, a process called "signature matching" takes place that verifies that the types and definitions are compatible.
The "happy path" is that in your program you depend on some library having a signature "hole", and also on another library that provides an implementation module with the same name. Then signature matching happens automatically. When the names don't match, or we need multiple instantiations of the signature-using library, we have to rename signatures and/or modules in the mixins section of the Cabal file.
As for why module signatures might be useful, consider bytestring, the most popular library by far for handling binary data in Haskell. But there are others, for example stdio with its Bytes type.
Suppose you are writing your own library that uses binary data, and you don't want to force your users into either stdio or bytestring. What are your choices?
One would be to create something like a Bytelike class and parameterize all your functions with it. You would also need to add a type parameter to every data type that contains bytes.
Another would be to create a signature that defines an abstract binary data type and all the operations that are required of it. Your library would make use of the signature, and remain "indefinite" until the user depends both on your library and a suitable implementation when creating his own libraries.
From the perspective of the user, the typeclass solution is unsatisfactory. The user knows that he wants to use either ByteString or Bytes, just one of them. The decision will not depend on some runtime flag and will remain constant across the extent of his program. And yet he has to deal with a more complex API that reminds him of that already decided issue at every turn.
It's better if he makes the decision once, writes it in his .cabal file, and deals with a simpler API from then onwards.

As described here, they're quite closely related to OCaml module signatures. They allow you to create a package missing some modules and say these modules, containing such and such types and values, should be delivered by package's user. I haven't tested it myself, but I imagine that such a package works very much like an OCaml functor.

How to achieve data polymorphism for multiple external formats in Haskell?

I need to process multiple formats and versions for semantically equivalent data.
I can generate Haskell data types for each schema (XSD for example), they will be technically different, but semantically and structurally identical in many cases.
The data is complex, includes references, and service components must process whole graph and produce also similar response (a component might just update a field, but might need to analyze whole graph to collect all required information, might call other services as well).
How can I represent ns1:address and ns2:adress as one polymorphic type that has country and street elements and application needs process them as identical, but keeps serialization context for writing response in correct format (one representation might encode them in single string while other might carry also superfluous complex data)?
How close can I get to writing mostly code that defines semantic equivalence of data, business logic and generate all else? What features in Haskell language or libraries should I evaluate as building blocks for potential solution?

An option is to create a data type for each schema and create a function to map them to a common data type. Process it as you wish. You don't need to create polymorphic types.
This approach is similar to Pandoc's: you get a bunch of readers to parse documents to a common document structure, then use writers to convert that common structure to a particular format.
You just need the libraries to read your complex input data (and write it back, if necessary). The rest is functions and data types.
If you are really handling graphs, you can look at the Data.Graph module.

It sounds like this is a problems that is well served by the Type Class infrastructure, and the Lens library.
Use a Type Class to define a standard and consistent high-level interface to the various implementations. Make sure that you focus on the operations you wish to perform, not on the underlying implementation or process.
Use Lenses and Prisms to reach into the underlying datatypes and return answers to queries, and modify values "in-place", without resorting to full serialisation/de-serialisation.

How to simulate optional fields in an ADT in Haskell?

I have an algebraic data type that's being used all thoughout my program. I've realized that there's a point when I need to annotate all my structures of that type with a simple string.
I would like to not go through tons of code to account for adding a field to my commonly used type. Especially since there's no meaningful value for this annotation until quite far in my program, it seems excessive to refactor a thousand lines of code for a trivial change.
Also, as the type is pretty complex, it seems silly to just duplicate the type and make a slightly different version.
Is this just a weakness of Haskell, or am I missing the right way to handle this? I assume it's the latter, but I can't find anything akin to optional parameters to a type constructor.

I wouldn't go as far as declaring "a weakness of Haskell" - more like something that's needed to take into account when dealing with ADTs (and not specifically in Haskell).
Each program consists of "entities" and "actions". Your problem is that you need to modify an entity (an ADT in Haskell).
In an OOP language the solution will be straightforward - subclass you type. But if you'll decide to change the signature of a method in an existing class - you'll have no choice but to go through your entire codebase and deal with the damage.
In a functional language, functions are "cheap" - each function is independent from the ADT and refactoring it is often simple. Is that a weakness of OOP languages? No - it's just a tradeoff between ease of modifying existing "actions" (which is easy in functional languages) and ease of modifying "entities" (which is easy in OOP languages).
And in a practical note:
Designing an ADT is very important since changing can be very expensive.
#bheklilr's advice is priceless - follow it!

Can Haskell functions be serialized?

The best way to do it would be to get the representation of the function (if it can be recovered somehow). Binary serialization is preferred for efficiency reasons.
I think there is a way to do it in Clean, because it would be impossible to implement iTask, which relies on that tasks (and so functions) can be saved and continued when the server is running again.
This must be important for distributed haskell computations.
I'm not looking for parsing haskell code at runtime as described here: Serialization of functions in Haskell.
I also need to serialize not just deserialize.

Unfortunately, it's not possible with the current ghc runtime system.
Serialization of functions, and other arbitrary data, requires some low level runtime support that the ghc implementors have been reluctant to add.
Serializing functions requires that you can serialize anything, since arbitrary data (evaluated and unevaluated) can be part of a function (e.g., a partial application).

No. However, the CloudHaskell project is driving home the need for explicit closure serialization support in GHC. The closest thing CloudHaskell has to explicit closures is the distributed-static package. Another attempt is the HdpH closure representation. However, both use Template Haskell in the way Thomas describes below.
The limitation is a lack of static support in GHC, for which there is a currently unactioned GHC ticket. (Any takers?). There has been a discussion on the CloudHaskell mailing list about what static support should actually look like, but nothing has yet progressed as far as I know.
The closest anyone has come to a design and implementation is Jost Berthold, who has implemented function serialisation in Eden. See his IFL 2010 paper "Orthogonal Serialisation for Haskell". The serialisation support is baked in to the Eden runtime system. (Now available as separate library: packman. Not sure whether it can be used with GHC or needs a patched GHC as in the Eden fork...) Something similar would be needed for GHC. This is the serialisation support Eden, in the version forked from GHC 7.4:
data Serialized a = Serialized { packetSize :: Int , packetData :: ByteArray# }
serialize :: a -> IO (Serialized a)
deserialize :: Serialized a -> IO a
So: one can serialize functions and data structures. There is a Binary instance for Serialized a, allowing you to checkpoint a long-running computation to file! (See Secion 4.1).
Support for such a simple serialization API in the GHC base libraries would surely be the Holy Grail for distributed Haskell programming. It would likely simplify the composability between the distributed Haskell flavours (CloudHaskell, MetaPar, HdpH, Eden and so on...)

Check out Cloud Haskell. It has a concept called Closure which is used to send code to be executed on remote nodes in a type safe manner.

Eden probably comes closest and probably deserves a seperate answer: (De-)Serialization of unevaluated thunks is possible, see https://github.com/jberthold/packman.
Deserialization is however limited to the same program (where program is a "compilation result"). Since functions are serialized as code pointers, previously unknown functions cannot be deserialized.
Possible usage:
storing unevaluated work for later
distributing work (but no sharing of new code)

A pretty simple and practical, but maybe not as elegant solution would be to (preferably have GHC automatically) compile each function into a separate module of machine-independent bytecode, serialize that bytecode whenever serialization of that function is required, and use the dynamic-loader or plugins packages, to dynamically load them, so even previously unknown functions can be used.
Since a module notes all its dependencies, those could then be (de)serialized and loaded too. In practice, serializing index numbers and attaching an indexed list of the bytecode blobs would probably be the most efficient.
I think as long as you compile the modules yourself, this is already possible right now.
As I said, it would not be very pretty though. Not to mention the generally huge security risk of de-serializing code from insecure sources to run in an unsecured environment. :-)
(No problem if it is trustworthy, of course.)
I’m not going to code it up right here, right now though. ;-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string