When to expose constructors of a data type when designing data structures? - haskell

When designing data structures in functional languages there are 2 options:
Expose their constructors and pattern match on them.
Hide their constructors and use higher-level functions to examine the data structures.
In what cases, what is appropriate?
Pattern matching can make code much more readable or simpler. On the other hand, if we need to change something in the definition of a data type then all places where we pattern-match on them (or construct them) need to be updated.
I've been asking this question myself for some time. Often it happens to me that I start with a simple data structure (or even a type alias) and it seems that constructors + pattern matching will be the easiest approach and produce a clean and readable code. But later things get more complicated, I have to change the data type definition and refactor a big part of the code.

The essential factor for me is the answer to the following question:
Is the structure of my datatype relevant to the outside world?
For example, the internal structure of the list datatype is very much relevant to the outside world - it has an inductive structure that is certainly very useful to expose to consumers, because they construct functions that proceed by induction on the structure of the list. If the list is finite, then these functions are guaranteed to terminate. Also, defining functions in this way makes it easy to provide properties about them, again by induction.
By contrast, it is best for the Set datatype to be kept abstract. Internally, it is implemented as a tree in the containers package. However, it might as well have been implemented using arrays, or (more usefully in a functional setting) with a tree with a slightly different structure and respecting different invariants (balanced or unbalanced, branching factor, etc). The need to enforce any invariants above and over those that the constructors already enforce through their types, by the way, precludes letting the datatype be concrete.
The essential difference between the list example and the set example is that the Set datatype is only relevant for the operations that are possible on Set's. Whereas lists are relevant because the standard library already provides many functions to act on them, but in addition their structure is relevant.
As a sidenote, one might object that actually the inductive structure of lists, which is so fundamental to write functions whose termination and behaviour is easy to reason about, is captured abstractly by two functions that consume lists: foldr and foldl. Given these two basic list operators, most functions do not need to inspect the structure of a list at all, and so it could be argued that lists too coud be kept abstract. This argument generalizes to many other similar structures, such as all Traversable structures, all Foldable structures, etc. However, it is nigh impossible to capture all possible recursion patterns on lists, and in fact many functions aren't recursive at all. Given only foldr and foldl, one would, writing head for example would still be possible, though quite tedious:
head xs = fromJust $ foldl (\b x -> maybe (Just x) Just b) Nothing xs
We're much better off just giving away the internal structure of the list.
One final point is that sometimes the actual representation of a datatype isn't relevant to the outside world, because say it is some kind of optimised and might not be the canonical representation, or there isn't a single "canonical" representation. In these cases, you'll want to keep your datatype abstract, but offer "views" of your datatype, which do provide concrete representations that can be pattern matched on.
One example would be if wanted to define a Complex datatype of complex numbers, where both cartesian forms and polar forms can be considered canonical. In this case, you would keep Complex abstract, but export two views, ie functions polar and cartesian that return a pair of a length and an angle or a coordinate in the cartesian plane, respectively.

Well, the rule is pretty simple: If it's easy to construct wrong values by using the actual constructors, then don't allow them to be used directly, but instead provide smart constructors. This is the path followed by some data structures like Map and Set, which are easy to get wrong.
Then there are the types for which it's impossible or hard to construct inconsistent/wrong values either because the type doesn't allow that at all or because you would need to introduce bottoms. The length-indexed list type (commonly called Vec) and most monads are examples of that.
Ultimately this is your own decision. Put yourself into the user's perspective and make the tradeoff between convenience and safety. If there is no tradeoff, then always expose the constructors. Otherwise your library users will hate you for the unnecessary opacity.

If the data type serves a simple purpose (like Maybe a) and no (explicit or implicit) assumptions about the data type can be violated by directly constructing a value via the data constructors, I would expose the constructors.
On the other hand, if the data type is more complex (like a balanced tree) and/or it's internal representation is likely to change, I usually hide the constructors.
When using a package, there's an unwritten rule that the interface exposed by a non-internal module should be "safe" to use on the given data type. Considering the balanced tree example, exposing the data constructors allows one to (accidentally) construct an unbalanced tree, and so the assumed runtime guarantees for searching the tree etc might be violated.

If the type is used to represent values with a canonical definition and representation (many mathematical objects fall into this category), and it's not possible to construct "invalid" values using the type, then you should expose the constructors.
For example, if you're representing something like two dimensional points with your own type (including a newtype), you might as well expose the constructor. The reality is that a change to this datatype is not going to be a change in how 2d points are represented, it's going to be a change in your need to use 2d points (maybe you're generalising to 3d space, maybe you're adding a concept of layers, or whatever), and is almost certain to need attention in the parts of the code using values of this type no matter what you do.[1]
A complex type representing something specific to your application or field is quite likely to undergo changes to the representation while continuing to support similar operations. Therefore you only want other modules depending on the operations, not on the internal structure. So you shouldn't expose the constructors.
Other types represent things with canonical definitions but not canonical representations. Everyone knows the properties expected of maps and sets, but there are lots of different ways of representing values that support those properties. So you again only want other modules depending on the operations they support, not on the particular representations.
Some types, whether or not they are if simple with canonical representations, allow the construction of values in the program which don't represent a valid value in the abstract concept the type is supposed to represent. A simple example would be a type representing a self-balancing binary search tree; client code with access to the constructors could easily construct invalid trees. Exposing the constructors either means you need to assume that such values passed in from outside may be invalid and therefore you need to make something sensible happen even for bizarre values, or means that it's the responsibility of the programmers working with your interface to ensure they don't violate any assumptions. It's usually better to just keep such types from being constructed directly outside your module.
Basically it comes down to the concept your type is supposed to represent. If your concept maps in a very simple and obvious[2] way directly to values in some data type which isn't "more inclusive" than the concept due to the compiler being unable to check needed invariants, then the concept is pretty much "the same" as the data type, and exposing its structure is fine. If not, then you probably need to keep the structure hidden.
[1] A likely change though would be to change which numeric type you're using for the coordinate values, so you probably do have to think about how to minimise the impact of such changes. That's pretty orthogonal to whether or not you expose the constructors though.
[2] "Obvious" here meaning that if you asked 10 people independently to come up with a data type representing the concept they would all come back with the same thing, modulo changing the names.

I would propose a different, noticeably more restrictive rule than most people. The central criterion would be:
Do you guarantee that this type will never, ever change? If so, exposing the constructors might be a good idea. Good luck with that, though!
But the types for which you can make that guarantee tend to be very simple, generic "foundation" types like Maybe, Either or [], which one could arguably write once and then never revisit again.
Though even those can be questioned, because they do get revisited from time to time; there's people who have used Church-encoded versions of Maybe and List in various contexts for performance reasons, e.g.:
{-# LANGUAGE RankNTypes #-}
newtype Maybe' a = Maybe' { elimMaybe' :: forall r. r -> (a -> r) -> r }
nothing = Maybe' $ \z k -> z
just x = Maybe' $ \z k -> k x
newtype List' a = List' { elimList' :: forall r. (a -> r -> r) -> r -> r }
nil = List' $ \k z -> z
cons x xs = List' $ \k z -> k x (elimList' k z xs)
These two examples highlight something important: you can replace the Maybe' type's implementation shown above with any other implementation as long as it supports the following three functions:
nothing :: Maybe' a
just :: a -> Maybe' a
elimMaybe' :: Maybe' a -> r -> (a -> r) -> r
...and the following laws:
elimMaybe' nothing z x == z
elimMaybe' (just x) z f == f x
And this technique can be applied to any algebraic data type. Which to me says that pattern matching against concrete constructors is just insufficiently abstract; it doesn't really gain you anything that you can't get out of the abstract constructors + destructor pattern, and it loses implementation flexibility.

Related

Type constraints on dimensionality of vectors in F# and Haskell (Dependent Types)

I'm new to F# and Haskell and am implementing a project in order to determine which language I would prefer to devote more time to.
I have a numerous situations where I expect a given numerical type to have given dimensions based on parameters given to a top-level function (ie, at runtime). For example, in this F# snippet, I have
type DataStreamItem = LinearAlgebra.Vector<float32>
type Ball =
{R : float32;
X : DataStreamItem}
and I expect all instances of type DataStreamItem to have D dimensions.
My question is in the interests of algorithm development and debugging since such shape-mismatche-bugs can be a headache to pin down but should be a non-issue when the algorithm is up-and-running:
Is there a way, in either F# or Haskell, to constrain DataStreamItem and / or Ball to have dimensions of D? Or do I need to resort to pattern matching on every calculation?
If the latter is the case, are there any good, light-weight paradigms to catch such constraint violations as soon as they occur (and that can be removed when performance is critical)?
Edit:
To clarify the sense in which D is constrained:
D is defined such that if you expressed the algorithm of the function main(DataStream) as a computation graph, all of the intermediate calculations would depend on the dimension of D for the execution of main(DataStream). The simplest example I can think of would be a dot-product of M with DataStreamItem: the dimension of DataStream would determine the creation of dimension parameters of M
Another Edit:
A week later, I find the following blog outlining precisely what I was looking for in dependant types in Haskell:
https://blog.jle.im/entry/practical-dependent-types-in-haskell-1.html
And Another:
This reddit contains some discussion on Dependent Types in Haskell and contains a link to the quite interesting dissertation proposal of R. Eisenberg.
Neither Haskell not F# type system is rich enough to (directly) express statements of the sort "N nested instances of a recursive type T, where N is between 2 and 6" or "a string of characters exactly 6 long". Not in those exact terms, at least.
I mean, sure, you can always express such a 6-long string type as type String6 = String6 of char*char*char*char*char*char or some variant of the sort (which technically should be enough for your particular example with vectors, unless you're not telling us the whole example), but you can't say something like type String6 = s:string{s.Length=6} and, more importantly, you can't define functions of the form concat: String<n> -> String<m> -> String<n+m>, where n and m represent string lengths.
But you're not the first person asking this question. This research direction does exist, and is called "dependent types", and I can express the gist of it most generally as "having higher-order, more powerful operations on types" (as opposed to just union and intersection, as we have in ML languages) - notice how in the example above I parametrize the type String with a number, not another type, and then do arithmetic on that number.
The most prominent language prototypes (that I know of) in this direction are Agda, Idris, F*, and Coq (not really the full deal AFAIK). Check them out, but beware: this is kind of the edge of tomorrow, and I wouldn't advise starting a big project based on those languages.
(edit: apparently you can do certain tricks in Haskell to simulate dependent types, but it's not very convenient, and you have to enable UndecidableInstances)
Alternatively, you could go with a weaker solution of doing the checks at runtime. The general gist is: wrap your vector types in a plain wrapper, don't allow direct construction of it, but provide constructor functions instead, and make those constructor functions ensure the desired property (i.e. length). Something like:
type Stream4 = private Stream4 of DataStreamItem
with
static member create (item: DataStreamItem) =
if item.Length = 4 then Some (Stream4 item)
else None
// Alternatively:
if item.Length <> 4 then failwith "Expected a 4-long vector."
item
Here is a fuller explanation of the approach from Scott Wlaschin: constrained strings.
So if I understood correctly, you're actually not doing any type-level arithmetic, you just have a “length tag” that's shared in a chain of function calls.
This has long been possible to do in Haskell; one way that I consider quite elegant is to annotate your arrays with a standard fixed-length type of the desired length:
newtype FixVect v s = FixVect { getFixVect :: VU.Vector s }
To ensure the correct length, you only provide (polymorphic) smart constructors that construct from the fixed-length type – perfectly safe, though the actual dimension number is nowhere mentioned!
class VectorSpace v => FiniteDimensional v where
asFixVect :: v -> FixVect v (Scalar v)
instance FiniteDimensional Float where
asFixVect s = FixVect $ VU.singleton s
instance (FiniteDimensional a, FiniteDimensional b, Scalar a ~ Scalar b) => FiniteDimensional (a,b) where
asFixVect (a,b) = case (asFixVect a, asFixVect b) of
(FixVect av, FixVect bv) -> FixVect $ av<>bv
This construction from unboxed tuples is really inefficient, however this doesn't mean you can write efficient programs with this paradigm – if the dimension always stays constant, you only need to wrap and unwrap the once and can do all the critical operations through safe yet runtime-unchecked zips, folds and LA combinations.
Regardless, this approach isn't really widely used. Perhaps the single constant dimension is in fact too limiting for most relevant operations, and if you need to unwrap to tuples often it's way too inefficient. Another approach that is taking off these days is to actually tag the vectors with type-level numbers. Such numbers have become available in a usable form with the introduction of data kinds in GHC-7.4. Up until now, they're still rather unwieldy and not fit for proper arithmetic, but the upcoming 8.0 will greatly improve many aspects of this dependently-typed programming in Haskell.
A library that offers efficient length-indexed arrays is linear.

Haskell: `Map (a,b) c` versus `Map a (Map b c)`?

Thinking of maps as representations of finite functions, a map of two or more variables can be given either in curried or uncurried form; that is, the types Map (a,b) c and Map a (Map b c) are isomorphic, or something close to it.
What practical considerations are there — efficiency, etc — for choosing between the two representations?
The Ord instance of tuples uses lexicographic order, so Map (a, b) c is going to sort by a first anyway, so the overall order will be the same. Regarding practical considerations:
Because Data.Map is a binary search tree splitting at a key is comparable to a lookup, so getting a submap for a given a in the uncurried form won't be significantly more expensive than in the curried form.
The curried form may produce a less balanced tree overall, for the obvious reason of having multiple trees instead of just one.
The curried form will have a bit of extra overhead to store the nested maps.
The nested maps of the curried form representing "partial applications" can be shared if some a values produce the same result.
Similarly, "partial application" of the curried form gives you the existing inner map, while the uncurried form must construct a new map.
So the uncurried form is clearly better in general, but the curried form may be better if you expect to do "partial application" often and would benefit from sharing of Map b c values.
Note that some care will be necessary to ensure you actually benefit from that potential sharing; you'll need to explicitly define any shared inner maps and reuse the single value when constructing the full map.
Edit: Tikhon Jelvis points out in the comments that the memory overhead of the tuple constructors--which I did not think to account for--is not at all negligible. There is certainly some overhead to the curried form, but that overhead is proportional to how many distinct a values there are. The tuple constructor overhead in the uncurried form, on the other hand, is proportional to the total number of keys.
So if, on average, for any given value of a there are three or more distinct keys using it you'll probably save memory using the curried version. The concerns about unbalanced trees still apply, of course. The more I think about it, the more I suspect the curried form is unequivocally better except perhaps if your keys are very sparse and unevenly distributed.
Note that because arity of definitions does matter to GHC, the same care is required when defining functions if you want subexpressions to be shared; this is one reason you sometimes see functions defined in a style like this:
foo x = go
where z = expensiveComputation x
go y = doStuff y z
Tuples are lazy in both elements, so the tuple version introduces a little extra laziness. Whether this is good or bad strongly depends on your usage. (In particular, comparisons may force the tuple elements, but only if there are lots of duplicate a values.)
Beyond that, I think it's going to depend on how many duplicates you have. If a is almost always different whenever b is, you're going to have a lot of small trees, so the tuple version might be better. On the other hand, if the opposite is true, the non-tuple version may save you a little time (not constantly recomparing a once you've found the appropriate subtree and you're looking for b).
I'm reminded of tries, and how they store common prefixes once. The non-tuple version seems to be a bit like that. A trie can be more efficient than a BST if there's lots of common prefixes, and less efficient if there aren't.
But the bottom line: benchmark it!! ;-)
Apart from the efficiency aspects, there's also a pragmatic side to this question: what do you want to do with this structure?
Do you, for instance, want to be able to store an empty map for a given value of type a? If so, then the uncurried version might be more practical!
Here's a simple example: let's say we want to store String-valued properties of persons - say the value of some fields on that person's stackoverflow profile page.
type Person = String
type Property = String
uncurriedMap :: Map Person (Map Property String)
uncurriedMap = fromList [
("yatima2975", fromList [("location","Utrecht"),("age","37")]),
("PLL", fromList []) ]
curriedMap :: Map (Person,Property) String
curriedMap = fromList [
(("yatima2975","location"), "Utrecht"),
(("yatima2975","age"), "37") ]
With the curried version, there is no nice way to record the fact that user "PLL" is known to the system, but hasn't filled in any information. A person/property pair ("PLL",undefined) is going to cause runtime crashes, since Map is strict in the keys.
You could change the type of curriedMap to Map (Person,Property) (Maybe String) and store Nothings in there, and that might very well be the best solution in this case; but where there's a unknown/varying number of properties (e.g. depending on the kind of Person) that will also run into difficulties.
So, I guess it also depends on whether you need a query function like this:
data QueryResult = PersonUnknown | PropertyUnknownForPerson | Value String
query :: Person -> Property -> Map (Person, Property) String -> QueryResult
This is hard to write (if not impossible) in the curried version, but easy in the uncurried version.

Given a Haskell type signature, is it possible to generate the code automatically?

What it says in the title. If I write a type signature, is it possible to algorithmically generate an expression which has that type signature?
It seems plausible that it might be possible to do this. We already know that if the type is a special-case of a library function's type signature, Hoogle can find that function algorithmically. On the other hand, many simple problems relating to general expressions are actually unsolvable (e.g., it is impossible to know if two functions do the same thing), so it's hardly implausible that this is one of them.
It's probably bad form to ask several questions all at once, but I'd like to know:
Can it be done?
If so, how?
If not, are there any restricted situations where it becomes possible?
It's quite possible for two distinct expressions to have the same type signature. Can you compute all of them? Or even some of them?
Does anybody have working code which does this stuff for real?
Djinn does this for a restricted subset of Haskell types, corresponding to a first-order logic. It can't manage recursive types or types that require recursion to implement, though; so, for instance, it can't write a term of type (a -> a) -> a (the type of fix), which corresponds to the proposition "if a implies a, then a", which is clearly false; you can use it to prove anything. Indeed, this is why fix gives rise to ⊥.
If you do allow fix, then writing a program to give a term of any type is trivial; the program would simply print fix id for every type.
Djinn is mostly a toy, but it can do some fun things, like deriving the correct Monad instances for Reader and Cont given the types of return and (>>=). You can try it out by installing the djinn package, or using lambdabot, which integrates it as the #djinn command.
Oleg at okmij.org has an implementation of this. There is a short introduction here but the literate Haskell source contains the details and the description of the process. (I'm not sure how this corresponds to Djinn in power, but it is another example.)
There are cases where is no unique function:
fst', snd' :: (a, a) -> a
fst' (a,_) = a
snd' (_,b) = b
Not only this; there are cases where there are an infinite number of functions:
list0, list1, list2 :: [a] -> a
list0 l = l !! 0
list1 l = l !! 1
list2 l = l !! 2
-- etc.
-- Or
mkList0, mkList1, mkList2 :: a -> [a]
mkList0 _ = []
mkList1 a = [a]
mkList2 a = [a,a]
-- etc.
(If you only want total functions, then consider [a] as restricted to infinite lists for list0, list1 etc, i.e. data List a = Cons a (List a))
In fact, if you have recursive types, any types involving these correspond to an infinite number of functions. However, at least in the case above, there is a countable number of functions, so it is possible to create an (infinite) list containing all of them. But, I think the type [a] -> [a] corresponds to an uncountably infinite number of functions (again restrict [a] to infinite lists) so you can't even enumerate them all!
(Summary: there are types that correspond to a finite, countably infinite and uncountably infinite number of functions.)
This is impossible in general (and for languages like Haskell that does not even has the strong normalization property), and only possible in some (very) special cases (and for more restricted languages), such as when a codomain type has the only one constructor (for example, a function f :: forall a. a -> () can be determined uniquely). In order to reduce a set of possible definitions for a given signature to a singleton set with just one definition need to give more restrictions (in the form of additional properties, for example, it is still difficult to imagine how this can be helpful without giving an example of use).
From the (n-)categorical point of view types corresponds to objects, terms corresponds to arrows (constructors also corresponds to arrows), and function definitions corresponds to 2-arrows. The question is analogous to the question of whether one can construct a 2-category with the required properties by specifying only a set of objects. It's impossible since you need either an explicit construction for arrows and 2-arrows (i.e., writing terms and definitions), or deductive system which allows to deduce the necessary structure using a certain set of properties (that still need to be defined explicitly).
There is also an interesting question: given an ADT (i.e., subcategory of Hask) is it possible to automatically derive instances for Typeable, Data (yes, using SYB), Traversable, Foldable, Functor, Pointed, Applicative, Monad, etc (?). In this case, we have the necessary signatures as well as additional properties (for example, the monad laws, although these properties can not be expressed in Haskell, but they can be expressed in a language with dependent types). There is some interesting constructions:
http://ulissesaraujo.wordpress.com/2007/12/19/catamorphisms-in-haskell
which shows what can be done for the list ADT.
The question is actually rather deep and I'm not sure of the answer, if you're asking about the full glory of Haskell types including type families, GADT's, etc.
What you're asking is whether a program can automatically prove that an arbitrary type is inhabited (contains a value) by exhibiting such a value. A principle called the Curry-Howard Correspondence says that types can be interpreted as mathematical propositions, and the type is inhabited if the proposition is constructively provable. So you're asking if there is a program that can prove a certain class of propositions to be theorems. In a language like Agda, the type system is powerful enough to express arbitrary mathematical propositions, and proving arbitrary ones is undecidable by Gödel's incompleteness theorem. On the other hand, if you drop down to (say) pure Hindley-Milner, you get a much weaker and (I think) decidable system. With Haskell 98, I'm not sure, because type classes are supposed to be able to be equivalent to GADT's.
With GADT's, I don't know if it's decidable or not, though maybe some more knowledgeable folks here would know right away. For example it might be possible to encode the halting problem for a given Turing machine as a GADT, so there is a value of that type iff the machine halts. In that case, inhabitability is clearly undecidable. But, maybe such an encoding isn't quite possible, even with type families. I'm not currently fluent enough in this subject for it to be obvious to me either way, though as I said, maybe someone else here knows the answer.
(Update:) Oh a much simpler interpretation of your question occurs to me: you may be asking if every Haskell type is inhabited. The answer is obviously not. Consider the polymorphic type
a -> b
There is no function with that signature (not counting something like unsafeCoerce, which makes the type system inconsistent).

Choosing between a class and a record

Basic question: what design principles should one follow when choosing between using a class or using a record (with polymorphic fields) ?
First, we know that classes and records are essentially equivalent (since in Core, classes get desugared to dictionaries, which are just records). Nevertheless, there are differences: classes are passed implicitly, records must be explicit.
Looking a little deeper, classes are really useful when:
we have many different representations of 'the same thing', and
in actual usage, which representation is used can be inferred.
Classes are awkward when we have (up to parametric polymorphism) only one representation of our data, but we have multiple instances. This leads to the syntactic noise of having to use newtype to add extra tags (which exist only in our code, as we know such tags get erased at run time) if we don't want to turn on all sorts of troublesome extensions (i.e. overlapping and/or undecidable instances).
Of course, things get muddier: what if I want to have constraints on my types? Let's pick a real example:
class (Bounded i, Enum i) => Partition a i where
index :: a -> i
I could just as easily have done
data Partition a i = Partition { index :: a -> i}
But now I've lost my constraints, and I will have to add them to specific functions instead.
Are there design guidelines that would help me out?
I tend to see no issue with only requiring constraints on functions. The issue is, I suppose, that your data structure no longer models precisely what you intend it to. On the other hand, if you think of it as a data structure first and foremost, then that should matter less.
I feel like I don't necessarily still have a good grasp on the question, and this is about as vague as can be, but my rule of thumb tends to be that typeclasses are things that obey laws (or model meaning), and datatypes are things that encode a certain quantity of information.
When we want to layer behavior in complex ways, I've found that typeclasses start off enticingly, but can get painful quickly and switching to dictionary-passing makes things more straightforward. Which is to say that when we want implementations to be interoperable, then we should fall back to a uniform dictionary type.
This is take two, expanding a bit on a concrete example, but still just sort of spinning ideas...
Suppose we want to model probability distributions over the reals. Two natural representations come to mind.
A) Typeclass-driven
class PDist a where
sample :: a -> Gen -> Double
B) Dictionary-driven
data PDist = PDist (Gen -> Double)
The former lets us do
data NormalDist = NormalDist Double Double -- mean, var
instance PDist NormalDist where...
data LognormalDist = LognormalDist Double Double
instance PDist LognormalDist where...
The latter lets us do
mkNormalDist :: Double -> Double -> PDist...
mkLognormalDist :: Double -> Double -> PDist...
In the former, we can write
data SumDist a b = SumDist a b
instance (PDist a, PDist b) => PDist (SumDist a b)...
in the latter we can simply write
sumDist :: PDist -> PDist -> PDist
So what are the tradeoffs? Typeclass-driven lets us specify what distributions we're given. The tradeoff is that we have to construct an algebra of distributions explicitly, including new types for their combinations. Data-driven doesn't let us restrict the distributions we're given (or even if they're well-formed) but in return we can do whatever the heck we want.
Furthermore we can write a parseDist :: String -> PDist relatively easily, but we have to go through some angst to do the equiv for the typeclass approach.
So this is, in a sense the typed/untyped static/dynamic tradeoff at another level. We can give it a twist though, and argue that the typeclass, along with associated algebraic laws, specifies the semantics of a probability distribution. And the PDist type can indeed be made an instance of the PDist typeclass. Meanwhile, we can resign ourselves to using the PDist type (rather than typeclass) nearly everywhere, while thinking of it as iso to the tower of instances and datatypes necessary to use the typeclass more "richly."
In fact, we can even define basic PDist function in terms of typeclass functions. i.e. mkNormalPDist m v = PDist (sample $ NormalDist m v) So there's lots of room in the design space to slide between the two representations as necessary...
Note: I'm not sure that I understand the OP exactly. Suggestions/comments for improvement appreciated!
Background:
When I first learned about typeclasses in Haskell, the general rule-of-thumb I picked up was that, in comparison to Java-like languages:
typeclasses are similar to interfaces
data are similar to classes
Here's another SO question and answer that describe guidelines for using interfaces (also some drawbacks of interface over-use). My interpretation:
records/Java-classes are what something is
interfaces/typeclasses are roles that a concretion can fulfil
multiple, unrelated concretions can fulfil the same role
I bet you already know all this.
The guidelines I try to follow for my own code are:
typeclasses are for abstractions
records are for concretions
So in practice this means:
let the needs of the data determine the records
let the client code determine what the interfaces are -- clients should depend on abstractions, and thereby drive the creation and design of typeclasses
Example:
typeclass Show, with function show :: (Show s) => s -> String: for data that can be represented as a String.
clients just want to turn data into strings
clients don't care what the data (concretion) is -- only care that it can be represented as a string
role of implementing data: can be string-ified
this could not be achieved without a typeclass -- each datatype would require a conversion function with a different name, what a pain to deal with!
Type-classes can sometimes provide additional type-safety (An example would be Ord with Data.Map.union). If you have similar circumstances where choosing type-classes may help your type-safety - then use type-classes.
I'll present a different example where I think type-classes would not provide additional safety:
class Drawing a where
drawAsHtml :: a -> Html
drawOpenGL :: a -> IO ()
exampleFunctionA :: Drawing a => a -> a -> Something
exampleFunctionB :: (Drawing a, Drawing b) => a -> b -> Something
There is nothing exampleFunctionA could do and exampleFunctionB could not do (I find it hard to explain why, insights are welcome).
In this case I see no benefit of using a type-class.
(Edited following feedback from Jacques and question from missingo)

Handling multiple types with the same internal representation and minimal boilerplate?

I find myself running into a problem commonly, when writing larger programs in Haskell. I find myself often wanting multiple distinct types that share an internal representation and several core operations.
There are two relatively obvious approaches to solving this problem.
One is using a type class and the GeneralizedNewtypeDeriving extension. Put enough logic into a type class to support the shared operations that the use case desires. Create a type with the desired representation, and create an instance of the type class for that type. Then, for each use case, create wrappers for it with newtype, and derive the common class.
The other is to declare the type with a phantom type variable, and then use EmptyDataDecls to create distinct types for each different use case.
My main concern is not mixing up values that share internal representation and operations, but have different meanings in my code. Both of those approaches solve that problem, but feel significantly clumsy. My second concern is reducing the amount of boilerplate required, and both approaches do well enough at that.
What are the advantages and disadvantages of each approach? Is there a technique that comes closer to doing what I want, providing type safety without boilerplate code?
There's another straightforward approach.
data MyGenType = Foo | Bar
op :: MyGenType -> MyGenType
op x = ...
op2 :: MyGenType -> MyGenType -> MyGenType
op2 x y = ...
newtype MySpecialType {unMySpecial :: MyGenType}
inMySpecial f = MySpecialType . f . unMySpecial
inMySpecial2 f x y = ...
somefun = ... inMySpecial op x ...
someOtherFun = ... inMySpecial2 op2 x y ...
Alternately,
newtype MySpecial a = MySpecial a
instance Functor MySpecial where...
instance Applicative MySpecial where...
somefun = ... fmap op x ...
someOtherFun = ... liftA2 op2 x y ...
I think these approaches are nicer if you want to use your general type "naked" with any frequency, and only sometimes want to tag it. If, on the other hand, you generally want to use it tagged, then the phantom type approach more directly expresses what you want.
I've benchmarked toy examples and not found a performance difference between the two approaches, but usage does typically differ a bit.
For instance, in some cases you have a generic type whose constructors are exposed and you want to use newtype wrappers to indicate a more semantically specific type. Using newtypes then leads to call sites like,
s1 = Specific1 $ General "Bob" 23
s2 = Specific2 $ General "Joe" 19
Where the fact that the internal representations are the same between the different specific newtypes is transparent.
The type tag approach almost always goes along with representation constructor hiding,
data General2 a = General2 String Int
and the use of smart constructors, leading to a data type definition and call sites like,
mkSpecific1 "Bob" 23
Part of the reason being that you want some syntactically light way of indicating which tag you want. If you didn't provide smart constructors, then client code would often pick up type annotations to narrow things down, e.g.,
myValue = General2 String Int :: General2 Specific1
Once you adopt smart constructors, you can easily add extra validation logic to catch misuses of the tag. A nice aspect of the phantom type approach is that pattern matching isn't changed at all for internal code that has access to the representation.
internalFun :: General2 a -> General2 a -> Int
internalFun (General2 _ age1) (General2 _ age2) = age1 + age2
Of course you can use the newtypes with smart constructors and an internal class for accessing the shared representation, but I think a key decision point in this design space is whether you want to keep your representation constructors exposed. If the sharing of representation should be transparent, and client code should be free to use whatever tag it wishes with no extra validation, then newtype wrappers with GeneralizedNewtypeDeriving work fine. But if you are going to adopt smart constructors for working with opaque representations, then I usually prefer phantom types.
Put enough logic into a type class to support the shared operations that the use case desires. Create a type with the desired representation, and create an instance of the type class for that type. Then, for each use case, create wrappers for it with newtype, and derive the common class.
This presents some pitfalls, depending on the nature of the type and what kind of operations are involved.
First, it forces a lot of functions to be unnecessarily polymorphic--even if in practice every instance does the same thing for different wrappers, the open world assumption for type classes means the compiler has to account for the possibility of other instances. While GHC is definitely smarter than the average compiler, the more information you can give it the more it can do to help you.
Second, this can create a bottleneck for more complicated data structures. Any generic function on the wrapped types will be constrained to the interface presented by the type class, so unless that interface is exhaustive in terms of both expressivity and efficiency, you run the risk of either hobbling algorithms that use the type or altering the type class repeatedly as you find missing functionality.
On the other hand, if the wrapped type is already kept abstract (i.e., it doesn't export constructors) the bottleneck issue is irrelevant, so a type class might make good sense. Otherwise, I'd probably go with the phantom type tags (or possibly the identity Functor approach that sclv described).

Resources