Haskell: `Map (a,b) c` versus `Map a (Map b c)`? - haskell

Thinking of maps as representations of finite functions, a map of two or more variables can be given either in curried or uncurried form; that is, the types Map (a,b) c and Map a (Map b c) are isomorphic, or something close to it.
What practical considerations are there — efficiency, etc — for choosing between the two representations?

The Ord instance of tuples uses lexicographic order, so Map (a, b) c is going to sort by a first anyway, so the overall order will be the same. Regarding practical considerations:
Because Data.Map is a binary search tree splitting at a key is comparable to a lookup, so getting a submap for a given a in the uncurried form won't be significantly more expensive than in the curried form.
The curried form may produce a less balanced tree overall, for the obvious reason of having multiple trees instead of just one.
The curried form will have a bit of extra overhead to store the nested maps.
The nested maps of the curried form representing "partial applications" can be shared if some a values produce the same result.
Similarly, "partial application" of the curried form gives you the existing inner map, while the uncurried form must construct a new map.
So the uncurried form is clearly better in general, but the curried form may be better if you expect to do "partial application" often and would benefit from sharing of Map b c values.
Note that some care will be necessary to ensure you actually benefit from that potential sharing; you'll need to explicitly define any shared inner maps and reuse the single value when constructing the full map.
Edit: Tikhon Jelvis points out in the comments that the memory overhead of the tuple constructors--which I did not think to account for--is not at all negligible. There is certainly some overhead to the curried form, but that overhead is proportional to how many distinct a values there are. The tuple constructor overhead in the uncurried form, on the other hand, is proportional to the total number of keys.
So if, on average, for any given value of a there are three or more distinct keys using it you'll probably save memory using the curried version. The concerns about unbalanced trees still apply, of course. The more I think about it, the more I suspect the curried form is unequivocally better except perhaps if your keys are very sparse and unevenly distributed.
Note that because arity of definitions does matter to GHC, the same care is required when defining functions if you want subexpressions to be shared; this is one reason you sometimes see functions defined in a style like this:
foo x = go
where z = expensiveComputation x
go y = doStuff y z

Tuples are lazy in both elements, so the tuple version introduces a little extra laziness. Whether this is good or bad strongly depends on your usage. (In particular, comparisons may force the tuple elements, but only if there are lots of duplicate a values.)
Beyond that, I think it's going to depend on how many duplicates you have. If a is almost always different whenever b is, you're going to have a lot of small trees, so the tuple version might be better. On the other hand, if the opposite is true, the non-tuple version may save you a little time (not constantly recomparing a once you've found the appropriate subtree and you're looking for b).
I'm reminded of tries, and how they store common prefixes once. The non-tuple version seems to be a bit like that. A trie can be more efficient than a BST if there's lots of common prefixes, and less efficient if there aren't.
But the bottom line: benchmark it!! ;-)

Apart from the efficiency aspects, there's also a pragmatic side to this question: what do you want to do with this structure?
Do you, for instance, want to be able to store an empty map for a given value of type a? If so, then the uncurried version might be more practical!
Here's a simple example: let's say we want to store String-valued properties of persons - say the value of some fields on that person's stackoverflow profile page.
type Person = String
type Property = String
uncurriedMap :: Map Person (Map Property String)
uncurriedMap = fromList [
("yatima2975", fromList [("location","Utrecht"),("age","37")]),
("PLL", fromList []) ]
curriedMap :: Map (Person,Property) String
curriedMap = fromList [
(("yatima2975","location"), "Utrecht"),
(("yatima2975","age"), "37") ]
With the curried version, there is no nice way to record the fact that user "PLL" is known to the system, but hasn't filled in any information. A person/property pair ("PLL",undefined) is going to cause runtime crashes, since Map is strict in the keys.
You could change the type of curriedMap to Map (Person,Property) (Maybe String) and store Nothings in there, and that might very well be the best solution in this case; but where there's a unknown/varying number of properties (e.g. depending on the kind of Person) that will also run into difficulties.
So, I guess it also depends on whether you need a query function like this:
data QueryResult = PersonUnknown | PropertyUnknownForPerson | Value String
query :: Person -> Property -> Map (Person, Property) String -> QueryResult
This is hard to write (if not impossible) in the curried version, but easy in the uncurried version.

Related

Type constraints on dimensionality of vectors in F# and Haskell (Dependent Types)

I'm new to F# and Haskell and am implementing a project in order to determine which language I would prefer to devote more time to.
I have a numerous situations where I expect a given numerical type to have given dimensions based on parameters given to a top-level function (ie, at runtime). For example, in this F# snippet, I have
type DataStreamItem = LinearAlgebra.Vector<float32>
type Ball =
{R : float32;
X : DataStreamItem}
and I expect all instances of type DataStreamItem to have D dimensions.
My question is in the interests of algorithm development and debugging since such shape-mismatche-bugs can be a headache to pin down but should be a non-issue when the algorithm is up-and-running:
Is there a way, in either F# or Haskell, to constrain DataStreamItem and / or Ball to have dimensions of D? Or do I need to resort to pattern matching on every calculation?
If the latter is the case, are there any good, light-weight paradigms to catch such constraint violations as soon as they occur (and that can be removed when performance is critical)?
Edit:
To clarify the sense in which D is constrained:
D is defined such that if you expressed the algorithm of the function main(DataStream) as a computation graph, all of the intermediate calculations would depend on the dimension of D for the execution of main(DataStream). The simplest example I can think of would be a dot-product of M with DataStreamItem: the dimension of DataStream would determine the creation of dimension parameters of M
Another Edit:
A week later, I find the following blog outlining precisely what I was looking for in dependant types in Haskell:
https://blog.jle.im/entry/practical-dependent-types-in-haskell-1.html
And Another:
This reddit contains some discussion on Dependent Types in Haskell and contains a link to the quite interesting dissertation proposal of R. Eisenberg.
Neither Haskell not F# type system is rich enough to (directly) express statements of the sort "N nested instances of a recursive type T, where N is between 2 and 6" or "a string of characters exactly 6 long". Not in those exact terms, at least.
I mean, sure, you can always express such a 6-long string type as type String6 = String6 of char*char*char*char*char*char or some variant of the sort (which technically should be enough for your particular example with vectors, unless you're not telling us the whole example), but you can't say something like type String6 = s:string{s.Length=6} and, more importantly, you can't define functions of the form concat: String<n> -> String<m> -> String<n+m>, where n and m represent string lengths.
But you're not the first person asking this question. This research direction does exist, and is called "dependent types", and I can express the gist of it most generally as "having higher-order, more powerful operations on types" (as opposed to just union and intersection, as we have in ML languages) - notice how in the example above I parametrize the type String with a number, not another type, and then do arithmetic on that number.
The most prominent language prototypes (that I know of) in this direction are Agda, Idris, F*, and Coq (not really the full deal AFAIK). Check them out, but beware: this is kind of the edge of tomorrow, and I wouldn't advise starting a big project based on those languages.
(edit: apparently you can do certain tricks in Haskell to simulate dependent types, but it's not very convenient, and you have to enable UndecidableInstances)
Alternatively, you could go with a weaker solution of doing the checks at runtime. The general gist is: wrap your vector types in a plain wrapper, don't allow direct construction of it, but provide constructor functions instead, and make those constructor functions ensure the desired property (i.e. length). Something like:
type Stream4 = private Stream4 of DataStreamItem
with
static member create (item: DataStreamItem) =
if item.Length = 4 then Some (Stream4 item)
else None
// Alternatively:
if item.Length <> 4 then failwith "Expected a 4-long vector."
item
Here is a fuller explanation of the approach from Scott Wlaschin: constrained strings.
So if I understood correctly, you're actually not doing any type-level arithmetic, you just have a “length tag” that's shared in a chain of function calls.
This has long been possible to do in Haskell; one way that I consider quite elegant is to annotate your arrays with a standard fixed-length type of the desired length:
newtype FixVect v s = FixVect { getFixVect :: VU.Vector s }
To ensure the correct length, you only provide (polymorphic) smart constructors that construct from the fixed-length type – perfectly safe, though the actual dimension number is nowhere mentioned!
class VectorSpace v => FiniteDimensional v where
asFixVect :: v -> FixVect v (Scalar v)
instance FiniteDimensional Float where
asFixVect s = FixVect $ VU.singleton s
instance (FiniteDimensional a, FiniteDimensional b, Scalar a ~ Scalar b) => FiniteDimensional (a,b) where
asFixVect (a,b) = case (asFixVect a, asFixVect b) of
(FixVect av, FixVect bv) -> FixVect $ av<>bv
This construction from unboxed tuples is really inefficient, however this doesn't mean you can write efficient programs with this paradigm – if the dimension always stays constant, you only need to wrap and unwrap the once and can do all the critical operations through safe yet runtime-unchecked zips, folds and LA combinations.
Regardless, this approach isn't really widely used. Perhaps the single constant dimension is in fact too limiting for most relevant operations, and if you need to unwrap to tuples often it's way too inefficient. Another approach that is taking off these days is to actually tag the vectors with type-level numbers. Such numbers have become available in a usable form with the introduction of data kinds in GHC-7.4. Up until now, they're still rather unwieldy and not fit for proper arithmetic, but the upcoming 8.0 will greatly improve many aspects of this dependently-typed programming in Haskell.
A library that offers efficient length-indexed arrays is linear.

Is there a way to "preserve" results in Haskell?

I'm new to Haskell and understand that it is (basically) a pure functional language, which has the advantage that results to functions will not change across multiple evaluations. Given this, I'm puzzled by why I can't easily mark a function in such a way that its remembers the results of its first evaluation, and does not have to be evaluated again each time its value is required.
In Mathematica, for example, there is a simple idiom for accomplishing this:
f[x_]:=f[x]= ...
but in Haskell, the closest things I've found is something like
f' = (map f [0 ..] !!)
where f 0 = ...
f n = f' ...
which in addition to being far less clear (and apparently limited to Int arguments?) does not (seem to) preserve results within an interactive session.
Admittedly (and clearly), I don't understand exactly what's going on here; but naively, it seems like Haskel should have some way, at the function definition level, of
taking advantage of the fact that its functions are functions and skipping re-computation of their results once they have been computed, and
indicating a desire to do this at the function definition level with a simple and clean idiom.
Is there a way to accomplish this in Haskell that I'm missing? I understand (sort of) that Haskell can't store the evaluations as "state", but why can't it simply (in effect) redefine evaluated functions to be their computed value?
This grows out of this question, in which lack of this feature results in terrible performance.
Use a suitable library, such as MemoTrie.
import Data.MemoTrie
f' = memo f
where f 0 = ...
f n = f' ...
That's hardly less nice than the Mathematica version, is it?
Regarding
“why can't it simply (in effect) redefine evaluated functions to be their computed value?”
Well, it's not so easy in general. These values have to be stored somewhere. Even for an Int-valued function, you can't just allocate an array with all possible values – it wouldn't fit in memory. The list solution only works because Haskell is lazy and therefore allows infinite lists, but that's not particularly satisfying since lookup is O(n).
For other types it's simply hopeless – you'd need to somehow diagonalise an over-countably infinite domain.
You need some cleverer organisation. I don't know how Mathematica does this, but it probably uses a lot of “proprietary magic”. I wouldn't be so sure that it does really work the way you'd like, for any inputs.
Haskell fortunately has type classes, and these allow you to express exactly what a type needs in order to be quickly memoisable. HasTrie is such a class.

Memoize a Double function in Haskell

I have a function
slow :: Double -> Double
which gets called very often (hundreds of millions of times), but only gets called on about a thousand discrete values. This seems like an excellent candidate for memoization, but I can't figure out how to memoize a function of a Double.
The standard technique of making a list doesn't work, since it's not an integral type. I looked at Data.MemoCombinators, but it doesn't natively support Doubles. There was a bits function for handling more data types, but Double isn't an instance of Data.Bits.
Is there an elegant way to memoize `slow?
You could always use ugly-memo. The internals are impure, but it's fast and does what you need (except if the argument is NaN).
I think StableMemo should do exactly what you want, but I don't have any experience with that.
There are two main approaches: use the Ord property to store the keys in a tree structure, like Map. That doesn't require the integral property you'd need for e.g. a MemoTrie approach; it is thus slower but very simple.
The alternative, that works with yet much more general types, is to map unorderedly onto a large integral domain with a Hash function, in order to, well, store the keys in a hash map. This is going to be substantially faster but pretty much just as simple since the interface of HashMap largely matches that of Map, so you probably want to go that way.
Now, sadly neither is quite as simple to use as MemoCombinators. That builds directly on IntTrie, which is specialised for offering a lazy / infinite / pure interface. Both Map and particularly HashMap, in contrast, can be used very well for impure memoisation, but are not inherently able to do it purely. You may throw in some UnsafePerformIO (uh oh), or just do it openly in the IO monad (yuk!). Or use StableMemo.
But it's actually easy and safe if you already know which values it's going to be, at least most of the calls, at compile-time. Then you can just fill a local hash map with those values in the beginning, and at each call look up if it's there and otherwise simply call the expensive function directly:
import qualified Data.HashMap.Lazy as HM
type X = Double -- could really be anything else
notThatSlow :: Double -> X
notThatSlow = \v -> case HM.lookup v memo of
Just x -> x
Nothing -> slow v
where memo = HM.fromList [ (v, x) | v<-expectedValues, let x = slow v ]

When to expose constructors of a data type when designing data structures?

When designing data structures in functional languages there are 2 options:
Expose their constructors and pattern match on them.
Hide their constructors and use higher-level functions to examine the data structures.
In what cases, what is appropriate?
Pattern matching can make code much more readable or simpler. On the other hand, if we need to change something in the definition of a data type then all places where we pattern-match on them (or construct them) need to be updated.
I've been asking this question myself for some time. Often it happens to me that I start with a simple data structure (or even a type alias) and it seems that constructors + pattern matching will be the easiest approach and produce a clean and readable code. But later things get more complicated, I have to change the data type definition and refactor a big part of the code.
The essential factor for me is the answer to the following question:
Is the structure of my datatype relevant to the outside world?
For example, the internal structure of the list datatype is very much relevant to the outside world - it has an inductive structure that is certainly very useful to expose to consumers, because they construct functions that proceed by induction on the structure of the list. If the list is finite, then these functions are guaranteed to terminate. Also, defining functions in this way makes it easy to provide properties about them, again by induction.
By contrast, it is best for the Set datatype to be kept abstract. Internally, it is implemented as a tree in the containers package. However, it might as well have been implemented using arrays, or (more usefully in a functional setting) with a tree with a slightly different structure and respecting different invariants (balanced or unbalanced, branching factor, etc). The need to enforce any invariants above and over those that the constructors already enforce through their types, by the way, precludes letting the datatype be concrete.
The essential difference between the list example and the set example is that the Set datatype is only relevant for the operations that are possible on Set's. Whereas lists are relevant because the standard library already provides many functions to act on them, but in addition their structure is relevant.
As a sidenote, one might object that actually the inductive structure of lists, which is so fundamental to write functions whose termination and behaviour is easy to reason about, is captured abstractly by two functions that consume lists: foldr and foldl. Given these two basic list operators, most functions do not need to inspect the structure of a list at all, and so it could be argued that lists too coud be kept abstract. This argument generalizes to many other similar structures, such as all Traversable structures, all Foldable structures, etc. However, it is nigh impossible to capture all possible recursion patterns on lists, and in fact many functions aren't recursive at all. Given only foldr and foldl, one would, writing head for example would still be possible, though quite tedious:
head xs = fromJust $ foldl (\b x -> maybe (Just x) Just b) Nothing xs
We're much better off just giving away the internal structure of the list.
One final point is that sometimes the actual representation of a datatype isn't relevant to the outside world, because say it is some kind of optimised and might not be the canonical representation, or there isn't a single "canonical" representation. In these cases, you'll want to keep your datatype abstract, but offer "views" of your datatype, which do provide concrete representations that can be pattern matched on.
One example would be if wanted to define a Complex datatype of complex numbers, where both cartesian forms and polar forms can be considered canonical. In this case, you would keep Complex abstract, but export two views, ie functions polar and cartesian that return a pair of a length and an angle or a coordinate in the cartesian plane, respectively.
Well, the rule is pretty simple: If it's easy to construct wrong values by using the actual constructors, then don't allow them to be used directly, but instead provide smart constructors. This is the path followed by some data structures like Map and Set, which are easy to get wrong.
Then there are the types for which it's impossible or hard to construct inconsistent/wrong values either because the type doesn't allow that at all or because you would need to introduce bottoms. The length-indexed list type (commonly called Vec) and most monads are examples of that.
Ultimately this is your own decision. Put yourself into the user's perspective and make the tradeoff between convenience and safety. If there is no tradeoff, then always expose the constructors. Otherwise your library users will hate you for the unnecessary opacity.
If the data type serves a simple purpose (like Maybe a) and no (explicit or implicit) assumptions about the data type can be violated by directly constructing a value via the data constructors, I would expose the constructors.
On the other hand, if the data type is more complex (like a balanced tree) and/or it's internal representation is likely to change, I usually hide the constructors.
When using a package, there's an unwritten rule that the interface exposed by a non-internal module should be "safe" to use on the given data type. Considering the balanced tree example, exposing the data constructors allows one to (accidentally) construct an unbalanced tree, and so the assumed runtime guarantees for searching the tree etc might be violated.
If the type is used to represent values with a canonical definition and representation (many mathematical objects fall into this category), and it's not possible to construct "invalid" values using the type, then you should expose the constructors.
For example, if you're representing something like two dimensional points with your own type (including a newtype), you might as well expose the constructor. The reality is that a change to this datatype is not going to be a change in how 2d points are represented, it's going to be a change in your need to use 2d points (maybe you're generalising to 3d space, maybe you're adding a concept of layers, or whatever), and is almost certain to need attention in the parts of the code using values of this type no matter what you do.[1]
A complex type representing something specific to your application or field is quite likely to undergo changes to the representation while continuing to support similar operations. Therefore you only want other modules depending on the operations, not on the internal structure. So you shouldn't expose the constructors.
Other types represent things with canonical definitions but not canonical representations. Everyone knows the properties expected of maps and sets, but there are lots of different ways of representing values that support those properties. So you again only want other modules depending on the operations they support, not on the particular representations.
Some types, whether or not they are if simple with canonical representations, allow the construction of values in the program which don't represent a valid value in the abstract concept the type is supposed to represent. A simple example would be a type representing a self-balancing binary search tree; client code with access to the constructors could easily construct invalid trees. Exposing the constructors either means you need to assume that such values passed in from outside may be invalid and therefore you need to make something sensible happen even for bizarre values, or means that it's the responsibility of the programmers working with your interface to ensure they don't violate any assumptions. It's usually better to just keep such types from being constructed directly outside your module.
Basically it comes down to the concept your type is supposed to represent. If your concept maps in a very simple and obvious[2] way directly to values in some data type which isn't "more inclusive" than the concept due to the compiler being unable to check needed invariants, then the concept is pretty much "the same" as the data type, and exposing its structure is fine. If not, then you probably need to keep the structure hidden.
[1] A likely change though would be to change which numeric type you're using for the coordinate values, so you probably do have to think about how to minimise the impact of such changes. That's pretty orthogonal to whether or not you expose the constructors though.
[2] "Obvious" here meaning that if you asked 10 people independently to come up with a data type representing the concept they would all come back with the same thing, modulo changing the names.
I would propose a different, noticeably more restrictive rule than most people. The central criterion would be:
Do you guarantee that this type will never, ever change? If so, exposing the constructors might be a good idea. Good luck with that, though!
But the types for which you can make that guarantee tend to be very simple, generic "foundation" types like Maybe, Either or [], which one could arguably write once and then never revisit again.
Though even those can be questioned, because they do get revisited from time to time; there's people who have used Church-encoded versions of Maybe and List in various contexts for performance reasons, e.g.:
{-# LANGUAGE RankNTypes #-}
newtype Maybe' a = Maybe' { elimMaybe' :: forall r. r -> (a -> r) -> r }
nothing = Maybe' $ \z k -> z
just x = Maybe' $ \z k -> k x
newtype List' a = List' { elimList' :: forall r. (a -> r -> r) -> r -> r }
nil = List' $ \k z -> z
cons x xs = List' $ \k z -> k x (elimList' k z xs)
These two examples highlight something important: you can replace the Maybe' type's implementation shown above with any other implementation as long as it supports the following three functions:
nothing :: Maybe' a
just :: a -> Maybe' a
elimMaybe' :: Maybe' a -> r -> (a -> r) -> r
...and the following laws:
elimMaybe' nothing z x == z
elimMaybe' (just x) z f == f x
And this technique can be applied to any algebraic data type. Which to me says that pattern matching against concrete constructors is just insufficiently abstract; it doesn't really gain you anything that you can't get out of the abstract constructors + destructor pattern, and it loses implementation flexibility.

Memoization in Haskell using premade data structures

I find this answer and this wiki page to be excellent introductions to memoization in Haskell. They do, however, still leave me with a question that I hope to get answered:
It seems to me that the technique used requires you to "open up" (as in "access the internals of") the data structure you use to store your memoization. For example, 1 implements a table structure and 2 implements a tree in section 3. Is it possible to do something similar with a pre-made data structure? Suppose, for example, that you think that Data.Map is really awesome, and would like to store your memoized values in such a Map. Can one approach memoization with a pre-made data structure such as this, where one does not implement the structure itself, but rather use a pre-made one?
Hopefully someone will give me a hint on how to think, or, perhaps more likely, correct my misunderstanding of functional memoization in general.
Edit: I can think of one way to do it, but it's not at all elegant: If f :: a -> b, then one can probably easily make a memoized version f' :: Map a b -> a -> (Map a b, b), where the first argument is the memoization storage, and the output pair contains a potentially updated storage and the computed value. This state-passing is certainly not what I want (although I guess it could be wrapped in a monad, but it's several orders of magnitudes uglier than the approach in 1 and 2).
Edit 2: Maybe it helps to try and express my current way of (incorrect) thought. Currently, I seem to repeatedly pull myself, against my will, into the non-solution
import qualified Data.Map as Map
memo :: (Ord a) => [a] -> (a -> b) -> (a -> b)
memo domain f = (Map.!) storage
where
storage = Map.fromList (zip domain (map f domain))
The more I stare at this, the more I realize I've misunderstood something basic. You see, it feels to me that my memo [True, False] is equivalent to the bool memoizer of 1.
If you notice, Data.Memocombinators actually relies on the "pre-made" Data.IntTrie. I'm sure you could take the same code and replace uses of the IntTrie with another data structure, though it may not be as efficient.
The general idea of memoization is to save computed results. In Haskell, the easiest way to do this is to map your function onto a table where the table has one dimension per parameter. Since Haskell is lazy (well, most data structures in Haskell are), it will only evaluate the value of a given cell when you specifically ask for it. "table" basically means "map" since it takes you from key(s) to value.
[edit] Additional thoughts regarding Example 2
If I'm not mistaken, then the first time (Map.!) storage is forced to evaluate for a given key, the storage structure will be entirely wrung out (though the computation f won't happen for anything but the given key). So the first lookup will cause an additional O(n) operation, n being length domain. Subsequent lookups would not, afaik, incur this cost.
Lazier structures like typical int-indexed lists or the IntTrie similarly need to manifest their structure when a lookup is invoked, but unlike a Map, they need not do so all at once. Lists are wrung out until the indexed key is accessed. IntTries wring out only the integer keys that are "prefixes" (or suffixes? not sure. could be implemented either way) of the desired key. Index 11, (1011) would wring out 1 (1), 2 (10), 5 (101), and 11 (1011). Data.Memocombinators simply transforms all keys into Ints (or "bits") so that an IntTrie can be used.
p.s. is there a better phrase than "wring out" for this? The words "force", "spine", and "manifest" come to mind, but I can't quite think of the right word/phrase for this.

Resources