Why are type-safe relational operations so difficult? - haskell

I was trying to code a relational problem in Haskell, when I had to find out that doing this in a type safe manner is far from obvious. E.g. a humble
select 1,a,b, from T
already raises a number of questions:
what is the type of this function?
what is the type of the projection 1,a,b ? What is the type of a projection in general?
what is the result type and how do I express the relationship between the result type and the projection?
what is the type of such a function which accepts any valid projection?
how can I detect invalid projections at compile time ?
How would I add a column to a table or to a projection?
I believe even Oracle's PL/SQL language does not get this quite right. While invald projections are mostly detected at compile time, the is a large number of type errors which only show at runtime. Most other bindings to RDBMSs (e.g. Java's jdbc and perl's DBI) use SQL contained in Strings and thus give up type-safety entirely.
Further research showed that there are some Haskell libraries (HList, vinyl and TRex), which provide type-safe extensible records and some more. But these libraries all require Haskell extensions like DataKinds, FlexibleContexts and many more. Furthermore these libraries are not easy to use and have a smell of trickery, at least to uninitialized observers like me.
This suggests, that type-safe relational operations do not fit in well with the functional paradigm, at least not as it is implemented in Haskell.
My questions are the following:
What are the fundamental causes of this difficulty to model relational operations in a type safe way. Where does Hindley-Milner fall short? Or does the problem originate at typed lambda calculus already?
Is there a paradigm, where relational operations are first class citizens? And if so, is there a real-world implementation?

Let's define a table indexed on some columns as a type with two type parameters:
data IndexedTable k v = ???
groupBy :: (v -> k) -> IndexedTable k v
-- A table without an index just has an empty key
type Table = IndexedTable ()
k will be a (possibly nested) tuple of all columns that the table is indexed on. v will be a (possibly nested) tuple of all columns that the table is not indexed on.
So, for example, if we had the following table
| Id | First Name | Last Name |
|----|------------|-----------|
| 0 | Gabriel | Gonzalez |
| 1 | Oscar | Boykin |
| 2 | Edgar | Codd |
... and it were indexed on the first column, then the type would be:
type Id = Int
type FirstName = String
type LastName = String
IndexedTable Int (FirstName, LastName)
However, if it were indexed on the first and second column, then the type would be:
IndexedTable (Int, Firstname) LastName
Table would implement the Functor, Applicative, and Alternative type classes. In other words:
instance Functor (IndexedTable k)
instance Applicative (IndexedTable k)
instance Alternative (IndexedTable k)
So joins would be implemented as:
join :: IndexedTable k v1 -> IndexedTable k v2 -> IndexedTable k (v1, v2)
join t1 t2 = liftA2 (,) t1 t2
leftJoin :: IndexedTable k v1 -> IndexedTable k v2 -> IndexedTable k (v1, Maybe v2)
leftJoin t1 t2 = liftA2 (,) t1 (optional t2)
rightJoin :: IndexedTable k v1 -> IndexedTable k v2 -> IndexedTable k (Maybe v1, v2)
rightJoin t1 t2 = liftA2 (,) (optional t1) t2
Then you would have a separate type that we will call a Select. This type will also have two type parameters:
data Select v r = ???
A Select would consume a bunch of rows of type v from the table and produce a result of type r. In other words, we should have a function of type:
selectIndexed :: Indexed k v -> Select v r -> r
Some example Selects that we might define would be:
count :: Select v Integer
sum :: Num a => Select a a
product :: Num a => Select a a
max :: Ord a => Select a a
This Select type would implement the Applicative interface, so we could combine multiple Selects into a single Select. For example:
liftA2 (,) count sum :: Select Integer (Integer, Integer)
That would be analogous to this SQL:
SELECT COUNT(*), SUM(*)
However, often our table will have multiple columns, so we need a way to focus a Select onto a single column. Let's call this function Focus:
focus :: Lens' a b -> Select b r -> Select a r
So that we can write things like:
liftA3 (,,) (focus _1 sum) (focus _2 product) (focus _3 max)
:: (Num a, Num b, Ord c)
=> Select (a, b, c) (a, b, c)
So if we wanted to write something like:
SELECT COUNT(*), MAX(firstName) FROM t
That would be equivalent to this Haskell code:
firstName :: Lens' Row String
table :: Table Row
select table (liftA2 (,) count (focus firstName max)) :: (Integer, String)
So you might wonder how one might implement Select and Table.
I describe how to implement Table in this post:
http://www.haskellforall.com/2014/12/a-very-general-api-for-relational-joins.html
... and you can implement Select as just:
type Select = Control.Foldl.Fold
type focus = Control.Foldl.pretraverse
-- Assuming you define a `Foldable` instance for `IndexedTable`
select t s = Control.Foldl.fold s t
Also, keep in mind that these are not the only ways to implement Table and Select. They are just a simple implementation to get you started and you can generalize them as necessary.
What about selecting columns from a table? Well, you can define:
column :: Select a (Table a)
column = Control.Foldl.list
So if you wanted to do:
SELECT col FROM t
... you would write:
field :: Lens' Row Field
table :: Table Row
select (focus field column) table :: [Field]
The important takeaway is that you can implement a relational API in Haskell just fine without any fancy type system extensions.

Related

Clarifying Data Constructor in Haskell

In the following:
data DataType a = Data a | Datum
I understand that Data Constructor are value level function. What we do above is defining their type. They can be function of multiple arity or const. That's fine. I'm ok with saying Datum construct Datum. What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce. Please let me know if i am getting it well:
1 - a) Basically writing Data a, is defining both a Data Structure and its Constructor function (as in scala or java usually the class and the constructor have the same name) ?
2 - b) So if i unpack and make an analogy. With Data a We are both defining a Structure(don't want to use class cause class imply a type already i think, but maybe we could) of object (Data Structure), the constructor function (Data Constructor/Value constructor), and the later return an object of that object Structure. Finally The type of that Structure of object is given by the Type constructor. An Object Structure in a sense is just a Tag surrounding a bunch value of some type. Is my understanding correct ?
3 - c) Can I formally Say:
Data Constructor that are Nullary represent constant values -> Return the the constant value itself of which the type is given by the Type Constructor at the definition site.
Data Constructor that takes an argument represent class of values, where class is a Tag ? -> Return an infinite number of object of that class, of which the type is given by the Type constructor at the definition site.
Another way of writing this:
data DataType a = Data a | Datum
Is with generalised algebraic data type (GADT) syntax, using the GADTSyntax extension, which lets us specify the types of the constructors explicitly:
{-# LANGUAGE GADTSyntax #-}
data DataType a where
Data :: a -> DataType a
Datum :: DataType a
(The GADTs extension would work too; it would also allow us to specify constructors with different type arguments in the result, like DataType Int vs. DataType Bool, but that’s a more advanced topic, and we don’t need that functionality here.)
These are exactly the types you would see in GHCi if you asked for the types of the constructor functions with :type / :t:
> :{
| data DataType a where
| Data :: a -> DataType a
| Datum :: DataType a
| :}
> :type Data
Data :: a -> DataType a
> :t Datum
Datum :: DataType a
With ExplicitForAll we can also specify the scope of the type variables explicitly, and make it clearer that the a in the data definition is a separate variable from the a in the constructor definitions by also giving them different names:
data DataType a where
Data :: forall b. b -> DataType b
Datum :: forall c. DataType c
Some more examples of this notation with standard prelude types:
data Either a b where
Left :: forall a b. a -> Either a b
Right :: forall a b. b -> Either a b
data Maybe a where
Nothing :: Maybe a
Just :: a -> Maybe a
data Bool where
False :: Bool
True :: Bool
data Ordering where
LT, EQ, GT :: Ordering -- Shorthand for repeated ‘:: Ordering’
I understand that Data Constructor are value level function. What we do above is defining their type. They can be function of multiple arity or const. That's fine. I'm ok with saying Datum construct Datum. What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce.
Datum and Data are both “constructors” of DataType a values; neither Datum nor Data is a type! These are just “tags” that select between the possible varieties of a DataType a value.
What is produced is always a value of type DataType a for a given a; the constructor selects which “shape” it takes.
A rough analogue of this is a union in languages like C or C++, plus an enumeration for the “tag”. In pseudocode:
enum Tag {
DataTag,
DatumTag,
}
// A single anonymous field.
struct DataFields<A> {
A field1;
}
// No fields.
struct DatumFields<A> {};
// A union of the possible field types.
union Fields<A> {
DataFields<A> data;
DatumFields<A> datum;
}
// A pair of a tag with the fields for that tag.
struct DataType<A> {
Tag tag;
Fields<A> fields;
}
The constructors are then just functions returning a value with the appropriate tag and fields. Pseudocode:
<A> DataType<A> newData(A x) {
DataType<A> result;
result.tag = DataTag;
result.fields.data.field1 = x;
return result;
}
<A> DataType<A> newDatum() {
DataType<A> result;
result.tag = DatumTag;
// No fields.
return result;
}
Unions are unsafe, since the tag and fields can get out of sync, but sum types are safe because they couple these together.
A pattern-match like this in Haskell:
case someDT of
Datum -> f
Data x -> g x
Is a combination of testing the tag and extracting the fields. Again, in pseudocode:
if (someDT.tag == DatumTag) {
f();
} else if (someDT.tag == DataTag) {
var x = someDT.fields.data.field1;
g(x);
}
Again this is coupled in Haskell to ensure that you can only ever access the fields if you have checked the tag by pattern-matching.
So, in answer to your questions:
1 - a) Basically writing Data a, is defining both a Data Structure and its Constructor function (as in scala or java usually the class and the constructor have the same name) ?
Data a in your original code is not defining a data structure, in that Data is not a separate type from DataType a, it’s just one of the possible tags that a DataType a value may have. Internally, a value of type DataType Int is one of the following:
The tag for Data (in GHC, a pointer to an “info table” for the constructor), and a reference to a value of type Int.
x = Data (1 :: Int) :: DataType Int
+----------+----------------+ +---------+----------------+
x ---->| Data tag | pointer to Int |---->| Int tag | unboxed Int# 1 |
+----------+----------------+ +---------+----------------+
The tag for Datum, and no other fields.
y = Datum :: DataType Int
+-----------+
y ----> | Datum tag |
+-----------+
In a language with unions, the size of a union is the maximum of all its alternatives, since the type must support representing any of the alternatives with mutation. In Haskell, since values are immutable, they don’t require any extra “padding” since they can’t be changed.
It’s a similar situation for standard data types, e.g., a product or sum type:
(x :: X, y :: Y) :: (X, Y)
+---------+--------------+--------------+
| (,) tag | pointer to X | pointer to Y |
+---------+--------------+--------------+
Left (m :: M) :: Either M N
+-----------+--------------+
| Left tag | pointer to M |
+-----------+--------------+
Right (n :: N) :: Either M N
+-----------+--------------+
| Right tag | pointer to N |
+-----------+--------------+
2 - b) So if i unpack and make an analogy. With Data a We are both defining a Structure(don't want to use class cause class imply a type already i think, but maybe we could) of object (Data Structure), the constructor function (Data Constructor/Value constructor), and the later return an object of that object Structure. Finally The type of that Structure of object is given by the Type constructor. An Object Structure in a sense is just a Tag surrounding a bunch value of some type. Is my understanding correct ?
This is sort of correct, but again, the constructors Data and Datum aren’t “data structures” by themselves. They’re just the names used to introduce (construct) and eliminate (match) values of type DataType a, for some type a that is chosen by the caller of the constructors to fill in the forall
data DataType a = Data a | Datum says:
If some term e has type T, then the term Data e has type DataType T
Inversely, if some value of type DataType T matches the pattern Data x, then x has type T in the scope of the match (case branch or function equation)
The term Datum has type DataType T for any type T
3 - c) Can I formally Say:
Data Constructor that are Nullary represent constant values -> Return the the constant value itself of which the type is given by the Type Constructor at the definition site.
Data Constructor that takes an argument represent class of values, where class is a Tag ? -> Return an infinite number of object of that class, of which the type is given by the Type constructor at the definition site.
Not exactly. A type constructor like DataType :: Type -> Type, Maybe :: Type -> Type, or Either :: Type -> Type -> Type, or [] :: Type -> Type (list), or a polymorphic data type, represents an “infinite” family of concrete types (Maybe Int, Maybe Char, Maybe (String -> String), …) but only in the same way that id :: forall a. a -> a represents an “infinite” family of functions (id :: Int -> Int, id :: Char -> Char, id :: String -> String, …).
That is, the type a here is a parameter filled in with an argument value given by the caller. Usually this is implicit, through type inference, but you can specify it explicitly with the TypeApplications extension:
-- Akin to: \ (a :: Type) -> \ (x :: a) -> x
id :: forall a. a -> a
id x = x
id #Int :: Int -> Int
id #Int 1 :: Int
Data :: forall a. a -> DataType a
Data #Char :: Char -> DataType Char
Data #Char 'x' :: DataType Char
The data constructors of each instantiation don’t really have anything to do with each other. There’s nothing in common between the instantiations Data :: Int -> DataType Int and Data :: Char -> DataType Char, apart from the fact that they share the same tag name.
Another way of thinking about this in Java terms is with the visitor pattern. DataType would be represented as a function that accepts a “DataType visitor”, and then the constructors don’t correspond to separate data types, they’re just the methods of the visitor which accept the fields and return some result. Writing the equivalent code in Java is a worthwhile exercise, but here it is in Haskell:
{-# LANGUAGE RankNTypes #-}
-- (Allows passing polymorphic functions as arguments.)
type DataType a
= forall r. -- A visitor with a generic result type
r -- With one “method” for the ‘Datum’ case (no fields)
-> (a -> r) -- And one for the ‘Data’ case (one field)
-> r -- Returning the result
newData :: a -> DataType a
newData field = \ _visitDatum visitData -> visitData field
newDatum :: DataType a
newDatum = \ visitDatum _visitData -> visitDatum
Pattern-matching is simply running the visitor:
matchDT :: DataType a -> b -> (a -> b) -> b
matchDT dt visitDatum visitData = dt visitDatum visitData
-- Or: matchDT dt = dt
-- Or: matchDT = id
-- case someDT of { Datum -> f; Data x -> g x }
-- f :: r
-- g :: a -> r
-- someDT :: DataType a
-- :: forall r. r -> (a -> r) -> r
someDT f (\ x -> g x)
Similarly, in Haskell, data constructors are just the ways of introducing and eliminating values of a user-defined type.
What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce
I'm having trouble following your question, but I think you are complicating things. I would suggest not thinking too deeply about the "constructor" terminology.
But hopefully the following helps:
Starting simple:
data DataType = Data Int | Datum
The above reads "Declare a new type named DataType, which has the possible values Datum or Data <some_number> (e.g. Data 42)"
So e.g. Datum is a value of type DataType.
Going back to your example with a type parameter, I want to point out what the syntax is doing:
data DataType a = Data a | Datum
^ ^ ^ These things appear in type signatures (type level)
^ ^ These things appear in code (value level stuff)
There's a bit of punning happening here. so in the data declaration you might see "Data Int" and this is mixing type-level and value-level stuff in a way that you wouldn't see in code. In code you'd see e.g. Data 42 or Data someVal.
I hope that helps a little...

Anonymous records: what ways to type-level tag in Haskell?

I'm playing with lightweight anonymous record-alikes, more to explore the type theory for them than anything 'industrial strength'. I want the fields to be simply type-tagged.
myRec = (EmpId 54321, EmpName "Jo", EmpPhone "98-7654321") -- in which
newtype EmpPhone a = EmpPhone a -- and maybe
data EmpName a where EmpName :: IsString a => a -> EmpName a -- GADT
data EmpId a where EmpId :: Int -> EmpId Int -- GADT to same pattern
Although I could put newtype EmpId = EmpId Int, I want to follow the same pattern for all tags, so that I can go for example:
project (EmpId, EmpName) myRec -- use tags as field names
I'll also use StandaloneDeriving/DeriveAnyType to derive instance Eq, Show, Num etc.
Other possible designs
For the records, rather than Haskell tuples I could use HList or make my own data types Tuple0, Tuple1, Tuple2, .... I don't think that would affect the typing issues below.
For the tags/fields I could pair a Symbol (type-level String) as phantom type with the value -- for example CTRex does something like that. Then use TypeApplications to build fields.
data Tag (tag :: Symbol) a = Tag a
myRec = (Tag #"EmpId" 54321, ...)
That makes the field syntax (and projection list) rather 'noisy'; also prevents any validation that EmpIds are Int, etc.
Three related lines of questions on typing for these:
How best to prevent
sillyRec = (EmpId 65432, Just "not my tag", "or [] as constructor",
Right "or even worse" :: Either Int String)
I could declare a class, put my tags only in it (not too bad with DeriveAnyClass), put constraints everywhere. But my tags have a consistent structure: single data constructor named same as the type; single type parameter which is the only parameter to the data constructor.
How to express I want each record-alike to follow a consistent type pattern? That is prevent:
notaRec = (EmpId 76543, EmpName)
Bare EmpName is OK in a projection list, providing all the other fields are bare constructors. I want to say that notaRec is not well-Kinded, but bare EmpName is Kind * -> *, which is unifiable with *. So I mean more like: all fields in the record fit the same type pattern.
Then when I get to sets-of-records (aka tables/relations)
myTable = ( myRec, -- tuple of tuples
(EmpName "Kaz", EmpPhone 987654312, EmpId 87654),
EmpId 98765, EmpPhone "21-4365879", EmpName "Bo")
Putting the fields in a different order is OK because we have a tuple-of-tuples. But EmpPhone is at two different types in the two records. And the last line isn't a record at all: it's fields at the 'wrong' pattern. (Same mis-match as with bare EmpName in 2.)
Again I want to say these are ill-Kinded. My field tags are appearing at different 'depths' or in differing type patterns.
I guess I could get there with a great deal of hard-coding for valid instances/combos of types. Is there a more generic way?
EDIT: In response to comments. (Yes I'm mortal too. Thanks #duplode for figuring out the formatting.)
why not type Record = (EmpId Int, EmpName String, EmpPhone String)?
As a type synonym that's fine. But doesn't answer the question because I want it equivalent to any permutation of those tags. (I think I can verify that equivalence at type level using HList techniques.)
some sort of high-level overview of your objective [thank you David]
I want to treat the ( ... , ... , ... ) as a set. Because the Relational Database Model says relations are sets of 'tuples' [not Haskell tuples] and 'tuples' are sets of pairs of tag-value. I also want to treat the project function as having a first-class parameter which is a set of tags. (Contrast that in Codd's Relational Algebra, the π operator has its set of tags subscripted as if part of the operator.)
These couldn't be Haskell Sets because the elements are not the same type. I want to say the elements are the same Kind; and that a Haskell-tuple of same-Kinded elements represents a set-of that Kind. But I know that's abusing terminology. (The alternative design I considered using Symbol tags perhaps shows better there's a Kindiness aspect.)
If I can treat the Haskell tuples as set-ish, I can use well-known HList techniques to emulate the Relational Operators.
If this helps explain, I could do this with a lot of boilerplate:
class MyTag a -- type/kind-level predicate
deriving instance MyTag (EmpId Int) -- uses DeriveAnyClass
-- etc for all my tags
class WellKinded tup
instance WellKinded ()
instance {-# OVERLAPPING #-}
(MyTag (n1 a1), MyTag (n2 a2), MyTag (n3 a3))
=> WellKinded (n1 a1, n2 a2, n3 a3) -- and so on for every arity of tuple
instance {-# OVERLAPPABLE #-}
(MyTag (n1 a1), MyTag (n2 a2), MyTag (n3 a3))
=> WellKinded (a1 -> n1 a1, a2 -> n2 a2, a3 -> n3 a3)
All those instances for different arities are rapidly going to get tedious, so I could convert to HList; despatch an instance on the Kind of the first element; iterate down the list verifying all the same Kind.
For tuple-of-tuples, detect the Kind of the first element of the first sub-tuple; iterate both across and down. (Again needs OverlappingInstances: a tuple-of-tuples-of-tuples is still a tuple. This is what I mean by "a great deal of hard-coding" above.) It doesn't seem unachievable. But it does feel like going down the wrong rabbit-hole.
This is crazy enough it might just work. Pattern synonyms to the rescue:
newtype Label (n :: Symbol) (a :: *) = MkLab a -- newtype yay!
deriving (Eq, Ord, Show)
pattern EmpPhone x = MkLab x :: Label "EmpPhone" a
pattern EmpName x = MkLab x :: IsString a => Label "EmpName" a
pattern EmpId x = MkLab x :: Label "EmpId" Int
myRec = (EmpId 54321, EmpName "Jo", EmpPhone "98-7654321") -- works a treat
Then to answer the q's
To count as a record, all tuple elements must be of type Label s a.
To count as a projection list, all tuple elements must be of type a -> Label s a.
(That works, by the way.)
Those are the only types/kinds allowed in tuples-as-records.
So to parse a tuple-of-tuples at type level, I need only despatch on the type of the leftmost element.
I'm looking for type constructor Label.
All the rest I can do with HList-style type matching.
For those patterns I did need to switch on a swag of extensions:
{-# LANGUAGE PatternSynonyms,
KindSignatures, DataKinds,
ScopedTypeVariables, -- for the signatures on patterns
RankNTypes #-} -- for the signatures with contexts
import GHC.TypeLits -- for the Symbols
Here's a kinda answer or at least explanation for 2., 3.; a partial answer to 1.
How to express I want each record-alike to follow a consistent type pattern? That is prevent:
notaRec = (EmpId 76543, EmpName)
On the face of it EmpId 76543 matches type pattern (n a); whereas EmpName :: a -> (n a). But Hindley-Milner doesn't "match" simplistically like that, it uses unifiability. So all of these unify with (n a):
-- as `( n a )`
a -> (n a) -- as `( ((->) a) (n a) )`
(b, c) -- as `( (,) b ) c `
(b, c, d) -- as `( (,,) b c ) d ` -- etc for all larger Haskell tuples
[ a ], Maybe a -- as `( [] a )`, `( Maybe a )`
Either b c -- as `( (Either b) c )`
b -> (Either b c) -- as `( ((->) b) (Either b c) )` -- for example, bare `Left`
To disagree with myself on the abuse of terminology:
I want to say these are ill-Kinded. My field tags are appearing at different 'depths' ...
But I know that's abusing terminology.
Any type with a -> outermost constructor is at a different Kind vs one without. Either is at a different Kind vs EmpId, because it is different arity. Type unification builds the 'most general unifier', and that makes them appear same-Kinded.
For the purposes here we want the opposite of the mgu -- call it the 'maximally specific Kind', MaSK for short.
We can express it with a closed Type Family and lots of overlapping equations (so the order of them is critical). This can also catch the Prelude's constructors that shouldn't count:
type family MaSK ( a :: * ) where
-- presume the result is one from some pre-declared bunch of types
-- use that result to verify all 'elements' of a set are same-kinded
MaSK (_ -> _ _ _) = No -- e.g. bare `Left`
MaSK (_ -> [ _ ]) = No -- reject unwanted constructors
MaSK (_ -> Maybe _ ) = No -- ditto
MaSK (a' -> n a') = YesAsBareTag -- this we want
MaSK (_ -> _ _ ) = No --
MaSK (_ -> _ ) = No
MaSK ( _ , _ , _ , _ ) = YesAsSet -- etc for greater arities
MaSK ( _ , _ , _ ) = YesAsSet
MaSK ( _ , _ ) = YesAsSet
MaSK (_ _ _ ) = No -- too much arity, e.g. `Either b c`
MaSK [ _ ] = No -- reject unwanted constructors
MaSK (Maybe _) = No -- ditto
MaSK (n a) = YesAsTagValue -- this we want providing all the above eliminated
MaSK _ = No -- i.e. bare `Int, Bool, Char, ...`
Limitations: this approach can't check there's a single data constructor for the type, nor that other constructors for that type match the pattern, nor that the constructor is named same as the type, nor that the constructor might smuggle in existentially-quantified parameters. For that, go full metal generics.

Polymorphic Vector storage for in-memory column store type of task

"My brain is about to explode" tells me ghc and the feeling is mutual.
There's an aggregation function that works as a charm on polymorphic vectors (in-memory column store context) - takes 2 vectors and groups by unique values of the 1st one while applying function f to values of the 2nd. So, similar to SQL GROUP BY or mongo aggregation. Examples:
\> groupColumns creg (+) cint
fromList [("EMEA",345),("NA",681),("RoW",988)]
\> groupColumns cint (*) cdouble
fromList [(1,13.0),(2,46.0),(4,16.0),(23,5359.0),(24,528.0),(234,5475.599999999999),(43,18619.0),(12,14.399999999999999),(412,1697.44),(252,-6350.4)]
Relevant code:
groupColumns :: (Eq k, G.Vector v k, G.Vector v2 a, Hashable k) =>
v k -> (a -> a -> a) -> v2 a -> Map.HashMap k a
groupColumns xxs f yys = ...
cint :: U.Vector Int64
cint = U.fromList [1,2,43,234,23,412,24,12,4,252,1,2,43,234,23,412,24,12,4,252]
cdouble :: U.Vector Double
cdouble = U.fromList [13,23,433,23.4,233,4.12,22,1.2,4,-25.2, 1,2,43,234,23,412,24,12,4,252]
creg :: V.Vector Text
creg = V.fromList ["EMEA", "NA", "EMEA", "RoW", "NA", "RoW", "EMEA", "EMEA", "NA", "RoW", "EMEA", "NA", "RoW", "NA", "RoW", "NA", "RoW", "EMEA", "NA", "EMEA"]
The problem arises when I want to parse a user's input and run an aggregation function built dynamically. Let's say there's a source data table of "Region" : Text, "Revenue" : Int, "Country" : Text, "Booking Date" : Int. A user may want to do (pseudocode): 1) groupby "Region", "Country" sum "Revenue" or 2) groupby "Country", "Booking Date" sum "Revenue" etc etc. The issue is that "Region" and "Country" vectors are V.Vector Text while "Booking Date" and "Revenue" are U.Vector Int64 - so I can't store them in one Hashtable or List and do an obvious thing: get abstract (Vector a) from this one Hashtable or List, pass them to groupColumns function (which already perfectly supports polymorphic vectors!!!) and get a result. I don't care about specific type here for groupColumns, I only care that whatever I'm passing supports Vector interface (by being part of the type family).
So, it boils down to: I need some sort of Storage type that 1) given Text (name of the vector) 2) gives back U.Vector a or V.Vector b without explicit type signature. In the ideal world it'd be just one line call: groupColumns (extractVec col1 cms) func (extractVec col2 cms), where col1 and col2 is Text parsed from user input and func is a function dynamically set from parsed user input.
In the real world I tried:
1) heterogenous tricks (of the Data HV where HV :: (Vector v a) => v a -> HV sort) but there both mine and ghc brains start to explode because various type variables escape their scope (e.g., in the f function that is passed to groupColumns) - even though getting (HV U.Vector Int64) from [(Text, HV a)] is straightforward.
2) Typed vector storage, like this:
data ColumnMemoryStore = ColumnMemoryStore {
intCols :: Map.HashMap Text (U.Vector Int64),
doubleCols :: Map.HashMap Text (U.Vector Double),
textCols :: Map.HashMap Text (V.Vector Text),
typeSchema :: Map.HashMap Text SupportedTypes -- helper map from column names to their types
}
with polymorphic extractVector function, so I can do extractVector name cms :: U.Vector Int64 - and it automagically returns U.Vector Int64 from the intCols and respectively for others. Here, the problem is that after parsing user input I have to analyze what is the Type of the vectors he wants to aggregate by (by consulting typeSchema) and give corresponding type signatures to extractVector calls - which turns into absolutely, terrifyingly ugly spaghetti of case statements that makes me want to write everything in C as it will be 5x shorter. Here's a sample:
let t1 = checkColType' colname (cms gs)
case t1 of
PText -> let col = extractVec colname (cms gs) :: V.Vector T.Text
result = groupColumns col (+) aggCol
in outputStrLn $ show result
PInt -> let col = extractVec colname (cms gs) :: U.Vector Int64
result = groupColumns col (+) aggCol
in outputStrLn $ show result
etc etc etc. This compiles and works but it's ugly, non-functional and boilerplate. I mean, the ONLY reason for doing this is the need to specify return type of extractVec explicitly while then it's never used by groupColumns, which simply expects anything that's Vector v a! There has to be a way around it...
3) Should I even think about Data.Reflection or something similar but no less scary? Template Haskell?
I am sorry for a long description, but I spent tons of time researching and feel like completely stuck - which probably (hopefully) means I'm missing something pretty obvious (like not enough abstraction levels) and those of you who think in Haskell can at least point me to the correct approach of solving this issue. Thanks a lot!

Polymorphic return types and "rigid type variable" error in Haskell

There's a simple record Column v a which holds a Vector from the Data.Vector family (so that v can be Vector.Unboxed, just Vector etc), it's name and type (simple enum-like ADT SupportedTypes). I would like to be able to serialize it using the binary package. To do that, I try to define a Binary instance below.
Now put works fine, however when I try to define deserialization in the get function and want to set a specific type to the rawVector that is being returned based on the colType (U.Vector Int64 when it's PInt, U.Vector Double when it's PDouble etc) - I get this error message:
Couldn't match type v with U.Vector
v is a rigid type variable bound by the instance declaration at src/Quark/Base/Column.hs:75:10
Expected type: v a
Actual type: U.Vector Int64
error.
Is there a better way to achieve my goal - deserialize Vectors of different types based on the colType value or am I stuck with defining Binary instance for all possible Vector / primitive type combinations? Shouldn't be the case...
Somewhat new to Haskell and appreciate any help! Thanks!
{-# LANGUAGE OverloadedStrings, TransformListComp, RankNTypes,
TypeSynonymInstances, FlexibleInstances, OverloadedLists, DeriveGeneric #-}
{-# LANGUAGE MultiParamTypeClasses, FlexibleContexts,
TypeFamilies, ScopedTypeVariables, InstanceSigs #-}
import qualified Data.Vector.Generic as G
import qualified Data.Vector.Unboxed as U
data Column v a = Column {rawVector :: G.Vector v a => v a, colName :: Text, colType :: SupportedTypes }
instance (G.Vector v a, Binary (v a)) => Binary (Column v a) where
put Column {rawVector = vec, colName = cn, colType = ct} = do put (fromEnum ct) >> put cn >> put vec
get = do t <- get :: Get Int
nm <- get :: Get Text
let pt = toEnum t :: SupportedTypes
case pt of
PInt -> do vec <- get :: Get (U.Vector Int64)
return Column {rawVector = vec, colName = nm, colType = pt}
PDouble -> do vec <- get :: Get (U.Vector Double)
return Column {rawVector = vec, colName = nm, colType = pt}
UPDATED Thank you for all the answers below, some pretty good ideas! It's quite clear that what I want to do is impossible to achieve head-on - so that is my answer. But the other suggested solutions are a good reading in itself, thanks a bunch!
The type you are really trying to represent is
data Column v = Column (Either (v Int) (v Double))
but this representation may be unsatisfactory to you. So how do you write this type with the vector itself at the 'top level' of the constructor?
First, start with a representation of your sum (Either Int Double) at the type level, as opposed to the value level:
data IsSupportedType a where
TInt :: IsSupportedType Int
TDouble :: IsSupportedType Double
From here Column is actually quite simple:
data Column v a = Column (IsSupportedType a) (v a)
But you'll probably want a existentially quantified to use it how you want:
data Column v = forall a . Column (IsSupportedType a) (v a)
The binary instance is as follows:
instance (Binary (v Int), Binary (v Double)) => Binary (Column v) where
put (Column t v) = do
case t of
TInt -> put (0 :: Int) >> put v
TDouble -> put (1 :: Int) >> put v
get = do
t :: Int <- get
case t of
0 -> Column TInt <$> get
1 -> Column TDouble <$> get
Note that there is no inherent reliance in Vector here - v could really be anything.
The problem you're actually running into (or if you're not yet, that you will) is that you're trying to decide a resulting type from an input value. You cannot do that. At all. You could cleverly lock the result type in a box and throw away the key so the type appears to be normal from the outside, but then you cannot do anything much with it because you locked the type in a box and threw away the key. You can store extra information about it using GADTs and boxing it up with a type class instance, but even still this is not a great idea.
Your could make your life far easier here if you simply had two constructors for Column to reflect whether there was a vector of Ints or Doubles.
But really, don't do any of that. Just let the automatically derivable Binary instance deserialize any deserializable value into your vector for you.
data Column a = ... deriving (Binary)
Using the DeriveAnyClass extension that let's you derive any class that has a Generic implementation (which Binary has). Then just deserialize a Column Double or a Column Int when you need it.
As the comment says, you can simply not case on the type, and always call
vec <- get
return Column {rawVector = vec, colName = nm, colType = pt}
This fulfills your type signature properly. But note that colType is not useful to you here -- you have no way to enforce that it corresponds to the type within your vector, since it only exists at the value level. But that may be ok, and you may simply want to remove colType from your data structure altogether, since you can always derive it directly from the concrete type of a chosen in Column v a.
In fact, the constraint in the Column type isn't doing much good either, and I think it would be better to render it just as
data Column v a = Column {rawVector :: v a, colName :: Text}
Now you can just enforce the G.Vector constraint at call sites where necessary...

GADTs or phantom types to type-check function calls but keep homogeneity of type

I assume the following problem can be solved using type arithmetic but haven't found the solution yet.
Problem
I have a finite map from strings to values (using Tries as implementation) that I parse from a binary/text file (json, xml, ...).
type Value = ...
type Attributes = Data.Trie Value
data Object = Object Attributes
Each map has the same type of values but not the same set of keys.
I group maps with the same set of keys together to be able to prevent having to type-switch all the time I have a specialised function that requires certain keys:
data T1
data T2
...
data Object a where
T1 :: Attributes -> Object T1
T2 :: Attributes -> Object T2
...
This allows me to write something like:
f1 :: Object T1 -> ...
instead of
f1 :: Object ->
f1 o | check_if_T1 o = ...
This works but has two disadvantages:
Homogeneous lists of Object now become heterogeneous, i.e. I cannot have a list [Object] anymore.
I need to write a lot of boilerplate to get/set attributes:
get :: Object a -> Attributes
get (T1 a) = a
get (T2 a) = a
...
Question
Is there a better way to specialise functions depending on the constructor of an ADT?
How could I regain the ability to have a list [Object]? Is there a specialized version of Dynamic that only allows certain types?
I thought about wrapping the Object again, but this would add a lot of boilerplate. For instance,
data TObject = TT1 T1 | TT2 T2 ...
What I need is:
get :: a -> TObject -> Object a
So that I can then derive:
collect :: a -> [TObject] -> [Object a]
I looked into HList but I don't think it fits my problem. Especially, since the order of types in [Object] is not known at compile time.
It sounds to me like this can be solved using functional dependency / type arithmetic but I simply haven't found a nice way yet.
If all the constructors return a monomorphic Object type and there's no recursion, you might want to think about just using separate types. Instead of
data T1
data T2
data Object a where
T1 :: Attributes -> Object T1
T2 :: Attributes -> Object T2
consider
data T1 = T1 Attributes
data T2 = T2 Attributes
Dynamic is one way, and using the above you could just add deriving Typeable and be done. Alternately, you can do it by hand:
data TSomething = It's1 T1 | It's2 T2
getT1s :: [TSomething] -> [T1]
getT2s :: [TSomething] -> [T2]
getT1s xs = [t1 | It's1 t1 <- xs]
getT2s xs = [t2 | It's2 t2 <- xs]
As you say, this involves a bit of boilerplate. The Typeable version looks a bit nicer:
deriving Typeable T1
deriving Typeable T2
-- can specialize at the call-site to
-- getTs :: [Dynamic] -> [T1] or
-- getTs :: [Dynamic] -> [T2]
getTs :: Typeable a => [Dynamic] -> [a]
getTs xs = [x | Just x <- map fromDynamic xs]

Resources