Evolving data structure

Evolving data structure - haskell

I'm trying to write a compiler for a C-like language in Haskell. The compiler progresses by transforming an AST. The first pass parses the input to create the AST, tying a knot with the symbol table to allow symbols to be located before they have been defined without the need for forward references.
The AST contains information about types and expressions, and there can be connections between them; e.g. sizeof(T) is an expression that depends on a type T, and T[e] is an array type that depends on a constant expression e.
Types and expressions are represented by Haskell data types like so:
data Type = TypeInt Id
| TypePointer Id Type -- target type
| TypeArray Id Type Expr -- elt type, elt count
| TypeStruct Id [(String, Type)] -- [(field name, field type)]
| TypeOf Id Expr
| TypeDef Id Type
data Expr = ExprInt Int -- literal int
| ExprVar Var -- variable
| ExprSizeof Type
| ExprUnop Unop Expr
| ExprBinop Binop Expr Expr
| ExprField Bool Expr String -- Bool gives true=>s.field, false=>p->field
Where Unop includes operators like address-of (&), and dereference (*), and Binop includes operators like plus (+), and times (*), etc...
Note that each type is assigned a unique Id. This is used to construct a type dependency graph in order to detect cycles which lead to infinite types. Once we are sure that there are no cycles in the type graph, it is safe to apply recursive functions over them without getting into an infinite loop.
The next step is to determine the size of each type, assign offsets to struct fields, and replace ExprFields with pointer arithmetic. In doing so, we can determine the type of expressions, and eliminate ExprSizeofs, ExprFields, TypeDefs, and TypeOfs from the type graph, so our types and expressions have evolved, and now look more like this:
data Type' = TypeInt Id
| TypePointer Id Type'
| TypeArray Id Type' Int -- constant expression has been eval'd
| TypeStruct Id [(String, Int, Type')] -- field offset has been determined
data Expr' = ExprInt Type' Int
| ExprVar Type' Var
| ExprUnop Type' Unop Expr'
| ExprBinop Type' Binop Expr' Expr'
Note that we've eliminated some of the data constructors, and slightly changed some of the others. In particular, Type' no longer contains any Expr', and every Expr' has determined its Type'.
So, finally, the question: Is it better to create two almost identical sets of data types, or try to unify them into a single data type?
Keeping two separate data types makes it explicit that certain constructors can no longer appear. However, the function which performs constant folding to evaluate constant expressions will have type:
foldConstants :: Expr -> Either String Expr'
But this means that we cannot perform constant folding later on with Expr's (imagine some pass that manipulates an Expr', and wants to fold any emerged constant expressions). We would need another implementation:
foldConstants' :: Expr' -> Either String Expr'
On the other hand, keeping a single type would solve the constant folding problem, but would prevent the type checker from enforcing static invariants.
Furthermore, what do we put into the unknown fields (like field offsets, array sizes, and expression types) during the first pass? We could plug the holes with undefined, or error "*hole*", but that feels like a disaster waiting to happen (like NULL pointers that you can't even check). We could change the unknown fields into Maybes, and plug the holes with Nothing (like NULL pointers that we can check), but it would be annoying in subsequent passes to have to keep pulling values out of Maybes that are always going to be Justs.

Hopefully someone with more experience will have a more polished, battle-tested and ready answer, but here's my shot at it.
You can have your pie and eat part of it too at relatively little cost with GADTs:
{-# LANGUAGE GADTs #-}
data P0 -- phase zero
data P1 -- phase one
data Type p where
TypeInt :: Id -> Type p
TypePointer :: Id -> Type p -> Type p -- target type
TypeArray :: Id -> Type p -> Expr p -> Type p -- elt type, elt count
TypeStruct :: Id -> [(String, Type p)] -> Type p -- [(field name, field type)]
TypeOf :: Id -> Expr P0 -> Type P0
TypeDef :: Id -> Type P0 -> Type P0
data Expr p where
ExprInt :: Int -> Expr p -- literal int
ExprVar :: Var -> Expr p -- variable
ExprSizeof :: Type P0 -> Expr P0
ExprUnop :: Unop -> Expr p -> Expr p
ExprBinop :: Binop -> Expr p -> Expr p -> Expr p
ExprField :: Bool -> Expr P0 -> String -> Expr P0 -- Bool gives true=>s.field, false=>p->field
Here the things we've changed are:
The datatypes now use GADT syntax. This means that constructors are declared using their type signatures. data Foo = Bar Int Char becomes data Foo where Bar :: Int -> Char -> Foo (aside from syntax, the two are completely equivalent).
We have added a type variable to both Type and Expr. This is a so-called phantom type variable: there is no actual data stored that is of type p, it is used only to enforce invariants in the type system.
We've declared dummy types to represent the phases before and after the transformation: phase zero and phase one. (In a more elaborate system with multiple phases we could potentially use type-level numbers to denote them.)
GADTs let us store type-level invariants in the data structure. Here we have two of them. The first is that recursive positions must be of the same phase as the structure containing them. For example, looking at TypePointer :: Id -> Type p -> Type p, you pass a Type p to the TypePointer constructor and get a Type p as result, and those ps must be the same type. (If we wanted to allow different types, we could use p and q.)
The second is that we enforce the fact that some constructors can only be used in the first phase. Most of the constructors are polymorphic in the phantom type variable p, but some of them require that it be P0. This means that those constructors can only be used to construct values of type Type P0 or Expr P0, not any other phase.
GADTs work in two directions. The first is that if you have a function which returns a Type P1, and try to use one of the constructors which returns a Type P0 to construct it, you will get a type error. This is what's called "correct by construction": it's statically impossible to construct an invalid structure (provided you can encode all of the relevant invariants in the type system). The flip side of it is that if you have a value of Type P1, you can be sure that it was constructed correctly: the TypeOf and TypeDef constructors can't have been used (in fact, the compiler will complain if you try to pattern match on them), and any recursive positions must also be of phase P1. Essentially, when you construct a GADT you store evidence that the type constraints are satisfied, and when you pattern match on it you retrieve that evidence and can take advantage of it.
That was the easy part. Unfortunately, we have some differences between the two types beyond just which constructors are allowed: some of the constructor arguments are different between the phases, and some are only present in the post-transformation phase. We can again use GADTs to encode this, but it's not as low-cost and elegant. One solution would be to duplicate all of the constructors which are different, and have one each for P0 and P1. But duplication isn't nice. We can try to do it more fine-grained:
-- a couple of helper types
-- here I take advantage of the fact that of the things only present in one phase,
-- they're always present in P1 and not P0, and not vice versa
data MaybeP p a where
NothingP :: MaybeP P0 a
JustP :: a -> MaybeP P1 a
data EitherP p a b where
LeftP :: a -> EitherP P0 a b
RightP :: b -> EitherP P1 a b
data Type p where
TypeInt :: Id -> Type p
TypePointer :: Id -> Type p -> Type p
TypeArray :: Id -> Type p -> EitherP p (Expr p) Int -> Type p
TypeStruct :: Id -> [(String, MaybeP p Int, Type p)] -> Type p
TypeOf :: Id -> Expr P0 -> Type P0
TypeDef :: Id -> Type P0 -> Type P0
-- for brevity
type MaybeType p = MaybeP p (Type p)
data Expr p where
ExprInt :: MaybeType p -> Int -> Expr p
ExprVar :: MaybeType p -> Var -> Expr p
ExprSizeof :: Type P0 -> Expr P0
ExprUnop :: MaybeType p -> Unop -> Expr p -> Expr p
ExprBinop :: MaybeType p -> Binop -> Expr p -> Expr p -> Expr p
ExprField :: Bool -> Expr P0 -> String -> Expr P0
Here with some helper types we've enforced the fact that some constructor arguments can only be present in phase one (MaybeP) and some are different between the two phases (EitherP). While this makes us completely type-safe, it feels a bit ad-hoc and we still have to wrap things in and out of the MaybePs and EitherPs all the time. I don't know if there's a better solution in that respect. Complete type safety is something, though: we could write fromJustP :: MaybeP P1 a -> a and be sure that it is completely and utterly safe.
Update: An alternative is to use TypeFamilies:
data Proxy a = Proxy
class Phase p where
type MaybeP p a
type EitherP p a b
maybeP :: Proxy p -> MaybeP p a -> Maybe a
eitherP :: Proxy p -> EitherP p a b -> Either a b
phase :: Proxy p
phase = Proxy
instance Phase P0 where
type MaybeP P0 a = ()
type EitherP P0 a b = a
maybeP _ _ = Nothing
eitherP _ a = Left a
instance Phase P1 where
type MaybeP P1 a = a
type EitherP P1 a b = b
maybeP _ a = Just a
eitherP _ a = Right a
The only change to Expr and Type relative the previous version is that the constructors need to have an added Phase p constraint, e.g. ExprInt :: Phase p => MaybeType p -> Int -> Expr p.
Here if the type of p in a Type or Expr is known, you can statically know whether the MaybePs will be () or the given type and which of the types the EitherPs are, and can use them directly as that type without explicit unwrapping. When p is unknown you can use maybeP and eitherP from the Phase class to find out what they are. (The Proxy arguments are necessary, because otherwise the compiler wouldn't have any way to tell which phase you meant.) This is analogous to the GADT version where, if p is known, you can be sure of what MaybeP and EitherP contains, while otherwise you have to pattern match both possibilities. This solution isn't perfect either in the respect that the 'missing' arguments become () rather than disappearing entirely.
Constructing Exprs and Types also seems to be broadly similar between the two versions: if the value you're constructing has anything phase-specific about it then it must specify that phase in its type. The trouble seems to come when you want to write a function polymorphic in p but still handling phase-specific parts. With GADTs this is straightforward:
asdf :: MaybeP p a -> MaybeP p a
asdf NothingP = NothingP
asdf (JustP a) = JustP a
Note that if I had merely written asdf _ = NothingP the compiler would have complained, because the type of the output wouldn't be guaranteed to be the same as the input. By pattern matching we can figure out what type the input was and return a result of the same type.
With the TypeFamilies version, though, this is a lot harder. Just using maybeP and the resulting Maybe you can't prove anything to the compiler about types. You can get part of the way there by, instead of having maybeP and eitherP return Maybe and Either, making them deconstructor functions like maybe and either which also make a type equality available:
maybeP :: Proxy p -> (p ~ P0 => r) -> (p ~ P1 => a -> r) -> MaybeP p a -> r
eitherP :: Proxy p -> (p ~ P0 => a -> r) -> (p ~ P1 => b -> r) -> EitherP p a b -> r
(Note that we need Rank2Types for this, and note also that these are essentially the CPS-transformed versions of the GADT versions of MaybeP and EitherP.)
Then we can write:
asdf :: Phase p => MaybeP p a -> MaybeP p a
asdf a = maybeP phase () id a
But that's still not enough, because GHC says:
data.hs:116:29:
Could not deduce (MaybeP p a ~ MaybeP p0 a0)
from the context (Phase p)
bound by the type signature for
asdf :: Phase p => MaybeP p a -> MaybeP p a
at data.hs:116:1-29
NB: `MaybeP' is a type function, and may not be injective
In the fourth argument of `maybeP', namely `a'
In the expression: maybeP phase () id a
In an equation for `asdf': asdf a = maybeP phase () id a
Maybe you could solve this with a type signature somewhere, but at that point it seems like more bother than it's worth. So pending further information from someone else I'm going to recommend using the GADT version, being the simpler and more robust one, at the cost of a little syntactic noise.
Update again: The problem here was that because MaybeP p a is a type function and there is no other information to go by, GHC has no way to know what p and a should be. If I pass in a Proxy p and use that instead of phase that solves p, but a is still unknown.

There's no ideal solution to this problem, since each one has different pros and cons.
I would personally go with using a single data type "tree" and add separate data constructors for things that need to be differentiated. I.e.:
data Type
= ...
| TypeArray Id Type Expr
| TypeResolvedArray Id Type Int
| ...
This has the advantage that you can run the same phase multiple times on the same tree, as you say, but the reasoning goes deeper than that: Let's say that you are implementing a syntax element that generates more AST (something like include or a C++ template kind of deal, and it can depend on constant exprs like your TypeArray so you can't evaluate it in the first iteration). With the unified data type approach, you can just insert the new AST in your existing tree, and not only can you run the same phases as before on that tree directly, but you will get caching for free, i.e. if the new AST references an array by using sizeof(typeof(myarr)) or something, you don't have to determine the constant size of myarr again, because its type is already a TypeResolvedArray from your previous resolution phase.
You could use a different representation when you are past all of your compilation phases, and it is time to interpret the code (or something); then you are certain of the fact that no more AST changes will be necessary, and a more streamlined representation might be a good idea.
By the way, you should use Data.Word.Word instead of Data.Int.Int for array sizes. It is such a common error in C to use ints for indexing arrays, while C pointers actually are unsigned. Please don't make this mistake in your language, too, unless you really want to support arrays with negative sizes.

Related

Multiple types for f in this picture?

https://youtu.be/brE_dyedGm0?t=1362
data T a where
T1 :: Bool -> T Bool
T2 :: T a
f x y = case x of
T1 x -> True
T2 -> y
Simon is saying that f could be typed as T a -> a -> a, but I would think the return value MUST be a Bool since that is an explicit result in a branch of the case expression. This is in regards to Haskell GADTs. Why is this the case?

This is kind of the whole point of GADTs. Matching on a constructor can cause additional type information to come into scope.
Let's look at what happens when GHC checks the following definition:
f :: T a -> a -> a
f x y = case x of
T1 _ -> True
T2 -> y
Let's look at the T2 case first, just to get it out of the way. y and the result have the same type, and T2 is polymorphic, so you can declare its type is also T a. All good.
Then comes the trickier case. Note that I removed the binding of the name x inside the case as the shadowing might be confusing, the inner value isn't used, and it doesn't change anything in the explanation. When you match on a GADT constructor like T1, one that explicitly sets a type variable, it introduces additional constraints inside that branch which add that type information. In this case, matching on T1 introduces a (a ~ Bool) constraint. This type equality says that a and Bool match each other. Therefore the literal True with type Bool matches the a written in the type signature. y isn't used, so the branch is consistent with T a -> a -> a.
So it all matches up. T a -> a -> a is a valid type for that definition. But as Simon is saying, it's ambiguous. T a -> Bool -> Bool is also a valid type for that definition. Neither one is more general than the other, so the definition doesn't have a principle type. So the definition is rejected unless a type is provided, because inference cannot pick a single most-correct type for it.

A value of type T a with a different from Bool can never have the form T1 x (since that has only type T Bool).
Hence, in such case, the T1 x branch in the case becomes inaccessible and can be ignored during type checking/inference.
More concretely: GADTs allow the type checker to assume type-level equations during pattern matching, and exploit such equations later on. When checking
f :: T a -> a -> a
f x y = case x of
T1 x -> True
T2 -> y
the type checker performs the following reasoning:
f :: T a -> a -> a
f x y = case x of
T1 x -> -- assume: a ~ Bool
True -- has type Bool, hence it has also type a
T2 -> -- assume: a~a (pointless)
y -- has type a
Thanks to GADTs, both branches of the case have type a, hence the whole case expression has type a and the function definition type checks.
More generally, when x :: T A and the GADT constructor was defined as K :: ... -> T B then, when type checking we can make the following assumption:
case x of
K ... -> -- assume: A ~ B
Note that A and B can be types involving type variables (as in a~Bool above), so that allows one obtain useful information about them and exploit it later on.

Clarifying Data Constructor in Haskell

In the following:
data DataType a = Data a | Datum
I understand that Data Constructor are value level function. What we do above is defining their type. They can be function of multiple arity or const. That's fine. I'm ok with saying Datum construct Datum. What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce. Please let me know if i am getting it well:
1 - a) Basically writing Data a, is defining both a Data Structure and its Constructor function (as in scala or java usually the class and the constructor have the same name) ?
2 - b) So if i unpack and make an analogy. With Data a We are both defining a Structure(don't want to use class cause class imply a type already i think, but maybe we could) of object (Data Structure), the constructor function (Data Constructor/Value constructor), and the later return an object of that object Structure. Finally The type of that Structure of object is given by the Type constructor. An Object Structure in a sense is just a Tag surrounding a bunch value of some type. Is my understanding correct ?
3 - c) Can I formally Say:
Data Constructor that are Nullary represent constant values -> Return the the constant value itself of which the type is given by the Type Constructor at the definition site.
Data Constructor that takes an argument represent class of values, where class is a Tag ? -> Return an infinite number of object of that class, of which the type is given by the Type constructor at the definition site.

Another way of writing this:
data DataType a = Data a | Datum
Is with generalised algebraic data type (GADT) syntax, using the GADTSyntax extension, which lets us specify the types of the constructors explicitly:
{-# LANGUAGE GADTSyntax #-}
data DataType a where
Data :: a -> DataType a
Datum :: DataType a
(The GADTs extension would work too; it would also allow us to specify constructors with different type arguments in the result, like DataType Int vs. DataType Bool, but that’s a more advanced topic, and we don’t need that functionality here.)
These are exactly the types you would see in GHCi if you asked for the types of the constructor functions with :type / :t:
> :{
| data DataType a where
| Data :: a -> DataType a
| Datum :: DataType a
| :}
> :type Data
Data :: a -> DataType a
> :t Datum
Datum :: DataType a
With ExplicitForAll we can also specify the scope of the type variables explicitly, and make it clearer that the a in the data definition is a separate variable from the a in the constructor definitions by also giving them different names:
data DataType a where
Data :: forall b. b -> DataType b
Datum :: forall c. DataType c
Some more examples of this notation with standard prelude types:
data Either a b where
Left :: forall a b. a -> Either a b
Right :: forall a b. b -> Either a b
data Maybe a where
Nothing :: Maybe a
Just :: a -> Maybe a
data Bool where
False :: Bool
True :: Bool
data Ordering where
LT, EQ, GT :: Ordering -- Shorthand for repeated ‘:: Ordering’
I understand that Data Constructor are value level function. What we do above is defining their type. They can be function of multiple arity or const. That's fine. I'm ok with saying Datum construct Datum. What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce.
Datum and Data are both “constructors” of DataType a values; neither Datum nor Data is a type! These are just “tags” that select between the possible varieties of a DataType a value.
What is produced is always a value of type DataType a for a given a; the constructor selects which “shape” it takes.
A rough analogue of this is a union in languages like C or C++, plus an enumeration for the “tag”. In pseudocode:
enum Tag {
DataTag,
DatumTag,
}
// A single anonymous field.
struct DataFields<A> {
A field1;
}
// No fields.
struct DatumFields<A> {};
// A union of the possible field types.
union Fields<A> {
DataFields<A> data;
DatumFields<A> datum;
}
// A pair of a tag with the fields for that tag.
struct DataType<A> {
Tag tag;
Fields<A> fields;
}
The constructors are then just functions returning a value with the appropriate tag and fields. Pseudocode:
<A> DataType<A> newData(A x) {
DataType<A> result;
result.tag = DataTag;
result.fields.data.field1 = x;
return result;
}
<A> DataType<A> newDatum() {
DataType<A> result;
result.tag = DatumTag;
// No fields.
return result;
}
Unions are unsafe, since the tag and fields can get out of sync, but sum types are safe because they couple these together.
A pattern-match like this in Haskell:
case someDT of
Datum -> f
Data x -> g x
Is a combination of testing the tag and extracting the fields. Again, in pseudocode:
if (someDT.tag == DatumTag) {
f();
} else if (someDT.tag == DataTag) {
var x = someDT.fields.data.field1;
g(x);
}
Again this is coupled in Haskell to ensure that you can only ever access the fields if you have checked the tag by pattern-matching.
So, in answer to your questions:
1 - a) Basically writing Data a, is defining both a Data Structure and its Constructor function (as in scala or java usually the class and the constructor have the same name) ?
Data a in your original code is not defining a data structure, in that Data is not a separate type from DataType a, it’s just one of the possible tags that a DataType a value may have. Internally, a value of type DataType Int is one of the following:
The tag for Data (in GHC, a pointer to an “info table” for the constructor), and a reference to a value of type Int.
x = Data (1 :: Int) :: DataType Int
+----------+----------------+ +---------+----------------+
x ---->| Data tag | pointer to Int |---->| Int tag | unboxed Int# 1 |
+----------+----------------+ +---------+----------------+
The tag for Datum, and no other fields.
y = Datum :: DataType Int
+-----------+
y ----> | Datum tag |
+-----------+
In a language with unions, the size of a union is the maximum of all its alternatives, since the type must support representing any of the alternatives with mutation. In Haskell, since values are immutable, they don’t require any extra “padding” since they can’t be changed.
It’s a similar situation for standard data types, e.g., a product or sum type:
(x :: X, y :: Y) :: (X, Y)
+---------+--------------+--------------+
| (,) tag | pointer to X | pointer to Y |
+---------+--------------+--------------+
Left (m :: M) :: Either M N
+-----------+--------------+
| Left tag | pointer to M |
+-----------+--------------+
Right (n :: N) :: Either M N
+-----------+--------------+
| Right tag | pointer to N |
+-----------+--------------+
2 - b) So if i unpack and make an analogy. With Data a We are both defining a Structure(don't want to use class cause class imply a type already i think, but maybe we could) of object (Data Structure), the constructor function (Data Constructor/Value constructor), and the later return an object of that object Structure. Finally The type of that Structure of object is given by the Type constructor. An Object Structure in a sense is just a Tag surrounding a bunch value of some type. Is my understanding correct ?
This is sort of correct, but again, the constructors Data and Datum aren’t “data structures” by themselves. They’re just the names used to introduce (construct) and eliminate (match) values of type DataType a, for some type a that is chosen by the caller of the constructors to fill in the forall
data DataType a = Data a | Datum says:
If some term e has type T, then the term Data e has type DataType T
Inversely, if some value of type DataType T matches the pattern Data x, then x has type T in the scope of the match (case branch or function equation)
The term Datum has type DataType T for any type T
3 - c) Can I formally Say:
Data Constructor that are Nullary represent constant values -> Return the the constant value itself of which the type is given by the Type Constructor at the definition site.
Data Constructor that takes an argument represent class of values, where class is a Tag ? -> Return an infinite number of object of that class, of which the type is given by the Type constructor at the definition site.
Not exactly. A type constructor like DataType :: Type -> Type, Maybe :: Type -> Type, or Either :: Type -> Type -> Type, or [] :: Type -> Type (list), or a polymorphic data type, represents an “infinite” family of concrete types (Maybe Int, Maybe Char, Maybe (String -> String), …) but only in the same way that id :: forall a. a -> a represents an “infinite” family of functions (id :: Int -> Int, id :: Char -> Char, id :: String -> String, …).
That is, the type a here is a parameter filled in with an argument value given by the caller. Usually this is implicit, through type inference, but you can specify it explicitly with the TypeApplications extension:
-- Akin to: \ (a :: Type) -> \ (x :: a) -> x
id :: forall a. a -> a
id x = x
id #Int :: Int -> Int
id #Int 1 :: Int
Data :: forall a. a -> DataType a
Data #Char :: Char -> DataType Char
Data #Char 'x' :: DataType Char
The data constructors of each instantiation don’t really have anything to do with each other. There’s nothing in common between the instantiations Data :: Int -> DataType Int and Data :: Char -> DataType Char, apart from the fact that they share the same tag name.
Another way of thinking about this in Java terms is with the visitor pattern. DataType would be represented as a function that accepts a “DataType visitor”, and then the constructors don’t correspond to separate data types, they’re just the methods of the visitor which accept the fields and return some result. Writing the equivalent code in Java is a worthwhile exercise, but here it is in Haskell:
{-# LANGUAGE RankNTypes #-}
-- (Allows passing polymorphic functions as arguments.)
type DataType a
= forall r. -- A visitor with a generic result type
r -- With one “method” for the ‘Datum’ case (no fields)
-> (a -> r) -- And one for the ‘Data’ case (one field)
-> r -- Returning the result
newData :: a -> DataType a
newData field = \ _visitDatum visitData -> visitData field
newDatum :: DataType a
newDatum = \ visitDatum _visitData -> visitDatum
Pattern-matching is simply running the visitor:
matchDT :: DataType a -> b -> (a -> b) -> b
matchDT dt visitDatum visitData = dt visitDatum visitData
-- Or: matchDT dt = dt
-- Or: matchDT = id
-- case someDT of { Datum -> f; Data x -> g x }
-- f :: r
-- g :: a -> r
-- someDT :: DataType a
-- :: forall r. r -> (a -> r) -> r
someDT f (\ x -> g x)
Similarly, in Haskell, data constructors are just the ways of introducing and eliminating values of a user-defined type.

What is not that explicit and clear to me here is somehow the difference between the constructor function and what it produce
I'm having trouble following your question, but I think you are complicating things. I would suggest not thinking too deeply about the "constructor" terminology.
But hopefully the following helps:
Starting simple:
data DataType = Data Int | Datum
The above reads "Declare a new type named DataType, which has the possible values Datum or Data <some_number> (e.g. Data 42)"
So e.g. Datum is a value of type DataType.
Going back to your example with a type parameter, I want to point out what the syntax is doing:
data DataType a = Data a | Datum
^ ^ ^ These things appear in type signatures (type level)
^ ^ These things appear in code (value level stuff)
There's a bit of punning happening here. so in the data declaration you might see "Data Int" and this is mixing type-level and value-level stuff in a way that you wouldn't see in code. In code you'd see e.g. Data 42 or Data someVal.
I hope that helps a little...

Why does a wildcard match work when enumerating all cases doesn't?

Consider this code:
{-# LANGUAGE GADTs #-}
data P t where
PA :: P Int
PB :: P Double
PC :: P Char
isA PA = True
isA _ = False
It compiles and works fine. Now consider this code:
{-# LANGUAGE GADTs #-}
data P t where
PA :: P Int
PB :: P Double
PC :: P Char
isA PA = True
isA PB = False
isA PC = False
It fails to compile:
Main.hs:8:10: error:
• Couldn't match expected type ‘p’ with actual type ‘Bool’
‘p’ is untouchable
inside the constraints: t ~ Int
bound by a pattern with constructor: PA :: P Int,
in an equation for ‘isA’
at Main.hs:8:5-6
‘p’ is a rigid type variable bound by
the inferred type of isA :: P t -> p
at Main.hs:(8,1)-(10,14)
Possible fix: add a type signature for ‘isA’
• In the expression: True
In an equation for ‘isA’: isA PA = True
• Relevant bindings include isA :: P t -> p (bound at Main.hs:8:1)
|
8 | isA PA = True
| ^^^^
Main.hs:9:10: error:
• Couldn't match expected type ‘p’ with actual type ‘Bool’
‘p’ is untouchable
inside the constraints: t ~ Double
bound by a pattern with constructor: PB :: P Double,
in an equation for ‘isA’
at Main.hs:9:5-6
‘p’ is a rigid type variable bound by
the inferred type of isA :: P t -> p
at Main.hs:(8,1)-(10,14)
Possible fix: add a type signature for ‘isA’
• In the expression: False
In an equation for ‘isA’: isA PB = False
• Relevant bindings include isA :: P t -> p (bound at Main.hs:8:1)
|
9 | isA PB = False
| ^^^^^
Main.hs:10:10: error:
• Couldn't match expected type ‘p’ with actual type ‘Bool’
‘p’ is untouchable
inside the constraints: t ~ Char
bound by a pattern with constructor: PC :: P Char,
in an equation for ‘isA’
at Main.hs:10:5-6
‘p’ is a rigid type variable bound by
the inferred type of isA :: P t -> p
at Main.hs:(8,1)-(10,14)
Possible fix: add a type signature for ‘isA’
• In the expression: False
In an equation for ‘isA’: isA PC = False
• Relevant bindings include isA :: P t -> p (bound at Main.hs:8:1)
|
10 | isA PC = False
| ^^^^^
Why? What's going on here?
Edit: Adding the type signature isA :: P t -> Bool makes it work, so my question now becomes: why doesn't type inference work in the second case, since it does in the first case?

In typing a case construct (whether an explicit case statement or an implicit pattern-based function definition) in the absence of GADTs, the individual alternatives:
pattern -> body
can be unified by typing all the patterns and unifying those with the type of the scrutinee, then typing all the bodies and unifying those with the type of the case expression as a whole. So, in a simple example like:
data U = UA | UB | UC
isA1 u = case u of
UA -> True
UB -> False
x -> False
we can initially type the patterns UA :: U, UB :: U, x :: a, unify them using the type equality a ~ U to infer the type of the scrutinee u :: U, and similarly unify True :: Bool and both of the False :: Bool to the type of the overall case expression Bool, unifying that with the type of isA to get isA :: U -> Bool.
Note that the process of unification can specialize the types. Here, the type of the pattern x :: a was general, but by the end of the unification process, it had been specialized to x :: U. This can happen with the bodies, too. For example:
len mstr = case mstr of
Nothing -> 0
Just str -> length str
Here, 0 :: Num a => a is polymorphic, but because length returns an Int, by the end of the process, the bodies (and so the entire case expression) have been unified to the type Int.
In general, through unification, the common, unified type of all the bodies (and so the type of the overall case expression) will be the "most general" / "least restrictive" type of which the types of the bodies are all generalizations. In some cases, this type might be the type of one of the bodies, but in general, all the bodies can be more more general than the "most general" unified type, but no body can be more restrictive.
Things change in the presence of GADTs. When type-checking case constructs with GADTs, the patterns in an alternative can introduce a "type refinement", a set of additional bindings of type variables to be used in type-checking the bodies of the alternative. (This is what makes GADTs useful in the first place.)
Because different alternatives' bodies are typed under different refinements, naive unification isn't possible. For example, consider the tiny typed DSL and its interpreter:
data Term a where
Lit :: Int -> Term Int
IsZ :: Term Int -> Term Bool
If :: Term Bool -> Term a -> Term a -> Term a
eval :: Term a -> a
eval t = case t of
Lit n -> n
IsZ t -> eval t == 0
If b t e -> if eval b then eval t else eval e
If we were to naively unify the bodies n :: Int, eval t == 0 :: Bool, and if eval b then eval t else eval e :: a, the program wouldn't type check (most obviously, because Int and Bool don't unify!).
In general, because type refinements allow the calculated types of the alternatives' bodies to be more specific than the final type, there's no obvious "most general" / "least restrictive" type to which all bodies can be unified, as there was for the case expression without GADTs.
Instead, we generally need to make available a "target" type for the overall case expression (e.g., for eval, the return type a in the type signature), and then determine if under each refinement introduced by a constructor (e.g., IsZ introducing the refinement a ~ Bool), the body eval t == 0 :: Bool has as its type the associated refinement of a.
If no target type is explicitly provided, then the best we can do -- in general -- is use a fresh type variable p as the target and try to check each refined type against that.
This means that, given the following definition without a type signature for isA2:
data P t where
PA :: P Int
PB :: P Double
PC :: P Char
isA2 = \p -> case p of
PA -> True
PB -> False
PC -> False
what GHC tries to do is type isA2 :: P t -> p. For the alternative:
PA -> True
it types PA :: P t giving the refinement t ~ Int, and under this refinement, it tries to type True :: p. Unfortunately, p is not Bool under any refinement involving the unrelated type variable a, and we get an error. Similar errors are generated for the other alternatives, too.
Actually, there's one more thing we can do. If there are alternatives that don't introduce a type refinement, then the calculated types of their bodies are NOT more specific than the final type. So, if we unify the body types for "unrefined" alternatives, the resulting type provides a legitimate unification target for the refined alternatives.
This means that, for the example:
isA3 = \p -> case p of
PA -> True
x -> False
the second alternative:
x -> False
is typed by matching the pattern x :: P t which introduces no type refinement. The unrefined type of the body is Bool, and this type is an appropriate target for unification of the other alternatives.
Specifically, the first alternative:
PA -> True
matches with a type refinement a ~ Int. Under this refinement, the actual type of the body True :: Bool matches the "refinement" of the target type Bool (which is "refined" to Bool), and the alternative is determined to have valid type.
So, the intuition is that, without a wildcard alternative, the inferred type for the case expression is an arbitrary type variable p, which is too general to be unified with the type refining alternatives. However, when you add a wildcard case alternative _ -> False, it introduces a more restrictive body type Bool into the unification process that, having been deduced without any type refinement by the pattern _, can inform the unification algorithm by providing a more restrictive type Bool to which the other, type refined alternatives, can be unified.
Above, I've made it sound like there's some two-phase approach, where the "non-refining" alternatives are examined first to determine a target type, and then the refining alternatives are checked against it.
In fact, what happens is that the refinement process introduces fresh variables into the unification process that, even when they're unified, don't affect the larger type context. So, all the alternatives are unified at once, but unification of the unrefining alternatives affects the larger type context while unification of the refined alternatives affects a bunch of fresh variables, giving the same end result as if the unrefined and refined alternatives were processed separately.

Disclaimer: I write this as an answer because It doesn't fit in a comment. But I might be wrong
This behaviour is the expected when pattern match on GADTs. Up to GHC's User manual:
type refinement is only carried out based on user-supplied type
annotations. So if no type signature is supplied for eval, no type
refinement happens, and lots of obscure error messages will occur
Also from de User Manual:
When pattern-matching against data constructors drawn from a GADT, for
example in a case expression, the following rules apply:
The type of the scrutinee must be rigid.
The type of the entire case expression must be rigid.
The type of any free variable mentioned in any of the case alternatives must be rigid.
Note: a type variable is rigid iff it is specified by the user.
Up to this, when pattern matching against a GADT you must provided the type signature (the reason is that type inference is difficult on GADTs). Hence, apparently the first definition of isA should fail to compile, but in the paper which type inference for GADTs is explained (section 6.4):
We remarked in Section 4.3 that in PCON-R it would be unsound to use
any unifier other than a most-general one. But must the refinement be
a unifier at all? For example, even though the case expression could
do refinement, no refinement is necessary to typecheck this function:
f :: Term a -> Int
f (Lit i) = i
f _ = 0
The above example is exactly your case!. In the paper this is called a pre-unifier and there is a very technical explanation on how this works but As far as I can understand, when writing:
data P t where
PA :: P Int
PB :: P Double
PC :: P Char
isA PA = True
isA PB = False
isA PC = False
the compiler starts by deducing isA :: P t -> p and refuse to continue, because type variables aren't rigid (i.e. aren't user-specify)
whereas when writing:
data P t where
PA :: P Int
PB :: P Double
PC :: P Char
isA PA = True
isA _ = False
the compiler can deduce that any type inference will be less general than deducing Bool as a returning type, hence It can safely deduce isA :: P t -> Bool.
Probably this seems as obscure to you as to me, but for sure the two cases you ask for are actually documentated, so probably this is the desired behaviour for GHC developers and not a weird bug.

Haskell: Understanding generalised algebraic data types

I have the following code, which outlines a language of boolean and arithmetic expressions:
data Exp a where
Plus :: Exp Int -> Exp Int -> Exp Int
Const :: (Show a) => a -> Exp a
Not :: Exp Bool -> Exp Bool
And :: Exp Bool -> Exp Bool -> Exp Bool
Greater :: Exp Int -> Exp Int -> Exp Bool
The evaluation function for the above language has the following type:
eval :: Exp a -> a
I am trying to understand what are the possible types the eval function can return. The above code uses GADTs which enable the type Exp a to be defined in terms of the type signatures of its constructors.
The return type of the constructors is not always Exp a. I believe that the types eval can return include Int, Bool, or a value of any type implementing Show. However, is it possible for eval to return any other type besides the three I mentioned previously (since the return type of the function is a)? Any insights are appreciated.

Exp a -> a is a type that could have many possible functions. For example, a valid function that probably isn't the one you are thinking of as "the" eval function is
foo :: Exp a -> a
foo (Plus _ _) = 3
foo (Const x) = x
foo (Not _) = True
foo (And _ _) = True
foo (Greater _ _) = True
The image of a function is the set of values it can return. This example demonstrates that the function foo and the eval you are expecting have different images, despite having the same return type forall a. a.
You are asking, in essence, what the union of all possible images of functions of type Exp a -> a is. This depends greatly on the actual definition of Exp. As currently defined, that would be the union of Int, Bool, and Show a => a.
Exp the type constructor, though, is capable of defining uninhabited types. The type Exp (Int -> Int) exists, even though you haven't defined a constructor which can create values of that type. Since you can't provide a value of type Exp (Int -> Int) for any potential eval function, it can't affect the image of any such function, either.
Changing the definition of Exp to include such a constructor would increase the set of values that could be passed to a function of type Exp a -> a, thus increase the set of values that could occur in the image of such a function.

To be clear, given your definition of Exp above, the answer is "yes": any function with type signature eval :: Exp a -> a can only return a (defined) value of type Int, Bool, or some other type with a Show instance. Because any type (of kind *) can be given a Show instance, that technically means that eval can return any type, but within a particular program that has a fixed set of Show instances, eval will be limited to returning values from this set of types.
You can see that this must be true as follows. Suppose that eval e returned a value of some fixed type t for some expression e. By the type signature of eval, that would imply that e must have type Exp t. However, data type declarations are "closed" meaning that the set of constructors given in the data Exp a declaration is exhaustive, and they represent the only methods of constructing a (defined) value of type Exp a. It's clear from the set of constructors that the only possible values of type Exp a are those that appear in the rightmost positions of the constructors' signatures: Exp Int, Exp Bool, and Exp a with the constraint Show a. Therefore, these are the only types possible for e implying that t must be Int, Bool, or some other a satisfying the constraint Show a.
As always in reasoning about Haskell types, we have to be a little bit careful in considering undefined/bottom values. If you consider "returning an undefined value" to be meaningful, then, indeed, eval can "return" an undefined value of any type, even one without a Show instance. For example, the following will typecheck:
stupid :: Exp (Int -> Int)
stupid = eval undefined
However, if your reason for asking the question is to determine whether you'd ever be in a position where the expression eval e might unexpectedly have a type other than Int, Bool, or some Show a => a that you would somehow have to handle, then no. The form of the GADT places limits on the possible types a in the signature eval :: Exp a -> a.

Typed abstract syntax and DSL design in Haskell

I'm designing a DSL in Haskell and I would like to have an assignment operation. Something like this (the code below is just for explaining my problem in a limited context, I didn't have type checked Stmt type):
data Stmt = forall a . Assign String (Exp a) -- Assignment operation
| forall a. Decl String a -- Variable declaration
data Exp t where
EBool :: Bool -> Exp Bool
EInt :: Int -> Exp Int
EAdd :: Exp Int -> Exp Int -> Exp Int
ENot :: Exp Bool -> Exp Bool
In the previous code, I'm able to use a GADT to enforce type constraints on expressions. My problem is how can I enforce that the left hand side of an assignment is: 1) Defined, i.e., a variable must be declared before it is used and 2) The right hand side must have the same type of the left hand side variable?
I know that in a full dependently typed language, I could define statements indexed by some sort of typing context, that is, a list of defined variables and their type. I believe that this would solve my problem. But, I'm wondering if there is some way to achieve this in Haskell.
Any pointer to example code or articles is highly appreciated.

Given that my work focuses on related issues of scope and type safety being encoded at the type-level, I stumbled upon this old-ish question whilst googling around and thought I'd give it a try.
This post provides, I think, an answer quite close to the original specification. The whole thing is surprisingly short once you have the right setup.
First, I'll start with a sample program to give you an idea of what the end result looks like:
program :: Program
program = Program
$ Declare (Var :: Name "foo") (Of :: Type Int)
:> Assign (The (Var :: Name "foo")) (EInt 1)
:> Declare (Var :: Name "bar") (Of :: Type Bool)
:> increment (The (Var :: Name "foo"))
:> Assign (The (Var :: Name "bar")) (ENot $ EBool True)
:> Done
Scoping
In order to ensure that we may only assign values to variables which have been declared before, we need a notion of scope.
GHC.TypeLits provides us with type-level strings (called Symbol) so we can very-well use strings as variable names if we want. And because we want to ensure type safety, each variable declaration comes with a type annotation which we will store together with the variable name. Our type of scopes is therefore: [(Symbol, *)].
We can use a type family to test whether a given Symbol is in scope and return its associated type if that is the case:
type family HasSymbol (g :: [(Symbol,*)]) (s :: Symbol) :: Maybe * where
HasSymbol '[] s = 'Nothing
HasSymbol ('(s, a) ': g) s = 'Just a
HasSymbol ('(t, a) ': g) s = HasSymbol g s
From this definition we can define a notion of variable: a variable of type a in scope g is a symbol s such that HasSymbol g s returns 'Just a. This is what the ScopedSymbol data type represents by using an existential quantification to store the s.
data ScopedSymbol (g :: [(Symbol,*)]) (a :: *) = forall s.
(HasSymbol g s ~ 'Just a) => The (Name s)
data Name (s :: Symbol) = Var
Here I am purposefully abusing notations all over the place: The is the constructor for the type ScopedSymbol and Name is a Proxy type with a nicer name and constructor. This allows us to write such niceties as:
example :: ScopedSymbol ('("foo", Int) ': '("bar", Bool) ': '[]) Bool
example = The (Var :: Name "bar")
Statements
Now that we have a notion of scope and of well-typed variables in that scope, we can start considering the effects Statements should have. Given that new variables can be declared in a Statement, we need to find a way to propagate this information in the scope. The key hindsight is to have two indices: an input and an output scope.
To Declare a new variable together with its type will expand the current scope with the pair of the variable name and the corresponding type.
Assignments on the other hand do not modify the scope. They merely associate a ScopedSymbol to an expression of the corresponding type.
data Statement (g :: [(Symbol, *)]) (h :: [(Symbol,*)]) where
Declare :: Name s -> Type a -> Statement g ('(s, a) ': g)
Assign :: ScopedSymbol g a -> Exp g a -> Statement g g
data Type (a :: *) = Of
Once again we have introduced a proxy type to have a nicer user-level syntax.
example' :: Statement '[] ('("foo", Int) ': '[])
example' = Declare (Var :: Name "foo") (Of :: Type Int)
example'' :: Statement ('("foo", Int) ': '[]) ('("foo", Int) ': '[])
example'' = Assign (The (Var :: Name "foo")) (EInt 1)
Statements can be chained in a scope-preserving way by defining the following GADT of type-aligned sequences:
infixr 5 :>
data Statements (g :: [(Symbol, *)]) (h :: [(Symbol,*)]) where
Done :: Statements g g
(:>) :: Statement g h -> Statements h i -> Statements g i
Expressions
Expressions are mostly unchanged from your original definition except that they are now scoped and a new constructor EVar lets us dereference a previously-declared variable (using ScopedSymbol) giving us an expression of the appropriate type.
data Exp (g :: [(Symbol,*)]) (t :: *) where
EVar :: ScopedSymbol g a -> Exp g a
EBool :: Bool -> Exp g Bool
EInt :: Int -> Exp g Int
EAdd :: Exp g Int -> Exp g Int -> Exp g Int
ENot :: Exp g Bool -> Exp g Bool
Programs
A Program is quite simply a sequence of statements starting in the empty scope. We use, once more, an existential quantification to hide the scope we end up with.
data Program = forall h. Program (Statements '[] h)
It is obviously possible to write subroutines in Haskell and use them in your programs. In the example, I have the very simple increment which can be defined like so:
increment :: ScopedSymbol g Int -> Statement g g
increment v = Assign v (EAdd (EVar v) (EInt 1))
I have uploaded the whole code snippet together with the right LANGUAGE pragmas and the examples listed here in a self-contained gist. I haven't however included any comments there.

You should know that your goals are quite lofty. I don't think you will get very far treating your variables exactly as strings. I'd do something slightly more annoying to use, but more practical. Define a monad for your DSL, which I'll call M:
newtype M a = ...
data Exp a where
... as before ...
data Var a -- a typed variable
assign :: Var a -> Exp a -> M ()
declare :: String -> a -> M (Var a)
I'm not sure why you have Exp a for assignment and just a for declaration, but I reproduced that here. The String in declare is just for cosmetics, if you need it for code generation or error reporting or something -- the identity of the variable should really not be tied to that name. So it's usually used as
myFunc = do
foobar <- declare "foobar" 42
which is the annoying redundant bit. Haskell doesn't really have a good way around this (though depending on what you're doing with your DSL, you may not need the string at all).
As for the implementation, maybe something like
data Stmt = forall a. Assign (Var a) (Exp a)
| forall a. Declare (Var a) a
data Var a = Var String Integer -- string is auxiliary from before, integer
-- stores real identity.
For M, we need a unique supply of names and a list of statements to output.
newtype M a = M { runM :: WriterT [Stmt] (StateT Integer Identity a) }
deriving (Functor, Applicative, Monad)
Then the operations as usually fairly trivial.
assign v a = M $ tell [Assign v a]
declare name a = M $ do
ident <- lift get
lift . put $! ident + 1
let var = Var name ident
tell [Declare var a]
return var
I've made a fairly large DSL for code generation in another language using a fairly similar design, and it scales well. I find it a good idea to stay "near the ground", just doing solid modeling without using too many fancy type-level magical features, and accepting minor linguistic annoyances. That way Haskell's main strength -- it's ability to abstract -- can still be used for code in your DSL.
One drawback is that everything needs to be defined within a do block, which can be a hinderance to good organization as the amount of code grows. I'll steal declare to show a way around that:
declare :: String -> M a -> M a
used like
foo = declare "foo" $ do
-- actual function body
then your M can have as a component of its state a cache from names to variables, and the first time you use a declaration with a certain name you render it and put it in a variable (this will require a bit more sophisticated monoid than [Stmt] as the target of your Writer). Later times you just look up the variable. It does have a rather floppy dependence on uniqueness of names, unfortunately; an explicit model of namespaces can help with that but never eliminate it entirely.

After seeing all the code by #Cactus and the Haskell suggestions by #luqui, I've managed to got a solution close to what I want in Idris. The complete code is available at the following gist:
(https://gist.github.com/rodrigogribeiro/33356c62e36bff54831d)
Some little things I need to fix in the previous solution:
I don't know (yet) if Idris support integer literal overloading, what would be quite useful to build my DSL.
I've tried to define in DSL syntax a prefix operator for program variables, but it didn't worked as I like. I've got a solution (in the previous gist) that uses a keyword --- use --- for variable access.
I'll check this minor points with guys in Idris #freenode channel to see if these two points are possible.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string