Factoring out recursion in a complex AST

Factoring out recursion in a complex AST - haskell

For a side project I am working on I currently have to deal with an abstract syntax tree and transform it according to rules (the specifics are unimportant).
The AST itself is nontrivial, meaning it has subexpressions which are restricted to some types only. (e.g. the operator A must take an argument which is of type B only, not any Expr. A drastically simplified reduced version of my datatype looks like this:
data Expr = List [Expr]
| Strange Str
| Literal Lit
data Str = A Expr
| B Expr
| C Lit
| D String
| E [Expr]
data Lit = Int Int
| String String
My goal is to factor out the explicit recursion and rely on recursion schemes instead, as demonstrated in these two excellent blog posts, which provide very powerful general-purpose tools to operate on my AST. Applying the necessary factoring, we end up with:
data ExprF a = List [a]
| Strange (StrF a)
| Literal (LitF a)
data StrF a = A a
| B a
| C (LitF a)
| D String
| E [a]
data LitF a = Int Int
| String String
If I didn't mess up, type Expr = Fix ExprF should now be isomorphic to the previously defined Expr.
However, writing cata for these cases becomes rather tedious, as I have to pattern match B a :: StrF a inside of an Str :: ExprF a for cata to be well-typed. For the entire original AST this is unfeasible.
I stumbled upon fixing GADTs, which seems to me like it is a solution to my problem, however the user-unfriendly interface of the duplicated higher-order type classes etc. is quite the unneccessary boilerplate.
So, to sum up my questions:
Is rewriting the AST as a GADT the correct way to go about this?
If yes, how could I transform the example into a well-working version? On a second note, is there better support for higher kinded Functors in GHC now?

If you've gone through the effort of to separate out the recursion in your data type, then you can just derive Functor and you're done. You don't need any fancy features to get the recursion scheme. (As a side note, there's no reason to parameterize the Lit data type.)
The fold is:
newtype Fix f = In { out :: f (Fix f) }
gfold :: (Functor f) => (f a -> a) -> Fix f -> a
gfold alg = alg . fmap (gfold alg) . out
To specify the algebra (the alg parameter), you need to do a case analysis against ExprF, but the alternative would be to have the fold have a dozen or more parameters: one for each data constructor. That wouldn't really save you much typing and would be much harder to read. If you want (and this may require rank-2 types in general), you can package all those parameters up into a record and then you could use record update to update "pre-made" records that provide "default" behavior in various circumstances. There's an old paper Dealing with Large Bananas that takes an approach like this. What I'm suggesting, to be clear, is just wrapping the gfold function above with a function that takes a record, and passes in an algebra that will do the case analysis and call the appropriate field of the record for each case.
Of course, you could use GHC Generics or the various "generic/polytypic" programming libraries like Scrap Your Boilerplate instead of this. You are basically recreating what they do.

Related

What is the best practice to generate data which satisfy specific property in QuickCheck?

When we are using QuickCheck to check our programs, we need to define generators for our data, there is some generic way to define them, but the generic way usually become useless when we need the generated data to satisfy some constraints to work.
e.g.
data Expr
= LitI Int
| LitB Bool
| Add Expr Expr
| And Expr Expr
data TyRep = Number | Boolean
typeInfer :: Expr -> Maybe TyRep
typeInfer e = case e of
LitI _ -> Number
LitB _ -> Boolean
Add e1 e2 -> case (typeInfer e1, typeInfer e2) of
(Just Number, Just Number) -> Just Number
_ -> Nothing
And e1 e2 -> case (typeInfer e1, typeInfer e2) of
(Just Boolean, Just Boolean) -> Just Boolean
_ -> Nothing
now I need to define generator of Expr (i.e. Gen Expr or instance Arbitrary Expr), but also want it generates the type correct ones (i.e. isJust (typeInfer generatedExpr))
a naive way to do that is use suchThat to filter out the invalid ones, but that is obviously inefficient when Expr and TyRep becomes complicated with more cases.
Another similar situation is about reference integrity, e.g.
data Expr
= LitI Int
| LitB Bool
| Add Expr Expr
| And Expr Expr
| Ref String -- ^ reference another Expr via it's name
type Context = Map String Expr
In this case, we want all the referenced names in the generated Expr are contained in some specific Context, now I have to generate Expr for specific Context:
arbExpr :: Context -> Gen Expr
but now shrink will be a problem, and to solve this problem, I have to define a specific version of shrink, and use forAllShrink everytime I use arbExpr, that means a lot of work.
So I want to know, is there a best practice to do such things?

For well-typed terms, a simple approach in many cases is to have one generator for each type, or, equivalently, a function TyRep -> Gen Expr. Adding variables on top of that, this usually turns into a function Context -> TyRep -> Gen Expr.
In the case of generating terms with variables (and with no or very simple types), indexing the type of terms by the context (e.g., like you would do using the bound library) should make it quite easy to derive a generator generically.
For shrinking, hedgehog's approach can work quite well, where Gen generates a value together with shrunk versions, sparing you from defining a separate shrinking function.
Note that as the well-formedness/typing relation becomes more complex, you start hitting the theoretical wall where generating terms is at least as hard as arbitrary proof search.
For more advanced techniques/related literature, with my own comments about possibly using it in Haskell:
Generating Constrained Data with Uniform Distribution, by Claessen et al., FLOPS'14 (PDF). I believe the Haskell package lazy-search has most of the machinery described by the paper, but it seems aimed at enumeration rather than random generation.
Making Random Judgments: Automatically Generating Well-Typed Terms from the Definition of a Type-System, by Fetscher et al., ESOP'15 (PDF), the title says it all. I don't know about a Haskell implementation though; you might want to ask the authors.
Beginner's Luck: A Language for Property-Based Generators, by Lampropoulos et al., POPL'17 (PDF) (disclaimer: I'm a coauthor). A language of properties (more concretely, functions T -> Bool, e.g., a typechecker) that can be interpreted as random generators (Gen T). The language's syntax is strongly inspired by Haskell, but there are still a few differences. The implementation has an interface to extract the generated values in Haskell (github repo).
Generating Good Generators for Inductive Relations, by Lampropoulos et al. POPL'18 (PDF). It's in Coq QuickChick, but tying it to Haskell QuickCheck by extraction seems reasonably feasible.

Nearest equivalent to Prolog atom or Lisp symbol in Haskell

I'm trying to write a simple program for manipulating expressions in propositional calculus and would like a nice syntax for proposition variables (e.g. 'P or something).
Strings get the job done, but the syntax is misleading in this context and they permit inappropriate operations like ++.
Syntactically, I'd like to be able to write down something that does not "look quoted" visually (something like 'P is okay, though). In terms of the supported operations, I'd like to be able to determine whether two symbols are equal and to convert them into a string matching their name via show. I'd also like these things to be open (ADTs with only nullary constructors are similar in principle to symbols, but require all variants to be declared in advance).
Here's a toy example using strings where something symbol-like would be more appropriate.
type Var = String
data Proposition =
Primitive Var |
Negated Proposition |
Implication Proposition Proposition
instance Show Proposition where
show (Primitive p) = p
show (Negated n) = "!" ++ show n
show (Implication ant cons) =
"(" ++ show ant ++ "->" ++ show cons ++ ")"
main = putStrLn $ show $ Implication (Primitive "A") (Primitive "B")

Typically the way this is done in Haskell is by parameterizing over the type of symbols. So your example would become:
data Proposition a =
Primitive a |
Negated (Proposition a) |
Implication (Proposition a) (Proposition a)
which then leaves it up to the user to decide the best representation their symbols. This has advantages over LISP-like symbols: symbols intended for different purposes will not be mixed up, and data structures involving symbols now admit transformations over all the symbols, which are more useful than you realize. For example, Functor changes between symbol representations, and Monad models substitution.
(=<<) :: (a -> Proposition b) -> Proposition a -> Proposition b
^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^
substitute each free var with an expression in this expression
You can get a form of type-safe openness too:
implyOpen :: Proposition a -> Proposition b -> Proposition (Either a b)
implyOpen p q = Implication (Left <$> p) (Right <$> q)
Another fun trick is using a non-regular recursive type to model variable bindings in a type-safe way.
data Proposition a =
... |
ForAll (Proposition (Maybe a))
Here we have added one "free variable" to the inner proposition -- Primitive Nothing is the variable being quantified over. It may seem awkward at first, but when you get to coding it's bomb, because the types make it very hard to get it wrong.
bound is an excellent package for modelling expression languages based on this idea (and a few other tricks).

Using different Ordering for Sets

I was reading a Chapter 2 of Purely Functional Data Structures, which talks about unordered sets implemented as binary search trees. The code is written in ML, and ends up showing a signature ORDERED and a functor UnbalancedSet(Element: ORDERED): SET. Coming from more of a C++ background, this makes sense to me; custom comparison function objects form part of the type and can be passed in at construction time, and this seems fairly analogous to the ML functor way of doing things.
When it comes to Haskell, it seems the behavior depends only on the Ord instance, so if I wanted to have a set that had its order reversed, it seems like I'd have to use a newtype instance, e.g.
newtype ReverseInt = ReverseInt Int deriving (Eq, Show)
instance Ord ReverseInt where
compare (ReverseInt a) (ReverseInt b)
| a == b = EQ
| a < b = GT
| a > b = LT
which I could then use in a set:
let x = Set.fromList $ map ReverseInt [1..5]
Is there any better way of doing this sort of thing that doesn't resort to using newtype to create a different Ord instance?

No, this is really the way to go. Yes, having a newtype is sometimes annoying but you get some big benefits:
When you see a Set a and you know a, you immediately know what type of comparison it uses (sort of the same way that purity makes code more readable by not making you have to trace execution). You don't have to know where that Set a comes from.
For many cases, you can coerce your way through multiple newtypes at once. For example, I can turn xs = [1,2,3] :: Int into ys = [ReverseInt 1, ReverseInt 2, ReverseInt 3] :: [ReverseInt] just using ys = coerce xs :: [ReverseInt]. Unfortunately, that isn't the case for Set (and it shouldn't - you'd need the coercion function to be monotonic to not screw up the data structure invariants, and there is not yet a way to express that in the type system).
newtypes end up being more composable than you expect. For example, the ReverseInt type you made already exists in a form that generalizes to reversing any type with an Ord constraint: it is called Down. To be explicit, you could use Down Int instead of ReversedInt, and you get the instance you wrote out for free!
Of course, if you still feel very strongly about this, nothing is stopping you from writing your version of Set which has to have a field which is the comparison function it uses. Something like
data Set a = Set { comparisionKey :: a -> a -> Ordering
, ...
}
Then, every time you make a Set, you would have to pass in the comparison key.

Extending algebraic data type

Note: if this question is somehow odd, this is because I was only recently exposed to Haskell and am still adapting to the functional mindset.
Considering a data type like Maybe:
data MyOwnMaybe a = MyOwnNothing | MyOwnJust a
everyone using my data type will write functions like
maybeToList :: MyOwnMaybe a -> [a]
maybeToList MyOwnNothing = []
maybeToList (MyOwnJust x) = [x]
Now, suppose that, at a later time, I wish to extend this data type
data MyOwnMaybe a = MyOwnNothing | MyOwnJust a | SuperpositionOfNothingAndJust a
how do I make sure that everyone's functions will break at compile-time?
Of course, there is the chance that somehow I'm not "getting" algebraic data types and maybe I shouldn't be doing this at all, but considering a data type Action
data Action = Reset | Send | Remove
it would seem that adding an extra Action like Add would not be so uncommon (and I wouldn't want to risk having all these functions around that possibly cannot handle my new Action)

Well, bad news first: sometimes you just can't do it. Period.
But that is language-agnostic; in any language you sometimes have to break interface. There is no way around it.
Now, good news: you can actually go a great length before you have to do that.
You just have to carefully consider what you export from your module. If, instead of exporting the internal workings of it, you export high-level functions, then there is a good chance you can rewrite those function using the new data type, and everything would go smooth.
In particular, be very careful when exporting data constructors. In this case, you don't just export functions that create your data; you are also exporting the possibility of pattern-matching; and that is not something that ties you pretty tight.
So, in your example, if you write functions like
myOwnNothing :: MyOwnMaybe a
myOwnJust :: a -> MyOwnMaybe a
and
fromMyOwnMaybe :: MyOwnMaybe a -> b -> (a -> b) -> b
fromMyOwnMaybe MyOwnNothing b _ = b
fromMyOwnMaybe (MyOwnJust a) _ f = f a
then it's reasonable to assume that you would be able to reimplement it for the updated MyOwnMaybe data type; so, just export those functions and the data type itself, but don't export constructors.
The only situation in which you would benefit from exporting constructors is when you are absolutely sure that your data type won't ever change. For example, Bool would always have only two (fully defined) values: True and False, it won't be extended by some FileNotFound or anything (although Edward Kmett might disagree). Ditto Maybe or [].
But the idea is more general: stay as high-level as you can.

You seem to know that GHC can warn about non-exhaustive pattern matches in function via the -W flag or explicitly with -fwarn-incomplete-patterns.
There is a good discussion about why these warnings are not automatically compile-time errors at this SO question:
In Haskell, why non-exhaustive patterns are not compile-time errors?
Also, consider this case where you have an ADT with a large number of constructors:
data Alphabet = A | B | C | ... | X | Y | Z
isVowel :: Alphabet -> Bool
isVowel A = True
isVowel E = True
isVowel I = True
isVowel O = True
isVowel U = True
isVowel _ = False
A default case is used as a convenience to avoid having to write out the other 21 cases.
Now if you add an addition constructor to Alphabet, should isVowel be flagged as "incomplete"?

One thing that a lot of modules do is not to export their constructors. Instead, they export functions that can be used (“smart constructors”). If you change your ADT later, you have to fix your functions in the module, but no one else's code gets broken.

What is "Scrap Your Boilerplate"?

I see people talking about Scrap Your Boilerplate and generic programming in Haskell. What do these terms mean? When would I want to use Scrap Your Boilerplate, and how do I use it?

Often when making transformations on complex data types, we only need to affect small pieces of the structure---in other words, we're targeting specific reducible expressions, redexes, alone.
The classic example is double-negation elimination over a type of integer expressions:
data Exp = Plus Exp Exp | Mult Exp Exp | Negate Exp | Pure Int
doubleNegSimpl :: Exp -> Exp
doubleNegSimpl (Negate (Negate e)) = e
...
Even in describing this example, I'd prefer not to write out the entirety of the ... part. It is completely mechanical---nothing more than the engine for continuing the recursion throughout the entirety of the Exp.
This "engine" is the boilerplate we intend to scrap.
To achieve this, Scrap Your Boilerplate suggests a mechanism by which we can construct "generic traversals" over data types. These traversals operate exactly correctly without knowing anything at all about the specific data type in question. To do this, very roughly, we have a notion of generic annotated trees. These are larger than ADTs in such a way that all ADTs can be projected into the type of annotated trees:
section :: Generic a => a -> AnnotatedTree
and "valid" annotated trees can be projected back into some brand of ADT
retract :: Generic a => AnnotatedTree -> Maybe a
Notably, I'm introducing the Generic typeclass to indicate types which have section and retract defined.
Using this generic, annotated tree representation of all data types, we can define a traversal once and for all. In particular, we provide an interface (using section and retract strategically) so that end users are never exposed to the AnnotatedTree type. Instead, it looks a bit like:
everywhere' :: Generic a => (a -> a) -> (AnnotatedTree -> AnnotatedTree)
such that, combined with final and initial section and retracts and the invariant that our annotated trees are always "valid", we have
everywhere :: Generic a => (a -> a) -> (a -> a)
everywhere f a0 = fromJust . retract . everywhere' f . section
What does everywhere f a do? It tries to apply the function f "everywhere" in the ADT a. In other words, we now write our double negation simplification as follows
doubleNegSimpl :: Exp -> Exp
doubleNegSimpl (Negate (Negate e)) = e
doubleNegSimpl e = e
In other words, it acts as id whenever the redex (Negate (Negate _)) fails to match. If we apply everywhere to this
simplify :: Exp -> Exp
simplify = everywhere doubleNegSimpl
then double negations will be eliminated "everywhere" via a generic traversal. The ... boilerplate is gone.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string