Converting Paulson's parser combinators to Haskell - haskell

I am trying to convert the code from Paulson's ML for the working programmer book chapter 9, Writing Interpreters for the λ-Calculus.
I was wondering if anyone can help me translate this to Haskell.
I'm struggling to understand the syntax.
fun list ph = ph -- repeat ("," $-- ph) >> (op::);
fun pack ph = "(" $-- list ph --$")" >> #1
| empty;

In porting this code to Haskell, I see two challenges: One is rewriting the combinators so they use the type Either SyntaxError rather than exceptions for flow control, and the other is preserving the modularity of ML's functors. That is, writing a parser combinator library that is modular with regards to what keywords / symbols / tokenizer it should use.
While the ML code has the two
functor Lexical (Keyword: KEYWORD) : LEXICAL
functor Parsing (Lex: LEXICAL) : PARSE
you could start by having
data Keyword = Keyword
{ alphas :: [String]
, symbols :: [String]
}
data Token
= Key String
| Id String
deriving (Show, Eq)
lex :: Keyword -> String -> [Token]
lex kw s = ...
where
alphaTok :: String -> Token
alphaTok a | a `elem` alphas kw = Key a
| otherwise = Id a
...
The ML code uses the types string and substring while Haskell's String is actually a [Char]. The lexer functions would look a little different because ML's String.getc could simply be the pattern match c : ss1 in Haskell, etc.
Paulson's parsers have type [Token] → (τ, [Token]) but allow for exceptions. The Haskell parsers could have type [Token] → Either SyntaxError (τ, [Token]):
newtype SyntaxError = SyntaxError String
deriving Show
newtype Parser a = Parser { runParser :: [Token] -> Either SyntaxError (a, [Token]) }
err :: String -> Either SyntaxError b
err msg = Left (SyntaxError msg)
The operators id, $, ||, !!, -- and >> need new names, since they collide with a bunch of built-in operators and single-line comments. Ideas for names could be: ident, kw, |||, +++ and >>>. I would skip implementing the !! operator initially.
Here are two combinators implemented a little differently,
ident :: Parser String
ident = Parser f
where
f :: [Token] -> Either SyntaxError (String, [Token])
f (Id x : toks) = Right (x, toks)
f (Key x : _) = err $ "Identifier expected, got keyword '" ++ x ++ "'"
f [] = err "Identifier expected, got EOF"
(+++) :: Parser a -> Parser b -> Parser (a, b)
(+++) pa pb = Parser $ \toks1 -> do (x, toks2) <- runP pa toks1
(y, toks3) <- runP pb toks2
return ((x, y), toks3)
...
Some final remarks:
Read the paper Monadic Parsing in Haskell (Hutton, Meijer).
You may be interested in SimpleParse by Ken Friis Larsen, an educational parser combinator library that is a simplification of ReadP by Koen Claessen, since its source code is very easy to read. They are both non-deterministic.
If you're interested in using parser combinators in Haskell, rather than porting some old-fashioned library for the learning experience, I encourage you too look at Megaparsec (tutorial), a modern fork of Parsec. The implementation is a little complex.
None of these three libraries (SimpleParse, ReadP, Megaparsec) split lexing and parsing into two separate steps. Rather, they simply build small tokenizing parsers that implicitly eat meaningless whitespace. (See the token combinator in SimpleParse, for example.) However, Megaparsec does allow an arbitrary token type, whether that is Char or some token you have lexed.

Related

Describing my Parser type as a series of monad transformers

I have to describe the Parser type as a series of monad transformers.
As far as I understand, monad transformers are used to wrap monads into another monad. But I don't understand what is the task here.
Instead of defining a new type for Parser, you can simply define it as a type alias for a type created by one or more monad transformers. That is, you definition would look something like
type Parser a = SomeMonadT <some set of monads and types>
Your task, then, is to determine which monad transformer(s) to use, and what the arguments to the transformer(s) should be.
Before we begin to define combinators that act on parsers, we must choose a representation for a parser first.
A parser takes a string and produces an output that can be just about anything. A list parser will produce a list as it's output, an integer parser will produce Ints, a JSON parser might return a custom ADT representing a JSON.
Therefore, it makes sense to make Parser a polymorphic type. It also makes sense to return a list of results instead of a single result since grammar can be ambiguous, and there may be several ways to parse the same input string.
An empty list, then, implies the parser failed to parse the provided input.
newtype Parser a = Parser { parse :: String -> [(a, String)] }
You might wonder why we return the tuple (a, String), not just a. Well, a parser might not be able to parse the entire input string. Often, a parser is only intended to parse some prefix of the input, and let another parser do the rest of the parsing. Thus, we return a pair containing the parse result a and the unconsumed string subsequent parsers can use.
We can start by describing some basic parsers that do very little work. A result parser always succeeds in parsing without consuming the input string.
result :: a -> Parser a
result val = Parser $ \inp -> [(val, inp)]
item unconditionally accepts the first character of any input string.
item :: Parser Char
item = Parser parseItem
where
parseItem [] = []
parseItem (x:xs) = [(x, xs)]
Let's try some of these parsers in GHCi:
*Main> parse (result 42) "abc"
[(42, "abc")]
*Main> parse item "abc"
[('a', "bc")]
Say we want a parser that consumes a string if its first character satisfies a predicate. We can generalize this idea by writing a function that takes a (Char -> Bool) predicate and returns a parser that only consumes an input string if its first character returns True when supplied to the predicate.
The simplest solution for this would be:
sat :: (Char -> Bool) -> Parser Char
sat p = Parser parseIfSat
where
parseIfSat (x : xs) = if p x then [(x, xs)] else []
Using the previously defined item parser (this requires a Monad instance for the type Parser, which I leave to you as an exercise):
sat p =
-- Apply `item`, if it fails on an empty string, we simply short circuit and get `[]`.
item >>= \x ->
if p x
then result x
else zero
parseIfSat [] = []
Now we can use the sat combinator to describe several useful parsers. For example, parser for ASCII digits:
-- import Data.Char (isDigit, isLower, isUpper)
digit :: Parser Char
digit = sat isDigit
You get the idea. You start by defining elementary parsers, and use those to build more complex parsers. The type Parser shown here is actually StateT Monad Transformer, it combines State and [] in this case.
The code shown was taken from here.

"string" Implementation in Text.Parser.Char

First, just some quick context. I'm going through the Haskell Programming From First Principles book, and ran into the following exercise.
Try writing a Parser that does what string does, but using char.
I couldn't figure it out, so I checked out the source for the implementation. I'm currently trying to wrap my head around it. Here it is:
class Parsing m => CharParsing m where
-- etc.
string :: CharParsing m => String -> m String
string s = s <$ try (traverse_ char s) <?> show s
My questions are as follows, from most to least specific.
Why is show necessary?
Why is s <$ necessary? Doesn't traverse char s <?> s work the same? In other words, why do we throw away the results of the traversal?
What is going on with the traversal? I get what a list traversal does, so I guess I'm confused about the Applicative/Monad instances for Parser. On a high level, I get that the traversal applies char, which has type CharParsing m => Char -> m Char, to every character in string s, and then collects all the results into something of type Parser [Char]. So the types make sense, but I have no idea what's going on in the background.
Thanks in advance!
1) Why is show necessary?
Because showing a string (or a Text, etc.) escapes special characters, which makes sense for error messages:
GHCi> import Text.Parsec -- Simulating your scenario with Parsec.
GHCi> runParser ((\s -> s <$ try (traverse_ char s) <?> s) "foo\nbar") () "" "foo"
Left (line 1, column 4):
unexpected end of input
expecting foo
bar
GHCi> runParser ((\s -> s <$ try (traverse_ char s) <?> show s) "foo\nbar") () "" "foo"
Left (line 1, column 4):
unexpected end of input
expecting "foo\nbar"
2) Why is s <$ necessary? Doesn't traverse char s <?> s work the same? In other words, why do we throw away the results of the traversal?
The result of the parse is unnecessary because we know in advance that it would be s (if the parse were successful). traverse would needlessly reconstruct s from the results of parsing each individual character. In general, if the results are not needed it is a good idea to use traverse_ (which just combines the effects, discarding the results without trying to rebuild the data structure) rather than traverse, so that is likely why the function is written the way it is.
3) What is going on with the traversal?
traverse_ char s (traverse_, and not traverse, as explained above) is a parser. It tries to parse, in order, each character in s, while discarding the results, and it is built by sequencing parsers for each character in s. It may be helpful to remind that traverse_ is just a fold which uses (*>):
-- Slightly paraphrasing the definition in Data.Foldable:
traverse_ :: (Foldable t, Applicative f) => (a -> f b) -> t a -> f ()
traverse_ f = foldr (\x u -> f x *> u) (pure ())

Parsing to Free Monads

Say I have the following free monad:
data ExampleF a
= Foo Int a
| Bar String (Int -> a)
deriving Functor
type Example = Free ExampleF -- this is the free monad want to discuss
I know how I can work with this monad, eg. I could write some nice helpers:
foo :: Int -> Example ()
foo i = liftF $ Foo i ()
bar :: String -> Example Int
bar s = liftF $ Bar s id
So I can write programs in haskell like:
fooThenBar :: Example Int
fooThenBar =
do
foo 10
bar "nice"
I know how to print it, interpret it, etc. But what about parsing it?
Would it be possible to write a parser that could parse arbitrary
programs like:
foo 12
bar nice
foo 11
foo 42
So I can store them, serialize them, use them in cli programs etc.
The problem I keep running into is that the type of the program depends on which program is being parsed. If the program ends with a foo it's of
type Example () if it ends with a bar it's of type Example Int.
I do not feel like writing parsers for every possible permutation (it's simple here because there are only two possibilities, but imagine we add
Baz Int (String -> a), Doo (Int -> a), Moz Int a, Foz String a, .... This get's tedious and error-prone).
Perhaps I'm solving the wrong problem?
Boilerplate
To run the above examples, you need to add this to the beginning of the file:
{-# LANGUAGE DeriveFunctor #-}
import Control.Monad.Free
import Text.ParserCombinators.Parsec
Note: I put up a gist containing this code.
Not every Example value can be represented on the page without reimplementing some portion of Haskell. For example, return putStrLn has a type of Example (String -> IO ()), but I don't think it makes sense to attempt to parse that sort of Example value out of a file.
So let's restrict ourselves to parsing the examples you've given, which consist only of calls to foo and bar sequenced with >> (that is, no variable bindings and no arbitrary computations)*. The Backus-Naur form for our grammar looks approximately like this:
<program> ::= "" | <expr> "\n" <program>
<expr> ::= "foo " <integer> | "bar " <string>
It's straightforward enough to parse our two types of expression...
type Parser = Parsec String ()
int :: Parser Int
int = fmap read (many1 digit)
parseFoo :: Parser (Example ())
parseFoo = string "foo " *> fmap foo int
parseBar :: Parser (Example Int)
parseBar = string "bar " *> fmap bar (many1 alphaNum)
... but how can we give a type to the composition of these two parsers?
parseExpr :: Parser (Example ???)
parseExpr = parseFoo <|> parseBar
parseFoo and parseBar have different types, so we can't compose them with <|> :: Alternative f => f a -> f a -> f a. Moreover, there's no way to know ahead of time which type the program we're given will be: as you point out, the type of the parsed program depends on the value of the input string. "Types depending on values" is called dependent types; Haskell doesn't feature a proper dependent type system, but it comes close enough for us to have a stab at making this example work.
Let's start by forcing the expressions on either side of <|> to have the same type. This involves erasing Example's type parameter using existential quantification.†
data Ex a = forall i. Wrap (a i)
parseExpr :: Parser (Ex Example)
parseExpr = fmap Wrap parseFoo <|> fmap Wrap parseBar
This typechecks, but the parser now returns an Example containing a value of an unknown type. A value of unknown type is of course useless - but we do know something about Example's parameter: it must be either () or Int because those are the return types of parseFoo and parseBar. Programming is about getting knowledge out of your brain and onto the page, so we're going to wrap up the Example value with a bit of GADT evidence which, when unwrapped, will tell you whether a was Int or ().
data Ty a where
IntTy :: Ty Int
UnitTy :: Ty ()
data (a :*: b) i = a i :&: b i
type Sig a b = Ex (a :*: b)
pattern Sig x y = Wrap (x :&: y)
parseExpr :: Parser (Sig Ty Example)
parseExpr = fmap (\x -> Sig UnitTy x) parseFoo <|>
fmap (\x -> Sig IntTy x) parseBar
Ty is (something like) a runtime "singleton" representative of Example's type parameter. When you pattern match on IntTy, you learn that a ~ Int; when you pattern match on UnitTy you learn that a ~ (). (Information can be made to flow the other way, from types to values, using classes.) :*:, the functor product, pairs up two type constructors ensuring that their parameters are equal; thus, pattern matching on the Ty tells you about its accompanying Example.
Sig is therefore called a dependent pair or sigma type - the type of the second component of the pair depends on the value of the first. This is a common technique: when you erase a type parameter by existential quantification, it usually pays to make it recoverable by bundling up a runtime representative of that parameter.
Note that this use of Sig is equivalent to Either (Example Int) (Example ()) - a sigma type is a sum, after all - but this version scales better when you're summing over a large (or possibly infinite) set.
Now it's easy to build our expression parser into a program parser. We just have to repeatedly apply the expression parser, and then manipulate the dependent pairs in the list.
parseProgram :: Parser (Sig Ty Example)
parseProgram = fmap (foldr1 combine) $ parseExpr `sepBy1` (char '\n')
where combine (Sig _ val) (Sig ty acc) = Sig ty (val >> acc)
The code I've shown you is not exemplary. It doesn't separate the concerns of parsing and typechecking. In production code I would modularise this design by first parsing the data into an untyped syntax tree - a separate data type which doesn't enforce the typing invariant - then transform that into a typed version by type-checking it. The dependent pair technique would still be necessary to give a type to the output of the type-checker, but it wouldn't be tangled up in the parser.
*If binding is not a requirement, have you thought about using a free applicative to represent your data?
†Ex and :*: are reusable bits of machinery which I lifted from the Hasochism paper
So, I worry that this is the same sort of premature abstraction that you see in object-oriented languages, getting in the way of things. For example, I am not 100% sure that you are using the structure of the free monad -- your helpers for example simply seem to use id and () in a rather boring way, in fact I'm not sure if your Int -> x is ever anything other than either Pure :: Int -> Free ExampleF Int or const (something :: Free ExampleF Int).
The free monad for a functor F can basically be described as a tree whose data is stored in leaves and whose branching factor is controlled by the recursion in each constructor of the functor F. So for example Free Identity has no branching, hence only one leaf, and thus has the same structure as the monad:
data MonoidalFree m x = MF m x deriving (Functor)
instance Monoid m => Monad (MonoidalFree m) where
return x = MF mempty x
MF m x >>= my_x = case my_x x of MF n y -> MF (mappend m n) y
In fact Free Identity is isomorphic to MonoidalFree (Sum Integer), the difference is just that instead of MF (Sum 3) "Hello" you see Free . Identity . Free . Identity . Free . Identity $ Pure "Hello" as the means of tracking this integer. On the other hand if you have data E x = L x | R x deriving (Functor) then you get a sort of "path" of Ls and Rs before you hit this one leaf, Free E is going to be isomorphic to MonoidalFree [Bool].
The reason I'm going through this is that when you combine Free with an Integer -> x functor, you get an infinitely branching tree, and when I'm looking through your code to figure out how you're actually using this tree, all I see is that you use the id function with it. As far as I can tell, that restricts the recursion to either have the form Free (Bar "string" Pure) or else Free (Bar "string" (const subExpression)), in which case the system would seem to reduce completely to the MonoidalFree [Either Int String] monad.
(At this point I should pause to ask: Is that correct as far as you know? Was this what was intended?)
Anyway. Aside from my problems with your premature abstraction, the specific problem that you're citing with your monad (you can't tell the difference between () and Int has a bunch of really complicated solutions, but one really easy one. The really easy solution is to yield a value of type Example (Either () Int) and if you have a () you can fmap Left onto it and if you have an Int you can fmap Right onto it.
Without a much better understanding of how you're using this thing over TCP/IP we can't recommend a better structure for you than the generic free monads that you seem to be finding -- in particular we'd need to know how you're planning on using the infinite-branching of Int -> x options in practice.

Parsec parsing in Haskell

I have 2 parsers:
nexpr::Parser (Expr Double)
sexpr::Parser (Expr String)
How do I build a parser that tries one and then the other if it doesn't work? I can't figure out what to return. There must be a clever way to do this.
Thanks.
EDIT:
Adding a bit more info...
I'm learning Haskel, so I started with :
data Expr a where
N::Double -> Expr Double
S::String -> Expr String
Add::Expr Double -> Expr Double -> Expr Double
Cat::Expr String -> Expr String -> Expr String
then I read about F-algebra (here) and so I changed it to:
data ExprF :: (* -> *) -> * -> * where
N::Double -> ExprF r Double
S::String -> ExprF r String
Add::r Double -> r Double -> ExprF r Double
Cat::r String -> r String -> ExprF r String
with
type Expr = HFix ExprF
so my parse to:
Parser (Expr Double)
is actually:
Parser (ExprF HFix Double)
Maybe I'm biting off more than I can chew...
As noted in the comments, you can have a parser like this
nOrSexpr :: Parser (Either (Expr Double) (Expr String))
nOrSexpr = (Left <$> nexpr) <|> (Right <$> sexpr)
However, I think the reason that you are having this difficulty is because you are not representing your parse tree as a single type, which is the more usual thing to do. Something like this:
data Expr =
ExprDouble Double
| ExprInt Int
| ExprString String
That way you can have parsers for each kind of expression that are all of type Parser Expr. This is the same as using Either but more flexible and maintainable. So you might have
doubleParser :: Parser Expr
doubleParser = ...
intParser :: Parser Expr
intParser = ...
stringParser :: Parser Expr
stringParser = ...
exprParser :: Parser Expr
exprParser = intParser <|> doubleParser <|> stringParser
Note that the order of the parsers does matter and use can use Parsec's try function if backtracking is needed.
So, for example, if you want to have a sum expression now, you can add to the data type
data Expr =
ExprDouble Double
| ExprInt Int
| ExprString String
| ExprSum Expr Expr
and make the parser
sumParser :: Parser Expr
sumParser = do
a <- exprParser
string " + "
b <- exprParser
return $ ExprSum a b
UPDATE
Well, I take my hat off to you diving straight into GADTs if you are just starting with Haskell. I have been reading through the paper you linked and noticed this immediately in the first paragraph:
The jury is still out on whether the additional type-safety provided by GADTs is worth the added inconvenience of working with them.
There are three points worth taking away here I think. The first is simply that I would have a go with the simpler way of doing things first, to get an idea of how it works and why you might want to add more type safety, before trying to more complicated type theoretical stuff. That comment may not help so feel free to ignore it!
Secondly, and more importantly, your representation...
data ExprF :: (* -> *) -> * -> * where
N :: Double -> ExprF r Double
S :: String -> ExprF r String
Add :: r Double -> r Double -> ExprF r Double
Cat :: r String -> r String -> ExprF r String
...is specifically designed to not allow ill formed type expressions. Contrasted with mine which can, eg ExprSum (ExprDouble 5.0) (ExprString "test"). So the question you really want to ask is what should actually happen when the parser attempts to parse something like "5.0 + \"test\""? Do you want it to just not parse, or do you want it to return a nice message saying that this expression is the wrong type? Compilers are usually designed in multiple stages for this reason. The first pass turns the input into an abstract syntax tree (AST), and further passes annotate this tree with type judgements. This annotated AST can then be transformed into the semantic representation that you really want it in.
So in your case I would recommend two stages. first, parse into a dumb representation like mine, that will give you the correct tree shape but allow ill-typed expressions. Like
data ExprAST =
ExprASTDouble Double
| ExprASTInt Int
| ExprASTString String
| ExprASTAdd Expr Expr
Then have another function that will typecheck the ExprAST. Something like
typecheck :: ExprAST -> Maybe (ExprF HFix a)
(You could also use Either and return either the typechecked GADT or an error string saying what the problem is.) The further problem here is that you don't know what a is statically. The other answer solves this by using type tags and an existential wrapper, which you might find to be the best way to go. I feel like it might be simpler to have a top level expression in your GADT that all expressions must live in, so an entire parse will always have the same type. In the end there is usually only one program type.
My third, and last, point is related to this
The jury is still out on whether the additional type-safety provided by GADTs is worth the added inconvenience of working with them.
The more type safety, generally the more work you have to do to get it. You mention you are new to Haskell, yet this adventure has taken us right to the edge of what it is capable of doing. The type of the parsed expression cannot depend only on the input string in a Haskell function, because it does not allow for dependant types. If you want to go down this path, I might suggest you have a look at a language called Idris. A great introduction to what it is capable of can be found in this video, in which he constructs a typesafe printf.
The problem described looks to be using Parsec to parse into a GADT representation, for which probably the easiest solution would be parse into a monotype representation and then have a (likely partial) type checking phase to produce the well-typed GADT, if it can. The monotype representation could be an existential wrapper over a GADT term, with a type-tag to reify the GADT index.
EDIT: a quick example
Let's define a type for type-tags and an existential wrapper:
data Type :: * -> * where
TDouble :: Type Double
TString :: Type String
data Judgement f = forall ix. Judgement (f ix) (Type ix)
With the example GADT given in the original post, we only have a problem with the outer-most production, which we need to parse to a monotype as we don't know statically which expression type we will get at runtime:
pExpr :: Parser (Judgement Expr)
pExpr = Judgement <$> pDblExpr <*> pure TDouble
<|> Judgement <$> pStrExpr <*> pure TString
We can write a type check phase to produce a GADT or fail, depending on whether the type assertion succeeds or not:
typecheck :: Judgement Expr -> Type ix -> Maybe (Expr ix)
typecheck (Judgement e TDouble) TDouble = Just e
typecheck (Judgement e TString) TString = Just e
typecheck _ _ = Nothing

Type confusion in simple function

I've been away from Haskell for quite a while. I'm working through "Implementing Functional Languages A Tutorial" now to get back up to speed and learn more about what goes on under the hood. I can't seem to get how the argument "tokens" fits in here. The type signature clearly says this function takes a function and 2 parsers and returns a parser. Is "tokens" the result of the list comprehension being used in the list comprehension? Thanks.
pThen :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
pThen combine p1 p2 tokens =
[(combine v1 v2, tokens2) | (v1, tokens1) <- p1 tokens,
(v2, tokens2) <- p2 tokens1]
Edit: After reading the helpful answer below, I noticed a little earlier in the book this simpler example. It's even more obvious here, just in case this is helpful to someone else in the future.
pAlt :: Parser a -> Parser a -> Parser a
pAlt :: Parser a -> Parser a -> Parser a
pAlt p1 p2 toks = (p1 toks) ++ (p2 toks)
Parser is itself a type synonym for a function type, and tokens is the argument of the resulting Parser.

Resources