Abstract Syntax Tree representing a program in Haskell - haskell

I've been given the following Data types:
data Aexp = N Integer | V Var | Add Aexp Aexp | Mult Aexp Aexp | Sub Aexp Aexp
data Bexp = Bcon Bool | Eq Aexp Aexp | Le Aexp Aexp | Neg Bexp | And Bexp Bexp
data Stm = Ass Var Aexp | Skip | Comp Stm Stm | If Bexp Stm Stm | While Bexp Stm | Block DecV DecP Stm | Call Pname
For Arithmetic expressions, Boolean expressions and Statements respectively.
I have been asked to return a Stm representing the following program:
x:=5;
y:=1;
while ¬(x=1) do
y:=y*x;
x:=x-1
So far I have the following :
p :: Stm
p = (Ass x 5) (Ass y 1) (While (Neg (Eq x 1)) Ass x (Sub x 1) Ass y (Mult y x))
I did this simply by writing out the program on paper and then drawing a syntax tree from it. Firstly I'm not sure if this is actually correct as I don't really know how to prove it. And secondy when I compile I get lots of errors like:
denotational.hs:122:10: Not in scope: `x'
denotational.hs:122:20: Not in scope: `y'
denotational.hs:122:41: Not in scope: `x'
denotational.hs:122:51: Not in scope: `x'
denotational.hs:122:58: Not in scope: `x'
denotational.hs:122:67: Not in scope: `y'
denotational.hs:122:75: Not in scope: `y'
denotational.hs:122:77: Not in scope: `x'
Any help with this would be greatly appreciated. Thanks

Your AST is a representation of the syntax of the (sub-)language. Everything that is in that sublanguage must be in your AST and nothing of Haskell.
So, assuming your language is fully represented by a list of Stms you might write
x:=5;
as
fragx :: [Stm]
fragx = [ Ass "x" (N 5) ]
which captures in the tree structure of your AST all of the information in that fragment from the sort of literal to the name of the variable being bound. A larger example
x:=5;
y:=1;
is
frag2 :: [Stm]
frag2 = [ Ass "x" (N 5)
, Ass "y" (N 1)
]
fragy :: [Stm]
fragy = [ Ass "y" (N 1) ]
with
frag2 == fragx ++ fragy
which demonstrates how we can talk about appending two programs syntactically as the Haskell function (++) :: [Stm] -> [Stm] -> [Stm]. Again, these fragments are simple static representations of the syntax of the sublanguage. There is nothing tricky going on in the Haskell side, in particular, no bleeding of Haskell variables into sublanguage variables.
The adding final fragment might look like
fragAll :: [Stm]
fragAll = fragx ++ fragy
++ [ While (Neg (Eq (V "x") (N 1))) (Block ...)
]
but it cannot be completed without looking into the structure of the arguments to Block.

Related

Assigning Integers to Variables in Haskell

I'm sorry that this is a beginner question. I just trying to let x = 5, y = 2, and set all other variables to zero in a function s. All this work is to verify p computes 5! in the code below, where p representing
y:=1 ; while ¬(x=1) do (y:=y*x; x:=x-1);
type Num = Integer
type Var = String
type Z = Integer
type T = Bool
type State = Var -> Z
data Aexp = N Num | V Var | Add Aexp Aexp | Mult Aexp Aexp | Sub Aexp Aexp deriving (Show, Eq, Read)
data Bexp = TRUE | FALSE | Eq Aexp Aexp | Le Aexp Aexp | Neg Bexp | And Bexp Bexp deriving (Show, Eq, Read)
data Stm = Ass Var Aexp | Skip | Comp Stm Stm | If Bexp Stm Stm | While Bexp Stm deriving (Show, Eq, Read)
p::Stm
p = (Comp(Ass "y" (N 1))(While(Neg(Eq (V "x") (N 1)))(Comp (Ass "y" (Mult (V "y") (V "x")))(Ass "x" (Sub (V "x") (N 1))))))
s :: State {- It has to be Var -> Int-}
s x = 5
s y = 2
And when I try to compile this, ghci gives that the pattern matches are overlapped. I know this is a quite simple question, but there is not much information online for me to solve this. Could you give me any hints? Thanks!
You are pattern matching. This is typically done using different values of the argument type.
didSayHello :: String -> Bool
didSayHello "hello" = True
didSayHello x = False
This matches top-down, reading "if the string argument is 'hello', then True" and "if it is any random String argument (excluding 'hello'), then False"
Your matches are overlapping because in both patterns you're referring to any random String place holder. The one just happened to be called "x" and the other "y".
See this link for more details
The two alternatives are doing exactly the same thing: matching on any value passed in, and calling it x. Since the first case matches anything it always succeeds and returns 5.
To do what you seem to be trying to do, you cannot use pattern matching, you need to use equality (via the Eq class):
s :: Var -> Int
s v | v == x = 5
| v == y = 3
You cannot pattern match against other values, only patterns. x and y are values when you defined them using let, but x and y are patterns (matching absolutely any input value)
You mentioned that Var is a String, so this won't work if x and y are set like let x = 5, y = 2 because there is no instance for Num String, so x and y won't be Strings, and can't be tested for equality with a Var
It's pretty unclear what you are trying to do, but know that assignment does not really exist in Haskell. Perhaps you want to look at
s string =
let
x = 5
y = 2
in
-- something using x and y, e.g.
x + y
In this case s will return 7 no matter what string you pass it, i.e. :
s "alpha" => 7
s "beta" => 7

LaTeX natural deduction proofs using Haskell

How can one create LaTeX source for natural deduction proof trees (like those shown here) via Haskell eg using HaTeX? I'd like to emulate LaTeX .stys like bussproofs.sty or proof.sty.
I'm using your question as an excuse to improve and demo a Haskell
call-tracing library I'm working
on. In the context of
tracing, an obvious way to create a proof tree is to trace a type
checker and then format the trace as a natural-deduction proof. To
keep things simple my example logic is the simply-typed lambda
calculus
(STLC),
which corresponds to the implicational fragment of propositional
intuitionistic
logic.
I am using proofs.sty, but not via HaTeX or any other Haskell
Latex library. The Latex for proof trees is very simple and using a
Haskell Latex library would just complicate things.
I've written the proof-tree generation code twice:
in a self-contained way, by writing a type checker that also
returns a proof tree;
using my tracing library, by call-tracing a type checker and then
post processing the trace into a proof tree.
Since you didn't ask about call-tracing libraries, you may be less
interested in the call-trace based version, but I think it's
interesting to compare both versions.
Examples
Let's start with some output examples first, to see what all this gets us.
The first three examples are motivated
by an axiom system for implicational propositional
calculus;
the first two also happen to correspond to S and
K:
The first axiom, K, with proof terms:
The second axiom, S, with proof terms, but with premises in the
context, not lambda bound:
The fourth axiom, modus ponens, without proof terms:
The third axiom in that Wikipedia article (Peirce's law) is
non-constructive and so we can't prove it here.
For a different kind of example, here's a failed type check of the Y
combinator:
The arrows are meant to lead you to the error, which is marked with a
bang (!).
Code
Now I'll describe the code which generated those examples. The code is
from this
file
unless otherwise noted. I'm not including every line of code here;
see that link if you want something you can actually build using GHC
7.6.3.
Most of the code -- the grammar, parser, and pretty printer -- is the
same for both versions; only the type checkers and proof-tree
generators differ. All of the common code is in the file just
referenced.
STLC grammar
The STLC grammar in ASCII:
-- Terms
e ::= x | \x:T.e | e e
-- Types
T ::= A | T -> T
-- Contexts
C ::= . | C,x:T
And the corresponding Haskell:
type TmVar = String
type TyVar = String
data Tm = Lam TmVar Ty Tm
| TmVar TmVar
| Tm :#: Tm
deriving Show
data Ty = TyVar TyVar
| Ty :->: Ty
deriving (Eq , Show)
type Ctx = [(TmVar,Ty)]
Type checking + proof tree generation
Both versions implement the same abstract STLC type checker. In ASCII:
(x:T) in C
---------- Axiom
C |- x : T
C,x:T1 |- e : T2
--------------------- -> Introduction
C |- \x:T1.e : T1->T2
C |- e : T1 -> T2 C |- e1 : T1
--------------------------------- -> Elimination
C |- e e1 : T2
Version 1: self-contained with inline proof-tree generation
The full code for this version is
here.
The proof-tree generation happens in the type checker, but the actual
proof-tree generation code is factored out into addProof and
conclusion.
Type checking
-- The mode is 'True' if proof terms should be included.
data R = R { _ctx :: Ctx , _mode :: Bool }
type M a = Reader R a
extendCtx :: TmVar -> Ty -> M a -> M a
extendCtx x t = local extend where
extend r = r { _ctx = _ctx r ++ [(x,t)] }
-- These take the place of the inferred type when there is a type
-- error.
here , there :: String
here = "\\,!"
there = "\\,\\uparrow"
-- Return the inferred type---or error string if type inference
-- fails---and the latex proof-tree presentation of the inference.
--
-- This produces different output than 'infer' in the error case: here
-- all premises are always computed, whereas 'infer' stops at the
-- first failing premise.
inferProof :: Tm -> M (Either String Ty , String)
inferProof tm#(Lam x t e) = do
(et' , p) <- extendCtx x t . inferProof $ e
let et'' = (t :->:) <$> et'
addProof et'' [p] tm
inferProof tm#(TmVar x) = do
mt <- lookup x <$> asks _ctx
let et = maybe (Left here) Right mt
addProof et [] tm
inferProof tm#(e :#: e1) = do
(et , p) <- inferProof e
(et1 , p1) <- inferProof e1
case (et , et1) of
(Right t , Right t1) ->
case t of
t1' :->: t2 | t1' == t1 -> addProof (Right t2) [p , p1] tm
_ -> addProof (Left here) [p , p1] tm
_ -> addProof (Left there) [p , p1] tm
Proof tree generation
The addProof corresponds to proofTree in the other version:
-- Given the inferred type, the proof-trees for all premise inferences
-- (subcalls), and the input term, annotate the inferred type with a
-- result proof tree.
addProof :: Either String Ty -> [String] -> Tm -> M (Either String Ty , String)
addProof et premises tm = do
R { _mode , _ctx } <- ask
let (judgment , rule) = conclusion _mode _ctx tm et
let tex = "\\infer[ " ++ rule ++ " ]{ " ++
judgment ++ " }{ " ++
intercalate " & " premises ++ " }"
return (et , tex)
The code for conclusion is common to both versions:
conclusion :: Mode -> Ctx -> Tm -> Either String Ty -> (String , String)
conclusion mode ctx tm e = (judgment mode , rule tm)
where
rule (TmVar _) = "\\textsc{Axiom}"
rule (Lam {}) = "\\to \\text{I}"
rule (_ :#: _) = "\\to \\text{E}"
tyOrError = either id pp e
judgment True = pp ctx ++ " \\vdash " ++ pp tm ++ " : " ++ tyOrError
judgment False = ppCtxOnlyTypes ctx ++ " \\vdash " ++ tyOrError
Version 2: via call-tracing, with proof-tree generation as post processing
Here the type checker is not even aware of proof-tree generation, and
adding call-tracing is just one line.
Type checking
type Mode = Bool
type Stream = LogStream (ProofTree Mode)
type M a = ErrorT String (ReaderT Ctx (Writer Stream)) a
type InferTy = Tm -> M Ty
infer , infer' :: InferTy
infer = simpleLogger (Proxy::Proxy "infer") ask (return ()) infer'
infer' (TmVar x) = maybe err pure . lookup x =<< ask where
err = throwError $ "Variable " ++ x ++ " not in context!"
infer' (Lam x t e) = (t :->:) <$> (local (++ [(x,t)]) . infer $ e)
infer' (e :#: e1) = do
t <- infer e
t1 <- infer e1
case t of
t1' :->: t2 | t1' == t1 -> pure t2
_ -> throwError $ "Can't apply " ++ show t ++ " to " ++ show t1 ++ "!"
The LogStream
type
and ProofTree
class
are from the library. The LogStream is the type of log events that
the "magic"
simpleLogger
logs. Note the line
infer = simpleLogger (Proxy::Proxy "infer") ask (return ()) infer'
which defines infer to be a logged version of infer', the actual
type checker. That's all you have to do to trace a monadic function!
I won't get into how simpleLogger actually works here, but the
result is that each call to infer gets logged, including the
context, arguments, and return value, and these data get grouped
together with all logged subcalls (here only to infer). It would be
easy to manually write such logging code for infer, but it's nice
that with the library you don't have to.
Proof-tree generation
To generate the Latex proof trees, we implement ProofTree to post
process infer's call trace. The library provides a proofTree
function that calls the ProofTree methods and assembles the proof
trees; we just need to specify how the conclusions of the typing
judgments will be formatted:
instance ProofTree Mode (Proxy (SimpleCall "infer" Ctx InferTy ())) where
callAndReturn mode t = conclusion mode ctx tm (Right ty)
where
(tm , ()) = _arg t
ty = _ret t
ctx = _before t
callAndError mode t = conclusion mode ctx tm (Left error)
where
(tm , ()) = _arg' t
how = _how t
ctx = _before' t
error = maybe "\\,!" (const "\\,\\uparrow") how
The pp calls are to a user defined pretty printer; obviously, the
library can't know how to pretty print your data types.
Because calls can be erroneous -- the library detects errors
-- we have to say how to format
successful and failing calls. Refer to the Y-combinator example above
for an example a failing type check, corresponding to the
callAndError case here.
The library's proofTree
function
is quite simple: it builds a proofs.sty proof tree with the current
call as conclusion, and the subcalls as premises:
proofTree :: mode -> Ex2T (LogTree (ProofTree mode)) -> String
proofTree mode (Ex2T t#(CallAndReturn {})) =
"\\infer[ " ++ rule ++ " ]{ " ++ conclusion ++ " }{ " ++ intercalate " & " premises ++ " }"
where
(conclusion , rule) = callAndReturn mode t
premises = map (proofTree mode) (_children t)
proofTree mode (Ex2T t#(CallAndError {})) =
"\\infer[ " ++ rule ++ " ]{ " ++ conclusion ++ " }{ " ++ intercalate " & " premises ++ " }"
where
(conclusion , rule) = callAndError mode t
premises = map (proofTree mode)
(_children' t ++ maybe [] (:[]) (_how t))
I use proofs.sty in the library because it allows arbitrarily many
premises, although bussproofs.sty would work for this STLC example
since no rule has more than five premises (the limit for
bussproofs). Both Latex packages are described
here.
Pretty printing
Now we return to code that is common between both versions.
The pretty printer that defines the pp used above is rather long --
it handles precedence and associativity, and is written in a way that
should be extensible if more terms, e.g. products, are added to the
calculus -- but mostly straightforward. First, we set up a precedence
table and a precedence-and-associativity-aware parenthesizer:
- Precedence: higher value means tighter binding.
type Prec = Double
between :: Prec -> Prec -> Prec
between x y = (x + y) / 2
lowest , highest , precLam , precApp , precArr :: Prec
highest = 1
lowest = 0
precLam = lowest
precApp = between precLam highest
precArr = lowest
-- Associativity: left, none, or right.
data Assoc = L | N | R deriving Eq
-- Wrap a pretty print when the context dictates.
wrap :: Pretty a => Assoc -> a -> a -> String
wrap side ctx x = if prec x `comp` prec ctx
then pp x
else parens . pp $ x
where
comp = if side == assoc x || assoc x == N
then (>=)
else (>)
parens s = "(" ++ s ++ ")"
And then we define the individual pretty printers:
class Pretty t where
pp :: t -> String
prec :: t -> Prec
prec _ = highest
assoc :: t -> Assoc
assoc _ = N
instance Pretty Ty where
pp (TyVar v) = v
pp t#(t1 :->: t2) = wrap L t t1 ++ " {\\to} " ++ wrap R t t2
prec (_ :->: _) = precArr
prec _ = highest
assoc (_ :->: _) = R
assoc _ = N
instance Pretty Tm where
pp (TmVar v) = v
pp (Lam x t e) = "\\lambda " ++ x ++ " {:} " ++ pp t ++ " . " ++ pp e
pp e#(e1 :#: e2) = wrap L e e1 ++ " " ++ wrap R e e2
prec (Lam {}) = precLam
prec (_ :#: _) = precApp
prec _ = highest
assoc (_ :#: _) = L
assoc _ = N
instance Pretty Ctx where
pp [] = "\\cdot"
pp ctx#(_:_) =
intercalate " , " [ x ++ " {:} " ++ pp t | (x,t) <- ctx ]
By adding a "mode" argument, it would be easy to use the same pretty
printer to print plain ASCII, which would be useful with other
call-trace post processors, such as the (unfinished) UnixTree
processor.
Parsing
A parser is not essential to the example, but of course I did not enter the
example input terms directly as Haskell Tms.
Recall the STLC grammar in ASCII:
-- Terms
e ::= x | \x:T.e | e e
-- Types
T ::= A | T -> T
-- Contexts
C ::= . | C,x:T
This grammar is ambiguous: both the term application e e
and function type T -> T have no associativity given by the
grammar. But in STLC term application is left associative and function
types are right associative, and so the corresponding disambiguated
grammar we actually parse is
-- Terms
e ::= e' | \x:T.e | e e'
e' ::= x | ( e )
-- Types
T ::= T' | T' -> T
T' ::= A | ( T )
-- Contexts
C ::= . | C,x:T
The parser is maybe too simple -- I'm not using a languageDef and
it's whitespace sensitive -- but it gets the job done:
type P a = Parsec String () a
parens :: P a -> P a
parens = Text.Parsec.between (char '(') (char ')')
tmVar , tyVar :: P String
tmVar = (:[]) <$> lower
tyVar = (:[]) <$> upper
tyAtom , arrs , ty :: P Ty
tyAtom = parens ty
<|> TyVar <$> tyVar
arrs = chainr1 tyAtom arrOp where
arrOp = string "->" *> pure (:->:)
ty = arrs
tmAtom , apps , lam , tm :: P Tm
tmAtom = parens tm
<|> TmVar <$> tmVar
apps = chainl1 tmAtom appOp where
appOp = pure (:#:)
lam = uncurry Lam <$> (char '\\' *> typing)
<*> (char '.' *> tm)
tm = apps <|> lam
typing :: P (TmVar , Ty)
typing = (,) <$> tmVar
<*> (char ':' *> ty)
ctx :: P Ctx
ctx = typing `sepBy` (char ',')
To clarify what the input terms look like, here are the examples from
the Makefile:
# OUTFILE CONTEXT TERM
./tm2latex.sh S.ctx 'x:P->Q->R,y:P->Q,z:P' 'xz(yz)'
./tm2latex.sh S.lam '' '\x:P->Q->R.\y:P->Q.\z:P.xz(yz)'
./tm2latex.sh S.err '' '\x:P->Q->R.\y:P->Q.\z:P.xzyz'
./tm2latex.sh K.ctx 'x:P,y:Q' 'x'
./tm2latex.sh K.lam '' '\x:P.\y:Q.x'
./tm2latex.sh I.ctx 'x:P' 'x'
./tm2latex.sh I.lam '' '\x:P.x'
./tm2latex.sh MP.ctx 'x:P,y:P->Q' 'yx'
./tm2latex.sh MP.lam '' '\x:P.\y:P->Q.yx'
./tm2latex.sh ZERO '' '\s:A->A.\z:A.z'
./tm2latex.sh SUCC '' '\n:(A->A)->(A->A).\s:A->A.\z:A.s(nsz)'
./tm2latex.sh ADD '' '\m:(A->A)->(A->A).\n:(A->A)->(A->A).\s:A->A.\z:A.ms(nsz)'
./tm2latex.sh MULT '' '\m:(A->A)->(A->A).\n:(A->A)->(A->A).\s:A->A.\z:A.m(ns)z'
./tm2latex.sh Y.err '' '\f:A->A.(\x:A.f(xx))(\x:A.f(xx))'
./tm2latex.sh Y.ctx 'a:A->(A->A),y:(A->A)->A' '\f:A->A.(\x:A.f(axx))(y(\x:A.f(axx)))'
Latex document generation
The ./tm2latex.sh script just calls pdflatex on the output of the
Haskell programs described above. The Haskell programs produce the proof
tree and then wrap it in a minimal Latex document:
unlines
[ "\\documentclass[10pt]{article}"
, "\\usepackage{proof}"
, "\\usepackage{amsmath}"
, "\\usepackage[landscape]{geometry}"
, "\\usepackage[cm]{fullpage}"
-- The most slender font I could find:
-- http://www.tug.dk/FontCatalogue/iwonalc/
, "\\usepackage[light,condensed,math]{iwona}"
, "\\usepackage[T1]{fontenc}"
, "\\begin{document}"
, "\\tiny"
, "\\[" ++ tex ++ "\\]"
, "\\end{document}"
]
As you can see, most of the Latex is devoted to making the proof trees
as small as possible; I plan to also write an ASCII proof tree post
processor, which may be more useful in practice when the examples are
larger.
Conclusion
As always, it takes a bit of code to write a parser, type checker, and
pretty printer. On top of that, adding proof-tree generation is
pretty simple in both versions. This is a fun toy example, but I
expect to do something similar in the context of a "real"
unification-based type checker for a dependently-typed language; there
I expect call tracing and proof-tree generation (in ASCII) to provide
significant help in debugging the type checker.

How to make the Parsec chainl1 function follow operator precedence rules

I'm programming a standard math notation -> DC POSIX-compliant format converter. It takes the input string, parses it into an intermediary data type and then turns it into the output string by showing it.
This is the Data type used. I have no problems with the Data type -> Output String conversion, it works flawlessly:
data Expression = Expression :+ Expression
| Expression :- Expression
| Expression :* Expression
| Expression :/ Expression
| Expression :^ Expression
| Cons String
infixr 0 :+
infixr 0 :-
infixr 1 :*
infixr 1 :/
infixr 2 :^
instance Show Expression where
show (x :+ y) = unwords [show x, show y, "+"]
show (x :- y) = unwords [show x, show y, "-"]
show (x :* y) = unwords [show x, show y, "*"]
show (x :/ y) = unwords [show x, show y, "/"]
show (x :^ y) = unwords [show x, show y, "^"]
show (Cons y) = y
The Parsec parser part, however, refuses to comply with the defined operator precedency rules. Clearly because of the way chainl1 is used in the subexpression parser definition:
expression :: Parser Expression
expression = do
spaces
x <- subexpression
spaces >> eof >> return x
subexpression :: Parser Expression
subexpression = (
(bracketed subexpression) <|>
constant
) `chainl1` (
try addition <|>
try substraction <|>
try multiplication <|>
try division <|>
try exponentiation
)
addition = operator '+' (:+)
substraction = operator '-' (:-)
multiplication = operator '*' (:*)
division = operator '/' (:/)
exponentiation = operator '^' (:^)
operator :: Char -> (a -> a -> a) -> Parser (a -> a -> a)
operator c op = do
spaces >> char c >> spaces
return op
bracketed :: Parser a -> Parser a
bracketed parser = do
char '('
x <- parser
char ')'
return x
constant :: Parser Expression
constant = do
parity <- optionMaybe $ oneOf "-+"
constant <- many1 (digit <|> char '.')
return (if parity == Just '-'
then (Cons $ '_':constant)
else Cons constant)
Is there a way of making the parser take into account the operator precedence rules without having to rewrite the entirety of my code?
Well, you don't need to rewrite your entire code, but since your subexpression parser doesn't take precedence into account at all, you have to rewrite that - substantially.
One possibility is to build it from parsers for subexpressions with top-level operators with the same precedence,
atom :: Parser Expression
atom = bracketed subexpression <|> constant
-- highest precedence operator is exponentiation, usually that's
-- right-associative, hence I use chainr1 here
powers :: Parser Expression
powers = atom `chainr1` try exponentiation
-- a multiplicative expression is a product or quotient of powers,
-- left-associative
multis :: Parser Expression
multis = powers `chainl1` (try multiplication <|> try division)
-- a subexpression is a sum (or difference) of multiplicative expressions
subexpression :: Parser Expression
subexpression = multis `chainl1` (try addition <|> try substraction)
Another option is to let the precedence and associativities be taken care of by the library and use Text.Parsec.Expr, namely buildExpressionParser:
table = [ [binary "^" (:^) AssocRight]
, [binary "*" (:*) AssocLeft, binary "/" (:/) AssocLeft]
, [binary "+" (:+) AssocLeft, binary "-" (:-) AssocLeft]
]
binary name fun assoc = Infix (do{ string name; spaces; return fun }) assoc
subexpression = buildExpressionParser table atom
(which requires that bracketed parser and constant consume the spaces after the used tokens).

Splicing arbitrary expressions in a Haskell quasiquoter

Reading through Why It’s Nice to be Quoted, in section 3 there's an example of splicing a variable identifier in a quasiquote.
subst [:lam | $exp:e1 $exp:e2 |] x y =
let e1' = subst e1 x y
e2' = subst e2 x y
in
[:lam | $exp:e1' $exp:e2' |]
I see why the recursive calls to subst are done outside the [:lam| ... |], it's because the function antiVarE in section 3.2 builds a TH.varE out of the variable name.
My question is how much work would be required to support arbitrary expression splices beyond just a variable name?
For example:
subst [:lam | $exp:e1 $exp:e2 |] x y =
[:lam | $exp:(subst e1 x y) $exp:(subst e2 x y) |]
Answering my own question for posterity.
Turns out it was quite simple. Using the parseExp function in haskell-src-meta package I was able to easily convert a string to AST fragment.
In the original paper, aside from the parser changes required to capture an expression string between parentheses, the antiExpE function could be rewritten as such.
antiExpE :: Exp -> Maybe TH.ExpQ
antiExpE (AE v) =
case parseExp v of
Right exp -> Just . return $ exp
Left _ -> Nothing
antiExpE = Nothing

Can parser combinators be made efficient?

Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently revisited this subject and benchmarked Haskell's Parsec vs a simple hand-rolled precedence climbing parser written in F# and was surprised to find the F# to be 25× faster than the Haskell.
Here's the Haskell code I used to read a large mathematical expression from file, parse and evaluate it:
import Control.Applicative
import Text.Parsec hiding ((<|>))
expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
fact = read <$> many1 digit <|> char '(' *> expr <* char ')'
eval :: String -> Int
eval = either (error . show) id . parse expr "" . filter (/= ' ')
main :: IO ()
main = do
file <- readFile "expr"
putStr $ show $ eval file
putStr "\n"
and here's my self-contained precedence climbing parser in F#:
let rec (|Expr|) = function
| P(f, xs) -> Expr(loop (' ', f, xs))
| xs -> invalidArg "Expr" (sprintf "%A" xs)
and loop = function
| ' ' as oop, f, ('+' | '-' as op)::P(g, xs)
| (' ' | '+' | '-' as oop), f, ('*' | '/' as op)::P(g, xs) ->
let h, xs = loop (op, g, xs)
match op with
| '+' -> (+) | '-' -> (-) | '*' -> (*) | '/' | _ -> (/)
|> fun op -> loop (oop, op f h, xs)
| _, f, xs -> f, xs
and (|P|_|) = function
| '('::Expr(f, ')'::xs) -> Some(P(f, xs))
| c::_ as xs when '0' <= c && c <= '9' ->
let rec loop n = function
| c2::xs when '0' <= c2 && c2 <= '9' -> loop (10*n + int(string c2)) xs
| xs -> Some(P(n, xs))
loop 0 xs
| _ -> None
My impression is that even state-of-the-art parser combinators waste a lot of time back tracking. Is that correct? If so, is it possible to write parser combinators that generate state machines to obtain competitive performance or is it necessary to use code generation?
EDIT:
Here's the OCaml script I used to generate a ~2Mb expression for benchmarking:
open Printf
let rec f ff n =
if n=0 then fprintf ff "1" else
fprintf ff "%a+%a*(%a-%a)" f (n-1) f (n-1) f (n-1) f (n-1)
let () =
let n = try int_of_string Sys.argv.(1) with _ -> 3 in
fprintf stdout "%a\n" f n
I've come up with a Haskell solution that is 30× faster than the Haskell solution you posted (with my concocted test expression).
Major changes:
Change Parsec/String to Attoparsec/ByteString
In the fact function, change read & many1 digit to decimal
Made the chainl1 recursion strict (remove $! for the lazier version).
I tried to keep everything else you had as similar as possible.
import Control.Applicative
import Data.Attoparsec
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
expr :: Parser Int
expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
term :: Parser Int
term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
fact :: Parser Int
fact = decimal <|> char '(' *> expr <* char ')'
eval :: B.ByteString -> Int
eval = either (error . show) id . eitherResult . parse expr . B.filter (/= ' ')
chainl1 :: (Monad f, Alternative f) => f a -> f (a -> a -> a) -> f a
chainl1 p op = p >>= rest where
rest x = do f <- op
y <- p
rest $! (f x y)
<|> pure x
main :: IO ()
main = B.readFile "expr" >>= (print . eval)
I guess what I concluded from this is that the majority of the slowdown for the parser combinator was that it was sitting on an inefficient base, not that it was a parser combinator, per se.
I imagine with more time and profiling this could go faster, as I stopped when I went past the 25× mark.
I don't know if this would be faster than the precedence climbing parser ported to Haskell. Maybe that would be an interesting test?
I'm currently working on the next version of FParsec (v. 0.9), which will in many situations improve performance by up to a factor of 2 relative to the current version.
[Update: FParsec 0.9 has been released, see http://www.quanttec.com/fparsec ]
I've tested Jon's F# parser implementation against two FParsec implementations. The first FParsec parser is a direct translation of djahandarie's parser. The second one uses FParsec's embeddable operator precedence component. As the input I used a string generated with Jon's OCaml script with parameter 10, which gives me an input size of about 2.66MB. All parsers were compiled in release mode and were run on the 32-bit .NET 4 CLR. I only measured the pure parsing time and didn't include startup time or the time needed for constructing the input string (for the FParsec parsers) or the char list (Jon's parser).
I measured the following numbers (updated numbers for v. 0.9 in parens):
Jon's hand-rolled parser: ~230ms
FParsec parser #1: ~270ms (~235ms)
FParsec parser #2: ~110ms (~102ms)
In light of these numbers, I'd say that parser combinators can definitely offer competitive performance, at least for this particular problem, especially if you take into account that FParsec
automatically generates highly readable error messages,
supports very large files as input (with arbitrary backtracking), and
comes with a declarative, runtime-configurable operator-precedence parser module.
Here's the code for the two FParsec implementations:
Parser #1 (Translation of djahandarie's parser):
open FParsec
let str s = pstring s
let expr, exprRef = createParserForwardedToRef()
let fact = pint32 <|> between (str "(") (str ")") expr
let term = chainl1 fact ((str "*" >>% (*)) <|> (str "/" >>% (/)))
do exprRef:= chainl1 term ((str "+" >>% (+)) <|> (str "-" >>% (-)))
let parse str = run expr str
Parser #2 (Idiomatic FParsec implementation):
open FParsec
let opp = new OperatorPrecedenceParser<_,_,_>()
type Assoc = Associativity
let str s = pstring s
let noWS = preturn () // dummy whitespace parser
opp.AddOperator(InfixOperator("-", noWS, 1, Assoc.Left, (-)))
opp.AddOperator(InfixOperator("+", noWS, 1, Assoc.Left, (+)))
opp.AddOperator(InfixOperator("*", noWS, 2, Assoc.Left, (*)))
opp.AddOperator(InfixOperator("/", noWS, 2, Assoc.Left, (/)))
let expr = opp.ExpressionParser
let term = pint32 <|> between (str "(") (str ")") expr
opp.TermParser <- term
let parse str = run expr str
In a nutshell, parser combinators are slow for lexing.
There was a Haskell combinator library for building lexers (see "Lazy Lexing is Fast" Manuel M. T. Chakravarty) - as the tables were generated at runtime, there wasn't the hassle of code generation. The library got used a bit - it was initially used in one of the FFI preprocessors, but I don't think it ever got uploaded to Hackage, so maybe it was a little too inconvenient for regular use.
In the OCaml code above, the parser is directly matching on char-lists so it can be as fast as list destructuring is in the host language (it would be much faster than Parsec if it were re-implemented in Haskell). Christian Lindig had an OCaml library that had a set of parser combinators and a set of lexer combinators - the lexer combinators were certainly much simpler than Manuel Chakravarty's, and it might might be worthwhile tracking down this library and bench-marking it before writing a lexer generator.
Have you tried one of the known fast parser libraries? Parsec's aims have never really been speed, but ease of use and clarity. Comparing to something like attoparsec may be a more fair comparison, especially because the string types are likely to be more equal (ByteString instead of String).
I also wonder which compile flags were used. This being another trolling post by the infamous Jon Harrop, it would not surprise me if no optimisations were used at all for the Haskell code.

Resources