I'm writing a arithmetic parser to treat expressions like "1+2-3". I use this blog post as reference. To treat left associativity and precedence, I write a parser with Parsec according to this BNF (from blog post).
<exp> ::= <term> { ("+" | "-") <term> }
<term> ::= <factor> { ("*" | "/") <factor> }
<factor> ::= "(" <exp> ")" | <unary_op> <factor> | <int>
This is my parser code.
parseExp :: Parser Exp
parseExp = do
t1 <- parseTerm
loop t1
where termSuffix t1 = do
op <- lexeme $ oneOf "+-"
t2 <- parseTerm
case op of
'+' -> termSuffix (Binary Plus t1 t2)
'-' -> termSuffix (Binary Minus t1 t2)
loop t = termSuffix t <|> return t
parseTerm :: Parser Exp
parseTerm = do
f1 <- parseFactor
loop f1
where factorSuffix f1 = do
op <- lexeme $ oneOf "*/"
f2 <- parseFactor
case op of
'*' -> factorSuffix (Binary Mul f1 f2)
'/' -> factorSuffix (Binary Div f1 f2)
loop t = factorSuffix t <|> return t
parseFactor :: Parser Exp
parseFactor = parseConst <|> parseParen <|> parseUnary
parseParen = do
void $ lexeme $ char '('
e <- parseExp
void $ lexeme $ char ')'
return e
parseUnary :: Parser Exp
parseUnary = do
op <- lexeme $ oneOf "!~-"
f <- parseFactor
case op of
'!' -> return $ Unary LogNeg f
'~' -> return $ Unary BitCompl f
'-' -> return $ Unary ArithNeg f
parseConst :: Parser Exp
parseConst = do
i <- many1 digit
return (Const $ read i)
I also used this tutorial code as reference. tutorial
simpleExpr7 :: Parser SimpleExpr
simpleExpr7 = do
-- first parse a term
e <- term7
-- then see if it is followed by an '+ expr' suffix
maybeAddSuffix e
where
-- this function takes an expression, and parses a
-- '+ expr' suffix, returning an Add expression
-- it recursively calls itself via the maybeAddSuffix function
addSuffix e0 = do
void $ lexeme $ char '+'
e1 <- term7
maybeAddSuffix (Add e0 e1)
-- this is the wrapper for addSuffix, which adapts it so that if
-- addSuffix fails, it returns just the original expression
maybeAddSuffix e = addSuffix e <|> return e
My code doesn't work. This code works like this.
*Main CodeGen Parser> parseWithEof parseExp "-2"
Right (Unary ArithNeg (Const 2))
*Main CodeGen Parser> parseWithEof parseExp "(2)"
Right (Const 2)
*Main CodeGen Parser> parseWithEof parseExp "-!(((2)))"
Right (Unary ArithNeg (Unary LogNeg (Const 2)))
*Main CodeGen Parser> parseWithEof parseExp "1+2"
Left (line 1, column 4):
unexpected end of input
expecting digit
*Main CodeGen Parser> parseWithEof parseExp "1+2+3"
Left (line 1, column 6):
unexpected end of input
expecting digit
*Main CodeGen Parser> parseWithEof parseExp "1+2*3"
Left (line 1, column 6):
unexpected end of input
expecting digit
I can't understand why this results unexpected end of input.
Consider parsing 1+2. In parseExp this parses 1 into t1 = Const 1 and then enters the loop loop (Const 1). The loop tries the first alternative termSuffix (Const 1) which succesfully parses the operator +, the next term t2 = Const 2, and then loops back into termSuffix (Binary Plus (Const 1) (Const 2)) which expects either a + or -. The parse fails. Instead of looping back into termSuffix, you should loop back into loop to allow a single term after the first +:
parseExp :: Parser Exp
parseExp = do
t1 <- parseTerm
loop t1
where termSuffix t1 = do
op <- lexeme $ oneOf "+-"
t2 <- parseTerm
case op of
-- *** use `loop` here, not `termSuffix` ***
'+' -> loop (Binary Plus t1 t2)
'-' -> loop (Binary Minus t1 t2)
loop t = termSuffix t <|> return t
After making a similar change to parseTerm, your test cases all work fine.
So ... I messed up a recording in CSV format:
23,95489,0,20,9888
Due to language settings floating point numbers were written with commas as seperator ... in a comma separated value file ...
Problem is that the file does not have a nice formatting for every float. Some have no point at all and the number of numbers behind the point varies too.
My idea was to build a MegaParsec parser that would try to read every possible floating point formatting, move on and if back track if it finds an error.
Eg for the example above:
read 23,95489 -> good
read 0,20 -> good (so far)
read 9888 -> error (because value is too high for column (checked by guard))
(back tracking to 2.) read 0 -> good again
read 20,9888 -> good
done
I've implemented that as (pseudo code here):
floatP = try pointyFloatP <|> unpointyFloatP
lineP = (,,) <$> floatP <* comma <*> floatP <* comma <*> floatP <* comma
My problem is that apparently the try only works in the 'current' float. There is no backtracking to previous positions. Is this correct?
And if so ... how would I go about implementing further back tracking?
How far does “try” back track?
The parser try p consumes exactly as much input as p if p parses successfully, otherwise it does not consume any input at all. So if you look at that in terms of backtracking, it backtracks to the point where you were when you invoked it.
My problem is that apparently the try only works in the 'current' float. There is no backtracking to previous positions. Is this correct?
Yes, try does not "unconsume" input. All it does is to recover from a failure in the parser you give it without consuming any input. It does not undo the effects of any parsers that you've applied previously, nor does it affect subsequent parsers that you apply after try p succeeded.
And if so ... how would I go about implementing further back tracking?
Basically what you want is to not only know whether pointyFloatP succeeds on the current input, but also whether the rest of your lineP would succeed after successfully pointyFloatP - and if it doesn't you want to backtrack back to before you applied pointyFloatP. So basically you want the parser for the whole remaining line in the try, not just the float parser.
To achieve that you can make floatP take the parser for the remaining line as an argument like this:
floatP restP = try (pointyFloatP <*> restP) <|> unpointyFloatP <*> restP
Note that this kind of backtracking isn't going to be very efficient (but I assume you knew that going in).
Update: Include a custom monadic parser for more complex rows.
Using the List Monad for Simple Parsing
The list monad makes a better backtracking "parser" than Megaparsec. For example, to parse the cells:
row :: [String]
row = ["23", "95489", "0", "20", "9888"]
into exactly three columns of values satisfying a particular bound (e.g., less than 30), you can generate all possible parses with:
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Control.Applicative
rowResults :: [String] -> [[Double]]
rowResults = cols 3
where cols :: Int -> [String] -> [[Double]]
cols 0 [] = pure [] -- good, finished on time
cols 0 _ = empty -- bad, didn't use all the data
-- otherwise, parse exactly #n# columns from cells #xs#
cols n xs = do
-- form #d# from one or two cells
(d, ys) <- num1 xs <|> num2 xs
-- only accept #d < 30#
guard $ d < 30
ds <- cols (n-1) ys
return $ d : ds
-- read number from a single cell
num1 (x:xs) | ok1 x = pure (read x, xs)
num1 _ = empty
-- read number from two cells
num2 (x:y:zs) | ok1 x && ok2 y = pure (read (x ++ "." ++ y), zs)
num2 _ = empty
-- first cell: "0" is okay, but otherwise can't start with "0"
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- second cell: can't end with "0" (or *be* "0")
ok2 xs = last xs /= '0'
The above list-based parser tries to reduce ambiguity by assuming that if "xxx,yyy" is a number, the "xxx" won't start with zeros (unless it's just "0"), and the "yyy" won't end with a zero (or, for that matter, be a single "0"). If this isn't right, just modify ok1 and ok2 as appropriate.
Applied to row, this gives the single unambiguous parse:
> rowResults row
[[23.95489,0.0,20.9888]]
Applied to an ambiguous row, it gives all parses:
> rowResults ["0", "12", "5", "0", "8601"]
[[0.0,12.5,0.8601],[0.0,12.5,0.8601],[0.12,5.0,0.8601]]
Anyway, I'd suggest using a standard CSV parser to parse your file into a matrix of String cells like so:
dat :: [[String]]
dat = [ ["23", "95489", "0", "20", "9888"]
, ["0", "12", "5", "0", "8601"]
, ["23", "2611", "2", "233", "14", "422"]
]
and then use rowResults above get the row numbers of rows that were ambiguous:
> map fst . filter ((>1) . snd) . zip [1..] . map (length . rowResults) $ dat
[2]
>
or unparsable:
> map fst . filter ((==0) . snd) . zip [1..] . map (length . rowResults) $ dat
[]
>
Assuming there are no unparsable rows, you can regenerate one possible fixed file, even if some rows are ambiguous, but just grabbing the first successful parse for each row:
> putStr $ unlines . map (intercalate "," . map show . head . rowResults) $ dat
23.95489,0.0,20.9888
0.0,12.5,0.8601
23.2611,2.233,14.422
>
Using a Custom Monad based on the List Monad for More Complex Parsing
For more complex parsing, for example if you wanted to parse a row like:
type Stream = [String]
row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]
with a mixture of strings and numbers, it's actually not that difficult to write a monadic parser, based on the list monad, that generates all possible parses.
The key idea is to define a parser as a function that takes a stream and generates a list of possible parses, with each possible parse represented as a tuple of the object successfully parsed from the beginning of the stream paired with the remainder of the stream. Wrapped in a newtype, our parallel parser would look like:
newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)
Note the similarity to the type ReadS from Text.ParserCombinators.ReadP, which is also technically an "all possible parses" parser (though you usually only expect one, unambiguous parse back from a reads call):
type ReadS a = String -> [(a, String)]
Anyway, we can define a Monad instance for PParser like so:
instance Applicative PParser where
pure x = PParser (\s -> [(x, s)])
(<*>) = ap
instance Monad PParser where
PParser p >>= f = PParser $ \s1 -> do -- in list monad
(x, s2) <- p s1
let PParser q = f x
(y, s3) <- q s2
return (y, s3)
There's nothing too tricky here: pure x returns a single possible parse, namely the result x with an unchanged stream s, while p >>= f applies the first parser p to generate a list of possible parses, takes them one by one within the list monad to calculate the next parser q to use (which, as per usual for a monadic operation, can depend on the result of the first parse), and generates a list of possible final parses that are returned.
The Alternative and MonadPlus instances are pretty straightforward -- they just lift emptiness and alternation from the list monad:
instance Alternative PParser where
empty = PParser (const empty)
PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where
To run our parser, we have:
parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)
and now we can introduce primitives:
-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
(x:xs) -> pure (x, xs)
_ -> empty
-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
[] -> pure ((), s)
_ -> empty
and combinators:
-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
(x, s2) <- p s1 -- for each possible String
(y, "") <- reads x -- get each possible full read
-- (normally only one)
return (y, s2)
and parsers for various "terms" in our CSV row:
-- read a string from a single cell
str :: PParser String
str = token
-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)
-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
where dbl1 = convert (mfilter ok1 token)
dbl2 = convert $ do
t1 <- mfilter ok1 token
t2 <- mfilter ok2 token
return $ t1 ++ "." ++ t2
-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
x <- dbl
guard $ x < 30
return x
-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'
Then, for a particular row schema, we can write a row parser as we normally would with a monadic parser:
-- a row
data Row = Row String Int Double Double Double
Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
<*> int <*> str <*> str <* eof
and get all possible parses:
> parse rowResults row0
[Row "Apple" 15 1.5016 2.0 5.3 1801 "11/13/2018" "X101"
,Row "Apple" 15 1.5016 2.5 3.0 1801 "11/13/2018" "X101"]
>
The full program is:
{-# LANGUAGE DeriveFunctor #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Control.Applicative
type Stream = [String]
newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)
instance Applicative PParser where
pure x = PParser (\s -> [(x, s)])
(<*>) = ap
instance Monad PParser where
PParser p >>= f = PParser $ \s1 -> do -- in list monad
(x, s2) <- p s1
let PParser q = f x
(y, s3) <- q s2
return (y, s3)
instance Alternative PParser where
empty = PParser (const empty)
PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where
parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)
-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
(x:xs) -> pure (x, xs)
_ -> empty
-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
[] -> pure ((), s)
_ -> empty
-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
(x, s2) <- p s1 -- for each possible String
(y, "") <- reads x -- get each possible full read
-- (normally only one)
return (y, s2)
-- read a string from a single cell
str :: PParser String
str = token
-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)
-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
where dbl1 = convert (mfilter ok1 token)
dbl2 = convert $ do
t1 <- mfilter ok1 token
t2 <- mfilter ok2 token
return $ t1 ++ "." ++ t2
-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
x <- dbl
guard $ x < 30
return x
-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False
-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'
-- a row
data Row = Row String Int Double Double Double
Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
<*> int <*> str <*> str <* eof
row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]
main = print $ parse rowResults row0
Off-the-shelf Solutions
I find it a little surprising I can't find an existing parser library out there that provides this kind of "all possible parses" parser. The stuff in Text.ParserCombinators.ReadP takes the right approach, but it assumes that you're parsing characters from a String rather than arbitrary tokens from some other stream (in our case, Strings from a [String]).
Maybe someone else can point out an off-the-shelf solution that would save you from having to role your own parser type, instances, and primitives.
I wrote such permutation parsing example:
data Entry = Entry {
first_name :: String
, last_name :: String
, date_of_birth :: Maybe String
, nationality :: Maybe String
, parentage :: Maybe String
} deriving (Show)
nameParser :: Parser (String, String)
nameParser = do
first_name <- many1 upper
endOfLine
last_name <- many1 letter
endOfLine
return $ (first_name, last_name)
attributeParser :: String -> Parser String
attributeParser field = do
string $ field ++ ": "
value <- many1 (noneOf "\n")
endOfLine
return value
entryParser :: Parser Entry
entryParser = do
(f, l) <- nameParser
(d, n, p) <- permute ((,,)
<$?> (Nothing, liftM Just (try $ attributeParser "Date of Birth"))
<|?> (Nothing, liftM Just (try $ attributeParser "Nationality"))
<|?> (Nothing, liftM Just (try $ attributeParser "Parentage"))
)
return $ Entry f l d n p
main = do
mapM_ putStrLn . map (show . parse entryParser "") $ goodTests
goodTests =
"AAKVAAG\nTorvild\nDate of Birth: 1 July\nNationality: Norwegian\nParentage: business executive\n" :
"AAKVAAG\nTorvild\nNationality: Norwegian\nParentage: business executive\n" :
"AAKVAAG\nTorvild\nParentage: business executive\nNationality: Norwegian\n" :
"AAKVAAG\nTorvild\nParentage: business executive\n" :
"AAKVAAG\nTorvild\nNationality: Norwegian\n" : []
It would be good to extend Entry data with new fields in future, but doing that will require to put even more repetitive code in entryParser function. Is there a way to make this function accept list of parsers?
I started with this:
attributeParsers =
map attributeParser ["Date of Birth", "Nationality", "Parentage"]
permuteParams =
map (\p -> (Nothing, liftM Just (try p))) attributeParsers
But could not come of with correct way to fold permuteParams together with <|?> operator (I guess it would require something smarter than (,,) tuple constructor then).
As a first step, you can abstract the stuff you do for every component:
attr txt = (Nothing, liftM Just (try $ attributeParser txt))
With this, you can go to:
entryParser :: Parser Entry
entryParser = do
(f, l) <- nameParser
(d, n, p) <- permute ((,,)
<$?> attr "Date of Birth"
<|?> attr "Nationality"
<|?> attr "Parentage"
)
return $ Entry f l d n p
Then, if you want, you can combine the infix combinators and the attr calls:
f .$ x = f <$?> attr x
f .| x = f <|?> attr x
infixl 2 .$
infixl 2 .|
This gives you:
entryParser :: Parser Entry
entryParser = do
(f, l) <- nameParser
(d, n, p) <- permute ((,,)
.$ "Date of Birth"
.| "Nationality"
.| "Parentage"
)
return $ Entry f l d n p
Then you can further simplify by getting rid of the intermediate triple. All you're doing is to build it and then apply its components to Entry f l, so you can as well apply the result of the permutation parser to Entry f l directly:
entryParser :: Parser Entry
entryParser = do
(f, l) <- nameParser
permute (Entry f l
.$ "Date of Birth"
.| "Nationality"
.| "Parentage"
)
I think this is compact enough. If you really want some kind of fold, you'll either have to introduce an intermediate list and collect the permutation results in a list. This, however, only works as long as all the permutable attributes are of the same type (they currently are), and is not so nice because you'll make assumptions about the number of elements in this list. Or you'll have to use a heterogeneous list / some type class magic, which will lead to more complexity with the types and is, I think, not worth it here.
(<|?>) does not play nicely with folding because the type of the StreamPermParser you pass as its first argument isn't the same than that of the StreamPermParser result. For a simpler yet analogous issue, you would run into similar problems if you were trying to use (,,) with (<$>) and (<*>) in applicative style (e.g. (,,) <$> foo <*> bar <*> baz).
If you want to cut down some of the repetition, my prosaic suggestion would be using a local definition:
entryParser :: Parser Entry
entryParser = do
(f, l) <- nameParser
(d, n, p) <- permute ((,,)
<$?> optField "Date of Birth"
<|?> optField "Nationality"
<|?> optField "Parentage"
)
return $ Entry f l d n p
where
optField fieldName = (Nothing, liftM Just (try $ attributeParser fieldName))
Assume there is a data structure representing a text with comments inside.
data TWC
= T Text TWC -- text
| C Text TWC -- comment
| E -- end
deriving Show
Thus string like
"Text, {-comment-}, and something else"
could be encoded as
T "Text, " (C "comment" (T ", and something else" E))
Parsers for comment chunk and for E are pretty trivial:
twcP :: Parser TWC
twcP = eP <|> cP <|> tP
cP :: Parser TWC
cP = do
_ <- string "{-"
c <- manyTill anyChar (string "-}")
rest <- cP <|> tP <|> eP
return (C (pack c) rest)
eP :: Parser TWC
eP = do
endOfInput
return E
Implementing parser for text chunk in such a trivial way
tP :: Parser TWC
tP = do
t <- many1 anyChar
rest <- cP <|> eP
return (T (pack t) rest)
make it consume comments section as a text because of its greedy nature
> parseOnly twcP "text{-comment-}"
Right (T "text{-comment-}" E)
it ∷ Either String TWC
So, the question is how to express the logic of parsing until end of input or until comment section? In other words, how to implement conditional lookahead parser?
You're right, the problematic code is the first line of of tP, which parses text greedily without stopping at comments:
tP = do
t <- many1 anyChar
Before addressing that, I first want to refactor your code a little to introduce helpers and use applicative style, with the problematic code isolated into the text helper:
-- Like manyTill, but pack the result to Text.
textTill :: Alternative f => f Char -> f b -> f Text
textTill p end = pack <$> manyTill p end
-- Parse one comment string
comment :: Parser Text
comment = string "{-" *> textTill anyChar (string "-}")
-- Parse one non-comment text string (problematic implementation)
text :: Parser Text
text = pack <$> many1 anyChar
-- TWC parsers:
twcP :: Parser TWC
twcP = eP <|> cP <|> tP
cP :: Parser TWC
cP = C <$> comment <*> twcP
eP :: Parser TWC
eP = E <$ endOfInput
tP :: Parser TWC
tP = T <$> text <*> twcP
To implement lookahead, we can use the lookAhead combinator, which applies a parser without consuming the input. That allows us to make text parse until it reaches either a comment (without consuming it), or endOfInput:
-- Parse one non-comment text string (working implementation)
text :: Parser Text
text = textTill anyChar (void (lookAhead comment) <|> endOfInput)
With that implementation, twcP behaves as expected:
ghci> parseOnly twcP "text{-comment-} post"
Right (T "text" (C "comment" (T " post" E)))
Just a simple question that I cannot solve.
I want to parse a string as either a String or a Maybe Double, where an empty string or an "n/a" is parsed as a Nothing. For example something like:
data Value = S String | N (Maybe Double)
value::CharParser () Value
value = val <* spaces
where val = N <$> v_number
<|> S <$> v_string
<|> N <$> v_nothing
I am having trouble with the v_nothing (and also leading and training white space).
Thanks.
EDIT:
v_number :: CharParser () (Maybe Double)
v_number = do s <- getInput
case readSigned readFloat s of
[(n, s')] -> Just n <$ setInput s'
_ -> empty
v_string :: CharParser () String
v_string = (many1 jchar)
where jchar = char '\\' *> (p_escape <|> p_unicode)
<|> satisfy (`notElem` "\"\\")
I tried all sort sorts of things for v_nothing to no avail.
Maybe something like this?
value = do skipMany space
choice $ map try [
do string "n/a" <|> (eof >> return [])
return $ N Nothing,
do d <- many digit
return $ N $ Just (read d)
-- do ...
]