Parsec: multiple possible options on one line - haskell

All,
I'm trying to write a parser using parsec. The goal is to eventually be able to parse a toy language.
Right now I'm struggling to make parsec recognise two different possible options, for example assignment and function invocation.
How would one write a "parseCode" function to parse the following:
x = 3
y = 4
plus(x,y)
into:
(Assignment "x" "3")
(Assignment "y" "4")
(Invocation "plus" ["x","y"])
Thanks
EDIT:
** omitted for brevity **
EDIT 2:
I built a bit upon your suggestions and now have the following problem
Running parse parseTester "bla" "{plus(3,4)\nmin(2,3)\nx=3\n" gives the expected solution: Right (Body [Invocation "plus",Invocation "min",Assignment "x" "3"]).
But running the functionally (almost) equivalent parse parseBody "bla" "{plus(3,4)\nmin(2,3)\nx=3\n}" results in an error:
Left "bla" (line 4, column 2):
unexpected end of input
expecting white space or "="
I don't see the problem. Is the parser suddenly looking for an assignment where it should be looking for an invocation? Any suggestions?
Code:
data Body = Body [Statement]
deriving (Show)
data Arguments = Arguments [String]
deriving (Show)
data Statement = Assignment String String
| Invocation String
deriving (Show)
parseBody :: Parser Body
parseBody = do
char '{'
statements <- many1 parseStatement
char '}'
return $ Body statements
parseTester :: Parser Body
parseTester = do
char '{'
x <- many1 parseStatement
return $ Body x
parseStatement :: Parser Statement
parseStatement = do
x <- try parseInvocation <|> parseAssignment <?> "statement"
return x
parseInvocation :: Parser Statement
parseInvocation = do
spaces
name <- many1 (noneOf " (")
spaces
char '('
spaces
bla <- many1 (noneOf " )")
spaces
char ')'
char '\n'
return $ Invocation name
parseAssignment :: Parser Statement
parseAssignment = do
spaces
var <- many1 (noneOf " =")
spaces
char '=' <?> "equal in assignment"
spaces
value <- many1 (noneOf "\n")
char '\n'
spaces
return $ Assignment var value

If we need to parse some choices, you could use choice from Text.ParserCombinators.Parsec.Combinator
choice [parseInvocation, parseAssignmen]
or much simplier: try parseInvocation <|> try parseAssignmen
P.S.
You could use form Text.ParserCombinators.Parsec.Char:
many (oneOf " ") == spaces
oneOf " " == space

Related

Reading list of statements and ending with a single expression, when statements can be expressions

I've run into a problem where I want to parse a block of code with the following syntax
{
<stmt>;
<stmt>;
<stmt>;
<expr>
}
A statement can be of the form <expr>;. This trips up Parsec in a way which I don't know how to fix. This is probably just me being kinda new to Haskell and the Parsec library, but I don't know where to search for a solution to the problem. I've written an example that captures my exact problem.
With the input { 5; 5; 5 } it fails on the third 5, because it expects there to be a ; present. How do I get around this?
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Combinator
parseIdentifier = do
first <- letter
rest <- many $ letter <|> digit <|> char '_'
return $ first : rest
parseExpr = parseIdentifier <|> many1 digit
parseStmt = parseExpr <* char ';'
parseBlock = between
(char '{' >> spaces)
(spaces >> char '}')
(do
stmts <- try $ parseStmt `sepBy` spaces
parseExpr
)
readParser :: Parser String -> String -> String
readParser parser input = case parse parser "dusk" input of
Left err -> show err
Right val -> val
main = interact $ readParser parseBlock
Instead of sepBy, this sort of problems often can be solved by manyTill, the tricky point is to keep the input that don't be consumed by manyTill, it have to use try $ lookAhead
Side note: the reason can be found in source code of Parsec.
Internally, manyTill use <|>, so why try take effect, and
lookAhead can retain the input when apply monad bind >>=, >>
So, the correction look like:
parseBlock = between
(char '{' >> spaces)
(spaces >> char '}')
(do
stmts <- manyTill (parseStmt <* spaces)
(try $ lookAhead (parseExpr >> space))
parseExpr
)
The above parser just return the output of parseExpr, i.e. 5, if this is your intent, it can be simpified by:
manyTill (parseStmt <* spaces) (try $ lookAhead (parseExpr >> space)) >> parseExpr
if you actually need the parsed string of statements as well, it become:
(do
stmts <- manyTill (parseStmt <* spaces)
(try $ lookAhead (parseExpr >> space))
expr <- parseExpr
return (concat (stmts ++ [expr]))
)
it return 555
The problem with your code is that sepBy have certain expectation about its parameters. In case separator successfully parsed it doesn't expect element parser to fail.
To fix that I suggest following improvement
parseBlock = between
(char '{' >> spaces)
(spaces >> char '}')
(do
stmts <- try $ many $ spaces *> parseStmt
spaces
parseExpr
)

Parsec3 Text parser for quoted string, where everything is allowed in between quotes

I have actually asked this question before (here) but it turns out that the solution provided did not handle all test cases. Also, I need 'Text' parser rather than 'String', so I need parsec3.
Ok, the parser should allow for EVERY type of char inbetween quotes, even quotes. The end of the quoted text is marked by a ' character, followed by |, a space or end of input.
So,
'aa''''|
should return a string
aa'''
This is what I have:
import Text.Parsec
import Text.Parsec.Text
quotedLabel :: Parser Text
quotedLabel = do -- reads the first quote.
spaces
string "'"
lab <- liftM pack $ endBy1 anyChar endOfQuote
return lab
endOfQuote = do
string "'"
try(eof) <|> try( oneOf "| ")
Now, the problem here is of course that eof has a different type than oneOf "| ", so compilation falls.
How do I fix this? Is there a better way to achieve what I am trying to do?
Whitespace
First a comment on handling white space...
Generally the practice is to write your parsers so that they
consume the whitespace following a token
or syntactic unit. It's common to define combinator like:
lexeme p = p <* spaces
to easily convert a parser p to one that discards the whitespace
following whatever p parses. E.g., if you have
number = many1 digit
simply use lexeme number whenever you want to eat up the
whitespace following the number.
For more on this approach to handling whitespace and other advice
on parsing languages, see this Megaparsec tutorial.
Label expressions
Based on your previous SO question it appears you want
to parse expressions of the form:
label1 | label2 | ... | labeln
where each label may be a simple label or a quoted label.
The idiomatic way to parse this pattern is to use sepBy like this:
labels :: Parser String
labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
We define both simpleLabel and quotedLabel in terms of
what characters may occur in them. For simpleLabel a valid
character is a non-| and non-space:
simpleLabel :: Parser String
simpleLabel = many (noneOf "| ")
A quotedLabel is a single quote followed by a run
of valid quotedLabel-characters followed by an ending
single quote:
sq = char '\''
quotedLabel :: Parser String
quotedLabel = do
char sq
chs <- many validChar
char sq
return chs
A validChar is either a non-single quote or a single
quote not followed by eof or a vertical bar:
validChar = noneOf [sq] <|> try validQuote
validQuote = do
char sq
notFollowedBy eof
notFollowedBy (char '|')
return sq
The first notFollowedBy will fail if the single quote appears just
before the end of input. The second notFollowedBy will fail if
next character is a vertical bar. Therefore the sequence of the two
will succeed only if there is a non-vertical bar character following
the single quote. In this case the single quote should be interpreted
as part of the string and not the terminating single quote.
Unfortunately this doesn't quite work because the
current implementation of notFollowedBy
will always succeed with a parser which does not consume any
input -- i.e. like eof. (See this issue for more details.)
To work around this problem we can use this alternate
implementation:
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
Here is the complete solution with some tests. By adding a few lexeme
calls you can make this parser eat up any white space where you decide
it is not significant.
import Text.Parsec hiding (labels)
import Text.Parsec.String
import Control.Monad
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
sq = '\''
validChar = do
noneOf "'" <|> try validQuote
validQuote = do
char sq
notFollowedBy' eof
notFollowedBy (char '|')
return sq
quotedLabel :: Parser String
quotedLabel = do
char sq
str <- many validChar
char sq
return str
plainLabel :: Parser String
plainLabel = many (noneOf "| ")
labels :: Parser [String]
labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
test input expected = do
case parse (labels <* eof) "" input of
Left e -> putStrLn $ "error: " ++ show e
Right v -> if v == expected
then putStrLn $ "OK - got: " ++ show v
else putStrLn $ "NOT OK - got: " ++ show v ++ " expected: " ++ show expected
test1 = test "a|b|c" ["a","b","c"]
test2 = test "a|'b b'|c" ["a", "b b", "c"]
test3 = test "'abc''|def" ["abc'", "def" ]
test4 = test "'abc'" ["abc"]
test5 = test "x|'abc'" ["x","abc"]
To change the result of any functor computation you can just use:
fmap (const x) functor_comp
e.g.:
getLine :: IO String
fmap (const ()) getLine :: IO ()
eof :: Parser ()
oneOf "| " :: Parser Char
fmap (const ()) (oneOf "| ") :: Parser ()
Another option is to use operators from Control.Applicative:
getLine *> return 3 :: IO Integer
This performs getLine, discards the result and returns 3.
In your case, you might use:
try(eof) <|> try( oneOf "| " *> return ())

End of file unexpected in haskell

I think I have researched very hard about my problem so here I am.
I have a "end of file unexpected" error at line
6 colunm 33.
I have already tried many solutions to resolve my problem.
Here is my code and the file I am trying to parse.
Here is the text I am trying to parse :
ifc.txt :
#9512= IFCBUILDINGSTOREY('3y21AUC9X4yAqzLGUny16E',#16,'Story',$,$,#9509,$,$,.ELEMENT.,6200.);
#9509= IFCLOCALPLACEMENT(#115,#9506);
#9506= IFCAXIS2PLACEMENT3D(#9502,#9498,#9494);
#9502= IFCCARTESIANPOINT((0.,0.,6200.));
#9498= IFCDIRECTION((0.,0.,1.));
#9494= IFCDIRECTION((1.,0.,0.));
Here is the code :
code.hs :
import Text.ParserCombinators.Parsec
main = do
f <- readFile "ifc.txt"
let m = (parse ifc "" f)
print m
ifc :: Parser IfcModel
ifc = many ifcentry
ifcentry = do
string "#"
i <- idt
string "= "
name <- idt
string "("
prop <- idt
string ")"
string ";"
string "\n"
return (i,name,prop)
idt = many (letter <|> digit <|> char ','
<|> char '$' <|> char ')' <|> char '\''
<|> char '=' <|> char ';' <|> char '\n'
<|> char ' ' <|> char '(' <|> char '#'
<|> char '.' <|> char '\r')
Thanks for your help, i should have checked a bit earlier my anwser because i worked on my own and i found asolution i will post it when i can (8hours left for a newbie like me who has less than 10 in reputation).
Thanks again.
Solution: use sepBy instead of including the newline in ifcentry
Your ifcentry expects a newline at the end, and your input doesn't have one, which is why the EOF was unexpected.
Drop the string "\n" from ifcentry and instead define
ifc :: Parser IfcModel
ifc = ifcentry `sepBy` (char '\n')
Also, your idt parser is needlessly long. It would be clearer as
idt = many (letter <|> digit <|> oneOf ".,;' =#$()\n\r")
Clearer ifcentry
And while I'm at it, I'd write
ifcentry = do
char '#'
i <- idt
string "= "
name <- idt
prop <- parens idt
char ';'
return (i,name,prop)
Because parens (which parses an open bracket, your idt content, then a close bracket) tidies it up and makes it clearer.
Less verbose main
I'd also write
main = fmap (parse ifc "") (readFile "ifc.txt") >>= print
certainly there's no need for
let m = (parse ifc "" f)
print m
because you may as well do
print (parse ifc "" f)
In addtion to #enough rep to comment's answer
I would go much further and declare something in the line of
data IFCType = IFCBuildingStorey ....
| IFCLocalPlacement ....
| IFCAxis2Placement3D ....
| IFCCartesianpoint Double Double Double
| IFCDirection ....
deriving Show
and
type ID = Integer
type IFCElement = (ID,IFCType)
where i will show the CartesianPoint as an example
ifctype :: Parser IFCType
ifctype = do string "IFC"
buildingStorey
<|> localPlacement
<|> axis2Placement3D
<|> cartesianpoint
<|> direction
buildingStorey :: Parser IFCType
buildingStorey = do string "BUILDINGSTOREY"
return IFCBuildingStorey
localPlacement :: Parser IFCType
localPlacement = do string "LOCALPLACEMENT"
return IFCLocalPlacement
axis2Placement3D :: Parser IFCType
axis2Placement3D = do string "AXIS2PLACEMENT3D"
return IFCAxis2Placement3D
cartesianpoint :: Parser IFCType
cartesianpoint = do string "CARTESIANPOINT"
char '('
char '('
x <- double
char ','
y <- double
char ','
z <- double
char ')'
char ')'
return $ IFCCartesianpoint x y z
double :: Parser Double
double = do d <- many1 (digit <|> char '.')
return $ read d
direction :: Parser IFCType
direction = do string "DIRECTION"
return IFCDirection
this has the additional advantage that you have typed models.
Thanks for your help everyone, i should have checked a bit earlier my anwser because i worked on my own and finally found a solution :
import Text.ParserCombinators.Parsec
main = do
f <- readFile "ifc.txt"
let m = (parse ifc "" f)
print m
type IfcIdt = String
type IfcName = String
type IfcProp = [String]
type IfcModel = [(IfcIdt,IfcName,IfcProp)]
ifc :: Parser IfcModel
ifc = many ifcentry
ifcentry = do
string "#"
i <- idtnumber
string "= "
name <- idtname
opening
prop <- ifcprop
closing
eol
return (i,name,prop)
idtnumber = many digit
idtname = many (letter <|> digit)
ifcprop = sepBy prop (char ',')
prop = many (noneOf "=,();\n")
eol = string ";\n"
opening = try (string "((")
<|> string "("
closing = try (string "))")
<|> string ")"

Parsec and sequence of commaSep input

I took the example below partially from SO and changed it to my needs. It almost fits, but what I want to do is that always the first string in the commaSep expr is parsed as identifier whilst all subsequent strings should be strings only.
Currently they are all parsed as Identifiers.
*Parser> parse expr "" "rd (isFib, test2, 100.1, ?BOOL)"
Right (FuncCall "rd" [Identifier "isFib",Identifier "test2",Number 100.1,Query "?BOOL"])
I have tried a number of solutions that in the end all would break down to parsing the whole input without using commaSep. Means I would have to ignore the structure and do something like
expr_parse = do
name <- resvd_cmd
char '('
skipMany space
worker <- ident
char ','
skipMany1 space
args <- commaSep expr --not fully worked this out yet
query <- theQuery
skipMany space
char ')'
return (name, worker, args, query)
that looks less optimal and very clunky to me. Is there any way to refactor expr in the code below, achive what I need and keep it simple?
module Parser where
import Control.Monad (liftM)
import Text.Parsec
import Text.Parsec.String (Parser)
import Lexer
import AST
expr = ident <|> astring <|> number <|> theQuery <|> callOrIdent
astring = liftM String stringLiteral <?> "String"
number = liftM Number float <?> "Number"
ident = liftM Identifier identifier <?> "WorkerName"
questionm :: Parser Char
questionm = oneOf "?"
theQuery :: Parser AST
theQuery = do first <- questionm
rest <- many1 letter
let query = first:rest
return ( Query query )
resvd_cmd = do { reserved "rd"; return ("rd") }
<|> do { reserved "eval"; return ("eval") }
<|> do { reserved "read"; return ("read") }
<|> do { reserved "in"; return ("in") }
<|> do { reserved "out"; return ("out") }
<?> "LINDA-like Tuple"
callOrIdent = do
name <- resvd_cmd
liftM (FuncCall name)(parens $ commaSep expr) <|> return (Identifier name)
AST.hs
{-# LANGUAGE DeriveDataTypeable #-}
module AST where
import Data.Typeable
data AST
= Number Double
| Identifier String
| String String
| FuncCall String [AST]
| Query String
deriving (Show, Eq, Typeable)
Lexer.hs
module Lexer (
identifier, reserved, operator, reservedOp, charLiteral, stringLiteral,
natural, integer, float, naturalOrFloat, decimal, hexadecimal, octal,
symbol, lexeme, whiteSpace, parens, braces, angles, brackets, semi,
comma, colon, dot, semiSep, semiSep1, commaSep, commaSep1
)where
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellStyle)
lexer = P.makeTokenParser ( haskellStyle
{P.reservedNames = ["rd", "in", "out", "eval", "take"]}
)
identifier = P.identifier lexer
reserved = P.reserved lexer
operator = P.operator lexer
reservedOp = P.reservedOp lexer
charLiteral = P.charLiteral lexer
stringLiteral = P.stringLiteral lexer
natural = P.natural lexer
integer = P.integer lexer
float = P.float lexer
naturalOrFloat = P.naturalOrFloat lexer
decimal = P.decimal lexer
hexadecimal = P.hexadecimal lexer
octal = P.octal lexer
symbol = P.symbol lexer
lexeme = P.lexeme lexer
whiteSpace = P.whiteSpace lexer
parens = P.parens lexer
braces = P.braces lexer
angles = P.angles lexer
brackets = P.brackets lexer
semi = P.semi lexer
comma = P.comma lexer
colon = P.colon lexer
dot = P.dot lexer
semiSep = P.semiSep lexer
semiSep1 = P.semiSep1 lexer
commaSep = P.commaSep lexer
commaSep1 = P.commaSep1 lexer
First, I'd like to introduce you to the function lexeme which alters a parser to eat trailing whitespace. You're encouraged to use it rather than explicitly eating the whitespace. The difficulty is with commaSep because it eats the , and then fails. It would be nice to write a less optimistic commaSep, but let's solve your problem directly.
Let's apply lexeme to comma
acomma = lexeme comma
One of the problems with your code was you were expecting it to see test2 as String "test2" but the astring parser expects its strings to begin and end with ". Let's make a parser for bald strings, but make sure they don't start with ? and don't contain spaces or commas:
baldString = lexeme $ do
x <- noneOf "? ,)"
xs <- many (noneOf " ,)") -- problematic - see comment below
return . String $ x:xs
The breakthrough came when I realised that because there has to be a query at the end, there was always a comma after a baldString:
baldStringComma = do
s <- baldString
acomma
return s
Now let's make a parser for one or more queries at the end of the tuple:
queries = commaSep1 (lexeme theQuery)
And now we can take the identifier, the baldStrings and the queries
therest = do
name <- lexeme ident
acomma
args <- many baldStringComma
qs <- queries
return (name,args,qs)
finally giving
tuple = do
name <- lexeme resvd_cmd
stuff <- parens therest
return (name,stuff)
So you get
*Parser> parseTest tuple "rd (isFib, test2, 100.1, ?BOOL)"
("rd",(Identifier "isFib",[String "test2",String "100.1"],[Query "?BOOL"]))
But if you want to lump the strings with the queries, you can return (name,args++qs) at the end of therest.
Applicative is Less Ugly
I found it frustrating to be tied to the Monad interface, when there are lovely things like <$>, <*> etc, so first
import Control.Applicative hiding (many, (<|>))
Then
baldString = lexeme . fmap String $
(:) <$> noneOf "? ,)"
<*> many (noneOf " ,)") -- problematic - see comment below
Here <$> is an infix version of fmap, so (:) will be applied to the output of noneOf "? ,", giving a parser that returns something like ('c':). This can then be applied to the output of many (noneOf " ,") using <*> to give the string we want.
baldStringComma = baldString <* acomma
This one's nice because we got the <*> operator to ignore the output of acomma and just return the output of baldString, using <*. If we wanted it the other way round, we could do *>, but you may as well use >> for that, which already ignores the output of the first parser.
therest = (,,) <$>
lexeme ident <* acomma
<*> many baldStringComma
<*> queries
and
tuple = (,) <$> lexeme resvd_cmd
<*> parens therest
But wouldn't it be nicer if we did
data Tuple = Tuple {cmd :: String,
id :: AST,
argumentList :: [AST],
queryList :: [AST]} deriving Show
so we could do
niceTuple = Tuple <$> lexeme resvd_cmd <* lexeme (char '(')
<*> lexeme ident <* acomma
<*> many baldStringComma
<*> queries <* lexeme (char ')')
which gives (with a little manual pretty-printing to get it into the width)
*Parser> parseTest niceTuple "rd (isFib, test2, 100.1, ?BOOL)"
Tuple {cmd = "rd",
id = Identifier "isFib",
argumentList = [String "test2",String "100.1"],
queryList = [Query "?BOOL"]}
I also think your current AST is more of an abstract syntax store than an abstract syntax tree, and that you might get more milage from designing your own Tuple type and use that. Use
newtype Command = Cmd String deriving Show
and suchlike to ensure type safety, then roll them together into your Tuple type with a parser to generate them.

Parsing FB2(XML) in Haskell

Started to learn Haskell, I decided to get acquainted with Parsec, but there were problems. I'm trying to implement the parsing of the books in the format of FB2. On conventional tags ( text ) is good, but when the tag within a tag - does not work.
import Text.ParserCombinators.Parsec
data FB2Doc = Node String FB2Doc
| InnText String
deriving (Eq,Show)
parseFB2 :: GenParser Char st [FB2Doc]
parseFB2 = many test
test :: GenParser Char st FB2Doc
test = do name <- nodeStart
value <- getvalue
nodeEnd
return $ Node name value
nodeStart = do char '<'
name <- many (letter <|> digit <|> oneOf "-_")
char '>'
return name
nodeEnd = do string "</"
many (letter <|> digit)
char '>'
spaces
gettext = do x <- many (letter <|> digit <|> oneOf "-_")
return $ InnText x
getvalue = do (nodeStart >> test) <|> gettext <|> return (Node "" (InnText ""))
main = do
print $ parse parseFB2 "" "<h1><a2>ge</a2></h1> <genre>history_russia</genre>"
I think you want this:
getvalue = try test <|> gettext
The try is needed for empty nodes: "<bla></bla>". test will consume the '<' of </bla>, and the try allows for backtracking.

Resources