I am currently trying to design a Parser in Haskell using Parsec.
The syntax for declaring type should look something like this:
Fruit is a Apple
Types should also be able have parameters:
Fruit a b is a Apple
Where
Fruit has type Name
a b has type [Parameter]
Apple has type Value
The problem here is that my parser currently does not know when to stop parsing the parameters and start parsing the value.
The code is as follows:
newtype Name = Name String deriving (Show)
newtype Parameter = Parameter String deriving (Show)
newtype Value = Value String deriving (Show)
data TypeAssignment = TypeAssignment Name [Parameter] Value deriving (Show)
-- first variant using `sepBy`
typeAssigment :: Parser TypeAssignment
typeAssigment =
TypeAssignment
<$> name
<*> (space *> parameter `sepBy` space)
<*> (string "is a" *> value)
-- second variant using manyTill
typeAssigment2 :: Parser TypeAssignment
typeAssigment2 =
TypeAssignment
<$> name
<*> (space *> manyTill parameter (string "is a"))
<*> value
name :: Parser Name
name = Name <$> word
parameter :: Parser Parameter
parameter = Parameter <$> word
value :: Parser Value
value = Value <$> word
word :: Parser String
word = (:) <$> letter <*> many (letter <|> digit)
I have tried parsing the parameters/value in the two ways I would know how to (once with sepBy and once with manyTill) both failed with the nearly the parse-error:
*EParser> parseTest typeAssigment "Fruit a b is a Apple"
parse error at (line 1, column 21):
unexpected end of input
expecting space or "is a"
*EParser> parseTest typeAssigment2 "Fruit a b is a Apple"
parse error at (line 1, column 8):
unexpected " "
expecting letter, digit or "is a"
The problem with typeAssignment1 is that "is" and "a" are perfectly valid parameter parses. So, the parameter parsing gobbles up the whole input until nothing is left, and then you get an error. In fact, if you look closely at that error you see this to be true: the parser is expecting either a space (for more parameters) or "is a" (the terminal of your whole parser).
On the other hand, typeAssignment2 is really close, but it seems that you're not handling spaces properly. In order to parse many parameters, you need to parse all the spaces between those parameter, not just the first one.
I think the following alternative should do the trick:
typeAssigment3 :: Parser TypeAssignment
typeAssigment3 =
TypeAssignment
<$> name
<*> manyTill (space *> parameter) (try $ string "is a")
<*> (space *> value)
I have a string, for example "MMMMABCNNNXYZPPPPP". I know that this string may have ABC in it and may have XYZ in it, but it is not required to have either. Additionally, the XYZ may be swapped for DEF (e.g. "MMMMABCNNNDEFPPPPP") and the behavior should remain the same.
I would like to parse the string and return the sequences between them, as well as which one of XYZ or DEF was present. Example:
data Divider1 = Abc
data Divider2 = Xyz | Def
--"MMMMABCNNNXYZPPPPP" should return ("MMMM", Just Abc, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNDEFPPPPP" should return ("MMMM", Just Abc, "NNN", Just Def, "PPPPP")
Note that if ABC is not present, I would like to return everything before the divider2 and if XYZ and DEF are both not present, I would like to return everything after divider 1.
Example:
--"MMMMNNNXYZPPPPP" should return ("MMMM", Nothing, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNPPPPP" should return ("MMMM", Just Abc, "NNN", Nothing, "PPPPP")
If neither ABC nor XYZ is present then I don't care if it returns nothing, or if it returns the entire string.
Currently my code is
parseEverything = many $ satisfy someGeneralCondition--check if all characters are valid
parseAbc = (\str -> Abc) <$> string "ABC"
parseXyz = (\str -> Xyz) <$> string "XYZ"
parseDef = (\str -> Def) <$> string "DEF"
parseFull = do
beforeAbc <- gather parseEverything
parseAbc <- (Just <$> parseAbc) <++ return Nothing
beforeDivider2 <- gather parseEverything
parseDivider2 <- (Just <$> parseXyz) <++ (Just <$> parseDef) <++ (Just <$> Nothing)
everythingElse <- look
return (beforeAbc, parseAbc, beforeDivider2, parseDivider2, everythingElse)
But when I run this on the example string "MMMMABCNNNXYZPPPPP", I get mostly failed results with just one result that I want. The problem is that I need to return everything in beforeAbc if parseAbc fails, but if parseAbc passes then I just need to return that. And the same thing with parseXyz and parseDef. I don't think that <++ is the correct operator to do this. I also tried a variant of this code using option, but it gave the same result. Is there a simple solution that I am missing, and/or should I set up the parsers in a different way?
Thanks in advance!
Edit: does this have anything to do with chainl or chainr or manyTill?
Updated: See note on applicative parsers below.
Here's what's going wrong with your current approach. As you undoubtedly know, the parsers in Text.ParserCombinators.ReadP generate all possible valid parses of all possible prefixes of the string. If you write a parser:
letterAndOther = do
letters <- many (satisfy isLetter)
others <- many get
return (letters, others)
which grabs an initial string of letters followed by the "rest" of the string and run it on a simple test string, you'll usually get way more than you bargained for:
> readP_to_S letterAndOther "abc"
[(("",""),"abc"),(("","a"),"bc"),(("a",""),"bc"),(("","ab"),"c"),
(("a","b"),"c"),(("ab",""),"c"),(("","abc"),""),(("a","bc"),""),
(("ab","c"),""),(("abc",""),"")]
In other words, in a do-block, each monadic action will typically generate a tree of possible parses. In your current code, the very first line of the do-block:
beforeAbc <- gather parseEverything
introduces a whole tree of parse branches, one branch for each possible initial prefix. These branches only get pruned if a later line of the do-block introduces a parse that fails. But, every line of your do-block represents a parser that always succeeds. For example, this always succeeds:
parseAbc <- (Just <$> parseAbc) <++ return Nothing
because even if the first divider isn't found, the right-hand side parser return Nothing will always succeed.
I would suggest the following approach. First, as we discovered in the comments, the first thing you want to do is figure out what your parser should return. Instead of trying to shoehorn the result into a weird tuple, it's a good idea to leverage Haskell's best feature, it's algebraic data types. Define a return type for your parse:
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
This is unambiguous and covers all possibilities. Admittedly, including Divider1 in the first two constructors is redundant, since there's only one possible Divider1, but programs are for humans to read, too, and keeping Divider1 explicit improves readability.
Now, let's define parsers for the first and second dividers:
divider1 = Abc <$ string "ABC"
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
Note that I've chosen to define a single divider2 instead of separate parsers for Def and Xyz. Since, in your grammar, it's always the case that "DEF" can appear anywhere "XYZ" can and vice versa, it makes sense to combine them into one parser.
We'll also want a parser for arbitrary strings (basically your parseEverything):
anything = many $ satisfy isLetter -- valid characters
Now, let's write a parser for the full string. A key insight here is that we have four alternatives (i.e., the four constructors for our Result type). It's true that they share some structure, but a first crack at a parser can just treat them as independent alternatives. We'll use the <++ operator to choose the best match:
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
A quick test of this will show we've forgotten something:
> readP_to_S result "MMMMABCNNNXYZPPPPP"
[(TwoDividers "MMMM" Abc "NNN" Xyz "","PPPPP"),...]
By default, the parser combinators will try every possible prefix of the input string, leaving more for later parsers. So, we should wrap this up in a final parser function that checks for the end-of-string:
parseResult = readP_to_S (result <* eof)
and with the tests:
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]
we get the expected unique parsed output:
[(TwoDividers "MMMM" Abc "NNN" Xyz "PPPPP","")]
[(TwoDividers "MMMM" Abc "NNN" Def "PPPPP","")]
[(SecondDivider "MMMMNNN" Xyz "PPPPP","")]
[(FirstDivider "MMMM" Abc "NNNPPPPP","")]
Note on Applicative Parsers. I've used applicative syntax here, rather than the monad syntax. The difference isn't purely syntactical -- you can always write an applicative expression in monadic form, but there are monadic operations that can't be expressed applicatively, so the monadic syntax is strictly more powerful. However, when an expression can be written both ways, often the applicative syntax is more succinct and easier to write and understand, at least once you get used to it.
In a nutshell, the expression p <*> x <*> y <*> z creates a new parser that applies the parsers p, x, y, and z in order, and then applies the result from parser p (which needs to be a function f) to the results from the rest of the parsers (which must be appropriate arguments for f). In many cases, the function f is a known function and doesn't need to be returned by a parser, so a common variant is to write f <$> x <*> y <*> z. This applies the parsers x, y, and z in order, and then applies f (given directly instead of returned by a parser) to the results from those parsers. For example, the expression:
FirstDivider <$> anything <*> divider1 <*> anything
runs three parsers in order to get anything, followed by a divider1, followed by anything, and then applies the function/contructor FirstDivider to the three arguments resulting from those parsers.
The operators <* and *> can be thought of as variants of <*>. The expression p <*> x first parses p, then parses x, then applies the result of the former to the latter. The expression p <* x first parses p, then parses x, but instead of applying the former to the latter, it returns the value the arrow is pointing to (i.e., whatever p produced) and throws away the other value. Similarly p *> x parses p then parses x, then returns whatever x produced. In particular:
someParser <* eof
first runs someParser, then parses (i.e., checks for) EOF, then returns whatever someParser produced.
This syntax can really shine when parsing more traditional languages into an abstract syntax tree. If you want to parse statements like:
let x = 1 + 5
into a Statement type like:
data Statement = ... | Let Var Expr | ...
you can write a Parsec parser that looks like:
statement = ...
<|> Let <$ string "let" <*> var <* symbol "=" <*> expr
...
The monadic equivalent in do-notation looks like this:
do string "let"
v <- var
symbol "="
e <- expr
return $ Let v e
which is fine, I suppose, but kind of obscures the simple structure of the parse. The applicative version is basically just the list of tokens to parse, with a little bit of syntactic sugar sprinkled in.
Anyway, here's the full program:
import Data.Char
import Text.ParserCombinators.ReadP
data Divider1 = Abc deriving (Show)
data Divider2 = Xyz | Def deriving (Show)
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
deriving (Show)
anything :: ReadP String
anything = many $ satisfy isLetter -- valid characters
divider1 :: ReadP Divider1
divider1 = Abc <$ string "ABC"
divider2 :: ReadP Divider2
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
result :: ReadP Result
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
parseResult :: String -> [(Result, String)]
parseResult = readP_to_S (result <* eof)
main :: IO ()
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]
I have complicated command line options, as
data Arguments = Arguments Bool (Maybe SubArguments)
data SubArguments = SubArguments String String
I want to parse these subarguments with a flag:
programName --someflag --subarguments "a" "b"
programName --someflag
I already have
subArgParser = SubArguments <$> argument str <*> argument str
mainParser = MainArgs <$> switch
(long "someflag"
<> help "Some argument flag")
<*> ???
(long "subarguments"
<> help "Sub arguments"
What do I have to write at the ???
Your question turned out to be more complicated than you think. Current optparse-applicative API is not supposed to be used with such cases. So you probably may want to change the way you handle CLI arguments or switch to another CLI parsing library. But I will describe most closest way of achieving your goal.
First, you need to read other two SO questions:
1. How to parse Maybe with optparse-applicative
2. Is it possible to have a optparse-applicative option with several parameters?
From first question you know how to parse optional arguments using optional function. From second you learn some problems with parsing multiple arguments. So I will write here several approaches how you can workaround this problem.
1. Naive and ugly
You can represent pair of strings as pair of String type and use just naive show of this pair. Here is code:
mainParser :: Parser Arguments
mainParser = Arguments
<$> switch (long "someflag" <> help "Some argument flag")
<*> optional (uncurry SubArguments <$>
(option auto $ long "subarguments" <> help "some desc"))
getArguments :: IO Arguments
getArguments = do
(res, ()) <- simpleOptions "main example" "" "desc" mainParser empty
return res
main :: IO ()
main = getArguments >>= print
Here is result in ghci:
ghci> :run main --someflag --subarguments "(\"a\",\"b\")"
Arguments True (Just (SubArguments "a" "b"))
2. Less naive
From answer to second question you should learn how pass multiple arguments inside one string. Here is code for parsing:
subArgParser :: ReadM SubArguments
subArgParser = do
input <- str
-- no error checking, don't actually do this
let [a,b] = words input
pure $ SubArguments a b
mainParser :: Parser Arguments
mainParser = Arguments
<$> switch (long "someflag" <> help "Some argument flag")
<*> optional (option subArgParser $ long "subarguments" <> help "some desc")
And here is ghci output:
ghci> :run main --someflag --subarguments "x yyy"
Arguments True (Just (SubArguments "x" "yyy"))
The only bad thing in second solution is that error checking is absent. Thus you can use another general purpose parsing library, for example megaparsec, instead of just let [a,b] = words input.
It's not possibile, at least not directly. You might find some indirect encoding that works for you, but I'm not sure. Options take arguments, not subparsers. You can have subparsers, but they are introduced by a "command", not an option (i.e. without the leading --).
I've investigated many answers to other rigid type variable error questions; but, alas, none of them, to my knowledge, apply to my case. So I'll ask yet another question.
Here's the relevant code:
module MultipartMIMEParser where
import Control.Applicative ((<$>), (<*>), (<*))
import Text.ParserCombinators.Parsec hiding (Line)
data Header = Header { hName :: String
, hValue :: String
, hAddl :: [(String,String)] } deriving (Eq, Show)
data Content a = Content a | Posts [Post a] deriving (Eq, Show)
data Post a = Post { pHeaders :: [Header]
, pContent :: [Content a] } deriving (Eq, Show)
post :: Parser (Post a)
post = do
hs <- headers
c <- case boundary hs of
"" -> content >>= \s->return [s]
b -> newline >> (string b) >> newline >>
manyTill content (string b)
return $ Post { pHeaders=hs, pContent=c }
boundary hs = case lookup "boundary" $ concatMap hAddl hs of
Just b -> "--" ++ b
Nothing -> ""
-- TODO: lookup "boundary" needs to be case-insensitive.
content :: Parser (Content a)
content = do
xs <- manyTill line blankField
return $ Content $ unlines xs -- N.b. This is the line the error message refers to.
where line = manyTill anyChar newline
headers :: Parser [Header]
headers = manyTill header blankField
blankField = newline
header :: Parser Header
header =
Header <$> fieldName <* string ":"
<*> fieldValue <* optional (try newline)
<*> nameValuePairs
where fieldName = many $ noneOf ":"
fieldValue = spaces >> many (noneOf "\r\n;")
nameValuePairs = option [] $ many nameValuePair
nameValuePair :: Parser (String,String)
nameValuePair = do
try $ do n <- name
v <- value
return $ (n,v)
name :: Parser String
name = string ";" >> spaces >> many (noneOf "=")
value :: Parser String
value = string "=" >> between quote quote (many (noneOf "\r\n;\""))
where quote = string "\""
And the error message:
Couldn't match type `a' with `String'
`a' is a rigid type variable bound by
the type signature for content :: Parser (Content a)
at MultipartMIMEParser.hs:(See comment in code.)
Expected type: Text.Parsec.Prim.ParsecT
String () Data.Functor.Identity.Identity (Content a)
Actual type: Text.Parsec.Prim.ParsecT
String () Data.Functor.Identity.Identity (Content String)
Relevant bindings include
content :: Parser (Content a)
(bound at MultipartMIMEParser.hs:72:1)
In a stmt of a 'do' block: return $ Content $ unlines xs
In the expression:
do { xs <- manyTill line blankField;
return $ Content $ unlines xs }
In an equation for `content':
content
= do { xs <- manyTill line blankField;
return $ Content $ unlines xs }
where
line = manyTill anyChar newline
From what I've seen, the problem is that I'm explicitly returning a String using unlines xs, and that breaks the generic nature of a in the type signature. Am I close to understanding?
I've declared Content to be generic because, presumably, this parser might eventually be used on types other than String. Perhaps I'm abstracting prematurely. I did try removing all my as, but I started getting many more compile errors. I think I'd like to stick with the generic approach, if that's reasonable at this point.
Is it clear from the code what I'm trying to do? If so, any suggestions on how to do it best?
You're telling the compiler that content has type Parser (Content a), but the line causing the error is
return $ Content $ unlines xs
Since unlines returns a String, and the Content constructor has type a -> Content a, here you would have String ~ a, so the value Content $ unlines xs has type Content String. If you change the type signature of content to Parser (Content String) then it should compile.
I've declared Content to be generic because, presumably, this parser might eventually be used on types other than String. Perhaps I'm abstracting prematurely. I did try removing all my as, but I started getting many more compile errors. I think I'd like to stick with the generic approach, if that's reasonable at this point.
It's fine to declare Content to be generic, and in many cases it is the exact right way to solve the problem, the issue is that while your container is generic, whenever you fill your container with something concrete, the type variables also have to be concrete. In particular:
> :t Container (1 :: Int)
Container 1 :: Container Int
> :t Container "test"
Container "test" :: Container String
> :t Container (Container "test")
Container (Container "test") :: Container (Container String)
Notice how all of these have their types inferred without any type variables left. You can use the container to hold whatever you want, you just have to make sure that you're accurately telling the compiler what it is.
I have an abstract syntax tree in haskell made from Parsec. I want to be able to query its structure while traversing it at the same time in order to translate it into intermediate code. For example, I need to know how many parameters any given function of my AST takes in order to make this translation. What I am currently doing is passing in the AST to every single function so I can call it whenever I need to do a lookup and I have helper functions in another file to do the lookups for me. This is polluting my type signatures. Especially when I begin to add more things like an accumulator.
Instead of passing in the AST to every function I've heard this would be a good job for the Reader Monad (for state that doesn't change, the AST) and the State Monad (for state that does change, the accumulator).
How can I take the ast out of the IO monad (gulp) and use it in a Reader Monad to do global lookups?
main = do
putStrLn "Please enter the name of your jack file (i.e. Main)"
fileName <- getLine
file <- readFile (fileName++".jack")
let ast = parseString file
writeFile (fileName++".xml") (toClass ast) --I need to query this globally
putStrLn $ "Completed Parsing, " ++ fileName ++ ".vm created..."
type VM = String
toClass :: Jack -> VM
toClass c = case c of
(Class ident decs) ->
toDecs decs
toDecs ::[Declaration] -> VM -- I don't want to add the ast in every function arg...
toDecs [] = ""
toDecs (x:xs) = case x of
(SubDec keyword typ subname params subbody) ->
case keyword of
"constructor" -> --use the above ast to query the # of local variables here...
toSubBody subbody ++
toDecs xs
otherwise -> []
UPDATE on Reader Monad progress:
I have transformed the above example into something like this: (see below). But now I'm wondering due to all this accumulation of string output, should I use a writer Monad as well? And if so, how should I go about composing the two? Should ReaderT encapsulate writer? or vice versa? Should I make a type that just accepts a Reader and a Writer without attempting to compose them as a Monad Transformer?
main = do
putStrLn "Please enter the name of your jack file (i.e. Main)"
fileName <- getLine
file <- readFile (fileName++".jack")
writeFile (fileName++".xml") (runReader toClass $ parseString file)
putStrLn $ "Completed Parsing, " ++ fileName ++ ".xml created..."
toClass = do
env <- ask
case env of Class ident decs -> return $ toDecs decs env
toDecs [] = return ""
toDecs ((SubDec keyword typ subname params subbody):xs) = do
env <- ask
res <- (case keyword of
"method" -> do return "push this 0\n"
"constructor" -> do return "pop pointer 0\nMemory.alloc 1\n"
otherwise -> do return "")
return $ res ++ toSubBody subbody env ++ toDecs xs env
toDecs (_:xs) = do
decs <- ask
return $ toDecs xs decs
toSubBody (SubBodyStatement states) = do
return $ toStatement states
toSubBody (SubBody _ states) = do
return $ toStatement states
http://hpaste.org/83595 --for declarations
Without knowing a bit more about the Jack and Declaration types it's hard to see how to transform it into a Reader monad. If the idea is to perform a "map" or a "fold" over something while having the ast :: Jack object in scope, you might write
f :: [Declaration] -> Reader Jack [Something]
f decls = mapM go decls where
go :: Declaration -> Reader Jack Something
go (SubDec keyword typ subname params subbody) =
case keyword of
"constructor" -> do
ast <- ask
return (doSomething subbody ast)
and then execute it in context with your ast as runReader (f decls) ast.