Related
I have a string, for example "MMMMABCNNNXYZPPPPP". I know that this string may have ABC in it and may have XYZ in it, but it is not required to have either. Additionally, the XYZ may be swapped for DEF (e.g. "MMMMABCNNNDEFPPPPP") and the behavior should remain the same.
I would like to parse the string and return the sequences between them, as well as which one of XYZ or DEF was present. Example:
data Divider1 = Abc
data Divider2 = Xyz | Def
--"MMMMABCNNNXYZPPPPP" should return ("MMMM", Just Abc, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNDEFPPPPP" should return ("MMMM", Just Abc, "NNN", Just Def, "PPPPP")
Note that if ABC is not present, I would like to return everything before the divider2 and if XYZ and DEF are both not present, I would like to return everything after divider 1.
Example:
--"MMMMNNNXYZPPPPP" should return ("MMMM", Nothing, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNPPPPP" should return ("MMMM", Just Abc, "NNN", Nothing, "PPPPP")
If neither ABC nor XYZ is present then I don't care if it returns nothing, or if it returns the entire string.
Currently my code is
parseEverything = many $ satisfy someGeneralCondition--check if all characters are valid
parseAbc = (\str -> Abc) <$> string "ABC"
parseXyz = (\str -> Xyz) <$> string "XYZ"
parseDef = (\str -> Def) <$> string "DEF"
parseFull = do
beforeAbc <- gather parseEverything
parseAbc <- (Just <$> parseAbc) <++ return Nothing
beforeDivider2 <- gather parseEverything
parseDivider2 <- (Just <$> parseXyz) <++ (Just <$> parseDef) <++ (Just <$> Nothing)
everythingElse <- look
return (beforeAbc, parseAbc, beforeDivider2, parseDivider2, everythingElse)
But when I run this on the example string "MMMMABCNNNXYZPPPPP", I get mostly failed results with just one result that I want. The problem is that I need to return everything in beforeAbc if parseAbc fails, but if parseAbc passes then I just need to return that. And the same thing with parseXyz and parseDef. I don't think that <++ is the correct operator to do this. I also tried a variant of this code using option, but it gave the same result. Is there a simple solution that I am missing, and/or should I set up the parsers in a different way?
Thanks in advance!
Edit: does this have anything to do with chainl or chainr or manyTill?
Updated: See note on applicative parsers below.
Here's what's going wrong with your current approach. As you undoubtedly know, the parsers in Text.ParserCombinators.ReadP generate all possible valid parses of all possible prefixes of the string. If you write a parser:
letterAndOther = do
letters <- many (satisfy isLetter)
others <- many get
return (letters, others)
which grabs an initial string of letters followed by the "rest" of the string and run it on a simple test string, you'll usually get way more than you bargained for:
> readP_to_S letterAndOther "abc"
[(("",""),"abc"),(("","a"),"bc"),(("a",""),"bc"),(("","ab"),"c"),
(("a","b"),"c"),(("ab",""),"c"),(("","abc"),""),(("a","bc"),""),
(("ab","c"),""),(("abc",""),"")]
In other words, in a do-block, each monadic action will typically generate a tree of possible parses. In your current code, the very first line of the do-block:
beforeAbc <- gather parseEverything
introduces a whole tree of parse branches, one branch for each possible initial prefix. These branches only get pruned if a later line of the do-block introduces a parse that fails. But, every line of your do-block represents a parser that always succeeds. For example, this always succeeds:
parseAbc <- (Just <$> parseAbc) <++ return Nothing
because even if the first divider isn't found, the right-hand side parser return Nothing will always succeed.
I would suggest the following approach. First, as we discovered in the comments, the first thing you want to do is figure out what your parser should return. Instead of trying to shoehorn the result into a weird tuple, it's a good idea to leverage Haskell's best feature, it's algebraic data types. Define a return type for your parse:
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
This is unambiguous and covers all possibilities. Admittedly, including Divider1 in the first two constructors is redundant, since there's only one possible Divider1, but programs are for humans to read, too, and keeping Divider1 explicit improves readability.
Now, let's define parsers for the first and second dividers:
divider1 = Abc <$ string "ABC"
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
Note that I've chosen to define a single divider2 instead of separate parsers for Def and Xyz. Since, in your grammar, it's always the case that "DEF" can appear anywhere "XYZ" can and vice versa, it makes sense to combine them into one parser.
We'll also want a parser for arbitrary strings (basically your parseEverything):
anything = many $ satisfy isLetter -- valid characters
Now, let's write a parser for the full string. A key insight here is that we have four alternatives (i.e., the four constructors for our Result type). It's true that they share some structure, but a first crack at a parser can just treat them as independent alternatives. We'll use the <++ operator to choose the best match:
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
A quick test of this will show we've forgotten something:
> readP_to_S result "MMMMABCNNNXYZPPPPP"
[(TwoDividers "MMMM" Abc "NNN" Xyz "","PPPPP"),...]
By default, the parser combinators will try every possible prefix of the input string, leaving more for later parsers. So, we should wrap this up in a final parser function that checks for the end-of-string:
parseResult = readP_to_S (result <* eof)
and with the tests:
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]
we get the expected unique parsed output:
[(TwoDividers "MMMM" Abc "NNN" Xyz "PPPPP","")]
[(TwoDividers "MMMM" Abc "NNN" Def "PPPPP","")]
[(SecondDivider "MMMMNNN" Xyz "PPPPP","")]
[(FirstDivider "MMMM" Abc "NNNPPPPP","")]
Note on Applicative Parsers. I've used applicative syntax here, rather than the monad syntax. The difference isn't purely syntactical -- you can always write an applicative expression in monadic form, but there are monadic operations that can't be expressed applicatively, so the monadic syntax is strictly more powerful. However, when an expression can be written both ways, often the applicative syntax is more succinct and easier to write and understand, at least once you get used to it.
In a nutshell, the expression p <*> x <*> y <*> z creates a new parser that applies the parsers p, x, y, and z in order, and then applies the result from parser p (which needs to be a function f) to the results from the rest of the parsers (which must be appropriate arguments for f). In many cases, the function f is a known function and doesn't need to be returned by a parser, so a common variant is to write f <$> x <*> y <*> z. This applies the parsers x, y, and z in order, and then applies f (given directly instead of returned by a parser) to the results from those parsers. For example, the expression:
FirstDivider <$> anything <*> divider1 <*> anything
runs three parsers in order to get anything, followed by a divider1, followed by anything, and then applies the function/contructor FirstDivider to the three arguments resulting from those parsers.
The operators <* and *> can be thought of as variants of <*>. The expression p <*> x first parses p, then parses x, then applies the result of the former to the latter. The expression p <* x first parses p, then parses x, but instead of applying the former to the latter, it returns the value the arrow is pointing to (i.e., whatever p produced) and throws away the other value. Similarly p *> x parses p then parses x, then returns whatever x produced. In particular:
someParser <* eof
first runs someParser, then parses (i.e., checks for) EOF, then returns whatever someParser produced.
This syntax can really shine when parsing more traditional languages into an abstract syntax tree. If you want to parse statements like:
let x = 1 + 5
into a Statement type like:
data Statement = ... | Let Var Expr | ...
you can write a Parsec parser that looks like:
statement = ...
<|> Let <$ string "let" <*> var <* symbol "=" <*> expr
...
The monadic equivalent in do-notation looks like this:
do string "let"
v <- var
symbol "="
e <- expr
return $ Let v e
which is fine, I suppose, but kind of obscures the simple structure of the parse. The applicative version is basically just the list of tokens to parse, with a little bit of syntactic sugar sprinkled in.
Anyway, here's the full program:
import Data.Char
import Text.ParserCombinators.ReadP
data Divider1 = Abc deriving (Show)
data Divider2 = Xyz | Def deriving (Show)
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
deriving (Show)
anything :: ReadP String
anything = many $ satisfy isLetter -- valid characters
divider1 :: ReadP Divider1
divider1 = Abc <$ string "ABC"
divider2 :: ReadP Divider2
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
result :: ReadP Result
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
parseResult :: String -> [(Result, String)]
parseResult = readP_to_S (result <* eof)
main :: IO ()
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]
I'm trying to use parsec to read a C/C++/java source file and do a series of transformations on the entire file. The first phase removes strings and the second phase removes comments. (That's because you might get a /* inside a string.)
So each phase transforms a string onto Either String Error, and I want to bind (in the sense of Either) them together to make a pipeline of transformations of the whole file. This seems like a fairly general requirement.
import Text.ParserCombinators.Parsec
commentless, stringless :: Parser String
stringless = fmap concat ( (many (noneOf "\"")) `sepBy` quotedString )
quotedString = (char '"') >> (many quotedChar) >> (char '"')
quotedChar = try (string "\\\"" >> return '"' ) <|> (noneOf "\"")
commentless = fmap concat $ notComment `sepBy` comment
notComment = manyTill anyChar (lookAhead (comment <|> eof))
comment = (string "//" >> manyTill anyChar newline >> spaces >> return ())
<|> (string "/*" >> manyTill anyChar (string "*/") >> spaces >> return ())
main =
do c <- getContents
case parse commentless "(stdin)" c of -- THIS WORKS
-- case parse stringless "(stdin)" c of -- THIS WORKS TOO
-- case parse (stringless `THISISWHATIWANT` commentless) "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> print r
So how can I do this? I tried parserBind but it didn't work.
(In case anybody cares why, I'm trying to do a kind of light parse where I just extract what I want but avoid parsing the entire grammar or even knowing whether it's C++ or Java. All I need to extract is the starting and ending line numbers of all classes and functions. So I envisage a bunch of preprocessing phases that just scrub out comments, #defines/ifdefs, template preambles and contents of parentheses (because of the semicolons in for clauses), then I'll parse for snippets preceding {s (or following }s because of typedefs) and stuff those snippets through yet another phase to get the type and name of whatever it is, then recurse to just the second level to get java member functions.)
You need to bind Either Error, not Parser. You need to move the bind outside the parse, and use multiple parses:
parse stringless "(stdin)" input >>= parse commentless "(stdin)"
There is probably a better approach than what you are using, but this will do what you want.
Well in fact I'm pretty sure I'm using the wrong terminology. Here is the problem I want to solve: a parser for the markdown format, well a subset of it.
My problem is with the blockquote feature. Each line in a blockquote starts with >; otherwise everything is the normal structure in a markdown file.
You can't look at individual lines separately, because you need to separate paragraphs from normal lines, eg
> a
> b
is not the same as
> a
>
> b
and things like that (same if a list is blockquoted you don't want x lists but one list with x elements). A natural and trivial way is to "take off" the > signs, parse the blockquote on its own, ignoring anything around it, wrap it with a BlockQuote type constructor, put that in the outer AST and resume parsing of the original input. It's what pango does if I'm not wrong:
https://hackage.haskell.org/package/pandoc-1.14.0.4/docs/src/Text-Pandoc-Readers-Markdown.html#blockQuote
blockQuote :: MarkdownParser (F Blocks)
blockQuote = do
raw <- emailBlockQuote
-- parse the extracted block, which may contain various block elements:
contents <- parseFromString parseBlocks $ (intercalate "\n" raw) ++ "\n\n"
return $ B.blockQuote <$> contents
And then:
http://hackage.haskell.org/package/pandoc-1.5.1/docs/src/Text-Pandoc-Shared.html#parseFromString
-- | Parse contents of 'str' using 'parser' and return result.
parseFromString :: GenParser tok st a -> [tok] -> GenParser tok st a
parseFromString parser str = do
oldPos <- getPosition
oldInput <- getInput
setInput str
result <- parser
setInput oldInput
setPosition oldPos
return result
Now parseFromString looks quite hacky to me and besides that it's also Parsec not attoparsec so I can't use it in my project. I'm not sure how I could take that Text from the blockquote and parse it and return the parsing result so that it "fits" within the current parsing. Seems impossible?
I've been googling on the issue and I think that pipes-parse and conduit can help on that area although I struggle to find examples and what I see appears considerably less nice to look at than "pure" parsec/attoparsec parsers.
Other options to parse blockquotes would be to rewrite the usual parsers but with the > catch... Complicating and duplicating a lot. Parsing blockquotes counting each line separately and writing some messy "merge" function. Or parsing to a first AST that would contain the blockquotes as Text inside a first BlockquoteText type constructor waiting for a transformation where they would be parsed separately, not very elegant but it has the benefit of simplicity, which does count for something.
I would probably go for the latter, but surely there's a better way?
I have asked myself the same question. Why is there no standard combinator for nested parsers like how you describe? My default mode is to trust the package author, especially when that author also co-wrote "Real World Haskell". If such an obvious capability is missing, perhaps it is by design and I should look for a better way. However, I've managed to convince myself that such a convenient combinator is mostly harmless. Useful whenever an all-or-nothing type parser is appropriate for the inner parse.
Implementation
import Data.Attoparsec.Text
import qualified Data.Text as T
import Data.Text(Text)
import Control.Applicative
I've divided the required functionality into two parsers. The first, constP, performs an "in place" parse of some given text. It substitutes the constant parser's fail with empty (from Alternative), but otherwise has no other side effects.
constP :: Parser a -> Text -> Parser a
constP p t = case parseOnly p t of
Left _ -> empty
Right a -> return a
The second part comes from parseOf, which performs the constant, inner parse based on the result of the outer parse. The empty alternative here allows a failed parse to return without consuming any input.
parseOf :: Parser Text -> Parser a -> Parser a
parseOf ptxt pa = bothParse <|> empty
where
bothParse = ptxt >>= constP pa
The block quote markdown can be written in the desired fashion. This implementation requires the resulting block to be totally parsed.
blockQuoteMarkdown :: Parser [[Double]]
blockQuoteMarkdown = parseOf blockQuote ( markdownSurrogate <*
endOfInput
)
Instead of the actual markdown parser, I just implemented a quick parser of space separated doubles. The complication of the parser comes from allowing the last, non-empty line, either end in a new line or not.
markdownSurrogate :: Parser [[Double]]
markdownSurrogate = do
lns <- many (mdLine <* endOfLine)
option lns ((lns ++) . pure <$> mdLine1)
where
mdLine = sepBy double (satisfy (==' '))
mdLine1 = sepBy1 double (satisfy (==' '))
These two parsers are responsible for returning the text internal to block quotes.
blockQuote :: Parser Text
blockQuote = T.unlines <$> many blockLine
blockLine :: Parser Text
blockLine = char '>' *> takeTill isEndOfLine <* endOfLine
Finally, a test of the parser.
parseMain :: IO ()
parseMain = do
putStrLn ""
doParse "a" markdownSurrogate a
doParse "_" markdownSurrogate ""
doParse "b" markdownSurrogate b
doParse "ab" markdownSurrogate ab
doParse "a_b" markdownSurrogate a_b
doParse "badMarkdown x" markdownSurrogate x
doParse "badMarkdown axb" markdownSurrogate axb
putStrLn ""
doParse "BlockQuote ab" blockQuoteMarkdown $ toBlockQuote ab
doParse "BlockQuote a_b" blockQuoteMarkdown $ toBlockQuote a_b
doParse "BlockQuote axb" blockQuoteMarkdown $ toBlockQuote axb
where
a = "7 3 1"
b = "4 4 4"
x = "a b c"
ab = T.unlines [a,b]
a_b = T.unlines [a,"",b]
axb = T.unlines [a,x,b]
doParse desc p str = do
print $ T.concat ["Parsing ",desc,": \"",str,"\""]
let i = parse (p <* endOfInput ) str
print $ feed i ""
toBlockQuote = T.unlines
. map (T.cons '>')
. T.lines
*Main> parseMain
"Parsing a: \"7 3 1\""
Done "" [[7.0,3.0,1.0]]
"Parsing _: \"\""
Done "" []
"Parsing b: \"4 4 4\""
Done "" [[4.0,4.0,4.0]]
"Parsing ab: \"7 3 1\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing a_b: \"7 3 1\n\n4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing badMarkdown x: \"a b c\""
Fail "a b c" [] "endOfInput"
"Parsing badMarkdown axb: \"7 3 1\na b c\n4 4 4\n\""
Fail "a b c\n4 4 4\n" [] "endOfInput"
"Parsing BlockQuote ab: \">7 3 1\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[4.0,4.0,4.0]]
"Parsing BlockQuote a_b: \">7 3 1\n>\n>4 4 4\n\""
Done "" [[7.0,3.0,1.0],[],[4.0,4.0,4.0]]
"Parsing BlockQuote axb: \">7 3 1\n>a b c\n>4 4 4\n\""
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
Discussion
The notable difference comes in the semantics of failure. For instance, when parsing axb and blockquoted axb, which are the following two strings, respectively
7 3 1
a b c
4 4 4
and
> 7 3 1
> a b c
> 4 4 4
the markdown parse results in
Fail "a b c\n4 4 4\n" [] "endOfInput"
whereas the quoted results in
Fail ">7 3 1\n>a b c\n>4 4 4\n" [] "Failed reading: empty"
The markdown consumes "7 3 1\n", but this is nowhere reported in the quoted failure. Instead, fail becomes all or nothing.
Likewise, there is no allowance for handling unparsed text in the case of partial success. But I don't see a need for this, given the use case. For example, if a parse looked something like the following
"{ <tok> unhandled }more to parse"
where {} denotes the recognized block quote context, and <tok> is parsed within that inner context. A partial success then would have to lift "unhandled" out of that block quote context and somehow combine it with "more to parse".
I see no general way to do this, but it is allowed through choice of inner parser return type. For instance, by some parser parseOf blockP innP :: Parser (<tok>,Maybe Text). However, if this need arises I would expect that there is a better way to handle the situation than with nested parsers.
There may also be concerns about the loss of attoparsec Partial parsing. That is, the implementation of constP uses parseOnly, which collapses the parse return Fail and Partial into a single Left failure state. In other words, we lose the ability to feed our inner parser with more text as it becomes available. However, note that text to parse is itself the result of an outer parse; it will only be available after enough text has been fed to the outer parse. So this shouldn't be an issue either.
I'm learning Haskell and as an exercise I'm trying to convert write the read_from function following code to Haskell. Taken from Peter Norvig's Scheme interpreter.
Is there a straightforward way do this?
def read(s):
"Read a Scheme expression from a string."
return read_from(tokenize(s))
parse = read
def tokenize(s):
"Convert a string into a list of tokens."
return s.replace('(',' ( ').replace(')',' ) ').split()
def read_from(tokens):
"Read an expression from a sequence of tokens."
if len(tokens) == 0:
raise SyntaxError('unexpected EOF while reading')
token = tokens.pop(0)
if '(' == token:
L = []
while tokens[0] != ')':
L.append(read_from(tokens))
tokens.pop(0) # pop off ')'
return L
elif ')' == token:
raise SyntaxError('unexpected )')
else:
return atom(token)
def atom(token):
"Numbers become numbers; every other token is a symbol."
try: return int(token)
except ValueError:
try: return float(token)
except ValueError:
return Symbol(token)
There is a straightforward way to "transliterate" Python into Haskell. This can be done by clever usage of monad transformers, which sounds scary, but it's really not. You see, due to purity, in Haskell when you want to use effects such as mutable state (e.g. the append and pop operations are performing mutation) or exceptions, you have to make it a little more explicit. Let's start at the top.
parse :: String -> SchemeExpr
parse s = readFrom (tokenize s)
The Python docstring said "Read a Scheme expression from a string", so I just took the liberty of encoding this as the type signature (String -> SchemeExpr). That docstring becomes obsolete because the type conveys the same information. Now... what is a SchemeExpr? According to your code, a scheme expression can be an int, float, symbol, or list of scheme expressions. Let's create a data type that represents these options.
data SchemeExpr
= SInt Int
| SFloat Float
| SSymbol String
| SList [SchemeExpr]
deriving (Eq, Show)
In order to tell Haskell that the Int we are dealing with should be treated as a SchemeExpr, we need to tag it with SInt. Likewise with the other possibilities. Let's move on to tokenize.
tokenize :: String -> [Token]
Again, the docstring turns into a type signature: turn a String into a list of Tokens. Well, what's a Token? If you look at the code, you'll notice that the left and right paren characters are apparently special tokens, which signal particular behaviors. Anything else is... unspecial. While we could create a data type to more clearly distinguish parens from other tokens, let's just use Strings, to stick a little closer to the original Python code.
type Token = String
Now let's try writing tokenize. First, let's write a quick little operator for making function chaining look a bit more like Python. In Haskell, you can define your own operators.
(|>) :: a -> (a -> b) -> b
x |> f = f x
tokenize s = s |> replace "(" " ( "
|> replace ")" " ) "
|> words
words is Haskell's version of split. However, Haskell has no pre-cooked version of replace that I know of. Here's one that should do the trick:
-- add imports to top of file
import Data.List.Split (splitOn)
import Data.List (intercalate)
replace :: String -> String -> String -> String
replace old new s = s |> splitOn old
|> intercalate new
If you read the docs for splitOn and intercalate, this simple algorithm should make perfect sense. Haskellers would typically write this as replace old new = intercalate new . splitOn old, but I used |> here for easier Python audience understanding.
Note that replace takes three arguments, but above I only invoked it with two. In Haskell you can partially apply any function, which is pretty neat. |> works sort of like the unix pipe, if you couldn't tell, except with more type safety.
Still with me? Let's skip over to atom. That nested logic is a bit ugly, so let's try a slightly different approach to clean it up. We'll use the Either type for a much nicer presentation.
atom :: Token -> SchemeExpr
atom s = Left s |> tryReadInto SInt
|> tryReadInto SFloat
|> orElse (SSymbol s)
Haskell doesn't have the automagical coersion functions int and float, so instead we will build tryReadInto. Here's how it works: we're going to thread Either values around. An Either value is either a Left or a Right. Conventionally, Left is used to signal error or failure, while Right signals success or completion. In Haskell, to simulate the Python-esque function call chaining, you just place the "self" argument as the last one.
tryReadInto :: Read a => (a -> b) -> Either String b -> Either String b
tryReadInto f (Right x) = Right x
tryReadInto f (Left s) = case readMay s of
Just x -> Right (f x)
Nothing -> Left s
orElse :: a -> Either err a -> a
orElse a (Left _) = a
orElse _ (Right a) = a
tryReadInto relies on type inference in order to determine which type it is trying to parse the string into. If the parse fails, it simply reproduces the same string in the Left position. If it succeeds, then it performs whatever function is desired and places the result in the Right position. orElse allows us to eliminate the Either by supplying a value in case the former computations failed. Can you see how Either acts as a replacement for exceptions here? Since the ValueExceptions in the Python code are always caught inside the function itself, we know that atom will never raise an exception. Similarly, in the Haskell code, even though we used Either on the inside of the function, the interface that we expose is pure: Token -> SchemeExpr, no outwardly-visible side effects.
OK, let's move on to read_from. First, ask yourself the question: what side effects does this function have? It mutates its argument tokens via pop, and it has internal mutation on the list named L. It also raises the SyntaxError exception. At this point, most Haskellers will be throwing up their hands saying "oh noes! side effects! gross!" But the truth is that Haskellers use side effects all the time as well. We just call them "monads" in order to scare people away and avoid success at all costs. Mutation can be accomplished with the State monad, and exceptions with the Either monad (surprise!). We will want to use both at the same time, so we'll in fact use "monad transformers", which I'll explain in a bit. It's not that scary, once you learn to see past the cruft.
First, a few utilities. These are just some simple plumbing operations. raise will let us "raise exceptions" as in Python, and whileM will let us write a while loop as in Python. For the latter, we simply have to make it explicit in what order the effects should happen: first perform the effect to compute the condition, then if it's True, perform the effects of the body and loop again.
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
raise = lift . Left
whileM :: Monad m => m Bool -> m () -> m ()
whileM mb m = do
b <- mb
if b
then m >> whileM mb m
else return ()
We again want to expose a pure interface. However, there is a chance that there will be a SyntaxError, so we will indicate in the type signature that the result will be either a SchemeExpr or a SyntaxError. This is reminiscent of how in Java you can annotate which exceptions a method will raise. Note that the type signature of parse has to change as well, since it might raise the SyntaxError.
data SyntaxError = SyntaxError String
deriving (Show)
parse :: String -> Either SyntaxError SchemeExpr
readFrom :: [Token] -> Either SyntaxError SchemeExpr
readFrom = evalStateT readFrom'
We are going to perform a stateful computation on the token list that is passed in. Unlike the Python, however, we are not going to be rude to the caller and mutate the very list passed to us. Instead, we will establish our own state space and initialize it to the token list we are given. We will use do notation, which provides syntactic sugar to make it look like we're programming imperatively. The StateT monad transformer gives us the get, put, and modify state operations.
readFrom' :: StateT [Token] (Either SyntaxError) SchemeExpr
readFrom' = do
tokens <- get
case tokens of
[] -> raise (SyntaxError "unexpected EOF while reading")
(token:tokens') -> do
put tokens' -- here we overwrite the state with the "rest" of the tokens
case token of
"(" -> (SList . reverse) `fmap` execStateT readWithList []
")" -> raise (SyntaxError "unexpected close paren")
_ -> return (atom token)
I've broken out the readWithList portion into a separate chunk of code,
because I want you to see the type signature. This portion of code introduces
a new scope, so we simply layer another StateT on top of the monad stack
that we had before. Now, the get, put, and modify operations refer
to the thing called L in the Python code. If we want to perform these operations
on the tokens, then we can simply preface the operation with lift in order
to strip away one layer of the monad stack.
readWithList :: StateT [SchemeExpr] (StateT [Token] (Either SyntaxError)) ()
readWithList = do
whileM ((\toks -> toks !! 0 /= ")") `fmap` lift get) $ do
innerExpr <- lift readFrom'
modify (innerExpr:)
lift $ modify (drop 1) -- this seems to be missing from the Python
In Haskell, appending to the end of a list is inefficient, so I instead prepended, and then reversed the list afterwards. If you are interested in performance, then there are better list-like data structures you can use.
Here is the complete file: http://hpaste.org/77852
So if you're new to Haskell, then this probably looks terrifying. My advice is to just give it some time. The Monad abstraction is not nearly as scary as people make it out to be. You just have to learn that what most languages have baked in (mutation, exceptions, etc), Haskell instead provides via libraries. In Haskell, you must explicitly specify which effects you want, and controlling those effects is a little less convenient. In exchange, however, Haskell provides more safety so you don't accidentally mix up the wrong effects, and more power, because you are in complete control of how to combine and refactor effects.
In Haskell, you wouldn't use an algorithm that mutates the data it operates on. So no, there is no straightforward way to do that. However, the code can be rewritten using recursion to avoid updating variables. Solution below uses the MissingH package because Haskell annoyingly doesn't have a replace function that works on strings.
import Data.String.Utils (replace)
import Data.Tree
import System.Environment (getArgs)
data Atom = Sym String | NInt Int | NDouble Double | Para deriving (Eq, Show)
type ParserStack = (Tree Atom, Tree Atom)
tokenize = words . replace "(" " ( " . replace ")" " ) "
atom :: String -> Atom
atom tok =
case reads tok :: [(Int, String)] of
[(int, _)] -> NInt int
_ -> case reads tok :: [(Double, String)] of
[(dbl, _)] -> NDouble dbl
_ -> Sym tok
empty = Node $ Sym "dummy"
para = Node Para
parseToken (Node _ stack, Node _ out) "(" =
(empty $ stack ++ [empty out], empty [])
parseToken (Node _ stack, Node _ out) ")" =
(empty $ init stack, empty $ (subForest (last stack)) ++ [para out])
parseToken (stack, Node _ out) tok =
(stack, empty $ out ++ [Node (atom tok) []])
main = do
(file:_) <- getArgs
contents <- readFile file
let tokens = tokenize contents
parseStack = foldl parseToken (empty [], empty []) tokens
schemeTree = head $ subForest $ snd parseStack
putStrLn $ drawTree $ fmap show schemeTree
foldl is the haskeller's basic structured recursion tool and it serves the same purpose as your while loop and recursive call to read_from. I think the code can be improved a lot, but I'm not so used to Haskell. Below is an almost straight transliteration of the above to Python:
from pprint import pprint
from sys import argv
def atom(tok):
try:
return 'int', int(tok)
except ValueError:
try:
return 'float', float(tok)
except ValueError:
return 'sym', tok
def tokenize(s):
return s.replace('(',' ( ').replace(')',' ) ').split()
def handle_tok((stack, out), tok):
if tok == '(':
return stack + [out], []
if tok == ')':
return stack[:-1], stack[-1] + [out]
return stack, out + [atom(tok)]
if __name__ == '__main__':
tokens = tokenize(open(argv[1]).read())
tree = reduce(handle_tok, tokens, ([], []))[1][0]
pprint(tree)
I'm currently using the Text.Parsec.Expr module to parse a subset of a scripting language.
Basically, there are two kinds of commands in this language: Assignment of the form $var = expr and a Command of the form $var = $array[$index] - there are of course other commands, but this suffices to explain my problem.
I've created a type Command, to represent this, along with corresponding parsers, where expr for the assignment is handled by Parsec's buildExpressionParser.
Now, the problem. First the parsing code:
main = case parse p "" "$c = $a[$b]" of
Left err -> putStrLn . show $ err
Right r -> putStrLn . show $ r
where p = (try assignment <|> command) <* eof -- (1)
The whole code (50 lines) is pasted here: Link (should compile if you've parsec installed)
The problem is, that parsing fails, since assignment doesn't successfully parse, even though there is a try before. Reversing the parsing order (try command <|> assignment) solves the problem, but is not possible in my case.
Of course I tried to locate the problem further and it appears to me, that the problem is the expression parser (build by buildExpressionParser), since parsing succeeds if I say expr = fail "". However I can't find anything in the Parsec sources that would explain this behaviour.
You parser fails because in fact assigment does succeeds here consuming $c = $a (try it with plain where p = assignment). Then there is supposed to be eof (or the rest of expr from assigment) hence the error. It seems that the beggining of your 'command' is identical to your 'assignment' in the case when 'assignment''s argument is just a var (like $c = $a).
Not sure why you can't reverse command and assignment but another way to make this particular example work would be:
main = case parse p "" "$c = $a[$b]" of
Left err -> putStrLn . show $ err
Right r -> putStrLn . show $ r
where p = try (assignment <* eof) <|> (command <* eof)