I'm trying to use parsec to read a C/C++/java source file and do a series of transformations on the entire file. The first phase removes strings and the second phase removes comments. (That's because you might get a /* inside a string.)
So each phase transforms a string onto Either String Error, and I want to bind (in the sense of Either) them together to make a pipeline of transformations of the whole file. This seems like a fairly general requirement.
import Text.ParserCombinators.Parsec
commentless, stringless :: Parser String
stringless = fmap concat ( (many (noneOf "\"")) `sepBy` quotedString )
quotedString = (char '"') >> (many quotedChar) >> (char '"')
quotedChar = try (string "\\\"" >> return '"' ) <|> (noneOf "\"")
commentless = fmap concat $ notComment `sepBy` comment
notComment = manyTill anyChar (lookAhead (comment <|> eof))
comment = (string "//" >> manyTill anyChar newline >> spaces >> return ())
<|> (string "/*" >> manyTill anyChar (string "*/") >> spaces >> return ())
main =
do c <- getContents
case parse commentless "(stdin)" c of -- THIS WORKS
-- case parse stringless "(stdin)" c of -- THIS WORKS TOO
-- case parse (stringless `THISISWHATIWANT` commentless) "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> print r
So how can I do this? I tried parserBind but it didn't work.
(In case anybody cares why, I'm trying to do a kind of light parse where I just extract what I want but avoid parsing the entire grammar or even knowing whether it's C++ or Java. All I need to extract is the starting and ending line numbers of all classes and functions. So I envisage a bunch of preprocessing phases that just scrub out comments, #defines/ifdefs, template preambles and contents of parentheses (because of the semicolons in for clauses), then I'll parse for snippets preceding {s (or following }s because of typedefs) and stuff those snippets through yet another phase to get the type and name of whatever it is, then recurse to just the second level to get java member functions.)
You need to bind Either Error, not Parser. You need to move the bind outside the parse, and use multiple parses:
parse stringless "(stdin)" input >>= parse commentless "(stdin)"
There is probably a better approach than what you are using, but this will do what you want.
I'm using Parsec to parse some expressions (see this question for more context), and most relevant part of my code is:
statement :: Parser Stmt
statement = assignStmt <|> simpleStmt
assignStmt :: Parser Stmt
assignStmt =
do var <- identifier
reservedOp "="
expr <- expression
return $ Assign var expr
simpleStmt :: Parser Stmt
simpleStmt =
do expr <- expression
return $ Simple expr
In action:
boobla> foo = 100 + ~100
167
boobla> foo
parser error: (line 1, column 4):
unexpected end of input
expecting letter or digit or "="
Second expression should have evaluated to 167, value of foo.
I thought that when Parsec would try to extract token reservedOp "=", it should have failed because there is no such token in the string, then it was to try second function simpleStmt and succeed with it. But it works differently: it expects more input and just throws this exception.
What should I use to make assignStmt fail if there is no more characters in the string (or in current line). foo = 10 should be parsed with assignStmt and foo should be parsed with simpleStmt.
You're missing the try function.
By default, the <|> operator will try the left parser, and if it fails without consuming any characters, it will try the right parser.
However, if — as in your case — the parser fails after having consumed some characters, and the right parser is never tried. Note that this is often the behavior you want; if you had something like
parseForLoop <|> parseWhileLoop
then if the input is something like "for break", then that's not a valid for-loop, and there's no point trying to parse it as a while-loop, since that will surely fail also.
The try combinator changes this behaviour. Specifically, it makes a failed parser appear to have consumed no input. (This has a space penalty; the input could have been thrown away, but try makes it hang around.)
I have been trying to create a parser using details from the following tutorial
much of the code is copied directly from the tutorial with only a few names changed.
import qualified Text.ParserCombinators.Parsec.Token as P
reserved = P.reserved lexer
integer = P.integer lexer
whiteSpace = P.whiteSpace lexer
identifier = P.identifier lexer
data Express = Seq [Express]
| ID String
| Num Integer
| BoolConst Bool
deriving (Show)
whileParser :: Parser Express
whileParser = whiteSpace >> expr7
expr7 = seqOfStmt
<|> expr8
seqOfStmt =
do list <- (sepBy1 expr8 whiteSpace)
return $ if length list == 1 then head list else Seq list
expr8 :: Parser Express
expr8 = name
<|> number
<|> bTerm
name :: Parser Express
name = fmap ID identifier
number :: Parser Express
number = fmap Num integer
bTerm :: Parser Express
bTerm = (reserved "True" >> return (BoolConst True ))
<|> (reserved "False" >> return (BoolConst False))
I understand that this code might be laughable but I would really like to learn a bit more about where I'm going wrong. I also think that this should provide enough info but if not let me know.
Error:
parse error on input `return'
I believe that the error has something to do with different return types, which is strange because I have tried to use the tutorial at the start of the post as a basis for all that I am attempting.
Thanks in advance,
Seán
If you are not comfortable with the layout rules, you may also use different syntax:
seqOfStmt =
do { list
<- (sepBy1 expr8 whiteSpace);
return $ if length
list == 1
then head list else Seq list;}
The layout without braces and semicolons is regarded superior, though, for 2 reasons:
You don't need to type ugly ; and braces
It forces you to write (mostly) readable code, unlike the distorted crap I gave as example above.
And the rules are really easy:
Don't use tabs, use spaces. Always. (Your editor can do that, if not, throw it away, it's crapware.)
Things that belong together must be aligned in the same column.
For example, you have 2 statements that belong to the do block, hence they must be aligned in the same column. But you have aligned the return with the do, hence the compiler sees this as:
do { list <- sepBy1 expr8 whiteSpace; };
return $ ....;
but what you want is this:
do {
list <- sepBy1 ....;
return $ .....;
}
(Note that you can just leave out the braces and the semicolon and it will be ok as long as you leave the indentation intact.
I'm learning Haskell and as an exercise I'm trying to convert write the read_from function following code to Haskell. Taken from Peter Norvig's Scheme interpreter.
Is there a straightforward way do this?
def read(s):
"Read a Scheme expression from a string."
return read_from(tokenize(s))
parse = read
def tokenize(s):
"Convert a string into a list of tokens."
return s.replace('(',' ( ').replace(')',' ) ').split()
def read_from(tokens):
"Read an expression from a sequence of tokens."
if len(tokens) == 0:
raise SyntaxError('unexpected EOF while reading')
token = tokens.pop(0)
if '(' == token:
L = []
while tokens[0] != ')':
L.append(read_from(tokens))
tokens.pop(0) # pop off ')'
return L
elif ')' == token:
raise SyntaxError('unexpected )')
else:
return atom(token)
def atom(token):
"Numbers become numbers; every other token is a symbol."
try: return int(token)
except ValueError:
try: return float(token)
except ValueError:
return Symbol(token)
There is a straightforward way to "transliterate" Python into Haskell. This can be done by clever usage of monad transformers, which sounds scary, but it's really not. You see, due to purity, in Haskell when you want to use effects such as mutable state (e.g. the append and pop operations are performing mutation) or exceptions, you have to make it a little more explicit. Let's start at the top.
parse :: String -> SchemeExpr
parse s = readFrom (tokenize s)
The Python docstring said "Read a Scheme expression from a string", so I just took the liberty of encoding this as the type signature (String -> SchemeExpr). That docstring becomes obsolete because the type conveys the same information. Now... what is a SchemeExpr? According to your code, a scheme expression can be an int, float, symbol, or list of scheme expressions. Let's create a data type that represents these options.
data SchemeExpr
= SInt Int
| SFloat Float
| SSymbol String
| SList [SchemeExpr]
deriving (Eq, Show)
In order to tell Haskell that the Int we are dealing with should be treated as a SchemeExpr, we need to tag it with SInt. Likewise with the other possibilities. Let's move on to tokenize.
tokenize :: String -> [Token]
Again, the docstring turns into a type signature: turn a String into a list of Tokens. Well, what's a Token? If you look at the code, you'll notice that the left and right paren characters are apparently special tokens, which signal particular behaviors. Anything else is... unspecial. While we could create a data type to more clearly distinguish parens from other tokens, let's just use Strings, to stick a little closer to the original Python code.
type Token = String
Now let's try writing tokenize. First, let's write a quick little operator for making function chaining look a bit more like Python. In Haskell, you can define your own operators.
(|>) :: a -> (a -> b) -> b
x |> f = f x
tokenize s = s |> replace "(" " ( "
|> replace ")" " ) "
|> words
words is Haskell's version of split. However, Haskell has no pre-cooked version of replace that I know of. Here's one that should do the trick:
-- add imports to top of file
import Data.List.Split (splitOn)
import Data.List (intercalate)
replace :: String -> String -> String -> String
replace old new s = s |> splitOn old
|> intercalate new
If you read the docs for splitOn and intercalate, this simple algorithm should make perfect sense. Haskellers would typically write this as replace old new = intercalate new . splitOn old, but I used |> here for easier Python audience understanding.
Note that replace takes three arguments, but above I only invoked it with two. In Haskell you can partially apply any function, which is pretty neat. |> works sort of like the unix pipe, if you couldn't tell, except with more type safety.
Still with me? Let's skip over to atom. That nested logic is a bit ugly, so let's try a slightly different approach to clean it up. We'll use the Either type for a much nicer presentation.
atom :: Token -> SchemeExpr
atom s = Left s |> tryReadInto SInt
|> tryReadInto SFloat
|> orElse (SSymbol s)
Haskell doesn't have the automagical coersion functions int and float, so instead we will build tryReadInto. Here's how it works: we're going to thread Either values around. An Either value is either a Left or a Right. Conventionally, Left is used to signal error or failure, while Right signals success or completion. In Haskell, to simulate the Python-esque function call chaining, you just place the "self" argument as the last one.
tryReadInto :: Read a => (a -> b) -> Either String b -> Either String b
tryReadInto f (Right x) = Right x
tryReadInto f (Left s) = case readMay s of
Just x -> Right (f x)
Nothing -> Left s
orElse :: a -> Either err a -> a
orElse a (Left _) = a
orElse _ (Right a) = a
tryReadInto relies on type inference in order to determine which type it is trying to parse the string into. If the parse fails, it simply reproduces the same string in the Left position. If it succeeds, then it performs whatever function is desired and places the result in the Right position. orElse allows us to eliminate the Either by supplying a value in case the former computations failed. Can you see how Either acts as a replacement for exceptions here? Since the ValueExceptions in the Python code are always caught inside the function itself, we know that atom will never raise an exception. Similarly, in the Haskell code, even though we used Either on the inside of the function, the interface that we expose is pure: Token -> SchemeExpr, no outwardly-visible side effects.
OK, let's move on to read_from. First, ask yourself the question: what side effects does this function have? It mutates its argument tokens via pop, and it has internal mutation on the list named L. It also raises the SyntaxError exception. At this point, most Haskellers will be throwing up their hands saying "oh noes! side effects! gross!" But the truth is that Haskellers use side effects all the time as well. We just call them "monads" in order to scare people away and avoid success at all costs. Mutation can be accomplished with the State monad, and exceptions with the Either monad (surprise!). We will want to use both at the same time, so we'll in fact use "monad transformers", which I'll explain in a bit. It's not that scary, once you learn to see past the cruft.
First, a few utilities. These are just some simple plumbing operations. raise will let us "raise exceptions" as in Python, and whileM will let us write a while loop as in Python. For the latter, we simply have to make it explicit in what order the effects should happen: first perform the effect to compute the condition, then if it's True, perform the effects of the body and loop again.
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
raise = lift . Left
whileM :: Monad m => m Bool -> m () -> m ()
whileM mb m = do
b <- mb
if b
then m >> whileM mb m
else return ()
We again want to expose a pure interface. However, there is a chance that there will be a SyntaxError, so we will indicate in the type signature that the result will be either a SchemeExpr or a SyntaxError. This is reminiscent of how in Java you can annotate which exceptions a method will raise. Note that the type signature of parse has to change as well, since it might raise the SyntaxError.
data SyntaxError = SyntaxError String
deriving (Show)
parse :: String -> Either SyntaxError SchemeExpr
readFrom :: [Token] -> Either SyntaxError SchemeExpr
readFrom = evalStateT readFrom'
We are going to perform a stateful computation on the token list that is passed in. Unlike the Python, however, we are not going to be rude to the caller and mutate the very list passed to us. Instead, we will establish our own state space and initialize it to the token list we are given. We will use do notation, which provides syntactic sugar to make it look like we're programming imperatively. The StateT monad transformer gives us the get, put, and modify state operations.
readFrom' :: StateT [Token] (Either SyntaxError) SchemeExpr
readFrom' = do
tokens <- get
case tokens of
[] -> raise (SyntaxError "unexpected EOF while reading")
(token:tokens') -> do
put tokens' -- here we overwrite the state with the "rest" of the tokens
case token of
"(" -> (SList . reverse) `fmap` execStateT readWithList []
")" -> raise (SyntaxError "unexpected close paren")
_ -> return (atom token)
I've broken out the readWithList portion into a separate chunk of code,
because I want you to see the type signature. This portion of code introduces
a new scope, so we simply layer another StateT on top of the monad stack
that we had before. Now, the get, put, and modify operations refer
to the thing called L in the Python code. If we want to perform these operations
on the tokens, then we can simply preface the operation with lift in order
to strip away one layer of the monad stack.
readWithList :: StateT [SchemeExpr] (StateT [Token] (Either SyntaxError)) ()
readWithList = do
whileM ((\toks -> toks !! 0 /= ")") `fmap` lift get) $ do
innerExpr <- lift readFrom'
modify (innerExpr:)
lift $ modify (drop 1) -- this seems to be missing from the Python
In Haskell, appending to the end of a list is inefficient, so I instead prepended, and then reversed the list afterwards. If you are interested in performance, then there are better list-like data structures you can use.
Here is the complete file: http://hpaste.org/77852
So if you're new to Haskell, then this probably looks terrifying. My advice is to just give it some time. The Monad abstraction is not nearly as scary as people make it out to be. You just have to learn that what most languages have baked in (mutation, exceptions, etc), Haskell instead provides via libraries. In Haskell, you must explicitly specify which effects you want, and controlling those effects is a little less convenient. In exchange, however, Haskell provides more safety so you don't accidentally mix up the wrong effects, and more power, because you are in complete control of how to combine and refactor effects.
In Haskell, you wouldn't use an algorithm that mutates the data it operates on. So no, there is no straightforward way to do that. However, the code can be rewritten using recursion to avoid updating variables. Solution below uses the MissingH package because Haskell annoyingly doesn't have a replace function that works on strings.
import Data.String.Utils (replace)
import Data.Tree
import System.Environment (getArgs)
data Atom = Sym String | NInt Int | NDouble Double | Para deriving (Eq, Show)
type ParserStack = (Tree Atom, Tree Atom)
tokenize = words . replace "(" " ( " . replace ")" " ) "
atom :: String -> Atom
atom tok =
case reads tok :: [(Int, String)] of
[(int, _)] -> NInt int
_ -> case reads tok :: [(Double, String)] of
[(dbl, _)] -> NDouble dbl
_ -> Sym tok
empty = Node $ Sym "dummy"
para = Node Para
parseToken (Node _ stack, Node _ out) "(" =
(empty $ stack ++ [empty out], empty [])
parseToken (Node _ stack, Node _ out) ")" =
(empty $ init stack, empty $ (subForest (last stack)) ++ [para out])
parseToken (stack, Node _ out) tok =
(stack, empty $ out ++ [Node (atom tok) []])
main = do
(file:_) <- getArgs
contents <- readFile file
let tokens = tokenize contents
parseStack = foldl parseToken (empty [], empty []) tokens
schemeTree = head $ subForest $ snd parseStack
putStrLn $ drawTree $ fmap show schemeTree
foldl is the haskeller's basic structured recursion tool and it serves the same purpose as your while loop and recursive call to read_from. I think the code can be improved a lot, but I'm not so used to Haskell. Below is an almost straight transliteration of the above to Python:
from pprint import pprint
from sys import argv
def atom(tok):
try:
return 'int', int(tok)
except ValueError:
try:
return 'float', float(tok)
except ValueError:
return 'sym', tok
def tokenize(s):
return s.replace('(',' ( ').replace(')',' ) ').split()
def handle_tok((stack, out), tok):
if tok == '(':
return stack + [out], []
if tok == ')':
return stack[:-1], stack[-1] + [out]
return stack, out + [atom(tok)]
if __name__ == '__main__':
tokens = tokenize(open(argv[1]).read())
tree = reduce(handle_tok, tokens, ([], []))[1][0]
pprint(tree)
I am reading Programming In Haskell, in the 8th chapter, the author gives an example of writing parsers.
The full source is here: http://www.cs.nott.ac.uk/~gmh/Parsing.lhs
I can't understand the following part: many permits zero or more applications of p,
whereas many1 requires at least one successful application:
many :: Parser a → Parser [a ]
many p = many1 p +++ return [ ]
many1 :: Parser a → Parser [a ]
many1 p = do v ← p
vs ← many p
return (v : vs)
How the recursive call happens at
vs <- many p
vs is the result value of many p, but many p called many1 p, all many1 has in its definition is a do notation, and again has result value v, and vs, when does the recursive call return?
Why does the following snippet can return [("123","abc")] ?
> parse (many digit) "123abc"
[("123", "abc")]
The recursion stops at the v <- p line. The monadic behavior of the Parser will just propagate a [] to the end of the computation when p cannot be parsed anymore.
p >>= f = P (\inp -> case parse p inp of
[] -> [] -- this line here does not call f
[(v,out)] -> parse (f v) out)
The second function is written in do-notation, which is just a nice syntax for the following:
many1 p = p >>= (\v -> many p >>= (\vs -> return (v : vs)))
If parsing p produces an empty list [] the function \v -> many p >>= (\vs -> return (v : vs)) will not be called, stopping the recursion.
For the last question:
> parse (many digit) "123abc"
[("123", "abc")]
Means that parsing has been successful as at least one result has been returned in the answer list. Hutton parsers always return a list - the empty list means parsing failure.
The result ("123", "abc") means that parsing has found three digits "123" and stopped at 'a' which is not a digit - so the "rest of the input" is "abc".
Note that many means "as many as possibly" not "one or more". If it were "one or more" you'd get this result instead:
[("1", "23abc"), ("12", "3abc"), ("123", "abc")]
This behaviour wouldn't be very good for deterministic parsing, though it might sometimes be needed for natural language parsing.
Let me strip this down to the barest bones to make absolutely clear why do-blocks can be misunderstood if they're read simply as imperative code. Consider this snippet:
doStuff :: Maybe Int
doStuff = do
a <- Nothing
doStuff
It looks like doStuff will recurse forever, after all, it's defined to do a sequence of things ending with doStuff. But the sequence of lines in a do-block is not simply a sequence of operations that is performed in order. If you're at a point in a do-block, the way the rest of the block is handled is determined by the definition of >>=. In my example, the second argument to >>= is only used if the first argument isn't Nothing. So the recursion never happens.
Something similar can happen in many different monads. Your example is just a little more complex: when there are no more ways to parse something, the stuff after the >>= is ignored.