My use of Text.Parsec is a little rusty. If I just want to return the matched string is this idiomatic?
category :: Stream s m Char => ParsecT s u m [Char]
category = concat <$> (many1 $ (:) <$> char '/' <*> (many1 $ noneOf "/\n"))
I feel like there might be an existing operator for liftM concat . many1 or (:) <$> p1 <*> p2 that I'm ignoring, but I'm not sure.
That's fine, I think. A little judicious naming would make it prettier:
category = concat <$> many1 segment
where
segment = (:) <$> char '/' <*> many1 (noneOf "/\n")
I think it would be slightly more idiomatic use of Parsec to return something more structured, for example, the list of strings:
catList :: Parser [String]
catList = char '/' *> many1 alphaNum `sepBy1` char '/'
I don't think there's a combinator like the one you were wondering there was, but this is Haskell, and roll-your-own-control-structure-or-combinator is always available:
concatMany1 :: Parser [a] -> Parser [a]
concatMany1 p = concat <$> many1 p
catConcat = concatMany1 $ (:) <$> char '/' <*> many1 alphaNum
But this next combinator is even nicer, and definitely idiomatic Haskell at least:
infixr 5 <:>
(<:>) :: Applicative f => f a -> f [a] -> f [a]
hd <:> tl = (:) <$> hd <*> tl
So now we can write
catCons :: Parser String
catCons = concatMany1 (char '/' <:> many1 alphaNum)
but incidentally also
contrivedExample :: IO String
contrivedExample = getChar <:> getLine
moreContrived :: String -> Maybe String
moreContrived name = find isLetter name <:> lookup name symbolTable
noneOf
You'll notice I've used alphaNum where you used noneOf "/\n". I think noneOf is not good practice; parsers should be really careful to accept onlt the right thing. Are you absolutely sure you want your parser to accept /qwerty/12345/!"£$%^&*()#:?><.,#{}[] \/ "/" /-=_+~? Should it really be happy with /usr\local\bin?
As it stands, your parser accepts any string as long as it starts with / and ends before \n with something that's not /. I think you should rewrite it with alphaNum <|> oneOf "_-.',~+" or similar instead of using noneOf. Using noneOf allows you to avoid thinking about what you should allow and focus on getting positive examples to parse instead of only positive examples to parse.
Parser
I've also always gone for Parser a instead of Stream s m t => ParsecT s u m a. That's just lazy typing, but let's pretend I did it to make it clearer what my code was doing, shall we? :) Use what type signature suits you, of course.
Related
I am trying to distinguish between Ints and floats in a parser. I have 2 parsers one for each int and float. However, I am having trouble getting into to fail on a '.'. I looked for negating and look ahead and didn't seem to get and fruits.
I hope I am not duplicating any questions.
I had it working with looking at the next character that is not a '.' but that is an ugly solution.
EDIT: Added more code.
--Int--------------------------------------------------------------------
findInt :: Parser String
findInt = plus <|> minus <|> number
number :: Parser String
number = many1 digit
plus :: Parser String
plus = char '+' *> number
minus :: Parser String
minus = char '-' <:> number
makeInt :: Parser Int
makeInt = prepareResult (findInt <* many (noneOf ".") <* endOfLine)
where readInt = read :: String -> Int
prepareResult = liftA readInt
makeInt2 :: Parser Int
makeInt2 = do
numberFound <- (findInt <* many (noneOf ".") <* endOfLine)
match <- char '.'
return (prepareResult numberFound)
where readInt = read :: String -> Int
prepareResult = readInt
--End Int----------------------------------------------------------------
I think you are best off actually combining the two parsers into one. Try something like this:
import Text.Parsec.String (Parser)
import Control.Applicative ((<|>))
import Text.Parsec.Char (char,digit)
import Text.Parsec.Combinator (many1,optionMaybe)
makeIntOrFloat :: Parser (Either Int Float)
makeIntOrFloat = do
sign <- optionMaybe (char '-' <|> char '+')
n <- many1 digit
m <- optionMaybe (char '.' *> many1 digit)
return $ case (m,sign) of
(Nothing, Just '-') -> Left (negate (read n))
(Nothing, _) -> Left (read n)
(Just m, Just '-') -> Right (negate (read n + read m / 10.0^(length m)))
(Just m, _) -> Right (read n + read m / 10.0^(length m))
ErikR has a correct solution, but the use of try means that parsec has to keep track of the possibility of backtracking (which is a bit inefficient) when in fact that is unnecessary in this case.
Here, the key difference is that we can actually tell right away if we have a float or not - if we don't have a float, the char '.' *> many1 digit parser in optionMaybe will fail immediately (without consuming input), so there is no need to consider backtracking.
At GHCi
ghci> import Text.Parsec.Prim
ghci> parseTest makeIntOrFloat "1234.012"
Right 1234.012
ghci> parseTest makeIntOrFloat "1234"
Left 1234
I would use notFollowedBy - e.g.:
import Text.Parsec
import Text.Parsec.String
import Text.Parsec.Combinator
int :: Parser String
int = many1 digit <* notFollowedBy (char '.')
float :: Parser (String,String)
float = do whole <- many1 digit
fracpart <- try (char '.' *> many digit) <|> (return "")
return (whole, fracpart)
intOrFloat :: Parser (Either String (String,String))
intOrFloat = try (fmap Left int) <|> (fmap Right float)
test1 = parseTest (intOrFloat <* eof) "123"
test2 = parseTest (intOrFloat <* eof) "123.456"
test3 = parseTest (intOrFloat <* eof) "123."
It is typically easiest to use applicative combinators to build your parsers - this makes your parsers easier to reason about and often you do not need monadic and backtracking functions of the parser.
For example, a parser for integers could be written as such:
import Text.Parsec hiding ((<|>), optional)
import Text.Parsec.String
import Numeric.Natural
import Control.Applicative
import Data.Foldable
natural :: Parser Natural
natural = read <$> many1 digit
sign :: Num a => Parser (a -> a)
sign = asum [ id <$ char '+'
, negate <$ char '-'
, pure id
]
integer :: Parser Integer
integer = sign <*> (fromIntegral <$> natural)
A decimal number is an integer optionally followed by a decimal portion (a '.' followed by another integer), which is itself a number proper, so your parser can be written as
decimalPart :: Parser Double
decimalPart = read . ("0."++) <$> (char '.' *> many1 digit)
integerOrDecimal :: Parser (Either Integer Double)
integerOrDecimal = liftA2 cmb integer (optional decimalPart) where
cmb :: Integer -> Maybe Double -> Either Integer Double
cmb x Nothing = Left x
cmb x (Just d) = Right (fromIntegral x + d)
The definition of cmb is obvious - if the is no decimal part, then produce an Integer, and if there is, produce a Double, by adding the integer part to the decimal part.
You can also define a parser for decimals in terms of the above:
decimal :: Parser Double
decimal = either fromIntegral id <$> integerOrDecimal
Note that none of the above parsers directly use monadic functions (i.e. >>=) or backtracking - making them simple and efficient.
Assume there is a data structure representing a text with comments inside.
data TWC
= T Text TWC -- text
| C Text TWC -- comment
| E -- end
deriving Show
Thus string like
"Text, {-comment-}, and something else"
could be encoded as
T "Text, " (C "comment" (T ", and something else" E))
Parsers for comment chunk and for E are pretty trivial:
twcP :: Parser TWC
twcP = eP <|> cP <|> tP
cP :: Parser TWC
cP = do
_ <- string "{-"
c <- manyTill anyChar (string "-}")
rest <- cP <|> tP <|> eP
return (C (pack c) rest)
eP :: Parser TWC
eP = do
endOfInput
return E
Implementing parser for text chunk in such a trivial way
tP :: Parser TWC
tP = do
t <- many1 anyChar
rest <- cP <|> eP
return (T (pack t) rest)
make it consume comments section as a text because of its greedy nature
> parseOnly twcP "text{-comment-}"
Right (T "text{-comment-}" E)
it ∷ Either String TWC
So, the question is how to express the logic of parsing until end of input or until comment section? In other words, how to implement conditional lookahead parser?
You're right, the problematic code is the first line of of tP, which parses text greedily without stopping at comments:
tP = do
t <- many1 anyChar
Before addressing that, I first want to refactor your code a little to introduce helpers and use applicative style, with the problematic code isolated into the text helper:
-- Like manyTill, but pack the result to Text.
textTill :: Alternative f => f Char -> f b -> f Text
textTill p end = pack <$> manyTill p end
-- Parse one comment string
comment :: Parser Text
comment = string "{-" *> textTill anyChar (string "-}")
-- Parse one non-comment text string (problematic implementation)
text :: Parser Text
text = pack <$> many1 anyChar
-- TWC parsers:
twcP :: Parser TWC
twcP = eP <|> cP <|> tP
cP :: Parser TWC
cP = C <$> comment <*> twcP
eP :: Parser TWC
eP = E <$ endOfInput
tP :: Parser TWC
tP = T <$> text <*> twcP
To implement lookahead, we can use the lookAhead combinator, which applies a parser without consuming the input. That allows us to make text parse until it reaches either a comment (without consuming it), or endOfInput:
-- Parse one non-comment text string (working implementation)
text :: Parser Text
text = textTill anyChar (void (lookAhead comment) <|> endOfInput)
With that implementation, twcP behaves as expected:
ghci> parseOnly twcP "text{-comment-} post"
Right (T "text" (C "comment" (T " post" E)))
After having read Anthony's response on a style-related parser question, I was trying to convince myself that writing monadic parsers can still be rather compact.
So instead of
reference :: Parser Transc
reference = try $ do string "#{"
a <- number
char ','
b <- number
char ','
c <- number
char '}'
return $ Outside (a,b,c)
We can simply have:
reference3 :: Parser Transc
reference3 = liftM3 (((Outside .).) . (,,))
(string "#{" >> number <<! char ',')
number
(char ',' >> number <<! char '}') where
(<<!) = liftM2 const
Which is very similar to applicative version provided by Anthony:
reference2 :: Parser Transc
reference2 = ((Outside .) .) . (,,)
<$> (string "#{" *> number2 <* char ',')
<*> number2
<*> (char ',' *> number2 <* char '}')
...except for the <<! operator which is conceptually similar to <* which is defined as liftA2 const meaning "sequence but discard value and use value provided to the left".
Of course << would have been a bad name for liftM2 const, it would have suggested that << is equivalent to flip >> if we follow the same logic as >>= and =<<.
I don't find a "liftM2 const" under a single name. Is this because it is not that useful?
I don't quite see the problem. Every monad is also an of applicative functor, so you can simply use (*>) in the monadic expressions as well.
(At the time of this answer (year 2011), Applicative was not a superclass of Monad, so it may have been necessary to add a corresponding class instance.)
Around 6 years ago, I benchmarked my own parser combinators in OCaml and found that they were ~5× slower than the parser generators on offer at the time. I recently revisited this subject and benchmarked Haskell's Parsec vs a simple hand-rolled precedence climbing parser written in F# and was surprised to find the F# to be 25× faster than the Haskell.
Here's the Haskell code I used to read a large mathematical expression from file, parse and evaluate it:
import Control.Applicative
import Text.Parsec hiding ((<|>))
expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
fact = read <$> many1 digit <|> char '(' *> expr <* char ')'
eval :: String -> Int
eval = either (error . show) id . parse expr "" . filter (/= ' ')
main :: IO ()
main = do
file <- readFile "expr"
putStr $ show $ eval file
putStr "\n"
and here's my self-contained precedence climbing parser in F#:
let rec (|Expr|) = function
| P(f, xs) -> Expr(loop (' ', f, xs))
| xs -> invalidArg "Expr" (sprintf "%A" xs)
and loop = function
| ' ' as oop, f, ('+' | '-' as op)::P(g, xs)
| (' ' | '+' | '-' as oop), f, ('*' | '/' as op)::P(g, xs) ->
let h, xs = loop (op, g, xs)
match op with
| '+' -> (+) | '-' -> (-) | '*' -> (*) | '/' | _ -> (/)
|> fun op -> loop (oop, op f h, xs)
| _, f, xs -> f, xs
and (|P|_|) = function
| '('::Expr(f, ')'::xs) -> Some(P(f, xs))
| c::_ as xs when '0' <= c && c <= '9' ->
let rec loop n = function
| c2::xs when '0' <= c2 && c2 <= '9' -> loop (10*n + int(string c2)) xs
| xs -> Some(P(n, xs))
loop 0 xs
| _ -> None
My impression is that even state-of-the-art parser combinators waste a lot of time back tracking. Is that correct? If so, is it possible to write parser combinators that generate state machines to obtain competitive performance or is it necessary to use code generation?
EDIT:
Here's the OCaml script I used to generate a ~2Mb expression for benchmarking:
open Printf
let rec f ff n =
if n=0 then fprintf ff "1" else
fprintf ff "%a+%a*(%a-%a)" f (n-1) f (n-1) f (n-1) f (n-1)
let () =
let n = try int_of_string Sys.argv.(1) with _ -> 3 in
fprintf stdout "%a\n" f n
I've come up with a Haskell solution that is 30× faster than the Haskell solution you posted (with my concocted test expression).
Major changes:
Change Parsec/String to Attoparsec/ByteString
In the fact function, change read & many1 digit to decimal
Made the chainl1 recursion strict (remove $! for the lazier version).
I tried to keep everything else you had as similar as possible.
import Control.Applicative
import Data.Attoparsec
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
expr :: Parser Int
expr = chainl1 term ((+) <$ char '+' <|> (-) <$ char '-')
term :: Parser Int
term = chainl1 fact ((*) <$ char '*' <|> div <$ char '/')
fact :: Parser Int
fact = decimal <|> char '(' *> expr <* char ')'
eval :: B.ByteString -> Int
eval = either (error . show) id . eitherResult . parse expr . B.filter (/= ' ')
chainl1 :: (Monad f, Alternative f) => f a -> f (a -> a -> a) -> f a
chainl1 p op = p >>= rest where
rest x = do f <- op
y <- p
rest $! (f x y)
<|> pure x
main :: IO ()
main = B.readFile "expr" >>= (print . eval)
I guess what I concluded from this is that the majority of the slowdown for the parser combinator was that it was sitting on an inefficient base, not that it was a parser combinator, per se.
I imagine with more time and profiling this could go faster, as I stopped when I went past the 25× mark.
I don't know if this would be faster than the precedence climbing parser ported to Haskell. Maybe that would be an interesting test?
I'm currently working on the next version of FParsec (v. 0.9), which will in many situations improve performance by up to a factor of 2 relative to the current version.
[Update: FParsec 0.9 has been released, see http://www.quanttec.com/fparsec ]
I've tested Jon's F# parser implementation against two FParsec implementations. The first FParsec parser is a direct translation of djahandarie's parser. The second one uses FParsec's embeddable operator precedence component. As the input I used a string generated with Jon's OCaml script with parameter 10, which gives me an input size of about 2.66MB. All parsers were compiled in release mode and were run on the 32-bit .NET 4 CLR. I only measured the pure parsing time and didn't include startup time or the time needed for constructing the input string (for the FParsec parsers) or the char list (Jon's parser).
I measured the following numbers (updated numbers for v. 0.9 in parens):
Jon's hand-rolled parser: ~230ms
FParsec parser #1: ~270ms (~235ms)
FParsec parser #2: ~110ms (~102ms)
In light of these numbers, I'd say that parser combinators can definitely offer competitive performance, at least for this particular problem, especially if you take into account that FParsec
automatically generates highly readable error messages,
supports very large files as input (with arbitrary backtracking), and
comes with a declarative, runtime-configurable operator-precedence parser module.
Here's the code for the two FParsec implementations:
Parser #1 (Translation of djahandarie's parser):
open FParsec
let str s = pstring s
let expr, exprRef = createParserForwardedToRef()
let fact = pint32 <|> between (str "(") (str ")") expr
let term = chainl1 fact ((str "*" >>% (*)) <|> (str "/" >>% (/)))
do exprRef:= chainl1 term ((str "+" >>% (+)) <|> (str "-" >>% (-)))
let parse str = run expr str
Parser #2 (Idiomatic FParsec implementation):
open FParsec
let opp = new OperatorPrecedenceParser<_,_,_>()
type Assoc = Associativity
let str s = pstring s
let noWS = preturn () // dummy whitespace parser
opp.AddOperator(InfixOperator("-", noWS, 1, Assoc.Left, (-)))
opp.AddOperator(InfixOperator("+", noWS, 1, Assoc.Left, (+)))
opp.AddOperator(InfixOperator("*", noWS, 2, Assoc.Left, (*)))
opp.AddOperator(InfixOperator("/", noWS, 2, Assoc.Left, (/)))
let expr = opp.ExpressionParser
let term = pint32 <|> between (str "(") (str ")") expr
opp.TermParser <- term
let parse str = run expr str
In a nutshell, parser combinators are slow for lexing.
There was a Haskell combinator library for building lexers (see "Lazy Lexing is Fast" Manuel M. T. Chakravarty) - as the tables were generated at runtime, there wasn't the hassle of code generation. The library got used a bit - it was initially used in one of the FFI preprocessors, but I don't think it ever got uploaded to Hackage, so maybe it was a little too inconvenient for regular use.
In the OCaml code above, the parser is directly matching on char-lists so it can be as fast as list destructuring is in the host language (it would be much faster than Parsec if it were re-implemented in Haskell). Christian Lindig had an OCaml library that had a set of parser combinators and a set of lexer combinators - the lexer combinators were certainly much simpler than Manuel Chakravarty's, and it might might be worthwhile tracking down this library and bench-marking it before writing a lexer generator.
Have you tried one of the known fast parser libraries? Parsec's aims have never really been speed, but ease of use and clarity. Comparing to something like attoparsec may be a more fair comparison, especially because the string types are likely to be more equal (ByteString instead of String).
I also wonder which compile flags were used. This being another trolling post by the infamous Jon Harrop, it would not surprise me if no optimisations were used at all for the Haskell code.
I have a list that I need to parse where the all but the last element needs to be parsed by one parser, and the last element needs to be parsed by another parser.
a = "p1 p1b ... p2"
or
a = "p2"
Originally I tried
parser = do parse1 <- many parser1
parse2 <- parser2
return AParse parse1 parse2
The problem is that parse1 can consume a parse2 input. So parse1 always consumes the entire list, and leave parse2 with nothing.
Is there a way to say to apply parse1 to everything besides the last element in a string, and then apply parse2?
How about:
parseTrain car caboose = choice
[ fmap (:[]) $ try (caboose `endBy` eof),
, liftM2 (:) car (parseTrain car caboose)
[
The eof bugs me, since that makes this parser not compositional. I.e. you couldn't say:
char '(' >> parseTrain p1 p2 >> char ')'
Doing this compsitionally is very hard for a parser. How is it supposed to know to move on to char ')', without trying to at every opportunity and seeing if it fails? Doing so could exponential time.
If you need it to be compositional, does your problem have some additional structure you can exploit? Can you, for example, parse a list of all elements and then process the last one after the fact?
If you can factor parser1 so that is defined like so:
parser1 = (try parser2) <|> parser1extra
Then the problem becomes a list of parser1extra or parser2 that must end in the later. You can code that as:
parserList =
liftM2 (:) (try parser1extra) parserList
<|>
liftM2 (:) (try parser2) (option [] parserList)
You may or may not need the try calls depending on if those parsers have any prefix overlap.
If you don't want the return value to be a list, but instead your AParse datum, then you could re-write it this way:
parserList =
do
a <- try parser1extra
prefix a parserList
<|>
do
a <- try parser2
option (AParse [] a) (prefix a parserList)
where prefix a p = do
(AParse as t) <- p
return $ (AParse (a:as) t)
Or, a full example:
import Control.Monad
import Text.ParserCombinators.Parsec
parseNum = do { v <- many1 digit; spaces; return v }
parseWord = do { v <- many1 letter; spaces; return v }
parsePart = parseNum <|> parseWord
parsePartListEndingInWord =
liftM2 (:) (try parseNum) parsePartListEndingInWord
<|>
liftM2 (:) (try parseWord) (option [] parsePartListEndingInWord)
Actually, the calls to try aren't needed in this case, as parseNum and parseWord match no common prefix. Notice that parsePartListEndingInWord doesn't actually reference parsePart, but instead, the two options that make up parsePart's definition
(Original answer, solving a somewhat different situation:)
How about something like:
parserTest = between (char '[') (char ']') $ do
p1s <- try parser1 `endBy` char ','
p2 <- parser2
return $ AParse p1s p2
Taking the punctuation out of your parsers and up into parseTest allows you to use the combinators between and endBy to do the work for you. Lastly, the try is there so that if parser1 and parser2 match a common prefix, endBy will perform the correct full backup to beginning of the common prefix.
Depending on your parsers, it is possible that you can leave the punctuation matching inside your sub-parsers, and all you need might be the a try around parser1:
parseTest = do parse1 <- many (try parser1)
parse2 <- parser2
return AParse parse1 parse2
I kind of combined the two approaches:
parserList = try (do a <- parser2
eof
return $ AParse [] a)
<|>
do a <- parser1
prefix a parserList
where
prefix a p = do
(AParse as t) <- p
return $ AParse a:as t
I think that this will work for my purposes.
Thanks!
This will do the trick:
parser1 `manyTill` (try parser2)