I've written several compilers and am familiar with lexers, regexs/NFAs/DFAs, parsers and semantic rules in flex/bison, JavaCC, JavaCup, antlr4 and so on.
Is there some sort of magical monadic operator that seamlessly grows/combines a token with a mix of Parser Char (ie Text.Megaparsec.Char) vs. Parser String?
Is there a way / best practices to represent a clean separation of lexing tokens and nonterminal expectations?
Typically, one uses applicative operations to directly combine Parser Char and Parser Strings, rather than "upgrading" the former. For example, a parser for alphanumeric identifiers that must start with a letter would probably look like:
ident :: Parser String
ident = (:) <$> letterChar <*> alphaNumChar
If you were doing something more complicated, like parsing dollar amounts with optional cents, for example, you might write:
dollars :: Parser String
dollars = (:) <$> char '$' <*> some digitChar
<**> pure (++)
<*> option "" ((:) <$> char '.' <*> replicateM 2 digitChar)
If you find yourself trying to build a Parser String out of a complicated sequence of Parser Char and Parser String parsers in a lot of situations, then you could define a few helper operators. If you find the variety of operators annoying, you could just define (<++>) and a short-form for charToStr like c :: Parser Char -> Parser String.
(<.+>) :: Parser Char -> Parser String -> Parser String
p <.+> q = (:) <$> p <*> q
infixr 5 <.+>
(<++>) :: Parser String -> Parser String -> Parser String
p <++> q = (++) <$> p <*> q
infixr 5 <++>
(<..>) :: Parser Char -> Parser Char -> Parser String
p <..> q = p <.+> fmap (:[]) q
infixr 5 <..>
so you can write something like:
dollars' :: Parser String
dollars' = char '$' <.+> some digitChar
<++> option "" (char '.' <.+> digitChar <..> digitChar)
As #leftroundabout says, there's nothing hackish about fmap (:[]). If you prefer, write fmap (\c -> [c]) if you think it looks clearer.
There's nothing nasty or hackish about fmap (: []) (or fmap pure or pure <$>) – it's the natural thing to do, performing a conversion that's concise, safe, expressive and transparent all at the same time.
An alternative that I wouldn't really recommend, but for some situations it might express the intent best: sequence [charParser]. This makes it clear that you're executing “all” of the parsers in a list of character-parsers, and gathering the result“s” as a list of character“s”.
Related
I want to parse strings like "0-9" into ('0', '9') but I think my two attempts look a bit clumsy.
numRange :: Parser (Char, Char)
numRange = (,) <$> digitChar <* char '-' <*> digitChar
numRange' :: Parser (Char, Char)
numRange' = liftM2 (,) (digitChar <* char '-') digitChar
I kind of expected that there already is an operator that sequences two parsers and returns both results in a tuple. If there is then I can't find it. I'm also having a hard time figuring out the desired signature in order to search on hoogle.
I tried Applicative f => f a -> f b -> f (a, b) based off the signature of <* but that only gives unrelated results.
The applicative form:
numRange = (,) <$> digitChar <* char '-' <*> digitChar
is standard. Anyone familiar with monadic parsers will immediately understand what this does.
The disadvantage of the liftM2 (or equivalently liftA2) form, or of a function with signature:
pair :: Applicative f => f a -> f b -> f (a, b)
pair = liftA2 (,)
is that the resulting parser expressions:
pair (digitChar <* char '-') digitChar
pair digitChar (char '-' *> digitChar)
obscure the fact that the char '-' syntax is not actually part of either digit parser. As a result, I think this is more likely to be confusing than the admittedly ugly applicative syntax.
I kind of expected that there already is an operator that sequences two parsers and returns both results in a tuple.
There is; it's liftA2 (,) as you noticed. However, you aren't sequencing two parser, you are sequencing three parsers. Even though you can treat this as a "metasequence" of two two-parser sequencing operations, those two operations are different:
In digitChar <* char '-', you ignore the result of the second parser (and in my opinion, <* always looks like a typo for <*>).
In ... <*> digitChar, you use both results.
If you don't like using the applicative operators directly, consider using do syntax along with the ApplicativeDo extension and write
numRange :: Parser (Char, Char)
numRange = do
x <- digitChar
char '-'
y <- digitChar
return (x,y)
It's longer, but it's arguably more readable than either of the two using <*, which I always think looks like a typo for <*>.
Suppose I have a function of this type:
once :: (a, b) -> Parser (a, b)
Now, I would like to repeatedly apply this parser (somewhat like using >>=) and use its last output to feed it in the next iteration.
Using something like
sequence :: (a, b) -> Parser (a, b)
sequence inp = once inp >>= sequence
with specifying the initial values for the first parser doesn't work, because it would go on until it inevitably fails. Instead, I would like it to stop when it would fail (somewhat like many).
Trying to fix it using try makes the computation too complex (adding try in each iteration).
sequence :: (a, b) -> Parser (a, b)
sequence inp = try (once inp >>= sequence) <|> pure inp
In other words, I am looking for a function somewhat similar to foldl on Parsers, which stops when the next Parser would fail.
If your once parser fails immediately without consuming input, you don't need try. As a concrete example, consider a rather silly once parser that uses a pair of delimiters to parse the next pair of delimiters:
once :: (Char, Char) -> Parser (Char, Char)
once (c1, c2) = (,) <$ char c1 <*> anyChar <*> anyChar <* char c2
You can parse a nested sequence using:
onces :: (Char, Char) -> Parser (Char, Char)
onces inp = (once inp >>= onces) <|> pure inp
which works fine:
> parseTest (onces ('(',')')) "([])[{}]{xy}xabyDONE"
('a','b')
You only need try if your once might fail after parsing input. For example, the following won't parse without try:
> parseTest (onces ('(',')')) "([])[not valid]"
parse error at (line 1, column 8):
unexpected "t"
expecting "]"
because we start parsing the opening delimiter [ before discovering not valid].
(With try, it returns the correct ('[',']').)
All that being said, I have no idea how you came to the conclusion that using try makes the computation "too complex". If you are just guessing from something you've read about try being potentially inefficient, then you've misunderstood. try can cause problems if it's used in a manner than can result in a big cascade of backtracking. That's not a problem here -- at most, you're backtracking a single once, so don't worry about it.
As an exercise¹, I've written a string parser that only uses char parsers and Trifecta:
import Text.Trifecta
import Control.Applicative ( pure )
stringParserWithChar :: String -> Parser Char
stringParserWithChar stringToParse =
foldr (\c otherParser -> otherParser >> char c) identityParser
$ reverse stringToParse
where identityParser = pure '?' -- ← This works but I think I can do better
The parser does its job just fine:
parseString (stringParserWithChar "123") mempty "1234"
-- Yields: Success '3'
Yet, I'm not happy with the specific identityParser to which I applied foldr. It seems hacky to have to choose an arbitrary character for pure.
My first intuition was to use mempty but Parser is not a monoid. It is an applicative but empty constitutes an unsuccessful parser².
What I'm looking for instead is a parser that works as a neutral element when combined with other parsers. It should successfully do nothing, i.e., not advance the cursor and let the next parser consume the character.
Is there an identity parser as described above in Trifecta or in another library? Or are parsers not meant to be used in a fold?
¹ The exercise is from the parser combinators chapter of the book Haskell Programming from first principles.
² As helpfully pointed out by cole, Parser is an Alternative and thus a monoid. The empty function stems from Alternative, not Parser's applicative instance.
Don't you want this to parse a String? Right now, as you can tell from the function signature, it parses a Char, returning the last character. Just because you only have a Char parser doesn't mean you can't make a String parser.
I'm going to assume that you want to parse a string, in which case your base case is simple: your identityParser is just pure "".
I think something like this should work (and it should be in the right order but might be reversed).
stringParserWithChar :: String -> Parser String
stringParserWithChar = traverse char
Unrolled, you get something like
stringParserWithChar' :: String -> Parser String
stringParserWithChar' "" = pure ""
stringParserWithChar' (c:cs) = liftA2 (:) (char c) (stringParserWithChar' cs)
-- the above with do notation, note that you can also just sequence the results of
-- 'char c' and 'stringParserWithChar' cs' and instead just return 'pure (c:cs)'
-- stringParserWithChar' (c:cs) = do
-- c' <- char c
-- cs' <- stringParserWithChar' cs
-- pure (c':cs')
Let me know if they don't work since I can't test them right now…
A digression on monoids
My first intuition was to use mempty but Parser is not a monoid.
Ah, but that is not quite the case. Parser is an Alternative, which is a Monoid. But you don't really need to look at the Alt typeclass of Data.Monoid to understand this; Alternative's typeclass definition looks just like a Monoid's:
class Applicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a
-- more definitions...
class Semigroup a => Monoid a where
mempty :: a
mappend :: a -> a -> a
-- more definitions...
Unfortunately, you want something that acts more like a product instead of an Alt, but that's what the default behavior of Parser does.
Let's rewrite your fold+reverse into just a fold to clarify what's going on:
stringParserWithChar :: String -> Parser Char
stringParserWithChar =
foldl (\otherParser c -> otherParser >> char c) identityParser
where identityParser = pure '?'
Any time you see foldl used to build up something using its Monad instance, that's a bit suspicious[*]. It hints that you really want a monadic fold of some sort. Let's see here...
import Control.Monad
-- foldM :: (Foldable t, Monad m) => (b -> a -> m b) -> b -> t a -> m b
attempt1 :: String -> Parser Char
attempt1 = foldM _f _acc
This is going to run into the same sort of trouble you saw before: what can you use for a starting value? So let's use a standard trick and start with Maybe:
-- (Control.Monad.<=<)
-- :: Monad m => (b -> m c) -> (a -> m b) -> a -> m c
stringParserWithChar :: String -> Parser Char
stringParserWithChar =
maybe empty pure <=< foldM _f _acc
Now we can start our fold off with Nothing, and immediately switch to Just and stay there. I'll let you fill in the blanks; GHC will helpfully show you their types.
[*] The main exception is when it's a "lazy monad" like Reader, lazy Writer, lazy State, etc. But parser monads are generally strict.
First, just some quick context. I'm going through the Haskell Programming From First Principles book, and ran into the following exercise.
Try writing a Parser that does what string does, but using char.
I couldn't figure it out, so I checked out the source for the implementation. I'm currently trying to wrap my head around it. Here it is:
class Parsing m => CharParsing m where
-- etc.
string :: CharParsing m => String -> m String
string s = s <$ try (traverse_ char s) <?> show s
My questions are as follows, from most to least specific.
Why is show necessary?
Why is s <$ necessary? Doesn't traverse char s <?> s work the same? In other words, why do we throw away the results of the traversal?
What is going on with the traversal? I get what a list traversal does, so I guess I'm confused about the Applicative/Monad instances for Parser. On a high level, I get that the traversal applies char, which has type CharParsing m => Char -> m Char, to every character in string s, and then collects all the results into something of type Parser [Char]. So the types make sense, but I have no idea what's going on in the background.
Thanks in advance!
1) Why is show necessary?
Because showing a string (or a Text, etc.) escapes special characters, which makes sense for error messages:
GHCi> import Text.Parsec -- Simulating your scenario with Parsec.
GHCi> runParser ((\s -> s <$ try (traverse_ char s) <?> s) "foo\nbar") () "" "foo"
Left (line 1, column 4):
unexpected end of input
expecting foo
bar
GHCi> runParser ((\s -> s <$ try (traverse_ char s) <?> show s) "foo\nbar") () "" "foo"
Left (line 1, column 4):
unexpected end of input
expecting "foo\nbar"
2) Why is s <$ necessary? Doesn't traverse char s <?> s work the same? In other words, why do we throw away the results of the traversal?
The result of the parse is unnecessary because we know in advance that it would be s (if the parse were successful). traverse would needlessly reconstruct s from the results of parsing each individual character. In general, if the results are not needed it is a good idea to use traverse_ (which just combines the effects, discarding the results without trying to rebuild the data structure) rather than traverse, so that is likely why the function is written the way it is.
3) What is going on with the traversal?
traverse_ char s (traverse_, and not traverse, as explained above) is a parser. It tries to parse, in order, each character in s, while discarding the results, and it is built by sequencing parsers for each character in s. It may be helpful to remind that traverse_ is just a fold which uses (*>):
-- Slightly paraphrasing the definition in Data.Foldable:
traverse_ :: (Foldable t, Applicative f) => (a -> f b) -> t a -> f ()
traverse_ f = foldr (\x u -> f x *> u) (pure ())
I am trying to convert the code from Paulson's ML for the working programmer book chapter 9, Writing Interpreters for the λ-Calculus.
I was wondering if anyone can help me translate this to Haskell.
I'm struggling to understand the syntax.
fun list ph = ph -- repeat ("," $-- ph) >> (op::);
fun pack ph = "(" $-- list ph --$")" >> #1
| empty;
In porting this code to Haskell, I see two challenges: One is rewriting the combinators so they use the type Either SyntaxError rather than exceptions for flow control, and the other is preserving the modularity of ML's functors. That is, writing a parser combinator library that is modular with regards to what keywords / symbols / tokenizer it should use.
While the ML code has the two
functor Lexical (Keyword: KEYWORD) : LEXICAL
functor Parsing (Lex: LEXICAL) : PARSE
you could start by having
data Keyword = Keyword
{ alphas :: [String]
, symbols :: [String]
}
data Token
= Key String
| Id String
deriving (Show, Eq)
lex :: Keyword -> String -> [Token]
lex kw s = ...
where
alphaTok :: String -> Token
alphaTok a | a `elem` alphas kw = Key a
| otherwise = Id a
...
The ML code uses the types string and substring while Haskell's String is actually a [Char]. The lexer functions would look a little different because ML's String.getc could simply be the pattern match c : ss1 in Haskell, etc.
Paulson's parsers have type [Token] → (τ, [Token]) but allow for exceptions. The Haskell parsers could have type [Token] → Either SyntaxError (τ, [Token]):
newtype SyntaxError = SyntaxError String
deriving Show
newtype Parser a = Parser { runParser :: [Token] -> Either SyntaxError (a, [Token]) }
err :: String -> Either SyntaxError b
err msg = Left (SyntaxError msg)
The operators id, $, ||, !!, -- and >> need new names, since they collide with a bunch of built-in operators and single-line comments. Ideas for names could be: ident, kw, |||, +++ and >>>. I would skip implementing the !! operator initially.
Here are two combinators implemented a little differently,
ident :: Parser String
ident = Parser f
where
f :: [Token] -> Either SyntaxError (String, [Token])
f (Id x : toks) = Right (x, toks)
f (Key x : _) = err $ "Identifier expected, got keyword '" ++ x ++ "'"
f [] = err "Identifier expected, got EOF"
(+++) :: Parser a -> Parser b -> Parser (a, b)
(+++) pa pb = Parser $ \toks1 -> do (x, toks2) <- runP pa toks1
(y, toks3) <- runP pb toks2
return ((x, y), toks3)
...
Some final remarks:
Read the paper Monadic Parsing in Haskell (Hutton, Meijer).
You may be interested in SimpleParse by Ken Friis Larsen, an educational parser combinator library that is a simplification of ReadP by Koen Claessen, since its source code is very easy to read. They are both non-deterministic.
If you're interested in using parser combinators in Haskell, rather than porting some old-fashioned library for the learning experience, I encourage you too look at Megaparsec (tutorial), a modern fork of Parsec. The implementation is a little complex.
None of these three libraries (SimpleParse, ReadP, Megaparsec) split lexing and parsing into two separate steps. Rather, they simply build small tokenizing parsers that implicitly eat meaningless whitespace. (See the token combinator in SimpleParse, for example.) However, Megaparsec does allow an arbitrary token type, whether that is Char or some token you have lexed.