How to parse a character range into a tuple - haskell

I want to parse strings like "0-9" into ('0', '9') but I think my two attempts look a bit clumsy.
numRange :: Parser (Char, Char)
numRange = (,) <$> digitChar <* char '-' <*> digitChar
numRange' :: Parser (Char, Char)
numRange' = liftM2 (,) (digitChar <* char '-') digitChar
I kind of expected that there already is an operator that sequences two parsers and returns both results in a tuple. If there is then I can't find it. I'm also having a hard time figuring out the desired signature in order to search on hoogle.
I tried Applicative f => f a -> f b -> f (a, b) based off the signature of <* but that only gives unrelated results.

The applicative form:
numRange = (,) <$> digitChar <* char '-' <*> digitChar
is standard. Anyone familiar with monadic parsers will immediately understand what this does.
The disadvantage of the liftM2 (or equivalently liftA2) form, or of a function with signature:
pair :: Applicative f => f a -> f b -> f (a, b)
pair = liftA2 (,)
is that the resulting parser expressions:
pair (digitChar <* char '-') digitChar
pair digitChar (char '-' *> digitChar)
obscure the fact that the char '-' syntax is not actually part of either digit parser. As a result, I think this is more likely to be confusing than the admittedly ugly applicative syntax.

I kind of expected that there already is an operator that sequences two parsers and returns both results in a tuple.
There is; it's liftA2 (,) as you noticed. However, you aren't sequencing two parser, you are sequencing three parsers. Even though you can treat this as a "metasequence" of two two-parser sequencing operations, those two operations are different:
In digitChar <* char '-', you ignore the result of the second parser (and in my opinion, <* always looks like a typo for <*>).
In ... <*> digitChar, you use both results.
If you don't like using the applicative operators directly, consider using do syntax along with the ApplicativeDo extension and write
numRange :: Parser (Char, Char)
numRange = do
x <- digitChar
char '-'
y <- digitChar
return (x,y)
It's longer, but it's arguably more readable than either of the two using <*, which I always think looks like a typo for <*>.

Related

Foldl-like operator for Parsec

Suppose I have a function of this type:
once :: (a, b) -> Parser (a, b)
Now, I would like to repeatedly apply this parser (somewhat like using >>=) and use its last output to feed it in the next iteration.
Using something like
sequence :: (a, b) -> Parser (a, b)
sequence inp = once inp >>= sequence
with specifying the initial values for the first parser doesn't work, because it would go on until it inevitably fails. Instead, I would like it to stop when it would fail (somewhat like many).
Trying to fix it using try makes the computation too complex (adding try in each iteration).
sequence :: (a, b) -> Parser (a, b)
sequence inp = try (once inp >>= sequence) <|> pure inp
In other words, I am looking for a function somewhat similar to foldl on Parsers, which stops when the next Parser would fail.
If your once parser fails immediately without consuming input, you don't need try. As a concrete example, consider a rather silly once parser that uses a pair of delimiters to parse the next pair of delimiters:
once :: (Char, Char) -> Parser (Char, Char)
once (c1, c2) = (,) <$ char c1 <*> anyChar <*> anyChar <* char c2
You can parse a nested sequence using:
onces :: (Char, Char) -> Parser (Char, Char)
onces inp = (once inp >>= onces) <|> pure inp
which works fine:
> parseTest (onces ('(',')')) "([])[{}]{xy}xabyDONE"
('a','b')
You only need try if your once might fail after parsing input. For example, the following won't parse without try:
> parseTest (onces ('(',')')) "([])[not valid]"
parse error at (line 1, column 8):
unexpected "t"
expecting "]"
because we start parsing the opening delimiter [ before discovering not valid].
(With try, it returns the correct ('[',']').)
All that being said, I have no idea how you came to the conclusion that using try makes the computation "too complex". If you are just guessing from something you've read about try being potentially inefficient, then you've misunderstood. try can cause problems if it's used in a manner than can result in a big cascade of backtracking. That's not a problem here -- at most, you're backtracking a single once, so don't worry about it.

Identity parser

As an exercise¹, I've written a string parser that only uses char parsers and Trifecta:
import Text.Trifecta
import Control.Applicative ( pure )
stringParserWithChar :: String -> Parser Char
stringParserWithChar stringToParse =
foldr (\c otherParser -> otherParser >> char c) identityParser
$ reverse stringToParse
where identityParser = pure '?' -- ← This works but I think I can do better
The parser does its job just fine:
parseString (stringParserWithChar "123") mempty "1234"
-- Yields: Success '3'
Yet, I'm not happy with the specific identityParser to which I applied foldr. It seems hacky to have to choose an arbitrary character for pure.
My first intuition was to use mempty but Parser is not a monoid. It is an applicative but empty constitutes an unsuccessful parser².
What I'm looking for instead is a parser that works as a neutral element when combined with other parsers. It should successfully do nothing, i.e., not advance the cursor and let the next parser consume the character.
Is there an identity parser as described above in Trifecta or in another library? Or are parsers not meant to be used in a fold?
¹ The exercise is from the parser combinators chapter of the book Haskell Programming from first principles.
² As helpfully pointed out by cole, Parser is an Alternative and thus a monoid. The empty function stems from Alternative, not Parser's applicative instance.
Don't you want this to parse a String? Right now, as you can tell from the function signature, it parses a Char, returning the last character. Just because you only have a Char parser doesn't mean you can't make a String parser.
I'm going to assume that you want to parse a string, in which case your base case is simple: your identityParser is just pure "".
I think something like this should work (and it should be in the right order but might be reversed).
stringParserWithChar :: String -> Parser String
stringParserWithChar = traverse char
Unrolled, you get something like
stringParserWithChar' :: String -> Parser String
stringParserWithChar' "" = pure ""
stringParserWithChar' (c:cs) = liftA2 (:) (char c) (stringParserWithChar' cs)
-- the above with do notation, note that you can also just sequence the results of
-- 'char c' and 'stringParserWithChar' cs' and instead just return 'pure (c:cs)'
-- stringParserWithChar' (c:cs) = do
-- c' <- char c
-- cs' <- stringParserWithChar' cs
-- pure (c':cs')
Let me know if they don't work since I can't test them right now…
A digression on monoids
My first intuition was to use mempty but Parser is not a monoid.
Ah, but that is not quite the case. Parser is an Alternative, which is a Monoid. But you don't really need to look at the Alt typeclass of Data.Monoid to understand this; Alternative's typeclass definition looks just like a Monoid's:
class Applicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a
-- more definitions...
class Semigroup a => Monoid a where
mempty :: a
mappend :: a -> a -> a
-- more definitions...
Unfortunately, you want something that acts more like a product instead of an Alt, but that's what the default behavior of Parser does.
Let's rewrite your fold+reverse into just a fold to clarify what's going on:
stringParserWithChar :: String -> Parser Char
stringParserWithChar =
foldl (\otherParser c -> otherParser >> char c) identityParser
where identityParser = pure '?'
Any time you see foldl used to build up something using its Monad instance, that's a bit suspicious[*]. It hints that you really want a monadic fold of some sort. Let's see here...
import Control.Monad
-- foldM :: (Foldable t, Monad m) => (b -> a -> m b) -> b -> t a -> m b
attempt1 :: String -> Parser Char
attempt1 = foldM _f _acc
This is going to run into the same sort of trouble you saw before: what can you use for a starting value? So let's use a standard trick and start with Maybe:
-- (Control.Monad.<=<)
-- :: Monad m => (b -> m c) -> (a -> m b) -> a -> m c
stringParserWithChar :: String -> Parser Char
stringParserWithChar =
maybe empty pure <=< foldM _f _acc
Now we can start our fold off with Nothing, and immediately switch to Just and stay there. I'll let you fill in the blanks; GHC will helpfully show you their types.
[*] The main exception is when it's a "lazy monad" like Reader, lazy Writer, lazy State, etc. But parser monads are generally strict.

Mixing Parser Char (lexer?) vs. Parser String

I've written several compilers and am familiar with lexers, regexs/NFAs/DFAs, parsers and semantic rules in flex/bison, JavaCC, JavaCup, antlr4 and so on.
Is there some sort of magical monadic operator that seamlessly grows/combines a token with a mix of Parser Char (ie Text.Megaparsec.Char) vs. Parser String?
Is there a way / best practices to represent a clean separation of lexing tokens and nonterminal expectations?
Typically, one uses applicative operations to directly combine Parser Char and Parser Strings, rather than "upgrading" the former. For example, a parser for alphanumeric identifiers that must start with a letter would probably look like:
ident :: Parser String
ident = (:) <$> letterChar <*> alphaNumChar
If you were doing something more complicated, like parsing dollar amounts with optional cents, for example, you might write:
dollars :: Parser String
dollars = (:) <$> char '$' <*> some digitChar
<**> pure (++)
<*> option "" ((:) <$> char '.' <*> replicateM 2 digitChar)
If you find yourself trying to build a Parser String out of a complicated sequence of Parser Char and Parser String parsers in a lot of situations, then you could define a few helper operators. If you find the variety of operators annoying, you could just define (<++>) and a short-form for charToStr like c :: Parser Char -> Parser String.
(<.+>) :: Parser Char -> Parser String -> Parser String
p <.+> q = (:) <$> p <*> q
infixr 5 <.+>
(<++>) :: Parser String -> Parser String -> Parser String
p <++> q = (++) <$> p <*> q
infixr 5 <++>
(<..>) :: Parser Char -> Parser Char -> Parser String
p <..> q = p <.+> fmap (:[]) q
infixr 5 <..>
so you can write something like:
dollars' :: Parser String
dollars' = char '$' <.+> some digitChar
<++> option "" (char '.' <.+> digitChar <..> digitChar)
As #leftroundabout says, there's nothing hackish about fmap (:[]). If you prefer, write fmap (\c -> [c]) if you think it looks clearer.
There's nothing nasty or hackish about fmap (: []) (or fmap pure or pure <$>) – it's the natural thing to do, performing a conversion that's concise, safe, expressive and transparent all at the same time.
An alternative that I wouldn't really recommend, but for some situations it might express the intent best: sequence [charParser]. This makes it clear that you're executing “all” of the parsers in a list of character-parsers, and gathering the result“s” as a list of character“s”.

What are Applicative left and right star sequencing operators expected to do?

I looked up the implementation and it's even more mysterious:
-- | Sequence actions, discarding the value of the first argument.
(*>) :: f a -> f b -> f b
a1 *> a2 = (id <$ a1) <*> a2
-- This is essentially the same as liftA2 (flip const), but if the
-- Functor instance has an optimized (<$), it may be better to use
-- that instead. Before liftA2 became a method, this definition
-- was strictly better, but now it depends on the functor. For a
-- functor supporting a sharing-enhancing (<$), this definition
-- may reduce allocation by preventing a1 from ever being fully
-- realized. In an implementation with a boring (<$) but an optimizing
-- liftA2, it would likely be better to define (*>) using liftA2.
-- | Sequence actions, discarding the value of the second argument.
(<*) :: f a -> f b -> f a
(<*) = liftA2 const
I don't even understand why does <$ deserve a place in a typeclass. It looks like there is some sharing-enhancig effect which fmap . const might not have and that a1 might not be "fully realized". How is that related to the meaning of Applicative sequencing operators?
These operators sequence two applicative actions and provide the result of the action that the arrow points to. For example,
> Just 1 *> Just 2
Just 2
> Just 1 <* Just 2
Just 1
Another example in writing parser combinators is
brackets p = char '(' *> p <* char ')'
which will be a parser that matches p contained in brackets and gives the result of parsing p.
In fact, (*>) is the same as (>>) but only requires an Applicative constraint instead of a Monad constraint.
I don't even understand why does <$ deserve a place in a typeclass.
The answer is given by the Functor documentation: (<$) can sometimes have more efficient implementations than its default, which is fmap . const.
How is that related to the meaning of Applicative sequencing operators?
In cases where (<$) is more efficient, you want to maintain that efficiency in the definition of (*>).

"string" Implementation in Text.Parser.Char

First, just some quick context. I'm going through the Haskell Programming From First Principles book, and ran into the following exercise.
Try writing a Parser that does what string does, but using char.
I couldn't figure it out, so I checked out the source for the implementation. I'm currently trying to wrap my head around it. Here it is:
class Parsing m => CharParsing m where
-- etc.
string :: CharParsing m => String -> m String
string s = s <$ try (traverse_ char s) <?> show s
My questions are as follows, from most to least specific.
Why is show necessary?
Why is s <$ necessary? Doesn't traverse char s <?> s work the same? In other words, why do we throw away the results of the traversal?
What is going on with the traversal? I get what a list traversal does, so I guess I'm confused about the Applicative/Monad instances for Parser. On a high level, I get that the traversal applies char, which has type CharParsing m => Char -> m Char, to every character in string s, and then collects all the results into something of type Parser [Char]. So the types make sense, but I have no idea what's going on in the background.
Thanks in advance!
1) Why is show necessary?
Because showing a string (or a Text, etc.) escapes special characters, which makes sense for error messages:
GHCi> import Text.Parsec -- Simulating your scenario with Parsec.
GHCi> runParser ((\s -> s <$ try (traverse_ char s) <?> s) "foo\nbar") () "" "foo"
Left (line 1, column 4):
unexpected end of input
expecting foo
bar
GHCi> runParser ((\s -> s <$ try (traverse_ char s) <?> show s) "foo\nbar") () "" "foo"
Left (line 1, column 4):
unexpected end of input
expecting "foo\nbar"
2) Why is s <$ necessary? Doesn't traverse char s <?> s work the same? In other words, why do we throw away the results of the traversal?
The result of the parse is unnecessary because we know in advance that it would be s (if the parse were successful). traverse would needlessly reconstruct s from the results of parsing each individual character. In general, if the results are not needed it is a good idea to use traverse_ (which just combines the effects, discarding the results without trying to rebuild the data structure) rather than traverse, so that is likely why the function is written the way it is.
3) What is going on with the traversal?
traverse_ char s (traverse_, and not traverse, as explained above) is a parser. It tries to parse, in order, each character in s, while discarding the results, and it is built by sequencing parsers for each character in s. It may be helpful to remind that traverse_ is just a fold which uses (*>):
-- Slightly paraphrasing the definition in Data.Foldable:
traverse_ :: (Foldable t, Applicative f) => (a -> f b) -> t a -> f ()
traverse_ f = foldr (\x u -> f x *> u) (pure ())

Resources