Megaparsec: skip space and non-alphanumeric - haskell

I'm a beginner with Megaparsec and Haskell in general, and trying to write a parser for the following grammar:
A word will always be one of:
A number composed of one or more ASCII digits (ie "0" or "1234") OR
A simple word composed of one or more ASCII letters (ie "a" or "they") OR
A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")
So far, I've got the following (this can probably be simplified):
data Word = Number String | SimpleWord String | Contraction String deriving (Show)
word :: Parser MyParser.Word
word = M.choice
[ Number <$> number
, Contraction <$> contraction
, SimpleWord <$> simpleWord
]
number :: Parser String
number = M.some C.numberChar
simpleWord :: Parser String
simpleWord = M.some C.letterChar
contraction :: Parser String
contraction = do
left <- simpleWord
void $ C.char '\''
right <- simpleWord
return (left ++ "'" ++ right)
But I'm having problem with defining a parser for skipping white spaces and anything that is non-alphanumeric. For example, given the input 'abc', the parser should discard the apostrophes and just take the "simple word".
The following doesn't compile:
filler :: Parser Char
filler = M.some (C.spaceChar A.<|> not C.alphaNumChar)
spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer

Here is the complete working code that I came up with.
type Parser =
M.Parsec
-- The type for custom error messages. We have none, so use `Void`.
Void
-- The input stream type. Let's use `String` for now.
String
data Word = Number String | SimpleWord String | Contraction String deriving (Eq)
instance Show WordCount.Word where
show (Number x) = x
show (SimpleWord x) = x
show (Contraction x) = x
words :: String -> Either String [String]
-- Force parser to consume entire input
-- <* Sequence actions, discarding the value of the second argument.
words input = case M.parse (M.some WordCount.word A.<* M.eof) "" input of
-- :t err = M.ParseErrorBundle String Void
Left err ->
let e = M.errorBundlePretty err
_ = putStr e
in Left e
Right (x) -> Right $ map (show) x
word :: Parser WordCount.Word
word =
M.skipManyTill filler $
lexeme $
M.choice
-- <$> is infix for 'fmap'
[ Number <$> number,
Contraction <$> M.try contraction,
SimpleWord <$> simpleWord
]
number :: Parser String
number = M.some MC.numberChar
simpleWord :: Parser String
simpleWord = M.some MC.letterChar
contraction :: Parser String
contraction = do
left <- simpleWord
void $ MC.char '\''
right <- simpleWord
return $ left ++ "'" ++ right
-- Define separator characters
isSep :: Char -> Bool
isSep x = C.isSpace x || (not . C.isAlphaNum) x
-- Fillers fill the space between tokens
filler :: Parser ()
filler = void $ M.some $ M.satisfy isSep
-- 3rd and 4th arguments are for ignoring comments
spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty
-- A parser that discards trailing space
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer

First, you probably want to use some1 for number and simple words, otherwise "" would be a number.
Your filler parser is good. That should use some because you want to allow for e.g. "they1234" to parse as SimpleWord "they" and Number "1234".
What you need to say for the overall parser is that your text consists of zero or more words separated by filler, with optional filler before and after. Fortunately megaparsec re-exports lots of useful stuff from Control.Monad.Combinators for doing this.
So we can use sepBy for the words separated by filler:
document :: Parser [Word]
document = do
_ <- filler -- Throw away any filler at the start.
result <- word `sepBy` filler
_ <- filler -- Throw away any filler at the end.
return result
We don't need optional for the start and end filler because filler can be zero length.
Finally, a style point: in a real parser you would want to make the Word type a bit more sophisticated. Something like:
data SimpleWord = Number String | SimpleWord String
data Word = Word SimpleWord | Contraction SimpleWord SimpleWord
That way whatever bit of code deals with Contraction downstream doesn't have to find the apostrophe all over again or deal with the "impossible" case where there isn't one. Once you've found the structure information in your input, don't throw it away. But that's a side issue for this exercise.

Related

Monadic Parser - handling string with one character

I was reading this Monadic Parsing article while I was trying to implement a pretty simple string parser in Haskell and also get a better understanding of using monads. Down below you can see my code, implementing functions for matching a single character or a whole string. It works as expected, but I observed two strange behaviors that I can't explain.
I have to handle single characters in string, otherwise, the parser will return only empty lists. To be exact, if I remove this line string [c] = do char c; return [c] it won't work anymore. I was expecting that string (c:s) would handle string (c:[]) properly. What could be the cause here?
In my opinion, string definition should be equivalent to string s = mapM char s as it would create a list of [Parser Char] for each character in s and collect the results as Parser [Char]. If I use the definition based on mapM, the program would get stuck in an infinite loop and won't print anything. Is something about lazy evalutation that I miss here?
.
module Main where
newtype Parser a = Parser { apply :: String->[(a, String)] }
instance Monad Parser where
return a = Parser $ \s -> [(a, s)]
ma >>= k = Parser $ \s -> concat [apply (k a) s' | (a, s') <- apply ma s]
instance Applicative Parser where
pure = return
mf <*> ma = do { f <- mf; f <$> ma; }
instance Functor Parser where
fmap f ma = f <$> ma
empty :: Parser a
empty = Parser $ const []
anychar :: Parser Char
anychar = Parser f where
f [] = []
f (c:s) = [(c, s)]
satisfy :: (Char -> Bool) -> Parser Char
satisfy prop = do
c <- anychar
if prop c then return c
else empty
char :: Char -> Parser Char
char c = satisfy (== c)
string :: String -> Parser String
string [] = empty
string [c] = do char c; return [c] --- if I remove this line, all results will be []
string (c:s) = do char c; string s; return (c:s)
main = do
let s = "12345"
print $ apply (string "123") s
print $ apply (string "12") s
print $ apply (string "1") s
print $ apply (string []) s
PS. I think the title of the question is not suggestive enough, please propose an edit if you have a better idea.
Since you did string [] = empty instead of string [] = return [], you can't use it as a base case for recursion that builds up a list.
fmap f ma = f <$> ma is wrong, since <$> is defined in terms of fmap. If you want to define fmap in terms of your other instances, then do fmap = liftA or fmap = liftM. Since mapM uses fmap internally but your original string didn't, this problem didn't come up in your first simple test.
string [] = empty
means: "If you need to parse an empty string, fail -- it can not be parsed at all, no matter what's the input string".
By comparison,
string [] = return ""
means: "If you need to parse an empty string, succeed and return the empty string -- it can always be parsed, no matter what's the input string".
By using the first equation, when you recurse in the case string (c:cs) you need to stop at one character (string [c]) since reaching zero characters will run empty and make the whole parser fail.
Hence, you need to either use that string [c] = return [c] equation, or modify the base "empty string" case so that it succeeds. Arguably, the latter would be more natural.

How to use parsec to get sub-strings of specific pattern in a string

I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,
"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on,
but F alone should not be chosen, that is, choose those which start with F
followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++
['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."
the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.
With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:
hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
; many space
; c1 <- digit
; cs <- many hao
; return (c:c1:cs)
}
parse tuhao "" str -- can parse the str and get one sub-string.
However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.
So I appreciate any tip! ^_^
F12, F 12, F1(a), F2a, F5-A, F34-5
This is an incomplete description, so I'll make some guesses.
I would start by defining a type that can contain the logical parts of these expressions. E.g.
newtype F = F (Int, Maybe String) deriving Show
That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.
Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]
I would then proceed to write parsers for each of these sub-parts and compose them:
module Main where
import Text.Parsec
import Data.Maybe (catMaybes)
symbol :: String -> Parser String
symbol s = string s <* spaces
parens :: Parser a -> Parser a
parens = between (string "(") (string ")")
digits :: Parser Int
digits = read <$> many1 digit
parseF :: Parser F
parseF = curry F <$> firstPart <*> secondPart
where
firstPart :: Parser Int
firstPart = symbol "F" >> digits
secondPart :: Parser (Maybe String)
secondPart = optionMaybe $ choice
[ many1 letter
, parens (many1 letter)
, string "-" >> many1 alphaNum
]
(As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,
extract :: Parser a -> Parser [a]
extract p = do (:) <$> try p <*> extract p
<|> do anyChar >> extract p
<|> do eof >> return []
readFs :: String -> Either ParseError [F]
readFs s = parse (extract parseF) "" s
main :: IO ()
main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")
This prints:
Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]
Takeaways:
You can parse optional whitespace using token parsing (symbol).
You can parse optional parts with option, optionMaybe or optional.
You can alternate between combinators using a <|> b <|> c or choice [a, b, c].
When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter, string "(" and string "-", i.e. not overlapping.)
[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B, F1(a)-5 and F1(a)-5A are valid, in which case I might change the model to:
newtype F = F (Int, Maybe String, Maybe String)
We can get sub-strings of specific pattern in a string with the
findAll
combinator from
replace-megaparsec.
Notice that this tuhao parser doesn't actually return anything. The findAll combinator just checks for success of the parser to find sub-strings which match the pattern.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
let tuhao :: Parsec Void String ()
tuhao = do
void $ single 'F'
void $ space
void $ digitChar
void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."
rights $ fromJust $ parseMaybe (findAll tuhao) input
["F12","F 12","F1(a)","F2a","F5-A","F34-5"]

how to parse a string of comma-separated values into a list of strings in haskell?

so if I have a string "(this, is, a, story, all, about, how)" into a list of the words inside it ["this", "is", "a", "story", "all", "about", "how"] as an instance of ReadP String? I've tried a bunch of different ways, one of which being this:
parseStr :: ReadP String
parseStr = do
skipSpaces
n <- munch1 isAlphaOrDigit
skipComma
return $ n
which parses all values but the last. I thought if I combined it with this parse:
parseLast :: ReadP String
parseLast = do
skipSpaces
n <- munch1 isAlphaOrDigit
return $ n
as
parseLet = (many parseStr) +++ parseLast
but that didn't work either. Any tips?
edit: more definitions
isAlphaOrDigit :: Char -> Bool
isAlphaOrDigit a = (isDigit a) || (isAlpha a)
comma = satisfy (','==)
skipComma = const () <$> some comma
The parser a +++ b sends the entire input string to a and the entire input string to b, producing all the results that either parser produced. You instead want a parser that sends the first part of the string to a and the second part to b, then lets you combine the results. Try this instead:
parseLet = liftA2 (\ss s -> ss ++ [s]) (many parseStr) parseLast
Many parser libraries also offer a manySepBy combinator (perhaps with a slightly different name) for this exact use case; you might consider looking through the ReadP library for such a thing.

Parsec3 Text parser for quoted string, where everything is allowed in between quotes

I have actually asked this question before (here) but it turns out that the solution provided did not handle all test cases. Also, I need 'Text' parser rather than 'String', so I need parsec3.
Ok, the parser should allow for EVERY type of char inbetween quotes, even quotes. The end of the quoted text is marked by a ' character, followed by |, a space or end of input.
So,
'aa''''|
should return a string
aa'''
This is what I have:
import Text.Parsec
import Text.Parsec.Text
quotedLabel :: Parser Text
quotedLabel = do -- reads the first quote.
spaces
string "'"
lab <- liftM pack $ endBy1 anyChar endOfQuote
return lab
endOfQuote = do
string "'"
try(eof) <|> try( oneOf "| ")
Now, the problem here is of course that eof has a different type than oneOf "| ", so compilation falls.
How do I fix this? Is there a better way to achieve what I am trying to do?
Whitespace
First a comment on handling white space...
Generally the practice is to write your parsers so that they
consume the whitespace following a token
or syntactic unit. It's common to define combinator like:
lexeme p = p <* spaces
to easily convert a parser p to one that discards the whitespace
following whatever p parses. E.g., if you have
number = many1 digit
simply use lexeme number whenever you want to eat up the
whitespace following the number.
For more on this approach to handling whitespace and other advice
on parsing languages, see this Megaparsec tutorial.
Label expressions
Based on your previous SO question it appears you want
to parse expressions of the form:
label1 | label2 | ... | labeln
where each label may be a simple label or a quoted label.
The idiomatic way to parse this pattern is to use sepBy like this:
labels :: Parser String
labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
We define both simpleLabel and quotedLabel in terms of
what characters may occur in them. For simpleLabel a valid
character is a non-| and non-space:
simpleLabel :: Parser String
simpleLabel = many (noneOf "| ")
A quotedLabel is a single quote followed by a run
of valid quotedLabel-characters followed by an ending
single quote:
sq = char '\''
quotedLabel :: Parser String
quotedLabel = do
char sq
chs <- many validChar
char sq
return chs
A validChar is either a non-single quote or a single
quote not followed by eof or a vertical bar:
validChar = noneOf [sq] <|> try validQuote
validQuote = do
char sq
notFollowedBy eof
notFollowedBy (char '|')
return sq
The first notFollowedBy will fail if the single quote appears just
before the end of input. The second notFollowedBy will fail if
next character is a vertical bar. Therefore the sequence of the two
will succeed only if there is a non-vertical bar character following
the single quote. In this case the single quote should be interpreted
as part of the string and not the terminating single quote.
Unfortunately this doesn't quite work because the
current implementation of notFollowedBy
will always succeed with a parser which does not consume any
input -- i.e. like eof. (See this issue for more details.)
To work around this problem we can use this alternate
implementation:
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
Here is the complete solution with some tests. By adding a few lexeme
calls you can make this parser eat up any white space where you decide
it is not significant.
import Text.Parsec hiding (labels)
import Text.Parsec.String
import Control.Monad
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
sq = '\''
validChar = do
noneOf "'" <|> try validQuote
validQuote = do
char sq
notFollowedBy' eof
notFollowedBy (char '|')
return sq
quotedLabel :: Parser String
quotedLabel = do
char sq
str <- many validChar
char sq
return str
plainLabel :: Parser String
plainLabel = many (noneOf "| ")
labels :: Parser [String]
labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
test input expected = do
case parse (labels <* eof) "" input of
Left e -> putStrLn $ "error: " ++ show e
Right v -> if v == expected
then putStrLn $ "OK - got: " ++ show v
else putStrLn $ "NOT OK - got: " ++ show v ++ " expected: " ++ show expected
test1 = test "a|b|c" ["a","b","c"]
test2 = test "a|'b b'|c" ["a", "b b", "c"]
test3 = test "'abc''|def" ["abc'", "def" ]
test4 = test "'abc'" ["abc"]
test5 = test "x|'abc'" ["x","abc"]
To change the result of any functor computation you can just use:
fmap (const x) functor_comp
e.g.:
getLine :: IO String
fmap (const ()) getLine :: IO ()
eof :: Parser ()
oneOf "| " :: Parser Char
fmap (const ()) (oneOf "| ") :: Parser ()
Another option is to use operators from Control.Applicative:
getLine *> return 3 :: IO Integer
This performs getLine, discards the result and returns 3.
In your case, you might use:
try(eof) <|> try( oneOf "| " *> return ())

Haskell "parse" not terminating for specific type of string

parse' :: Parser a -> String -> [(a,String)]
parse' p inp = p `with` inp
parse :: Parser a -> String -> [a]
parse p inp = [ v | (v,[]) <- parse' p inp ]
mkMany1 :: (Parser a -> Parser [a]) -> Parser a -> Parser [a]
mkMany1 many p = do x <- p
xs <- many p
return (x:xs)
many1L :: Parser a -> Parser [a]
many1L = mkMany1 manyL
manyL :: Parser a -> Parser [a]
manyL p = (many1L p) ||| (success [])
I'm trying to parse a String for a number of substrings that doesn't include the characters '<', '>' or ' '(space) but my parser doesn't seem to terminate. Can someone give me some pointers on what I'm missing?
textValid :: Char -> Bool
textValid c = c /= '<' && c /= '>' && not (isSpace c)
text :: Parser String
text = manyL (sat textValid)
When I try to run the following command, it never terminates.
parse (manyL text) "abc def <"
The problem is that manyL parser can succeed without consuming input (returning an empty list).
And one must not pass a parser that can succeed without consuming input as the argument of manyL, because in that case, you get precisely such an infinite loop as you are in.
After the first text consumed the "abc" prefix of the input, you are left with " def <" a String beginning with a space. So trying text on that, it consumes as many textValid characters as there are at the beginning of the String - namely 0 - and returns them - []. That leaves the same input. Now manyL text tries text another time to see if that succeeds too ...
You should probably define
text = many1L (sat textValid)
so that text doesn't succeed without consuming input, and probably it is a good idea to consume spaces from the beginning of the remaining input after each successful parse, like
text = do
result <- many1L (sat textValid)
skipSpaces
return result
(skipSpaces left to implement).

Resources