If I have a parser than reads a string of numbers separated by spaces into a list of Ints, how do I handle a trailing space? At the moment I have:
row :: Parser [Int]
row = do
optional spaces
f <- (many (oneOf "0123456789"))
r <- ((char ' ') >> row) <|> pure []
pure (read f:r)
Which works fine with a string that does not have a trailing space but fails with a trailing space.
>λ= parse row "" " 2 0 12 3 7"
Right [2,0,12,3,7]
>λ= parse row "" " 2 0 12 3 7 "
Right [2,0,12,3,7,*** Exception: Prelude.read: no parse
What is the solution to this problem and more so, how would I have a condition where if '\n' is consumed then the parser returns []
EDIT:
From reading #amalloy's answer and the parsec source code, I thought it useful to add a version that works here (although, #amalloy's advice to not try and roll existing functions makes more sense)
row :: Parser [Int]
row = do
spaces
f <- (read <$> many1 digit)
do
many1 $ char ' '
r <- row
pure (f:r) <|> pure [x]
<|> pure []
Instead of implementing all this low-level stuff yourself, I suggest just using sepEndBy. For example,
row :: Parser [Int]
row = spaces *> (int `sepEndBy` many1 space)
where int = read <$> many1 digit
Related
I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,
"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on,
but F alone should not be chosen, that is, choose those which start with F
followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++
['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."
the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.
With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:
hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
; many space
; c1 <- digit
; cs <- many hao
; return (c:c1:cs)
}
parse tuhao "" str -- can parse the str and get one sub-string.
However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.
So I appreciate any tip! ^_^
F12, F 12, F1(a), F2a, F5-A, F34-5
This is an incomplete description, so I'll make some guesses.
I would start by defining a type that can contain the logical parts of these expressions. E.g.
newtype F = F (Int, Maybe String) deriving Show
That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.
Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]
I would then proceed to write parsers for each of these sub-parts and compose them:
module Main where
import Text.Parsec
import Data.Maybe (catMaybes)
symbol :: String -> Parser String
symbol s = string s <* spaces
parens :: Parser a -> Parser a
parens = between (string "(") (string ")")
digits :: Parser Int
digits = read <$> many1 digit
parseF :: Parser F
parseF = curry F <$> firstPart <*> secondPart
where
firstPart :: Parser Int
firstPart = symbol "F" >> digits
secondPart :: Parser (Maybe String)
secondPart = optionMaybe $ choice
[ many1 letter
, parens (many1 letter)
, string "-" >> many1 alphaNum
]
(As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,
extract :: Parser a -> Parser [a]
extract p = do (:) <$> try p <*> extract p
<|> do anyChar >> extract p
<|> do eof >> return []
readFs :: String -> Either ParseError [F]
readFs s = parse (extract parseF) "" s
main :: IO ()
main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")
This prints:
Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]
Takeaways:
You can parse optional whitespace using token parsing (symbol).
You can parse optional parts with option, optionMaybe or optional.
You can alternate between combinators using a <|> b <|> c or choice [a, b, c].
When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter, string "(" and string "-", i.e. not overlapping.)
[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B, F1(a)-5 and F1(a)-5A are valid, in which case I might change the model to:
newtype F = F (Int, Maybe String, Maybe String)
We can get sub-strings of specific pattern in a string with the
findAll
combinator from
replace-megaparsec.
Notice that this tuhao parser doesn't actually return anything. The findAll combinator just checks for success of the parser to find sub-strings which match the pattern.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
let tuhao :: Parsec Void String ()
tuhao = do
void $ single 'F'
void $ space
void $ digitChar
void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."
rights $ fromJust $ parseMaybe (findAll tuhao) input
["F12","F 12","F1(a)","F2a","F5-A","F34-5"]
I have actually asked this question before (here) but it turns out that the solution provided did not handle all test cases. Also, I need 'Text' parser rather than 'String', so I need parsec3.
Ok, the parser should allow for EVERY type of char inbetween quotes, even quotes. The end of the quoted text is marked by a ' character, followed by |, a space or end of input.
So,
'aa''''|
should return a string
aa'''
This is what I have:
import Text.Parsec
import Text.Parsec.Text
quotedLabel :: Parser Text
quotedLabel = do -- reads the first quote.
spaces
string "'"
lab <- liftM pack $ endBy1 anyChar endOfQuote
return lab
endOfQuote = do
string "'"
try(eof) <|> try( oneOf "| ")
Now, the problem here is of course that eof has a different type than oneOf "| ", so compilation falls.
How do I fix this? Is there a better way to achieve what I am trying to do?
Whitespace
First a comment on handling white space...
Generally the practice is to write your parsers so that they
consume the whitespace following a token
or syntactic unit. It's common to define combinator like:
lexeme p = p <* spaces
to easily convert a parser p to one that discards the whitespace
following whatever p parses. E.g., if you have
number = many1 digit
simply use lexeme number whenever you want to eat up the
whitespace following the number.
For more on this approach to handling whitespace and other advice
on parsing languages, see this Megaparsec tutorial.
Label expressions
Based on your previous SO question it appears you want
to parse expressions of the form:
label1 | label2 | ... | labeln
where each label may be a simple label or a quoted label.
The idiomatic way to parse this pattern is to use sepBy like this:
labels :: Parser String
labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
We define both simpleLabel and quotedLabel in terms of
what characters may occur in them. For simpleLabel a valid
character is a non-| and non-space:
simpleLabel :: Parser String
simpleLabel = many (noneOf "| ")
A quotedLabel is a single quote followed by a run
of valid quotedLabel-characters followed by an ending
single quote:
sq = char '\''
quotedLabel :: Parser String
quotedLabel = do
char sq
chs <- many validChar
char sq
return chs
A validChar is either a non-single quote or a single
quote not followed by eof or a vertical bar:
validChar = noneOf [sq] <|> try validQuote
validQuote = do
char sq
notFollowedBy eof
notFollowedBy (char '|')
return sq
The first notFollowedBy will fail if the single quote appears just
before the end of input. The second notFollowedBy will fail if
next character is a vertical bar. Therefore the sequence of the two
will succeed only if there is a non-vertical bar character following
the single quote. In this case the single quote should be interpreted
as part of the string and not the terminating single quote.
Unfortunately this doesn't quite work because the
current implementation of notFollowedBy
will always succeed with a parser which does not consume any
input -- i.e. like eof. (See this issue for more details.)
To work around this problem we can use this alternate
implementation:
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
Here is the complete solution with some tests. By adding a few lexeme
calls you can make this parser eat up any white space where you decide
it is not significant.
import Text.Parsec hiding (labels)
import Text.Parsec.String
import Control.Monad
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
sq = '\''
validChar = do
noneOf "'" <|> try validQuote
validQuote = do
char sq
notFollowedBy' eof
notFollowedBy (char '|')
return sq
quotedLabel :: Parser String
quotedLabel = do
char sq
str <- many validChar
char sq
return str
plainLabel :: Parser String
plainLabel = many (noneOf "| ")
labels :: Parser [String]
labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
test input expected = do
case parse (labels <* eof) "" input of
Left e -> putStrLn $ "error: " ++ show e
Right v -> if v == expected
then putStrLn $ "OK - got: " ++ show v
else putStrLn $ "NOT OK - got: " ++ show v ++ " expected: " ++ show expected
test1 = test "a|b|c" ["a","b","c"]
test2 = test "a|'b b'|c" ["a", "b b", "c"]
test3 = test "'abc''|def" ["abc'", "def" ]
test4 = test "'abc'" ["abc"]
test5 = test "x|'abc'" ["x","abc"]
To change the result of any functor computation you can just use:
fmap (const x) functor_comp
e.g.:
getLine :: IO String
fmap (const ()) getLine :: IO ()
eof :: Parser ()
oneOf "| " :: Parser Char
fmap (const ()) (oneOf "| ") :: Parser ()
Another option is to use operators from Control.Applicative:
getLine *> return 3 :: IO Integer
This performs getLine, discards the result and returns 3.
In your case, you might use:
try(eof) <|> try( oneOf "| " *> return ())
This time I'm trying to parse a text file into [[String]] using Parsec. Result is a list consisting of lists that represent lines of the file. Every line is a list that contains words which may be separated by any number of spaces, (optionally) commas, and spaces after commas as well.
Here is my code and it even works.
import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Applicative ((<$>))
import System.IO
import System.Environment
myParser :: Parser [[String]]
myParser =
do x <- sepBy parseColl eol
eof
return x
eol :: Parser String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
spaces :: Parser ()
spaces = skipMany (char ' ') >> return ()
parseColl :: Parser [String]
parseColl = many parseItem
parseItem :: Parser String
parseItem =
do optional spaces
x <- many1 (noneOf " ,\n\r")
optional spaces
optional (char ',')
return x
parseText :: String -> String
parseText str =
case parse myParser "" str of
Left e -> "parser error: " ++ show e
Right x -> show x
main :: IO ()
main =
do fileName <- head <$> getArgs
handle <- openFile fileName ReadMode
contents <- hGetContents handle
putStr $ parseText contents
hClose handle
Test file:
this is my test file
this, line, is, separated, by, commas
and this is another, line
Result:
[["this","is","my","test","file"],
["this","line","is","separated","by","commas"],
["and","this","is","another","line"],
[]] -- well, this is a bit unexpected, but I can filter things
Now, to make my life harder, I wish to be able to 'escape' eol if there is a comma , before it, even if the comma is followed by spaces. So this is should be considered one line:
this is, spaces may be here
my line
What is best strategy (most idiomatic and elegant) to implement this syntax (without losing the ability to ignore commas inside a line).
A couple of solutions come to mind.... One is easy, the other is medium difficulty.
The medium-difficulty solution is to define an itemSeparator to be a comma followed by whitespace, and a lineSeparator to be a '\n' or '\r' followed by whitespace.... Make sure to skip non '\n', '\r'-whitespace, but no further, at the end of the item parse, so that the very next char after an item must be either a '\n', '\r', or ',', which determines, without backtracking, whether a new item or line is coming.
Then use sepBy1 to define parseLine (ie- parseLine = parseItem sepBy1 parseItemSeparator), and endBy to define parseFile (ie- parseFile = parseLine endBy parseLineSeparator).
You really do need that sepBy1 on the inside, vs sepBy, else you will have a list of zero sized items, which causes an infinite loop at parse time. endBy works like sepBy, but allows extra '\n', '\r' at the end of the file....
An easier way would be to canonicalize the input by running it though a simple transformation before parsing. You can write a function to remove whitespace after a comma (using dropWhile and isSpace), and perhaps even simplify the different cases of '\n', '\r'.... then run the output through a simplified parser.
Something like this would do the trick (this is untested....)
canonicalize::String->String
canonicalize [] == []
canonicalize (',':rest) = ',':canonicalize (dropWhile isSpace rest)
canonicalize ('\n':rest) = '\n':canonicalize (dropWhile isSpace rest)
canonicalize ('\r':rest) = '\n':canonicalize (dropWhile isSpace rest) --all '\r' will become '\n'
canonicalize (c:rest) = c:canonicalize rest
Because Haskell is lazy, this transformation will work on streaming data as the data comes in, so this really won't slow anything down at all (depending on how much you simplify the parser, it could even speed things up.... Although most likely it will be close to a wash)
I don't know how complicated the full question is, but perhaps a few rules added to a canonicalization function will in fact allow you to use lines and words after all....
Just use optional spaces in parseColl, like this:
parseColl :: Parser [String]
parseColl = optional spaces >> many parseItem
parseItem :: Parser String
parseItem =
do
x <- many1 (noneOf " ,\n\r")
optional spaces
optional (char ',')
return x
Second, divide separator from item
parseColl :: Parser [String]
parseColl = do
optional spaces
items <- parseItem `sepBy` parseSeparator
optional spaces
return items
parseItem :: Parser String
parseItem = many1 $ noneOf " ,\n\r"
parseSeparator = try (optional spaces >> char ',' >> optional spaces) <|> spaces
Third, we recreate a bit eol and spaces:
eol :: Parser String
eol = try (string "\n\r")
<|> string "\r\n"
<|> string "\n"
<|> string "\r"
<|> eof
<?> "end of line"
spaces :: Parser ()
spaces = skipMany1 $ char ' '
parseColl :: Parser [String]
parseColl = do
optional spaces
items <- parseItem `sepBy` parseSeparator
optional spaces
eol
return items
Finally, let's rewrite myParser:
myParser = many parseColl
I am trying to parse some text, but I can't understand how to parse a list of symbols separated by some separator, which may or may not occur also at the end of the list.
Example (numbers separated by spaces):
set A = 1 2 3 4 5;
set B =6 7 8 9;
set C = 10 11 12 ;
If I use sepBy, after the last space I got an error because it expects another digit, even if I try to read also many whitespace after the list. If I use endBy, I got an error when the space is missing.
import Text.ParserCombinators.Parsec
main :: IO ()
main = do
let input = "set A = 1 2 3 4 5;\n" ++
"set B =6 7 8 9;\n" ++
"set C = 10 11 12 ;\n"
case parse parseInput "(unknown)" input of
Left msg ->
print msg
Right rss ->
mapM_ (\(n, vs) -> putStrLn (n ++ " = " ++ show vs)) rss
whitespace :: GenParser Char st Char
whitespace = oneOf " \t"
parseInput :: GenParser Char st [(String, [Int])]
parseInput = parseRow `endBy` newline
parseRow :: GenParser Char st (String, [Int])
parseRow = do
string "set"
many1 whitespace
name <- many1 alphaNum
many whitespace
string "="
many whitespace
values <- many1 digit `sepBy` many1 whitespace
many whitespace
string ";"
return (name, map read values)
The combinator I think you want is sepEndBy. Using it gives you
-- I'm using the type synonym
-- type Parser = GenParser Char ()
-- from Text.ParseCombinator.Parsec.Prim
parseRow :: Parser (String, [Int])
parseRow = do
string "set" >> many1 whitespace
name <- many1 alphaNum
spaces >> char '=' >> spaces
values <- many1 digit `sepEndBy` many1 whitespace
char ';'
return (name, map read values)
where spaces = many whitespace
I'm working on a Parsec parser to handle a somewhat complex data file format (and I have no control over this format).
I've made a lot of progress, but am currently stuck with the following.
I need to be able to parse a line somewhat like this:
4 0.123 1.452 0.667 * 3.460 149 - -
Semantically, the 4 is a nodeNum, the Floats and the * are negative log probabilities (so, * represents the negative log of probability zero). The 149 and the minus signs are really junk, which I can discard, but I need to at least make sure they don't break the parser.
Here's what I have so far:
This handles the "junk" I mentioned. It could probably be simpler, but it works by itself.
emAnnotationSet = (,,) <$> p_int <*>
(reqSpaces *> char '-') <*>
(reqSpaces *> char '-')
the nodeNum at the beginning of the line is handled by another parser that works and I need not get into.
The problem is in trying to pick out all the p_logProbs from the line, without consuming the digits at the beginning of the emAnnotationSet.
the parser for p_logProb looks like this:
p_logProb = liftA mkScore (lp <?> "logProb")
where lp = try dub <|> string "*"
dub = (++) <$> ((++) <$> many1 digit <*> string ".") <*> many1 digit
And finally, I try to separate the logProb entries from the trailing emAnnotationSet (which starts with an integer) as follows:
hmmMatchEmissions = optSpaces *> (V.fromList <$> sepBy p_logProb reqSpaces)
<* optSpaces <* emAnnotationSet <* eol
<?> "matchEmissions"
So, p_logProb will only succeed on a float that begins with digits, includes a decimal point, and then has further digits (this restriction is respected by the file format).
I'd hoped that the try in the p_logProb definition would avoid consuming the leading digits if it didn't parse the decimal and the rest, but this doesn't seem to work; Parsec still complains that it sees an unexpected space after the digits of that integer in the emAnnotationSet:
Left "hmmNode" (line 1, column 196):
unexpected " "
expecting logProb
column 196 corresponds to the space after the integer preceding the minus signs, so it's clear to me that the problem is that the p_logProb parser is consuming the integer. How can I fix this so the p_logProb parser uses lookahead correctly, thus leaving that input for the emAnnotationSet parser?
The integer which terminates the probabilities cannot be mistaken for a probability since it doesn't contain a decimal point. The lexeme combinator converts a parser into one that skips trailing spaces.
import Text.Parsec
import Text.Parsec.String
import Data.Char
import Control.Applicative ( (<$>), (<*>), (<$), (<*), (*>) )
fractional :: Fractional a => Parser a
fractional = try $ do
n <- fromIntegral <$> decimal
char '.'
f <- foldr (\d f -> (f + fromIntegral (digitToInt d))/10.0) 0.0 <$> many1 digit
return $ n + f
decimal :: Parser Int
decimal = foldl (\n d -> 10 * n + digitToInt d) 0 <$> many1 digit
lexeme :: Parser a -> Parser a
lexeme p = p <* skipMany (char ' ')
data Row = Row Int [Maybe Double]
deriving ( Show )
probability :: Fractional a => Parser (Maybe a)
probability = (Just <$> fractional) <|> (Nothing <$ char '*')
junk = lexeme decimal <* count 2 (lexeme $ char '-')
row :: Parser Row
row = Row <$> lexeme decimal <*> many1 (lexeme probability) <* junk
rows :: Parser [Row]
rows = spaces *> sepEndBy row (lexeme newline) <* eof
Usage:
*Main> parseTest rows "4 0.123 1.234 2.345 149 - -\n5 0.123 * 2.345 149 - -"
[Row 4 [Just 0.123,Just 1.234,Just 2.345],Row 5 [Just 0.123,Nothing,Just 2.345]]
I'm not exactly sure of your problem. However, to parse the line given based on your description, it would be much easier to use existing lexers define in Text.Parsec.Token1, and join them together.
The below code parses the line into a Line data type, you can process it further from there if necessary. Instead of attempting to filter out the - and integers before parsing, it uses a parseEntry parser that returns a Just Double if it is a Float value, Just 0 for *, and Nothing for integers and dashes. This is then very simply filtered using catMaybes.
Here is the code:
module Test where
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellDef)
import Control.Applicative ((<$>))
import Data.Maybe (catMaybes)
lexer = P.makeTokenParser haskellDef
parseFloat = P.float lexer
parseInteger = P.natural lexer
whiteSpace = P.whiteSpace lexer
parseEntry = try (Just <$> parseFloat)
<|> try (const (Just 0) <$> (char '*' >> whiteSpace))
<|> try (const Nothing <$> (char '-' >> whiteSpace))
<|> (const Nothing <$> parseInteger)
data Line = Line {
lineNodeNum :: Integer
, negativeLogProbabilities :: [Double]
} deriving (Show)
parseLine = do
nodeNum <- parseInteger
whiteSpace
probabilities <- catMaybes <$> many1 parseEntry
return $ Line { lineNodeNum = nodeNum, negativeLogProbabilities = probabilities }
Example usage:
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46]}
The only issue that may (or may not) be a problem is it will parse *- as two different tokens, rather than fail at parsing. Eg
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -*"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46,0.0]}
Note the extra 0.0 at the end of the log probabilities.