Parsing string with parsec that needs to end with particular words? - haskell

I am working on some programming exercises. The one I am working on has following input format:
Give xxxxxxxxx as yyyy.
xxxxxxxx can be in several formats that repeatedly show up during these exercises. In particular its either binary (groups of 8 separated by spaces), hexadecimal (without spaces) or octal (groups of up to 3 numbers). I have already written parsers for these formats - however they all stumble over the "as". They looked like this
binaryParser = BinaryQuestion <$> (count 8 ( oneOf "01") ) `sepBy1` space
I solved using this monstrosity (trimmed unnecessary code)
{-# LANGUAGE OverloadedStrings #-}
import Text.Parsec.ByteString
import Text.Parsec
import Text.Parsec.Char
import Data.ByteString.Char8 (pack, unpack, dropWhile, drop, snoc)
import qualified Data.ByteString as B
data Input = BinaryQuestion [String]
| HexQuestion [String]
| OctalQuestion [String]
deriving Show
data Question = Question {input :: Input, target :: Target} deriving Show
data Target = Word deriving Show
test1 :: B.ByteString
test1 = "Give 01110100 01110101 01110010 01110100 01101100 01100101 as a word."
test2 :: B.ByteString
test2 = "Give 646f63746f72 as a word."
test3 :: B.ByteString
test3 = "Give 164 151 155 145 as a word."
targetParser :: Parser Target
targetParser = string "word" >> return Word
wrapAs :: Parser a -> Parser [a]
wrapAs kind = manyTill kind (try (string " as"))
inputParser :: Parser Input
inputParser = choice [try binaryParser, try (space >> hexParser), try octParser]
binaryParser :: Parser Input
binaryParser = BinaryQuestion <$> wrapAs (space >> count 8 ( oneOf "01") )
hexParser :: Parser Input
hexParser = HexQuestion <$> wrapAs (count 2 hexDigit)
octParser :: Parser Input
octParser = OctalQuestion <$> wrapAs (many1 space >> many1 (oneOf ['0'..'7']))
questionParser :: Parser Question
questionParser = do
string "Give"
inp <- inputParser
string " a "
tar <- targetParser
char '.'
eof
return $ Question inp tar
I don't like that I need to use the following string "as" inside the parsing of Input, and they generally are less readable. I mean using regex it would be trivial to have a trailing string. So I am not satisfied with my solution.
Is there a way I can reuse the 'nice' parsers - or at least use more readable parsers?
additional notes
The code I along the lines I wish I could get working would look like this:
{-# LANGUAGE OverloadedStrings #-}
import Text.Parsec.ByteString
import Text.Parsec
import Text.Parsec.Char
import Data.ByteString.Char8 (pack, unpack, dropWhile, drop, snoc)
import qualified Data.ByteString as B
data Input = BinaryQuestion [String]
| HexQuestion [String]
| OctalQuestion [String]
deriving Show
data Question = Question {input :: Input, target :: Target} deriving Show
data Target = Word deriving Show
test1 :: B.ByteString
test1 = "Give 01110100 01110101 01110010 01110100 01101100 01100101 as a word."
test2 :: B.ByteString
test2 = "Give 646f63746f72 as a word."
test3 :: B.ByteString
test3 = "Give 164 151 155 145 as a word."
targetParser :: Parser Target
targetParser = string "word" >> return Word
inputParser :: Parser Input
inputParser = choice [try binaryParser, try hexParser, try octParser]
binaryParser :: Parser Input
binaryParser = BinaryQuestion <$> count 8 ( oneOf "01") `sepBy1` space
hexParser :: Parser Input
hexParser = HexQuestion <$> many1 (count 2 hexDigit)
octParser :: Parser Input
octParser = OctalQuestion <$> (many1 (oneOf ['0'..'7'])) `sepBy1` space
questionParser :: Parser Question
questionParser = do
string "Give"
many1 space
inp <- inputParser
many1 space
string "as a"
many1 space
tar <- targetParser
char '.'
eof
return $ Question inp tar
but parseTest questionParser test3 will return me parse error at (line 1, column 22):
unexpected "a"
I suppose the problem is that space is used as separator inside the input but also comes in the as a string. I don't see any function inside parsec that would fit. In frustration I tried adding try in various places - however no success.

You are working with the pattern: Give {source} as a {target}.
So you can pipe:
Parser for Give a
Parser for {source}
Parser for as a
Parser for {target}
No need to wrap the parser for {source} with the parser for as a.

EDIT:
As said in comment, the clean parser cannot be reused by Previouse solution stated at the end of this post.
It led to develop a small parser using Parsec to handle all the possible situations for end parsing of numeric string separated by space i.e.
end with a space followed by non-required-digit character, e.g. "..11 as"
end with a space, e.g. "..11 "
end with eof, e.g. "..11"
and such a parser as below:
numParser:: (Parser Char->Parser String)->[Char]->Parser [String]
numParser repeatParser digits =
let digitParser = repeatParser $ oneOf digits
endParser = (try $ lookAhead $ (space >> noneOf digits)) <|>
(try $ lookAhead $ (space <* eof)) <|>
(eof >> return ' ')
in do init <- digitParser
rest <- manyTill (space >> digitParser) endParser
return (init : rest)
And binaryParser and octParser need to be modified as below:
binaryParser = BinaryQuestion <$> numParser (count 8) "01"
octParser = OctalQuestion <$> numParser many1 ['0'..'7']
And Nothing need to change of questionParser stated in question, for reference, I state it again here:
questionParser = do
string "Give"
many1 space
inp <- inputParser
many1 space --no need change to many
string "as a"
many1 space
tar <- targetParser
char '.'
eof
return $ Question inp tar
Previous Solution:
The functions endBy1 and many in Text.Parsec are helpful in this situation.
To replace sepBy1 by endBy1 as
binaryParser = BinaryQuestion <$> count 8 ( oneOf "01") `endBy1` space
and
octParser = OctalQuestion <$> (many1 (oneOf ['0'..'7'])) `endBy1` space
Unlike sepBy1, endBy1 will read next some chars to determine whether end the parsing, and therefor, one space after the last digit will be consumed, i.e.
Give 164 151 155 145 as a word.
^ this space will be consumed
So, instead of checking one or many space before "as a...", it need check zero or many space, so why use many function instead of many1, now the code become:
...
inp <- inputParser
many space -- change to many
string "as a"
....

Related

Parsec3 Text parser for quoted string, where everything is allowed in between quotes

I have actually asked this question before (here) but it turns out that the solution provided did not handle all test cases. Also, I need 'Text' parser rather than 'String', so I need parsec3.
Ok, the parser should allow for EVERY type of char inbetween quotes, even quotes. The end of the quoted text is marked by a ' character, followed by |, a space or end of input.
So,
'aa''''|
should return a string
aa'''
This is what I have:
import Text.Parsec
import Text.Parsec.Text
quotedLabel :: Parser Text
quotedLabel = do -- reads the first quote.
spaces
string "'"
lab <- liftM pack $ endBy1 anyChar endOfQuote
return lab
endOfQuote = do
string "'"
try(eof) <|> try( oneOf "| ")
Now, the problem here is of course that eof has a different type than oneOf "| ", so compilation falls.
How do I fix this? Is there a better way to achieve what I am trying to do?
Whitespace
First a comment on handling white space...
Generally the practice is to write your parsers so that they
consume the whitespace following a token
or syntactic unit. It's common to define combinator like:
lexeme p = p <* spaces
to easily convert a parser p to one that discards the whitespace
following whatever p parses. E.g., if you have
number = many1 digit
simply use lexeme number whenever you want to eat up the
whitespace following the number.
For more on this approach to handling whitespace and other advice
on parsing languages, see this Megaparsec tutorial.
Label expressions
Based on your previous SO question it appears you want
to parse expressions of the form:
label1 | label2 | ... | labeln
where each label may be a simple label or a quoted label.
The idiomatic way to parse this pattern is to use sepBy like this:
labels :: Parser String
labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
We define both simpleLabel and quotedLabel in terms of
what characters may occur in them. For simpleLabel a valid
character is a non-| and non-space:
simpleLabel :: Parser String
simpleLabel = many (noneOf "| ")
A quotedLabel is a single quote followed by a run
of valid quotedLabel-characters followed by an ending
single quote:
sq = char '\''
quotedLabel :: Parser String
quotedLabel = do
char sq
chs <- many validChar
char sq
return chs
A validChar is either a non-single quote or a single
quote not followed by eof or a vertical bar:
validChar = noneOf [sq] <|> try validQuote
validQuote = do
char sq
notFollowedBy eof
notFollowedBy (char '|')
return sq
The first notFollowedBy will fail if the single quote appears just
before the end of input. The second notFollowedBy will fail if
next character is a vertical bar. Therefore the sequence of the two
will succeed only if there is a non-vertical bar character following
the single quote. In this case the single quote should be interpreted
as part of the string and not the terminating single quote.
Unfortunately this doesn't quite work because the
current implementation of notFollowedBy
will always succeed with a parser which does not consume any
input -- i.e. like eof. (See this issue for more details.)
To work around this problem we can use this alternate
implementation:
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
Here is the complete solution with some tests. By adding a few lexeme
calls you can make this parser eat up any white space where you decide
it is not significant.
import Text.Parsec hiding (labels)
import Text.Parsec.String
import Control.Monad
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
sq = '\''
validChar = do
noneOf "'" <|> try validQuote
validQuote = do
char sq
notFollowedBy' eof
notFollowedBy (char '|')
return sq
quotedLabel :: Parser String
quotedLabel = do
char sq
str <- many validChar
char sq
return str
plainLabel :: Parser String
plainLabel = many (noneOf "| ")
labels :: Parser [String]
labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
test input expected = do
case parse (labels <* eof) "" input of
Left e -> putStrLn $ "error: " ++ show e
Right v -> if v == expected
then putStrLn $ "OK - got: " ++ show v
else putStrLn $ "NOT OK - got: " ++ show v ++ " expected: " ++ show expected
test1 = test "a|b|c" ["a","b","c"]
test2 = test "a|'b b'|c" ["a", "b b", "c"]
test3 = test "'abc''|def" ["abc'", "def" ]
test4 = test "'abc'" ["abc"]
test5 = test "x|'abc'" ["x","abc"]
To change the result of any functor computation you can just use:
fmap (const x) functor_comp
e.g.:
getLine :: IO String
fmap (const ()) getLine :: IO ()
eof :: Parser ()
oneOf "| " :: Parser Char
fmap (const ()) (oneOf "| ") :: Parser ()
Another option is to use operators from Control.Applicative:
getLine *> return 3 :: IO Integer
This performs getLine, discards the result and returns 3.
In your case, you might use:
try(eof) <|> try( oneOf "| " *> return ())

Parser for JSON String

I'm trying to write a parser for a JSON String.
A valid example, per my parser, would be: "\"foobar\"" or "\"foo\"bar\"".
Here's what I attempted, but it does not terminate:
parseEscapedQuotes :: Parser String
parseEscapedQuotes = Parser f
where
f ('"':xs) = Just ("\"", xs)
f _ = Nothing
parseStringJValue :: Parser JValue
parseStringJValue = (\x -> S (concat x)) <$>
((char '"') *>
(zeroOrMore (alt parseEscapedQuotes (oneOrMore (notChar '"'))))
<* (char '"'))
My reasoning is that, I can have a repetition of either escaped quotes "\"" or characters not equal to ".
But it's not working as I expected:
ghci> runParser parseStringJValue "\"foobar\""
Nothing
I don't know what parser combinator library you are using, but here is a working example using Parsec. I'm using monadic style to make it clearer what's going on, but it is easily translated to applicative style.
import Text.Parsec
import Text.Parsec.String
jchar :: Parser Char
jchar = escaped <|> anyChar
escaped :: Parser Char
escaped = do
char '\\'
c <- oneOf ['"', '\\', 'r', 't' ] -- etc.
return $ case c of
'r' -> '\r'
't' -> '\t'
_ -> c
jstringLiteral :: Parser String
jstringLiteral = do
char '"'
cs <- manyTill jchar (char '"')
return cs
test1 = parse jstringLiteral "" "\"This is a test\""
test2 = parse jstringLiteral "" "\"This is an embedded quote: \\\" after quote\""
test3 = parse jstringLiteral "" "\"Embedded return: \\r\""
Note the extra level of backslashes needed to represent parser input as Haskell string literals. Reading the input from a file would make creating the parser input more convenient.
The definition of the manyTill combinator is:
manyTill p end = scan
where
scan = do{ end; return [] }
<|>
do{ x <- p; xs <- scan; return (x:xs) }
and this might help you figure out why your definitions aren't working.

Parsec: Skipping first line

I have written a parsec code which works perfectly for what I want. It parses as expected the following file:
4,5
6,7
The corresponding code is like this:
import Text.ParserCombinators.Parsec
import Control.Applicative hiding ((<|>))
import Control.Monad
data Test = Test Integer Integer deriving Show
integer :: Parser Integer
integer = rd <$> many1 digit
where rd = read :: String -> Integer
testParser :: Parser Test
testParser = do
a <- integer
char ','
b <- integer
return $ Test a b
testParserFile = endBy testParser eol
eol :: Parser Char
eol = char '\n'
main = do
a <- parseFromFile testParserFile "./jack.txt"
print a
But my actual files are like this:
col 1,col 2
4,5
6,7
Is there a way to make the above parser, just skip the first line ?
testParserFile = manyTill anyChar newline *> endBy testParser eol
manyTill p end applies p until end succeeds. *> sequences two actions and discards the first value.
Note: if your actual file doesn't contain a newline at the end, then you need to use sepEndBy instead of endBy. However, this could be a result of the Markdown parser on StackOverflow.

How do I sepBy ambiguous parse with Parsec?

I am trying to separate a string using a delimiter consisting of multiple characters, but the problem is that each of those characters can occur by itself in non-delimiting string. For example, I have foo*X*bar*X*baz, where the delimiter is *X*, so I want to get [foo, bar, baz], but each one of those can contain * or X.
I have tried
sepBy (many anyChar) delimiter
but that just swallows the whole string, giving "foo*X*bar*X*baz", if I do
sepBy anyChar (optional delimiter)
it filters out the delimiters correctly, but doesn't partition the list, returning "foobarbaz". I don't know which other combination I could try.
Perhaps you want something like this,
tok = (:) <$> anyToken <*> manyTill anyChar (try (() <$ string sep) <|> eof)
The anyToken prevents us from looping forever at the end of input, the try lets us avoid being over-eager in consuming separator characters.
Full code for a test,
module ParsecTest where
import Control.Applicative ((<$), (<$>), (<*>))
import Data.List (intercalate)
import Text.Parsec
import Text.Parsec.String
sep,msg :: String
sep = "*X*"
msg = intercalate "*X*" ["foXo", "ba*Xr", "bX*az"]
tok :: Parser String
tok = (:) <$> anyToken <*> manyTill anyChar (try (() <$ string sep) <|> eof)
toks :: Parser [String]
toks = many tok
test :: Either ParseError [String]
test = runP toks () "" msg

Parsec lookahead to handle ints

I'm working on a Parsec parser to handle a somewhat complex data file format (and I have no control over this format).
I've made a lot of progress, but am currently stuck with the following.
I need to be able to parse a line somewhat like this:
4 0.123 1.452 0.667 * 3.460 149 - -
Semantically, the 4 is a nodeNum, the Floats and the * are negative log probabilities (so, * represents the negative log of probability zero). The 149 and the minus signs are really junk, which I can discard, but I need to at least make sure they don't break the parser.
Here's what I have so far:
This handles the "junk" I mentioned. It could probably be simpler, but it works by itself.
emAnnotationSet = (,,) <$> p_int <*>
(reqSpaces *> char '-') <*>
(reqSpaces *> char '-')
the nodeNum at the beginning of the line is handled by another parser that works and I need not get into.
The problem is in trying to pick out all the p_logProbs from the line, without consuming the digits at the beginning of the emAnnotationSet.
the parser for p_logProb looks like this:
p_logProb = liftA mkScore (lp <?> "logProb")
where lp = try dub <|> string "*"
dub = (++) <$> ((++) <$> many1 digit <*> string ".") <*> many1 digit
And finally, I try to separate the logProb entries from the trailing emAnnotationSet (which starts with an integer) as follows:
hmmMatchEmissions = optSpaces *> (V.fromList <$> sepBy p_logProb reqSpaces)
<* optSpaces <* emAnnotationSet <* eol
<?> "matchEmissions"
So, p_logProb will only succeed on a float that begins with digits, includes a decimal point, and then has further digits (this restriction is respected by the file format).
I'd hoped that the try in the p_logProb definition would avoid consuming the leading digits if it didn't parse the decimal and the rest, but this doesn't seem to work; Parsec still complains that it sees an unexpected space after the digits of that integer in the emAnnotationSet:
Left "hmmNode" (line 1, column 196):
unexpected " "
expecting logProb
column 196 corresponds to the space after the integer preceding the minus signs, so it's clear to me that the problem is that the p_logProb parser is consuming the integer. How can I fix this so the p_logProb parser uses lookahead correctly, thus leaving that input for the emAnnotationSet parser?
The integer which terminates the probabilities cannot be mistaken for a probability since it doesn't contain a decimal point. The lexeme combinator converts a parser into one that skips trailing spaces.
import Text.Parsec
import Text.Parsec.String
import Data.Char
import Control.Applicative ( (<$>), (<*>), (<$), (<*), (*>) )
fractional :: Fractional a => Parser a
fractional = try $ do
n <- fromIntegral <$> decimal
char '.'
f <- foldr (\d f -> (f + fromIntegral (digitToInt d))/10.0) 0.0 <$> many1 digit
return $ n + f
decimal :: Parser Int
decimal = foldl (\n d -> 10 * n + digitToInt d) 0 <$> many1 digit
lexeme :: Parser a -> Parser a
lexeme p = p <* skipMany (char ' ')
data Row = Row Int [Maybe Double]
deriving ( Show )
probability :: Fractional a => Parser (Maybe a)
probability = (Just <$> fractional) <|> (Nothing <$ char '*')
junk = lexeme decimal <* count 2 (lexeme $ char '-')
row :: Parser Row
row = Row <$> lexeme decimal <*> many1 (lexeme probability) <* junk
rows :: Parser [Row]
rows = spaces *> sepEndBy row (lexeme newline) <* eof
Usage:
*Main> parseTest rows "4 0.123 1.234 2.345 149 - -\n5 0.123 * 2.345 149 - -"
[Row 4 [Just 0.123,Just 1.234,Just 2.345],Row 5 [Just 0.123,Nothing,Just 2.345]]
I'm not exactly sure of your problem. However, to parse the line given based on your description, it would be much easier to use existing lexers define in Text.Parsec.Token1, and join them together.
The below code parses the line into a Line data type, you can process it further from there if necessary. Instead of attempting to filter out the - and integers before parsing, it uses a parseEntry parser that returns a Just Double if it is a Float value, Just 0 for *, and Nothing for integers and dashes. This is then very simply filtered using catMaybes.
Here is the code:
module Test where
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellDef)
import Control.Applicative ((<$>))
import Data.Maybe (catMaybes)
lexer = P.makeTokenParser haskellDef
parseFloat = P.float lexer
parseInteger = P.natural lexer
whiteSpace = P.whiteSpace lexer
parseEntry = try (Just <$> parseFloat)
<|> try (const (Just 0) <$> (char '*' >> whiteSpace))
<|> try (const Nothing <$> (char '-' >> whiteSpace))
<|> (const Nothing <$> parseInteger)
data Line = Line {
lineNodeNum :: Integer
, negativeLogProbabilities :: [Double]
} deriving (Show)
parseLine = do
nodeNum <- parseInteger
whiteSpace
probabilities <- catMaybes <$> many1 parseEntry
return $ Line { lineNodeNum = nodeNum, negativeLogProbabilities = probabilities }
Example usage:
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46]}
The only issue that may (or may not) be a problem is it will parse *- as two different tokens, rather than fail at parsing. Eg
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -*"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46,0.0]}
Note the extra 0.0 at the end of the log probabilities.

Resources