Escaping end of line with Parsec - haskell

This time I'm trying to parse a text file into [[String]] using Parsec. Result is a list consisting of lists that represent lines of the file. Every line is a list that contains words which may be separated by any number of spaces, (optionally) commas, and spaces after commas as well.
Here is my code and it even works.
import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Applicative ((<$>))
import System.IO
import System.Environment
myParser :: Parser [[String]]
myParser =
do x <- sepBy parseColl eol
eof
return x
eol :: Parser String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
spaces :: Parser ()
spaces = skipMany (char ' ') >> return ()
parseColl :: Parser [String]
parseColl = many parseItem
parseItem :: Parser String
parseItem =
do optional spaces
x <- many1 (noneOf " ,\n\r")
optional spaces
optional (char ',')
return x
parseText :: String -> String
parseText str =
case parse myParser "" str of
Left e -> "parser error: " ++ show e
Right x -> show x
main :: IO ()
main =
do fileName <- head <$> getArgs
handle <- openFile fileName ReadMode
contents <- hGetContents handle
putStr $ parseText contents
hClose handle
Test file:
this is my test file
this, line, is, separated, by, commas
and this is another, line
Result:
[["this","is","my","test","file"],
["this","line","is","separated","by","commas"],
["and","this","is","another","line"],
[]] -- well, this is a bit unexpected, but I can filter things
Now, to make my life harder, I wish to be able to 'escape' eol if there is a comma , before it, even if the comma is followed by spaces. So this is should be considered one line:
this is, spaces may be here
my line
What is best strategy (most idiomatic and elegant) to implement this syntax (without losing the ability to ignore commas inside a line).

A couple of solutions come to mind.... One is easy, the other is medium difficulty.
The medium-difficulty solution is to define an itemSeparator to be a comma followed by whitespace, and a lineSeparator to be a '\n' or '\r' followed by whitespace.... Make sure to skip non '\n', '\r'-whitespace, but no further, at the end of the item parse, so that the very next char after an item must be either a '\n', '\r', or ',', which determines, without backtracking, whether a new item or line is coming.
Then use sepBy1 to define parseLine (ie- parseLine = parseItem sepBy1 parseItemSeparator), and endBy to define parseFile (ie- parseFile = parseLine endBy parseLineSeparator).
You really do need that sepBy1 on the inside, vs sepBy, else you will have a list of zero sized items, which causes an infinite loop at parse time. endBy works like sepBy, but allows extra '\n', '\r' at the end of the file....
An easier way would be to canonicalize the input by running it though a simple transformation before parsing. You can write a function to remove whitespace after a comma (using dropWhile and isSpace), and perhaps even simplify the different cases of '\n', '\r'.... then run the output through a simplified parser.
Something like this would do the trick (this is untested....)
canonicalize::String->String
canonicalize [] == []
canonicalize (',':rest) = ',':canonicalize (dropWhile isSpace rest)
canonicalize ('\n':rest) = '\n':canonicalize (dropWhile isSpace rest)
canonicalize ('\r':rest) = '\n':canonicalize (dropWhile isSpace rest) --all '\r' will become '\n'
canonicalize (c:rest) = c:canonicalize rest
Because Haskell is lazy, this transformation will work on streaming data as the data comes in, so this really won't slow anything down at all (depending on how much you simplify the parser, it could even speed things up.... Although most likely it will be close to a wash)
I don't know how complicated the full question is, but perhaps a few rules added to a canonicalization function will in fact allow you to use lines and words after all....

Just use optional spaces in parseColl, like this:
parseColl :: Parser [String]
parseColl = optional spaces >> many parseItem
parseItem :: Parser String
parseItem =
do
x <- many1 (noneOf " ,\n\r")
optional spaces
optional (char ',')
return x
Second, divide separator from item
parseColl :: Parser [String]
parseColl = do
optional spaces
items <- parseItem `sepBy` parseSeparator
optional spaces
return items
parseItem :: Parser String
parseItem = many1 $ noneOf " ,\n\r"
parseSeparator = try (optional spaces >> char ',' >> optional spaces) <|> spaces
Third, we recreate a bit eol and spaces:
eol :: Parser String
eol = try (string "\n\r")
<|> string "\r\n"
<|> string "\n"
<|> string "\r"
<|> eof
<?> "end of line"
spaces :: Parser ()
spaces = skipMany1 $ char ' '
parseColl :: Parser [String]
parseColl = do
optional spaces
items <- parseItem `sepBy` parseSeparator
optional spaces
eol
return items
Finally, let's rewrite myParser:
myParser = many parseColl

Related

Conditions in Parsec

If I have a parser than reads a string of numbers separated by spaces into a list of Ints, how do I handle a trailing space? At the moment I have:
row :: Parser [Int]
row = do
optional spaces
f <- (many (oneOf "0123456789"))
r <- ((char ' ') >> row) <|> pure []
pure (read f:r)
Which works fine with a string that does not have a trailing space but fails with a trailing space.
>λ= parse row "" " 2 0 12 3 7"
Right [2,0,12,3,7]
>λ= parse row "" " 2 0 12 3 7 "
Right [2,0,12,3,7,*** Exception: Prelude.read: no parse
What is the solution to this problem and more so, how would I have a condition where if '\n' is consumed then the parser returns []
EDIT:
From reading #amalloy's answer and the parsec source code, I thought it useful to add a version that works here (although, #amalloy's advice to not try and roll existing functions makes more sense)
row :: Parser [Int]
row = do
spaces
f <- (read <$> many1 digit)
do
many1 $ char ' '
r <- row
pure (f:r) <|> pure [x]
<|> pure []
Instead of implementing all this low-level stuff yourself, I suggest just using sepEndBy. For example,
row :: Parser [Int]
row = spaces *> (int `sepEndBy` many1 space)
where int = read <$> many1 digit

Reading list of statements and ending with a single expression, when statements can be expressions

I've run into a problem where I want to parse a block of code with the following syntax
{
<stmt>;
<stmt>;
<stmt>;
<expr>
}
A statement can be of the form <expr>;. This trips up Parsec in a way which I don't know how to fix. This is probably just me being kinda new to Haskell and the Parsec library, but I don't know where to search for a solution to the problem. I've written an example that captures my exact problem.
With the input { 5; 5; 5 } it fails on the third 5, because it expects there to be a ; present. How do I get around this?
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Combinator
parseIdentifier = do
first <- letter
rest <- many $ letter <|> digit <|> char '_'
return $ first : rest
parseExpr = parseIdentifier <|> many1 digit
parseStmt = parseExpr <* char ';'
parseBlock = between
(char '{' >> spaces)
(spaces >> char '}')
(do
stmts <- try $ parseStmt `sepBy` spaces
parseExpr
)
readParser :: Parser String -> String -> String
readParser parser input = case parse parser "dusk" input of
Left err -> show err
Right val -> val
main = interact $ readParser parseBlock
Instead of sepBy, this sort of problems often can be solved by manyTill, the tricky point is to keep the input that don't be consumed by manyTill, it have to use try $ lookAhead
Side note: the reason can be found in source code of Parsec.
Internally, manyTill use <|>, so why try take effect, and
lookAhead can retain the input when apply monad bind >>=, >>
So, the correction look like:
parseBlock = between
(char '{' >> spaces)
(spaces >> char '}')
(do
stmts <- manyTill (parseStmt <* spaces)
(try $ lookAhead (parseExpr >> space))
parseExpr
)
The above parser just return the output of parseExpr, i.e. 5, if this is your intent, it can be simpified by:
manyTill (parseStmt <* spaces) (try $ lookAhead (parseExpr >> space)) >> parseExpr
if you actually need the parsed string of statements as well, it become:
(do
stmts <- manyTill (parseStmt <* spaces)
(try $ lookAhead (parseExpr >> space))
expr <- parseExpr
return (concat (stmts ++ [expr]))
)
it return 555
The problem with your code is that sepBy have certain expectation about its parameters. In case separator successfully parsed it doesn't expect element parser to fail.
To fix that I suggest following improvement
parseBlock = between
(char '{' >> spaces)
(spaces >> char '}')
(do
stmts <- try $ many $ spaces *> parseStmt
spaces
parseExpr
)

Skip blank lines with Parsec

New to Parsec, a beginner's question. How can one parse a file of lines where some lines may be blank, consisting only of whitespace followed by a newline? I just want to skip them, not have them in the parsed output.
import Text.ParserCombinators.Parsec
-- alias for parseTest
run :: Show a => Parser a -> String -> IO ()
run = parseTest
-- parse lines
p :: Parser [[String]]
p = lineP `endBy` newline <* eof
where lineP = wordP `sepBy` (char ' ')
wordP = many $ noneOf "\n"
Example parse with blank line:
*Main> run p "z x c\n1 2 3\n \na\n"
[["z x c"],["1 2 3"],[" "],["a"]]
I suspect I am going about this all wrong.
Instead of using newline, you could define a custom parser that captures your notion of the end of a line, which would parse at least one newline, and then optionally many empty lines (i.e. whitespaces followed by another newline). You will need the try operator to backtrack if the whitespace is not followed by another newline (or the end of input, I guess):
Code:
-- parse lines
p :: Parser [[String]]
p = lineP `endBy` lineEnd <* eof
where lineP = wordP `sepBy` (char ' ')
wordP = many $ noneOf " \n"
lineEnd :: Parser ()
lineEnd = do
newline
many (try (many (oneOf " \t") >> newline))
return ()
Output:
*Main> run p "z x c\n1 2 3\n \na\n"
[["z","x","c"],["1","2","3"],["a"]]
One approach might be to think of a file as a series of lines that are either blank or non-blank. The following expresses this idea with the expression line <|> emptyLine. The following uses the Maybe datatype to distinguish between the result of parsing a non-blank line, using catMaybes to filter out the Nothings at the end.
#!/usr/bin/env stack
{- stack
--resolver lts-7.0
--install-ghc
runghc
--package parsec
-}
import Prelude hiding (lines)
import Data.Maybe (catMaybes)
import Text.ParserCombinators.Parsec
-- parse lines
p :: Parser [[String]]
p = catMaybes <$> lines
where lines = (line <|> emptyLine) `endBy` newline <* eof
line = Just <$> word `sepBy1` spaces1
emptyLine = spaces1 >> pure Nothing
word = many1 $ noneOf ['\n', ' ']
spaces1 = skipMany1 (char ' ')
main = parseTest p "z x c\n1 2 3\n \na\n"
Output is:
[["z","x","c"],["1","2","3"],["a"]]
Another approach might be to use Prelude functions along with Data.Char.isSpace to collect the non-blank lines before you get started:
#!/usr/bin/env stack
{- stack
--resolver lts-7.0
--install-ghc
runghc
--package parsec
-}
import Data.Char
import Text.ParserCombinators.Parsec
p :: Parser [[String]]
p = line `endBy` newline <* eof where
line = word `sepBy1` spaces1
word = many1 $ noneOf ['\n', ' ']
spaces1 = skipMany1 (char ' ')
main = parseTest p (unlines nonBlankLines)
where input = "z x c\n1 2 3\n \na\n"
nonBlankLines = filter (not . all isSpace) $ lines input
Output is:
[["z","x","c"],["1","2","3"],["a"]]
This is pretty simple and has the additional benefit that using lines will not require a newline at the end of each line (this helps with portability).
Note, there was a small bug with your wordP parser. Also note that, as specified, these parsers do not cope with preceding or trailing spaces (on non-blank lines). I'm imaging that your non-minimal code is more resilient.

Parsec3 Text parser for quoted string, where everything is allowed in between quotes

I have actually asked this question before (here) but it turns out that the solution provided did not handle all test cases. Also, I need 'Text' parser rather than 'String', so I need parsec3.
Ok, the parser should allow for EVERY type of char inbetween quotes, even quotes. The end of the quoted text is marked by a ' character, followed by |, a space or end of input.
So,
'aa''''|
should return a string
aa'''
This is what I have:
import Text.Parsec
import Text.Parsec.Text
quotedLabel :: Parser Text
quotedLabel = do -- reads the first quote.
spaces
string "'"
lab <- liftM pack $ endBy1 anyChar endOfQuote
return lab
endOfQuote = do
string "'"
try(eof) <|> try( oneOf "| ")
Now, the problem here is of course that eof has a different type than oneOf "| ", so compilation falls.
How do I fix this? Is there a better way to achieve what I am trying to do?
Whitespace
First a comment on handling white space...
Generally the practice is to write your parsers so that they
consume the whitespace following a token
or syntactic unit. It's common to define combinator like:
lexeme p = p <* spaces
to easily convert a parser p to one that discards the whitespace
following whatever p parses. E.g., if you have
number = many1 digit
simply use lexeme number whenever you want to eat up the
whitespace following the number.
For more on this approach to handling whitespace and other advice
on parsing languages, see this Megaparsec tutorial.
Label expressions
Based on your previous SO question it appears you want
to parse expressions of the form:
label1 | label2 | ... | labeln
where each label may be a simple label or a quoted label.
The idiomatic way to parse this pattern is to use sepBy like this:
labels :: Parser String
labels = sepBy1 (try quotedLabel <|> simpleLabel) (char '|')
We define both simpleLabel and quotedLabel in terms of
what characters may occur in them. For simpleLabel a valid
character is a non-| and non-space:
simpleLabel :: Parser String
simpleLabel = many (noneOf "| ")
A quotedLabel is a single quote followed by a run
of valid quotedLabel-characters followed by an ending
single quote:
sq = char '\''
quotedLabel :: Parser String
quotedLabel = do
char sq
chs <- many validChar
char sq
return chs
A validChar is either a non-single quote or a single
quote not followed by eof or a vertical bar:
validChar = noneOf [sq] <|> try validQuote
validQuote = do
char sq
notFollowedBy eof
notFollowedBy (char '|')
return sq
The first notFollowedBy will fail if the single quote appears just
before the end of input. The second notFollowedBy will fail if
next character is a vertical bar. Therefore the sequence of the two
will succeed only if there is a non-vertical bar character following
the single quote. In this case the single quote should be interpreted
as part of the string and not the terminating single quote.
Unfortunately this doesn't quite work because the
current implementation of notFollowedBy
will always succeed with a parser which does not consume any
input -- i.e. like eof. (See this issue for more details.)
To work around this problem we can use this alternate
implementation:
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
Here is the complete solution with some tests. By adding a few lexeme
calls you can make this parser eat up any white space where you decide
it is not significant.
import Text.Parsec hiding (labels)
import Text.Parsec.String
import Control.Monad
notFollowedBy' :: (Stream s m t, Show a) => ParsecT s u m a -> ParsecT s u m ()
notFollowedBy' p = try $ join $
do {a <- try p; return (unexpected (show a));}
<|> return (return ())
sq = '\''
validChar = do
noneOf "'" <|> try validQuote
validQuote = do
char sq
notFollowedBy' eof
notFollowedBy (char '|')
return sq
quotedLabel :: Parser String
quotedLabel = do
char sq
str <- many validChar
char sq
return str
plainLabel :: Parser String
plainLabel = many (noneOf "| ")
labels :: Parser [String]
labels = sepBy1 (try quotedLabel <|> try plainLabel) (char '|')
test input expected = do
case parse (labels <* eof) "" input of
Left e -> putStrLn $ "error: " ++ show e
Right v -> if v == expected
then putStrLn $ "OK - got: " ++ show v
else putStrLn $ "NOT OK - got: " ++ show v ++ " expected: " ++ show expected
test1 = test "a|b|c" ["a","b","c"]
test2 = test "a|'b b'|c" ["a", "b b", "c"]
test3 = test "'abc''|def" ["abc'", "def" ]
test4 = test "'abc'" ["abc"]
test5 = test "x|'abc'" ["x","abc"]
To change the result of any functor computation you can just use:
fmap (const x) functor_comp
e.g.:
getLine :: IO String
fmap (const ()) getLine :: IO ()
eof :: Parser ()
oneOf "| " :: Parser Char
fmap (const ()) (oneOf "| ") :: Parser ()
Another option is to use operators from Control.Applicative:
getLine *> return 3 :: IO Integer
This performs getLine, discards the result and returns 3.
In your case, you might use:
try(eof) <|> try( oneOf "| " *> return ())

How do I sepBy ambiguous parse with Parsec?

I am trying to separate a string using a delimiter consisting of multiple characters, but the problem is that each of those characters can occur by itself in non-delimiting string. For example, I have foo*X*bar*X*baz, where the delimiter is *X*, so I want to get [foo, bar, baz], but each one of those can contain * or X.
I have tried
sepBy (many anyChar) delimiter
but that just swallows the whole string, giving "foo*X*bar*X*baz", if I do
sepBy anyChar (optional delimiter)
it filters out the delimiters correctly, but doesn't partition the list, returning "foobarbaz". I don't know which other combination I could try.
Perhaps you want something like this,
tok = (:) <$> anyToken <*> manyTill anyChar (try (() <$ string sep) <|> eof)
The anyToken prevents us from looping forever at the end of input, the try lets us avoid being over-eager in consuming separator characters.
Full code for a test,
module ParsecTest where
import Control.Applicative ((<$), (<$>), (<*>))
import Data.List (intercalate)
import Text.Parsec
import Text.Parsec.String
sep,msg :: String
sep = "*X*"
msg = intercalate "*X*" ["foXo", "ba*Xr", "bX*az"]
tok :: Parser String
tok = (:) <$> anyToken <*> manyTill anyChar (try (() <$ string sep) <|> eof)
toks :: Parser [String]
toks = many tok
test :: Either ParseError [String]
test = runP toks () "" msg

Resources