I have written a parsec code which works perfectly for what I want. It parses as expected the following file:
4,5
6,7
The corresponding code is like this:
import Text.ParserCombinators.Parsec
import Control.Applicative hiding ((<|>))
import Control.Monad
data Test = Test Integer Integer deriving Show
integer :: Parser Integer
integer = rd <$> many1 digit
where rd = read :: String -> Integer
testParser :: Parser Test
testParser = do
a <- integer
char ','
b <- integer
return $ Test a b
testParserFile = endBy testParser eol
eol :: Parser Char
eol = char '\n'
main = do
a <- parseFromFile testParserFile "./jack.txt"
print a
But my actual files are like this:
col 1,col 2
4,5
6,7
Is there a way to make the above parser, just skip the first line ?
testParserFile = manyTill anyChar newline *> endBy testParser eol
manyTill p end applies p until end succeeds. *> sequences two actions and discards the first value.
Note: if your actual file doesn't contain a newline at the end, then you need to use sepEndBy instead of endBy. However, this could be a result of the Markdown parser on StackOverflow.
Related
I am working on some programming exercises. The one I am working on has following input format:
Give xxxxxxxxx as yyyy.
xxxxxxxx can be in several formats that repeatedly show up during these exercises. In particular its either binary (groups of 8 separated by spaces), hexadecimal (without spaces) or octal (groups of up to 3 numbers). I have already written parsers for these formats - however they all stumble over the "as". They looked like this
binaryParser = BinaryQuestion <$> (count 8 ( oneOf "01") ) `sepBy1` space
I solved using this monstrosity (trimmed unnecessary code)
{-# LANGUAGE OverloadedStrings #-}
import Text.Parsec.ByteString
import Text.Parsec
import Text.Parsec.Char
import Data.ByteString.Char8 (pack, unpack, dropWhile, drop, snoc)
import qualified Data.ByteString as B
data Input = BinaryQuestion [String]
| HexQuestion [String]
| OctalQuestion [String]
deriving Show
data Question = Question {input :: Input, target :: Target} deriving Show
data Target = Word deriving Show
test1 :: B.ByteString
test1 = "Give 01110100 01110101 01110010 01110100 01101100 01100101 as a word."
test2 :: B.ByteString
test2 = "Give 646f63746f72 as a word."
test3 :: B.ByteString
test3 = "Give 164 151 155 145 as a word."
targetParser :: Parser Target
targetParser = string "word" >> return Word
wrapAs :: Parser a -> Parser [a]
wrapAs kind = manyTill kind (try (string " as"))
inputParser :: Parser Input
inputParser = choice [try binaryParser, try (space >> hexParser), try octParser]
binaryParser :: Parser Input
binaryParser = BinaryQuestion <$> wrapAs (space >> count 8 ( oneOf "01") )
hexParser :: Parser Input
hexParser = HexQuestion <$> wrapAs (count 2 hexDigit)
octParser :: Parser Input
octParser = OctalQuestion <$> wrapAs (many1 space >> many1 (oneOf ['0'..'7']))
questionParser :: Parser Question
questionParser = do
string "Give"
inp <- inputParser
string " a "
tar <- targetParser
char '.'
eof
return $ Question inp tar
I don't like that I need to use the following string "as" inside the parsing of Input, and they generally are less readable. I mean using regex it would be trivial to have a trailing string. So I am not satisfied with my solution.
Is there a way I can reuse the 'nice' parsers - or at least use more readable parsers?
additional notes
The code I along the lines I wish I could get working would look like this:
{-# LANGUAGE OverloadedStrings #-}
import Text.Parsec.ByteString
import Text.Parsec
import Text.Parsec.Char
import Data.ByteString.Char8 (pack, unpack, dropWhile, drop, snoc)
import qualified Data.ByteString as B
data Input = BinaryQuestion [String]
| HexQuestion [String]
| OctalQuestion [String]
deriving Show
data Question = Question {input :: Input, target :: Target} deriving Show
data Target = Word deriving Show
test1 :: B.ByteString
test1 = "Give 01110100 01110101 01110010 01110100 01101100 01100101 as a word."
test2 :: B.ByteString
test2 = "Give 646f63746f72 as a word."
test3 :: B.ByteString
test3 = "Give 164 151 155 145 as a word."
targetParser :: Parser Target
targetParser = string "word" >> return Word
inputParser :: Parser Input
inputParser = choice [try binaryParser, try hexParser, try octParser]
binaryParser :: Parser Input
binaryParser = BinaryQuestion <$> count 8 ( oneOf "01") `sepBy1` space
hexParser :: Parser Input
hexParser = HexQuestion <$> many1 (count 2 hexDigit)
octParser :: Parser Input
octParser = OctalQuestion <$> (many1 (oneOf ['0'..'7'])) `sepBy1` space
questionParser :: Parser Question
questionParser = do
string "Give"
many1 space
inp <- inputParser
many1 space
string "as a"
many1 space
tar <- targetParser
char '.'
eof
return $ Question inp tar
but parseTest questionParser test3 will return me parse error at (line 1, column 22):
unexpected "a"
I suppose the problem is that space is used as separator inside the input but also comes in the as a string. I don't see any function inside parsec that would fit. In frustration I tried adding try in various places - however no success.
You are working with the pattern: Give {source} as a {target}.
So you can pipe:
Parser for Give a
Parser for {source}
Parser for as a
Parser for {target}
No need to wrap the parser for {source} with the parser for as a.
EDIT:
As said in comment, the clean parser cannot be reused by Previouse solution stated at the end of this post.
It led to develop a small parser using Parsec to handle all the possible situations for end parsing of numeric string separated by space i.e.
end with a space followed by non-required-digit character, e.g. "..11 as"
end with a space, e.g. "..11 "
end with eof, e.g. "..11"
and such a parser as below:
numParser:: (Parser Char->Parser String)->[Char]->Parser [String]
numParser repeatParser digits =
let digitParser = repeatParser $ oneOf digits
endParser = (try $ lookAhead $ (space >> noneOf digits)) <|>
(try $ lookAhead $ (space <* eof)) <|>
(eof >> return ' ')
in do init <- digitParser
rest <- manyTill (space >> digitParser) endParser
return (init : rest)
And binaryParser and octParser need to be modified as below:
binaryParser = BinaryQuestion <$> numParser (count 8) "01"
octParser = OctalQuestion <$> numParser many1 ['0'..'7']
And Nothing need to change of questionParser stated in question, for reference, I state it again here:
questionParser = do
string "Give"
many1 space
inp <- inputParser
many1 space --no need change to many
string "as a"
many1 space
tar <- targetParser
char '.'
eof
return $ Question inp tar
Previous Solution:
The functions endBy1 and many in Text.Parsec are helpful in this situation.
To replace sepBy1 by endBy1 as
binaryParser = BinaryQuestion <$> count 8 ( oneOf "01") `endBy1` space
and
octParser = OctalQuestion <$> (many1 (oneOf ['0'..'7'])) `endBy1` space
Unlike sepBy1, endBy1 will read next some chars to determine whether end the parsing, and therefor, one space after the last digit will be consumed, i.e.
Give 164 151 155 145 as a word.
^ this space will be consumed
So, instead of checking one or many space before "as a...", it need check zero or many space, so why use many function instead of many1, now the code become:
...
inp <- inputParser
many space -- change to many
string "as a"
....
New to Parsec, a beginner's question. How can one parse a file of lines where some lines may be blank, consisting only of whitespace followed by a newline? I just want to skip them, not have them in the parsed output.
import Text.ParserCombinators.Parsec
-- alias for parseTest
run :: Show a => Parser a -> String -> IO ()
run = parseTest
-- parse lines
p :: Parser [[String]]
p = lineP `endBy` newline <* eof
where lineP = wordP `sepBy` (char ' ')
wordP = many $ noneOf "\n"
Example parse with blank line:
*Main> run p "z x c\n1 2 3\n \na\n"
[["z x c"],["1 2 3"],[" "],["a"]]
I suspect I am going about this all wrong.
Instead of using newline, you could define a custom parser that captures your notion of the end of a line, which would parse at least one newline, and then optionally many empty lines (i.e. whitespaces followed by another newline). You will need the try operator to backtrack if the whitespace is not followed by another newline (or the end of input, I guess):
Code:
-- parse lines
p :: Parser [[String]]
p = lineP `endBy` lineEnd <* eof
where lineP = wordP `sepBy` (char ' ')
wordP = many $ noneOf " \n"
lineEnd :: Parser ()
lineEnd = do
newline
many (try (many (oneOf " \t") >> newline))
return ()
Output:
*Main> run p "z x c\n1 2 3\n \na\n"
[["z","x","c"],["1","2","3"],["a"]]
One approach might be to think of a file as a series of lines that are either blank or non-blank. The following expresses this idea with the expression line <|> emptyLine. The following uses the Maybe datatype to distinguish between the result of parsing a non-blank line, using catMaybes to filter out the Nothings at the end.
#!/usr/bin/env stack
{- stack
--resolver lts-7.0
--install-ghc
runghc
--package parsec
-}
import Prelude hiding (lines)
import Data.Maybe (catMaybes)
import Text.ParserCombinators.Parsec
-- parse lines
p :: Parser [[String]]
p = catMaybes <$> lines
where lines = (line <|> emptyLine) `endBy` newline <* eof
line = Just <$> word `sepBy1` spaces1
emptyLine = spaces1 >> pure Nothing
word = many1 $ noneOf ['\n', ' ']
spaces1 = skipMany1 (char ' ')
main = parseTest p "z x c\n1 2 3\n \na\n"
Output is:
[["z","x","c"],["1","2","3"],["a"]]
Another approach might be to use Prelude functions along with Data.Char.isSpace to collect the non-blank lines before you get started:
#!/usr/bin/env stack
{- stack
--resolver lts-7.0
--install-ghc
runghc
--package parsec
-}
import Data.Char
import Text.ParserCombinators.Parsec
p :: Parser [[String]]
p = line `endBy` newline <* eof where
line = word `sepBy1` spaces1
word = many1 $ noneOf ['\n', ' ']
spaces1 = skipMany1 (char ' ')
main = parseTest p (unlines nonBlankLines)
where input = "z x c\n1 2 3\n \na\n"
nonBlankLines = filter (not . all isSpace) $ lines input
Output is:
[["z","x","c"],["1","2","3"],["a"]]
This is pretty simple and has the additional benefit that using lines will not require a newline at the end of each line (this helps with portability).
Note, there was a small bug with your wordP parser. Also note that, as specified, these parsers do not cope with preceding or trailing spaces (on non-blank lines). I'm imaging that your non-minimal code is more resilient.
I am trying to distinguish between Ints and floats in a parser. I have 2 parsers one for each int and float. However, I am having trouble getting into to fail on a '.'. I looked for negating and look ahead and didn't seem to get and fruits.
I hope I am not duplicating any questions.
I had it working with looking at the next character that is not a '.' but that is an ugly solution.
EDIT: Added more code.
--Int--------------------------------------------------------------------
findInt :: Parser String
findInt = plus <|> minus <|> number
number :: Parser String
number = many1 digit
plus :: Parser String
plus = char '+' *> number
minus :: Parser String
minus = char '-' <:> number
makeInt :: Parser Int
makeInt = prepareResult (findInt <* many (noneOf ".") <* endOfLine)
where readInt = read :: String -> Int
prepareResult = liftA readInt
makeInt2 :: Parser Int
makeInt2 = do
numberFound <- (findInt <* many (noneOf ".") <* endOfLine)
match <- char '.'
return (prepareResult numberFound)
where readInt = read :: String -> Int
prepareResult = readInt
--End Int----------------------------------------------------------------
I think you are best off actually combining the two parsers into one. Try something like this:
import Text.Parsec.String (Parser)
import Control.Applicative ((<|>))
import Text.Parsec.Char (char,digit)
import Text.Parsec.Combinator (many1,optionMaybe)
makeIntOrFloat :: Parser (Either Int Float)
makeIntOrFloat = do
sign <- optionMaybe (char '-' <|> char '+')
n <- many1 digit
m <- optionMaybe (char '.' *> many1 digit)
return $ case (m,sign) of
(Nothing, Just '-') -> Left (negate (read n))
(Nothing, _) -> Left (read n)
(Just m, Just '-') -> Right (negate (read n + read m / 10.0^(length m)))
(Just m, _) -> Right (read n + read m / 10.0^(length m))
ErikR has a correct solution, but the use of try means that parsec has to keep track of the possibility of backtracking (which is a bit inefficient) when in fact that is unnecessary in this case.
Here, the key difference is that we can actually tell right away if we have a float or not - if we don't have a float, the char '.' *> many1 digit parser in optionMaybe will fail immediately (without consuming input), so there is no need to consider backtracking.
At GHCi
ghci> import Text.Parsec.Prim
ghci> parseTest makeIntOrFloat "1234.012"
Right 1234.012
ghci> parseTest makeIntOrFloat "1234"
Left 1234
I would use notFollowedBy - e.g.:
import Text.Parsec
import Text.Parsec.String
import Text.Parsec.Combinator
int :: Parser String
int = many1 digit <* notFollowedBy (char '.')
float :: Parser (String,String)
float = do whole <- many1 digit
fracpart <- try (char '.' *> many digit) <|> (return "")
return (whole, fracpart)
intOrFloat :: Parser (Either String (String,String))
intOrFloat = try (fmap Left int) <|> (fmap Right float)
test1 = parseTest (intOrFloat <* eof) "123"
test2 = parseTest (intOrFloat <* eof) "123.456"
test3 = parseTest (intOrFloat <* eof) "123."
It is typically easiest to use applicative combinators to build your parsers - this makes your parsers easier to reason about and often you do not need monadic and backtracking functions of the parser.
For example, a parser for integers could be written as such:
import Text.Parsec hiding ((<|>), optional)
import Text.Parsec.String
import Numeric.Natural
import Control.Applicative
import Data.Foldable
natural :: Parser Natural
natural = read <$> many1 digit
sign :: Num a => Parser (a -> a)
sign = asum [ id <$ char '+'
, negate <$ char '-'
, pure id
]
integer :: Parser Integer
integer = sign <*> (fromIntegral <$> natural)
A decimal number is an integer optionally followed by a decimal portion (a '.' followed by another integer), which is itself a number proper, so your parser can be written as
decimalPart :: Parser Double
decimalPart = read . ("0."++) <$> (char '.' *> many1 digit)
integerOrDecimal :: Parser (Either Integer Double)
integerOrDecimal = liftA2 cmb integer (optional decimalPart) where
cmb :: Integer -> Maybe Double -> Either Integer Double
cmb x Nothing = Left x
cmb x (Just d) = Right (fromIntegral x + d)
The definition of cmb is obvious - if the is no decimal part, then produce an Integer, and if there is, produce a Double, by adding the integer part to the decimal part.
You can also define a parser for decimals in terms of the above:
decimal :: Parser Double
decimal = either fromIntegral id <$> integerOrDecimal
Note that none of the above parsers directly use monadic functions (i.e. >>=) or backtracking - making them simple and efficient.
I am trying to parse file [1] but below code is not working properly.
import Control.Applicative hiding ( many , ( <|> ) )
import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Token as T
import Text.Parsec.Language (emptyDef)
--Scheme Code;ISIN Div Payout/ ISIN Growth;ISIN Div Reinvestment;Scheme Name;Net Asset Value;Repurchase Price;Sale Price;Date
--120523;INF846K01ET8;INF846K01EU6;Axis Triple Advantage Fund - Direct Plan - Dividend Option;13.3660;13.2323;13.3660;19-Jun-2015
data MFund = MFund Integer String String String Double Double Double String deriving (Show) --String String String deriving (Show)
lexer :: T.TokenParser st
lexer = T.makeTokenParser emptyDef
natural :: Parser Integer
natural = T.natural lexer
float :: Parser Double
float = T.float lexer
eol :: Parser String
eol = try (string "\r\n")
<|> try (string "\n")
<|> try (string "\r")
parseFund :: Parsec String () MFund
parseFund = MFund <$> natural
<*> (char ';' *> (many1 alphaNum <|> string "-"))
<*> (char ';' *> (many1 alphaNum <|> string "-"))
<*> (char ';' *> manyTill anyChar (char ';'))
<*> float
<*> (char ';' *> float)
<*> (char ';' *> float)
<*> (char ';' *> many1 (alphaNum <|> char '-'))
parseBlockFund :: Parsec String () [MFund]
parseBlockFund = manyTill anyChar eol *> space *> eol *> endBy parseFund eol
parseMutual :: Parsec String () [MFund]
parseMutual = concat <$> (manyTill anyChar eol *> space *> eol *>
sepBy parseBlockFund (space *> eol))
{- each scheme is seperated by space and end of line-}
parseSchemeBlock :: Parsec String () [MFund]
parseSchemeBlock = concat <$> (sepBy parseMutual (space *> eol))
parseFile :: Parsec String () [MFund]
parseFile =
string "Scheme Code;ISIN Div Payout/ ISIN Growth;ISIN Div Reinvestment;Scheme Name;Net Asset Value;Repurchase Price;Sale Price;Date" *>
eol *> space *> eol *> parseSchemeBlock <* eof
main :: IO ()
main = do
input <- readFile "data.txt"
print input
case parse parseFile "" input of
Left err -> print err
Right val -> print val
I am starting to parse the file from parseFile function which consumes first line, end of line, space and end of line. Now I parse each scheme blocks which are separated by space and end of line. Within each scheme block, we have many mutual blocks which are also separated by space and end of line so I first consume the string till end of line followed by space and end of line and call parseBlockFund.
The problem which I suspect is when I am calling parseMutual function and when it's reaching to end of scheme block, it eats the space and end of line which is suppose to be consumed by parseSchemeBlock function. I tried try function but it's not working so I am not sure if I am thinking correctly or not. Could some one please tell me what is wrong with this code.
One interesting thing when I remove the eof, this code is able to parse at least one scheme block [2].
[1] http://portal.amfiindia.com/spages/NAV0.txt
[2] https://github.com/mukeshtiwari/Puzzles/blob/master/Assigment/Api.hs
I'm working on a Parsec parser to handle a somewhat complex data file format (and I have no control over this format).
I've made a lot of progress, but am currently stuck with the following.
I need to be able to parse a line somewhat like this:
4 0.123 1.452 0.667 * 3.460 149 - -
Semantically, the 4 is a nodeNum, the Floats and the * are negative log probabilities (so, * represents the negative log of probability zero). The 149 and the minus signs are really junk, which I can discard, but I need to at least make sure they don't break the parser.
Here's what I have so far:
This handles the "junk" I mentioned. It could probably be simpler, but it works by itself.
emAnnotationSet = (,,) <$> p_int <*>
(reqSpaces *> char '-') <*>
(reqSpaces *> char '-')
the nodeNum at the beginning of the line is handled by another parser that works and I need not get into.
The problem is in trying to pick out all the p_logProbs from the line, without consuming the digits at the beginning of the emAnnotationSet.
the parser for p_logProb looks like this:
p_logProb = liftA mkScore (lp <?> "logProb")
where lp = try dub <|> string "*"
dub = (++) <$> ((++) <$> many1 digit <*> string ".") <*> many1 digit
And finally, I try to separate the logProb entries from the trailing emAnnotationSet (which starts with an integer) as follows:
hmmMatchEmissions = optSpaces *> (V.fromList <$> sepBy p_logProb reqSpaces)
<* optSpaces <* emAnnotationSet <* eol
<?> "matchEmissions"
So, p_logProb will only succeed on a float that begins with digits, includes a decimal point, and then has further digits (this restriction is respected by the file format).
I'd hoped that the try in the p_logProb definition would avoid consuming the leading digits if it didn't parse the decimal and the rest, but this doesn't seem to work; Parsec still complains that it sees an unexpected space after the digits of that integer in the emAnnotationSet:
Left "hmmNode" (line 1, column 196):
unexpected " "
expecting logProb
column 196 corresponds to the space after the integer preceding the minus signs, so it's clear to me that the problem is that the p_logProb parser is consuming the integer. How can I fix this so the p_logProb parser uses lookahead correctly, thus leaving that input for the emAnnotationSet parser?
The integer which terminates the probabilities cannot be mistaken for a probability since it doesn't contain a decimal point. The lexeme combinator converts a parser into one that skips trailing spaces.
import Text.Parsec
import Text.Parsec.String
import Data.Char
import Control.Applicative ( (<$>), (<*>), (<$), (<*), (*>) )
fractional :: Fractional a => Parser a
fractional = try $ do
n <- fromIntegral <$> decimal
char '.'
f <- foldr (\d f -> (f + fromIntegral (digitToInt d))/10.0) 0.0 <$> many1 digit
return $ n + f
decimal :: Parser Int
decimal = foldl (\n d -> 10 * n + digitToInt d) 0 <$> many1 digit
lexeme :: Parser a -> Parser a
lexeme p = p <* skipMany (char ' ')
data Row = Row Int [Maybe Double]
deriving ( Show )
probability :: Fractional a => Parser (Maybe a)
probability = (Just <$> fractional) <|> (Nothing <$ char '*')
junk = lexeme decimal <* count 2 (lexeme $ char '-')
row :: Parser Row
row = Row <$> lexeme decimal <*> many1 (lexeme probability) <* junk
rows :: Parser [Row]
rows = spaces *> sepEndBy row (lexeme newline) <* eof
Usage:
*Main> parseTest rows "4 0.123 1.234 2.345 149 - -\n5 0.123 * 2.345 149 - -"
[Row 4 [Just 0.123,Just 1.234,Just 2.345],Row 5 [Just 0.123,Nothing,Just 2.345]]
I'm not exactly sure of your problem. However, to parse the line given based on your description, it would be much easier to use existing lexers define in Text.Parsec.Token1, and join them together.
The below code parses the line into a Line data type, you can process it further from there if necessary. Instead of attempting to filter out the - and integers before parsing, it uses a parseEntry parser that returns a Just Double if it is a Float value, Just 0 for *, and Nothing for integers and dashes. This is then very simply filtered using catMaybes.
Here is the code:
module Test where
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellDef)
import Control.Applicative ((<$>))
import Data.Maybe (catMaybes)
lexer = P.makeTokenParser haskellDef
parseFloat = P.float lexer
parseInteger = P.natural lexer
whiteSpace = P.whiteSpace lexer
parseEntry = try (Just <$> parseFloat)
<|> try (const (Just 0) <$> (char '*' >> whiteSpace))
<|> try (const Nothing <$> (char '-' >> whiteSpace))
<|> (const Nothing <$> parseInteger)
data Line = Line {
lineNodeNum :: Integer
, negativeLogProbabilities :: [Double]
} deriving (Show)
parseLine = do
nodeNum <- parseInteger
whiteSpace
probabilities <- catMaybes <$> many1 parseEntry
return $ Line { lineNodeNum = nodeNum, negativeLogProbabilities = probabilities }
Example usage:
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46]}
The only issue that may (or may not) be a problem is it will parse *- as two different tokens, rather than fail at parsing. Eg
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -*"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46,0.0]}
Note the extra 0.0 at the end of the log probabilities.