Parsing file with parsec

Parsing file with parsec - haskell

I am trying to parse file [1] but below code is not working properly.
import Control.Applicative hiding ( many , ( <|> ) )
import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Token as T
import Text.Parsec.Language (emptyDef)
--Scheme Code;ISIN Div Payout/ ISIN Growth;ISIN Div Reinvestment;Scheme Name;Net Asset Value;Repurchase Price;Sale Price;Date
--120523;INF846K01ET8;INF846K01EU6;Axis Triple Advantage Fund - Direct Plan - Dividend Option;13.3660;13.2323;13.3660;19-Jun-2015
data MFund = MFund Integer String String String Double Double Double String deriving (Show) --String String String deriving (Show)
lexer :: T.TokenParser st
lexer = T.makeTokenParser emptyDef
natural :: Parser Integer
natural = T.natural lexer
float :: Parser Double
float = T.float lexer
eol :: Parser String
eol = try (string "\r\n")
<|> try (string "\n")
<|> try (string "\r")
parseFund :: Parsec String () MFund
parseFund = MFund <$> natural
<*> (char ';' *> (many1 alphaNum <|> string "-"))
<*> (char ';' *> (many1 alphaNum <|> string "-"))
<*> (char ';' *> manyTill anyChar (char ';'))
<*> float
<*> (char ';' *> float)
<*> (char ';' *> float)
<*> (char ';' *> many1 (alphaNum <|> char '-'))
parseBlockFund :: Parsec String () [MFund]
parseBlockFund = manyTill anyChar eol *> space *> eol *> endBy parseFund eol
parseMutual :: Parsec String () [MFund]
parseMutual = concat <$> (manyTill anyChar eol *> space *> eol *>
sepBy parseBlockFund (space *> eol))
{- each scheme is seperated by space and end of line-}
parseSchemeBlock :: Parsec String () [MFund]
parseSchemeBlock = concat <$> (sepBy parseMutual (space *> eol))
parseFile :: Parsec String () [MFund]
parseFile =
string "Scheme Code;ISIN Div Payout/ ISIN Growth;ISIN Div Reinvestment;Scheme Name;Net Asset Value;Repurchase Price;Sale Price;Date" *>
eol *> space *> eol *> parseSchemeBlock <* eof
main :: IO ()
main = do
input <- readFile "data.txt"
print input
case parse parseFile "" input of
Left err -> print err
Right val -> print val
I am starting to parse the file from parseFile function which consumes first line, end of line, space and end of line. Now I parse each scheme blocks which are separated by space and end of line. Within each scheme block, we have many mutual blocks which are also separated by space and end of line so I first consume the string till end of line followed by space and end of line and call parseBlockFund.
The problem which I suspect is when I am calling parseMutual function and when it's reaching to end of scheme block, it eats the space and end of line which is suppose to be consumed by parseSchemeBlock function. I tried try function but it's not working so I am not sure if I am thinking correctly or not. Could some one please tell me what is wrong with this code.
One interesting thing when I remove the eof, this code is able to parse at least one scheme block [2].
[1] http://portal.amfiindia.com/spages/NAV0.txt
[2] https://github.com/mukeshtiwari/Puzzles/blob/master/Assigment/Api.hs

Related

Parse Text input and get Text output (not String) with Parsec3

I see that Parsec3 handles Text (not String) input, so I would like to convert an old String parser to get Text output. Other libraries I am using also uses Text, so that would reduce the number of conversions needed.
Now, the parsec3 library seems to do what it says (handle both Text and String input), this example is from within gchi:
Text.Parsec.Text Text.Parsec Data.Text> parseTest (many1 $ char 's') (pack "sss")
"sss"
Text.Parsec.Text Text.Parsec Data.Text> parseTest (many1 $ char 's') "sss"
"sss"
So, both Text (first case) and String (second case) works.
Now, In my real, converted, parser (sorry I have to piece together some remote parts of the code here to make a complete example)
{-# LANGUAGE OverloadedStrings #-}
data UmeQueryPart = MidQuery Text Text MatchType
data MatchType = Strict | Fuzzy deriving Show
funcMT :: Text -> MatchType
funcMT mt = case mt of
"~" -> Fuzzy
_ -> Strict
midOfQuery :: Parser UmeQueryPart
midOfQuery = do
spaces
string "MidOf"
spaces
char '('
spaces
clabeltype <- many1 alphaNum
spaces
sep <- try( char ',') <|> char '~'
spaces
plabeltype <- many1 alphaNum
spaces
char ')'
spaces
return $ MidQuery (pack plabeltype) (pack clabeltype) (funcMT sep)
I find myself with a lot of errors like this with regards to the funcMT call
UmeQueryParser.hs:456:96:
Couldn't match type ‘[Char]’ with ‘Text’
Expected type: Text
Actual type: String
In the first argument of ‘funcMT’, namely ‘sep’
In the fifth argument of ‘ midOfQuery’, namely ‘(funcMT sep)’
and if I don't explicitly pack the captures text in the code sample above, this:
UmeQueryParser.hs:288:26:
Couldn't match expected type ‘Text’ with actual type ‘[Char]’
In the first argument of ‘ midOfQuery’, namely ‘(plabeltype)’
In the second argument of ‘($)’, namely
‘StartQuery (plabeltype) (clabeltype) (funcMT sep)’
So, it seems that I need to convert captured strings explicitly to Text in the output.
So, why do I need to go through a step converting from Stringor Char to Text when the point was to do Text -> Text parsing?

You could just make your own Text parser, something simple like
midOfQuery :: Parser UmeQueryPart
midOfQuery = do
spaces
lexeme $ string "MidOf"
lexeme $ char '('
clabeltype <- lexeme alphaNums
sep <- lexeme $ try (char ',') <|> char '~'
plabeltype <- lexeme alphaNums
lexeme $ char ')'
return $ MidQuery plabeltype clabeltype (funcMT sep)
where
alphaNums = pack <$> many1 alphaNum
lexeme p = p <* spaces
or, slightly more compact (but I think still more readable):
midOfQuery :: Parser UmeQueryPart
midOfQuery = spaces *> lexeme (string "MidOf") *> parens (toQuery <$> lexeme alphaNums <*> lexeme matchType <*> lexeme alphaNums)
where
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces
alphaNums = pack <$> many1 alphaNum
parens = between (lexeme $ char '(') (lexeme $ char ')')
matchType = Fuzzy <$ char '~' <|>
Strict <$ char ','
toQuery cLabelType sep pLabelType = MidQuery pLabelType cLabelType sep

Parser for JSON String

I'm trying to write a parser for a JSON String.
A valid example, per my parser, would be: "\"foobar\"" or "\"foo\"bar\"".
Here's what I attempted, but it does not terminate:
parseEscapedQuotes :: Parser String
parseEscapedQuotes = Parser f
where
f ('"':xs) = Just ("\"", xs)
f _ = Nothing
parseStringJValue :: Parser JValue
parseStringJValue = (\x -> S (concat x)) <$>
((char '"') *>
(zeroOrMore (alt parseEscapedQuotes (oneOrMore (notChar '"'))))
<* (char '"'))
My reasoning is that, I can have a repetition of either escaped quotes "\"" or characters not equal to ".
But it's not working as I expected:
ghci> runParser parseStringJValue "\"foobar\""
Nothing

I don't know what parser combinator library you are using, but here is a working example using Parsec. I'm using monadic style to make it clearer what's going on, but it is easily translated to applicative style.
import Text.Parsec
import Text.Parsec.String
jchar :: Parser Char
jchar = escaped <|> anyChar
escaped :: Parser Char
escaped = do
char '\\'
c <- oneOf ['"', '\\', 'r', 't' ] -- etc.
return $ case c of
'r' -> '\r'
't' -> '\t'
_ -> c
jstringLiteral :: Parser String
jstringLiteral = do
char '"'
cs <- manyTill jchar (char '"')
return cs
test1 = parse jstringLiteral "" "\"This is a test\""
test2 = parse jstringLiteral "" "\"This is an embedded quote: \\\" after quote\""
test3 = parse jstringLiteral "" "\"Embedded return: \\r\""
Note the extra level of backslashes needed to represent parser input as Haskell string literals. Reading the input from a file would make creating the parser input more convenient.
The definition of the manyTill combinator is:
manyTill p end = scan
where
scan = do{ end; return [] }
<|>
do{ x <- p; xs <- scan; return (x:xs) }
and this might help you figure out why your definitions aren't working.

Escaping end of line with Parsec

This time I'm trying to parse a text file into [[String]] using Parsec. Result is a list consisting of lists that represent lines of the file. Every line is a list that contains words which may be separated by any number of spaces, (optionally) commas, and spaces after commas as well.
Here is my code and it even works.
import Text.ParserCombinators.Parsec hiding (spaces)
import Control.Applicative ((<$>))
import System.IO
import System.Environment
myParser :: Parser [[String]]
myParser =
do x <- sepBy parseColl eol
eof
return x
eol :: Parser String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
spaces :: Parser ()
spaces = skipMany (char ' ') >> return ()
parseColl :: Parser [String]
parseColl = many parseItem
parseItem :: Parser String
parseItem =
do optional spaces
x <- many1 (noneOf " ,\n\r")
optional spaces
optional (char ',')
return x
parseText :: String -> String
parseText str =
case parse myParser "" str of
Left e -> "parser error: " ++ show e
Right x -> show x
main :: IO ()
main =
do fileName <- head <$> getArgs
handle <- openFile fileName ReadMode
contents <- hGetContents handle
putStr $ parseText contents
hClose handle
Test file:
this is my test file
this, line, is, separated, by, commas
and this is another, line
Result:
[["this","is","my","test","file"],
["this","line","is","separated","by","commas"],
["and","this","is","another","line"],
[]] -- well, this is a bit unexpected, but I can filter things
Now, to make my life harder, I wish to be able to 'escape' eol if there is a comma , before it, even if the comma is followed by spaces. So this is should be considered one line:
this is, spaces may be here
my line
What is best strategy (most idiomatic and elegant) to implement this syntax (without losing the ability to ignore commas inside a line).

A couple of solutions come to mind.... One is easy, the other is medium difficulty.
The medium-difficulty solution is to define an itemSeparator to be a comma followed by whitespace, and a lineSeparator to be a '\n' or '\r' followed by whitespace.... Make sure to skip non '\n', '\r'-whitespace, but no further, at the end of the item parse, so that the very next char after an item must be either a '\n', '\r', or ',', which determines, without backtracking, whether a new item or line is coming.
Then use sepBy1 to define parseLine (ie- parseLine = parseItem sepBy1 parseItemSeparator), and endBy to define parseFile (ie- parseFile = parseLine endBy parseLineSeparator).
You really do need that sepBy1 on the inside, vs sepBy, else you will have a list of zero sized items, which causes an infinite loop at parse time. endBy works like sepBy, but allows extra '\n', '\r' at the end of the file....
An easier way would be to canonicalize the input by running it though a simple transformation before parsing. You can write a function to remove whitespace after a comma (using dropWhile and isSpace), and perhaps even simplify the different cases of '\n', '\r'.... then run the output through a simplified parser.
Something like this would do the trick (this is untested....)
canonicalize::String->String
canonicalize [] == []
canonicalize (',':rest) = ',':canonicalize (dropWhile isSpace rest)
canonicalize ('\n':rest) = '\n':canonicalize (dropWhile isSpace rest)
canonicalize ('\r':rest) = '\n':canonicalize (dropWhile isSpace rest) --all '\r' will become '\n'
canonicalize (c:rest) = c:canonicalize rest
Because Haskell is lazy, this transformation will work on streaming data as the data comes in, so this really won't slow anything down at all (depending on how much you simplify the parser, it could even speed things up.... Although most likely it will be close to a wash)
I don't know how complicated the full question is, but perhaps a few rules added to a canonicalization function will in fact allow you to use lines and words after all....

Just use optional spaces in parseColl, like this:
parseColl :: Parser [String]
parseColl = optional spaces >> many parseItem
parseItem :: Parser String
parseItem =
do
x <- many1 (noneOf " ,\n\r")
optional spaces
optional (char ',')
return x
Second, divide separator from item
parseColl :: Parser [String]
parseColl = do
optional spaces
items <- parseItem `sepBy` parseSeparator
optional spaces
return items
parseItem :: Parser String
parseItem = many1 $ noneOf " ,\n\r"
parseSeparator = try (optional spaces >> char ',' >> optional spaces) <|> spaces
Third, we recreate a bit eol and spaces:
eol :: Parser String
eol = try (string "\n\r")
<|> string "\r\n"
<|> string "\n"
<|> string "\r"
<|> eof
<?> "end of line"
spaces :: Parser ()
spaces = skipMany1 $ char ' '
parseColl :: Parser [String]
parseColl = do
optional spaces
items <- parseItem `sepBy` parseSeparator
optional spaces
eol
return items
Finally, let's rewrite myParser:
myParser = many parseColl

How do I sepBy ambiguous parse with Parsec?

I am trying to separate a string using a delimiter consisting of multiple characters, but the problem is that each of those characters can occur by itself in non-delimiting string. For example, I have foo*X*bar*X*baz, where the delimiter is *X*, so I want to get [foo, bar, baz], but each one of those can contain * or X.
I have tried
sepBy (many anyChar) delimiter
but that just swallows the whole string, giving "foo*X*bar*X*baz", if I do
sepBy anyChar (optional delimiter)
it filters out the delimiters correctly, but doesn't partition the list, returning "foobarbaz". I don't know which other combination I could try.

Perhaps you want something like this,
tok = (:) <$> anyToken <*> manyTill anyChar (try (() <$ string sep) <|> eof)
The anyToken prevents us from looping forever at the end of input, the try lets us avoid being over-eager in consuming separator characters.
Full code for a test,
module ParsecTest where
import Control.Applicative ((<$), (<$>), (<*>))
import Data.List (intercalate)
import Text.Parsec
import Text.Parsec.String
sep,msg :: String
sep = "*X*"
msg = intercalate "*X*" ["foXo", "ba*Xr", "bX*az"]
tok :: Parser String
tok = (:) <$> anyToken <*> manyTill anyChar (try (() <$ string sep) <|> eof)
toks :: Parser [String]
toks = many tok
test :: Either ParseError [String]
test = runP toks () "" msg

Parsec and sequence of commaSep input

I took the example below partially from SO and changed it to my needs. It almost fits, but what I want to do is that always the first string in the commaSep expr is parsed as identifier whilst all subsequent strings should be strings only.
Currently they are all parsed as Identifiers.
*Parser> parse expr "" "rd (isFib, test2, 100.1, ?BOOL)"
Right (FuncCall "rd" [Identifier "isFib",Identifier "test2",Number 100.1,Query "?BOOL"])
I have tried a number of solutions that in the end all would break down to parsing the whole input without using commaSep. Means I would have to ignore the structure and do something like
expr_parse = do
name <- resvd_cmd
char '('
skipMany space
worker <- ident
char ','
skipMany1 space
args <- commaSep expr --not fully worked this out yet
query <- theQuery
skipMany space
char ')'
return (name, worker, args, query)
that looks less optimal and very clunky to me. Is there any way to refactor expr in the code below, achive what I need and keep it simple?
module Parser where
import Control.Monad (liftM)
import Text.Parsec
import Text.Parsec.String (Parser)
import Lexer
import AST
expr = ident <|> astring <|> number <|> theQuery <|> callOrIdent
astring = liftM String stringLiteral <?> "String"
number = liftM Number float <?> "Number"
ident = liftM Identifier identifier <?> "WorkerName"
questionm :: Parser Char
questionm = oneOf "?"
theQuery :: Parser AST
theQuery = do first <- questionm
rest <- many1 letter
let query = first:rest
return ( Query query )
resvd_cmd = do { reserved "rd"; return ("rd") }
<|> do { reserved "eval"; return ("eval") }
<|> do { reserved "read"; return ("read") }
<|> do { reserved "in"; return ("in") }
<|> do { reserved "out"; return ("out") }
<?> "LINDA-like Tuple"
callOrIdent = do
name <- resvd_cmd
liftM (FuncCall name)(parens $ commaSep expr) <|> return (Identifier name)
AST.hs
{-# LANGUAGE DeriveDataTypeable #-}
module AST where
import Data.Typeable
data AST
= Number Double
| Identifier String
| String String
| FuncCall String [AST]
| Query String
deriving (Show, Eq, Typeable)
Lexer.hs
module Lexer (
identifier, reserved, operator, reservedOp, charLiteral, stringLiteral,
natural, integer, float, naturalOrFloat, decimal, hexadecimal, octal,
symbol, lexeme, whiteSpace, parens, braces, angles, brackets, semi,
comma, colon, dot, semiSep, semiSep1, commaSep, commaSep1
)where
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellStyle)
lexer = P.makeTokenParser ( haskellStyle
{P.reservedNames = ["rd", "in", "out", "eval", "take"]}
)
identifier = P.identifier lexer
reserved = P.reserved lexer
operator = P.operator lexer
reservedOp = P.reservedOp lexer
charLiteral = P.charLiteral lexer
stringLiteral = P.stringLiteral lexer
natural = P.natural lexer
integer = P.integer lexer
float = P.float lexer
naturalOrFloat = P.naturalOrFloat lexer
decimal = P.decimal lexer
hexadecimal = P.hexadecimal lexer
octal = P.octal lexer
symbol = P.symbol lexer
lexeme = P.lexeme lexer
whiteSpace = P.whiteSpace lexer
parens = P.parens lexer
braces = P.braces lexer
angles = P.angles lexer
brackets = P.brackets lexer
semi = P.semi lexer
comma = P.comma lexer
colon = P.colon lexer
dot = P.dot lexer
semiSep = P.semiSep lexer
semiSep1 = P.semiSep1 lexer
commaSep = P.commaSep lexer
commaSep1 = P.commaSep1 lexer

First, I'd like to introduce you to the function lexeme which alters a parser to eat trailing whitespace. You're encouraged to use it rather than explicitly eating the whitespace. The difficulty is with commaSep because it eats the , and then fails. It would be nice to write a less optimistic commaSep, but let's solve your problem directly.
Let's apply lexeme to comma
acomma = lexeme comma
One of the problems with your code was you were expecting it to see test2 as String "test2" but the astring parser expects its strings to begin and end with ". Let's make a parser for bald strings, but make sure they don't start with ? and don't contain spaces or commas:
baldString = lexeme $ do
x <- noneOf "? ,)"
xs <- many (noneOf " ,)") -- problematic - see comment below
return . String $ x:xs
The breakthrough came when I realised that because there has to be a query at the end, there was always a comma after a baldString:
baldStringComma = do
s <- baldString
acomma
return s
Now let's make a parser for one or more queries at the end of the tuple:
queries = commaSep1 (lexeme theQuery)
And now we can take the identifier, the baldStrings and the queries
therest = do
name <- lexeme ident
acomma
args <- many baldStringComma
qs <- queries
return (name,args,qs)
finally giving
tuple = do
name <- lexeme resvd_cmd
stuff <- parens therest
return (name,stuff)
So you get
*Parser> parseTest tuple "rd (isFib, test2, 100.1, ?BOOL)"
("rd",(Identifier "isFib",[String "test2",String "100.1"],[Query "?BOOL"]))
But if you want to lump the strings with the queries, you can return (name,args++qs) at the end of therest.
Applicative is Less Ugly
I found it frustrating to be tied to the Monad interface, when there are lovely things like <$>, <*> etc, so first
import Control.Applicative hiding (many, (<|>))
Then
baldString = lexeme . fmap String $
(:) <$> noneOf "? ,)"
<*> many (noneOf " ,)") -- problematic - see comment below
Here <$> is an infix version of fmap, so (:) will be applied to the output of noneOf "? ,", giving a parser that returns something like ('c':). This can then be applied to the output of many (noneOf " ,") using <*> to give the string we want.
baldStringComma = baldString <* acomma
This one's nice because we got the <*> operator to ignore the output of acomma and just return the output of baldString, using <*. If we wanted it the other way round, we could do *>, but you may as well use >> for that, which already ignores the output of the first parser.
therest = (,,) <$>
lexeme ident <* acomma
<*> many baldStringComma
<*> queries
and
tuple = (,) <$> lexeme resvd_cmd
<*> parens therest
But wouldn't it be nicer if we did
data Tuple = Tuple {cmd :: String,
id :: AST,
argumentList :: [AST],
queryList :: [AST]} deriving Show
so we could do
niceTuple = Tuple <$> lexeme resvd_cmd <* lexeme (char '(')
<*> lexeme ident <* acomma
<*> many baldStringComma
<*> queries <* lexeme (char ')')
which gives (with a little manual pretty-printing to get it into the width)
*Parser> parseTest niceTuple "rd (isFib, test2, 100.1, ?BOOL)"
Tuple {cmd = "rd",
id = Identifier "isFib",
argumentList = [String "test2",String "100.1"],
queryList = [Query "?BOOL"]}
I also think your current AST is more of an abstract syntax store than an abstract syntax tree, and that you might get more milage from designing your own Tuple type and use that. Use
newtype Command = Cmd String deriving Show
and suchlike to ensure type safety, then roll them together into your Tuple type with a parser to generate them.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parsing file with parsec - haskell

Related

Parse Text input and get Text output (not String) with Parsec3

Parser for JSON String

Escaping end of line with Parsec

How do I sepBy ambiguous parse with Parsec?

Parsec and sequence of commaSep input

Categories

Resources