using parsec to pick data out of text file

using parsec to pick data out of text file - haskell

As a learning exercise I'm using parsec to look for values in a test file. I'd normally use regexp for this particular case, but want to see if parsec makes sense as well. Unfortunately, I'm running into some problems.
The data file consists of repeating sections that look similar to the following. The 'SHEF' is one of six values and changes from page to page, and I want to use it in constructing a data type.
Part A SHEF Nov/14/2011 (10:52)
-------------------
Portfolio Valuation
-------------------
FOREIGN COMMON STOCK 6,087,152.65
FOREIGN COMMON STOCK - USA 7,803,858.84
RIGHTS 0.00
I'm constructing a data type of the amounts in each asset class:
type Sector = String
type Amount = Double
type FundCode = String
data SectorAmount = SectorAmount (Sector,Amount) deriving (Show, Eq)
data FundSectors = FundSectors {
fund :: FundCode
, sectorAmounts :: [SectorAmount]
} deriving (Show, Eq)
My code, which compiles successfully, is as shown below. It parses the file and correctly retrieves the values in each asset class, but I'm never able to set the state correctly in the fundValue parser. I've tested the fundValue parser with an input string and it does successfully parse it, but for some reason the line function isn't working the way I thought it would. I want it to look for lines in the file which start with "Part A", find the code and store it in state for later use when the tag parser successfully parses a line.
Is the use of fail causing the problem?
allocationParser :: String -> Either ParseError [FundSectors]
allocationParser input = do
runParser allocationFile "" "" input
allocationFile :: GenParser Char FundCode [FundSectors]
allocationFile = do
secAmt <- many line
return secAmt
line :: GenParser Char FundCode FundSectors
line = try (do fund <- try fundValue
eol
fail "")
<|> do result <- try tag
eol
f <- getState
return $ FundSectors {fund=f, sectorAmounts = [result]}
fundValue :: GenParser Char FundCode FundCode
fundValue = do manyTill anyChar . try $ lookAhead (string "Part A ")
string "Part A "
fCode <- try fundCode
setState fCode
v <- many (noneOf "\n\r")
eol
return fCode
fundCode :: GenParser Char FundCode String
fundCode = try (string "SHSF")
<|> try (string "SHIF")
<|> try (string "SHFF")
<|> try (string "SHEF")
<|> try (string "SHGE")
<|> try (string "SHSE")
<|> fail "Couldn't match fundCode"
tag :: GenParser Char FundCode SectorAmount
tag = do manyTill anyChar . try $ lookAhead tagName
name <- tagName
v <- many (noneOf "\n\r")
let value = read ([x | x <- v, x /= ',']) :: Double -- remove commas from currency
return $ SectorAmount (name,value)
eol :: GenParser Char FundCode String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<|> fail "Couldn't find EOL"
Thanks in advance.

Yes, the fail in the "try fundValue" block undoes the setState.
You will need to slightly redesign the parser, but you seem close.

Related

Haskell - Parsec :: Parse spaces until string literal

I am currently trying to design a Parser in Haskell using Parsec.
The syntax for declaring type should look something like this:
Fruit is a Apple
Types should also be able have parameters:
Fruit a b is a Apple
Where
Fruit has type Name
a b has type [Parameter]
Apple has type Value
The problem here is that my parser currently does not know when to stop parsing the parameters and start parsing the value.
The code is as follows:
newtype Name = Name String deriving (Show)
newtype Parameter = Parameter String deriving (Show)
newtype Value = Value String deriving (Show)
data TypeAssignment = TypeAssignment Name [Parameter] Value deriving (Show)
-- first variant using `sepBy`
typeAssigment :: Parser TypeAssignment
typeAssigment =
TypeAssignment
<$> name
<*> (space *> parameter `sepBy` space)
<*> (string "is a" *> value)
-- second variant using manyTill
typeAssigment2 :: Parser TypeAssignment
typeAssigment2 =
TypeAssignment
<$> name
<*> (space *> manyTill parameter (string "is a"))
<*> value
name :: Parser Name
name = Name <$> word
parameter :: Parser Parameter
parameter = Parameter <$> word
value :: Parser Value
value = Value <$> word
word :: Parser String
word = (:) <$> letter <*> many (letter <|> digit)
I have tried parsing the parameters/value in the two ways I would know how to (once with sepBy and once with manyTill) both failed with the nearly the parse-error:
*EParser> parseTest typeAssigment "Fruit a b is a Apple"
parse error at (line 1, column 21):
unexpected end of input
expecting space or "is a"
*EParser> parseTest typeAssigment2 "Fruit a b is a Apple"
parse error at (line 1, column 8):
unexpected " "
expecting letter, digit or "is a"

The problem with typeAssignment1 is that "is" and "a" are perfectly valid parameter parses. So, the parameter parsing gobbles up the whole input until nothing is left, and then you get an error. In fact, if you look closely at that error you see this to be true: the parser is expecting either a space (for more parameters) or "is a" (the terminal of your whole parser).
On the other hand, typeAssignment2 is really close, but it seems that you're not handling spaces properly. In order to parse many parameters, you need to parse all the spaces between those parameter, not just the first one.
I think the following alternative should do the trick:
typeAssigment3 :: Parser TypeAssignment
typeAssigment3 =
TypeAssignment
<$> name
<*> manyTill (space *> parameter) (try $ string "is a")
<*> (space *> value)

Making an 'optional' parser using optparse-applicative and constructing value for recursive data type

I have a data type called EntrySearchableInfo written like this
type EntryDate = UTCTime -- From Data.Time
type EntryTag = Tag -- String
type EntryName = Name -- String
type EntryDescription = Description -- String
type EntryId = Int
data EntrySearchableInfo
= SearchableEntryDate EntryDate
| SearchableEntryTag EntryTag
| SearchableEntryName EntryName
| SearchableEntryDescription EntryDescription
| SearchableEntryId EntryId
Basically represents things that make sense in 'search' context.
I want to write a function with this type
entrySearchableInfoParser :: Parser (Either String EntrySearchableInfo)
which (I think) will be a combination of several primitive Parser <Type> functions I have already written
entryDateParser :: Parser (Either String UTCTime)
entryDateParser = parseStringToUTCTime <$> strOption
(long "date" <> short 'd' <> metavar "DATE" <> help entryDateParserHelp)
searchableEntryDateParser :: Parser (Either String EntrySearchableInfo)
searchableEntryDateParser = SearchableEntryDate <$$> entryDateParser -- <$$> is just (fmap . fmap)
searchableEntryTagParser :: Parser (Either String EntrySearchableInfo)
searchableEntryTagParser = ...
...
So I have two questions:
How do I combine those parsers to make entrySearchableInfoParser functions.
EntrySearchableInfo type is a part of a larger Entry type defined like this
data Entry
= Add EntryDate EntryInfo EntryTag EntryNote EntryId
| Replace EntrySearchableInfo Entry
| ...
...
I already have a function with type
entryAdd :: Parser (Either String Entry)
which constructs Entry using Add.
But I'm not sure how to make Entry type using Replace with entrySearchableInfoParser and entryAdd.

So combining those parsers were a lot simpler than I imagined.
I just had to use <|>
entrySearchableInfoParser :: Parser (Either String EntrySearchableInfo)
entrySearchableInfoParser =
searchableEntryDateParser
<|> searchableEntryTagParser
<|> searchableEntryNameParser
<|> searchableEntryDescriptionParser
<|> searchableEntryIdParser
and constructing Entry type using Replace with entrySearchableInfoParser and entryAdd was too.
entryAdd :: Parser (Either String Entry)
entryAdd = ...
entryReplace :: Parser (Either String Entry)
entryReplace = liftA2 Edit <$> entrySearchableInfoParser <*> entryAdd
Now it works perfectly!

Why do I get "unexpected end of input" when my parser is explicitly looking for it?

import Control.Applicative hiding (many)
import Text.Parsec
import Text.Parsec.String
lexeme :: Parser a -> Parser a
lexeme p = many (oneOf " \n\r") *> p
identifier :: Parser String
identifier = lexeme $ many1 $ oneOf (['a'..'z'] ++ ['A'..'Z'])
operator :: String -> Parser String
operator = lexeme . string
field :: Parser (String, String)
field = (,) <$> identifier <* operator ":" <*> identifier <* operator ";"
fields :: Parser [(String, String)]
fields = many (try field) <* eof
testInput :: String
testInput = unlines
[ " FCheckErrors : Boolean ;"
, " FAcl : TStrings ;"
]
main :: IO ()
main = parseTest fields testInput
When runn this yields:
parse error at (line 3, column 1): unexpected end of input
When I remove the explicit eof matching there is no such parse error:
fields = many (try field)
I also tried try field `manyTill` eof, but that will result in the same behavior as the original code.
I want to make sure the parser consumes the whole input, how can I do that?

The problem is that there is a newline before eof (inserted by unlines).
So eof must be run through lexeme.
fields = many (try field) <* lexeme eof
Otherwise Parsec is trying to run the fields parser on the newline.

Yet Another Haskell Rigid Type Variable Error

I've investigated many answers to other rigid type variable error questions; but, alas, none of them, to my knowledge, apply to my case. So I'll ask yet another question.
Here's the relevant code:
module MultipartMIMEParser where
import Control.Applicative ((<$>), (<*>), (<*))
import Text.ParserCombinators.Parsec hiding (Line)
data Header = Header { hName :: String
, hValue :: String
, hAddl :: [(String,String)] } deriving (Eq, Show)
data Content a = Content a | Posts [Post a] deriving (Eq, Show)
data Post a = Post { pHeaders :: [Header]
, pContent :: [Content a] } deriving (Eq, Show)
post :: Parser (Post a)
post = do
hs <- headers
c <- case boundary hs of
"" -> content >>= \s->return [s]
b -> newline >> (string b) >> newline >>
manyTill content (string b)
return $ Post { pHeaders=hs, pContent=c }
boundary hs = case lookup "boundary" $ concatMap hAddl hs of
Just b -> "--" ++ b
Nothing -> ""
-- TODO: lookup "boundary" needs to be case-insensitive.
content :: Parser (Content a)
content = do
xs <- manyTill line blankField
return $ Content $ unlines xs -- N.b. This is the line the error message refers to.
where line = manyTill anyChar newline
headers :: Parser [Header]
headers = manyTill header blankField
blankField = newline
header :: Parser Header
header =
Header <$> fieldName <* string ":"
<*> fieldValue <* optional (try newline)
<*> nameValuePairs
where fieldName = many $ noneOf ":"
fieldValue = spaces >> many (noneOf "\r\n;")
nameValuePairs = option [] $ many nameValuePair
nameValuePair :: Parser (String,String)
nameValuePair = do
try $ do n <- name
v <- value
return $ (n,v)
name :: Parser String
name = string ";" >> spaces >> many (noneOf "=")
value :: Parser String
value = string "=" >> between quote quote (many (noneOf "\r\n;\""))
where quote = string "\""
And the error message:
Couldn't match type `a' with `String'
`a' is a rigid type variable bound by
the type signature for content :: Parser (Content a)
at MultipartMIMEParser.hs:(See comment in code.)
Expected type: Text.Parsec.Prim.ParsecT
String () Data.Functor.Identity.Identity (Content a)
Actual type: Text.Parsec.Prim.ParsecT
String () Data.Functor.Identity.Identity (Content String)
Relevant bindings include
content :: Parser (Content a)
(bound at MultipartMIMEParser.hs:72:1)
In a stmt of a 'do' block: return $ Content $ unlines xs
In the expression:
do { xs <- manyTill line blankField;
return $ Content $ unlines xs }
In an equation for `content':
content
= do { xs <- manyTill line blankField;
return $ Content $ unlines xs }
where
line = manyTill anyChar newline
From what I've seen, the problem is that I'm explicitly returning a String using unlines xs, and that breaks the generic nature of a in the type signature. Am I close to understanding?
I've declared Content to be generic because, presumably, this parser might eventually be used on types other than String. Perhaps I'm abstracting prematurely. I did try removing all my as, but I started getting many more compile errors. I think I'd like to stick with the generic approach, if that's reasonable at this point.
Is it clear from the code what I'm trying to do? If so, any suggestions on how to do it best?

You're telling the compiler that content has type Parser (Content a), but the line causing the error is
return $ Content $ unlines xs
Since unlines returns a String, and the Content constructor has type a -> Content a, here you would have String ~ a, so the value Content $ unlines xs has type Content String. If you change the type signature of content to Parser (Content String) then it should compile.
I've declared Content to be generic because, presumably, this parser might eventually be used on types other than String. Perhaps I'm abstracting prematurely. I did try removing all my as, but I started getting many more compile errors. I think I'd like to stick with the generic approach, if that's reasonable at this point.
It's fine to declare Content to be generic, and in many cases it is the exact right way to solve the problem, the issue is that while your container is generic, whenever you fill your container with something concrete, the type variables also have to be concrete. In particular:
> :t Container (1 :: Int)
Container 1 :: Container Int
> :t Container "test"
Container "test" :: Container String
> :t Container (Container "test")
Container (Container "test") :: Container (Container String)
Notice how all of these have their types inferred without any type variables left. You can use the container to hold whatever you want, you just have to make sure that you're accurately telling the compiler what it is.

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

I'm trying to write a parser using Parsec that will parse literate Haskell files, such as the following:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
I've written the following, sort-of-inspired by the examples in RWH:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
Which I hoped would result in something along the lines of:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
(allowing for whitespace etc).
This compiles fine, but when run, I get the error:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Can anyone shed any light on this, and possibly help with a solution please?

As sth pointed out many anyChar is the problem. But not just in prose but also in code. The problem with code is, that content <- many anyChar will consume everything: The newlines and the \end{code} tag.
So, you need to have some way to tell the prose and the code apart. An easy (but maybe too naive) way to do so, is to look for backslashes:
literateFile = many codeOrProse <* eof
code = do string "\\begin{code}"
content <- many $ noneOf "\\"
string "\\end{code}"
return $ Haskell content
prose = do content <- many1 $ noneOf "\\"
return $ Text content
Now, you don't completely have the desired result, because the Haskell part will also contain newlines, but you can filter these out quite easily (given a function filterNewlines you could say `content <- filterNewlines <$> (many $ noneOf "\\")).
Edit
Okay, I think I found a solution (requires the newest Parsec version, because of lookAhead):
import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "" input
literateFile
= many codeOrProse
codeOrProse = code <|> prose
code = do string "\\begin{code}\n"
c <- untilP (string "\\end{code}\n")
string "\\end{code}\n"
return $ Haskell c
prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
return $ Text t
untilP p = do s <- many $ noneOf "\n"
newline
s' <- try (lookAhead p >> return "") <|> untilP p
return $ s ++ s'
untilP p parses a line, then checks if the beginning of the next line can be successfully parsed by p. If so, it returns the empty string, otherwise it goes on. The lookAhead is needed, because otherwise the begin\end-tags would be consumed and code couldn't recognize them.
I guess it could still be made more concise (i.e. not having to repeat string "\\end{code}\n" inside code).

I haven't tested it, but:
many anyChar can match an empty string
Therefore prose can match an empty string
Therefore codeOrProse can match an empty string
Therefore literateFile can loop forever, matching infinitely many empty strings
Changing prose to match many1 characters might fix this problem.
(I'm not very familiar with Parsec, but how will prose know how many characters it should match? It might consume the whole input, never giving the code parser a second chance to look for the start of a new code segment. Alternatively it might only match one character in each call, making the many/many1 in it useless.)

For reference, here's another version I came up with (slightly expanded to handle other cases):
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "test.tex"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
| Section String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= do es <- many elements
eof
return es
elements
= try section
<|> try quotedBackslash
<|> try code
<|> prose
code
= do string "\\begin{code}"
c <- anyChar `manyTill` try (string "\\end{code}")
return $ Haskell c
quotedBackslash
= do string "\\\\"
return $ Text "\\\\"
prose
= do t <- many1 (noneOf "\\")
return $ Text t
section
= do string "\\section{"
content <- many1 (noneOf "}")
char '}'
return $ Section content

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

using parsec to pick data out of text file - haskell

Yes, the fail in the "try fundValue" block undoes the setState. You will need to slightly redesign the parser, but you seem close.

Related

Haskell - Parsec :: Parse spaces until string literal

Making an 'optional' parser using optparse-applicative and constructing value for recursive data type

Why do I get "unexpected end of input" when my parser is explicitly looking for it?

Yet Another Haskell Rigid Type Variable Error

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

Categories

Resources