Haskell - Parsec :: Parse spaces until string literal

Haskell - Parsec :: Parse spaces until string literal - haskell

I am currently trying to design a Parser in Haskell using Parsec.
The syntax for declaring type should look something like this:
Fruit is a Apple
Types should also be able have parameters:
Fruit a b is a Apple
Where
Fruit has type Name
a b has type [Parameter]
Apple has type Value
The problem here is that my parser currently does not know when to stop parsing the parameters and start parsing the value.
The code is as follows:
newtype Name = Name String deriving (Show)
newtype Parameter = Parameter String deriving (Show)
newtype Value = Value String deriving (Show)
data TypeAssignment = TypeAssignment Name [Parameter] Value deriving (Show)
-- first variant using `sepBy`
typeAssigment :: Parser TypeAssignment
typeAssigment =
TypeAssignment
<$> name
<*> (space *> parameter `sepBy` space)
<*> (string "is a" *> value)
-- second variant using manyTill
typeAssigment2 :: Parser TypeAssignment
typeAssigment2 =
TypeAssignment
<$> name
<*> (space *> manyTill parameter (string "is a"))
<*> value
name :: Parser Name
name = Name <$> word
parameter :: Parser Parameter
parameter = Parameter <$> word
value :: Parser Value
value = Value <$> word
word :: Parser String
word = (:) <$> letter <*> many (letter <|> digit)
I have tried parsing the parameters/value in the two ways I would know how to (once with sepBy and once with manyTill) both failed with the nearly the parse-error:
*EParser> parseTest typeAssigment "Fruit a b is a Apple"
parse error at (line 1, column 21):
unexpected end of input
expecting space or "is a"
*EParser> parseTest typeAssigment2 "Fruit a b is a Apple"
parse error at (line 1, column 8):
unexpected " "
expecting letter, digit or "is a"

The problem with typeAssignment1 is that "is" and "a" are perfectly valid parameter parses. So, the parameter parsing gobbles up the whole input until nothing is left, and then you get an error. In fact, if you look closely at that error you see this to be true: the parser is expecting either a space (for more parameters) or "is a" (the terminal of your whole parser).
On the other hand, typeAssignment2 is really close, but it seems that you're not handling spaces properly. In order to parse many parameters, you need to parse all the spaces between those parameter, not just the first one.
I think the following alternative should do the trick:
typeAssigment3 :: Parser TypeAssignment
typeAssigment3 =
TypeAssignment
<$> name
<*> manyTill (space *> parameter) (try $ string "is a")
<*> (space *> value)

Related

Making an 'optional' parser using optparse-applicative and constructing value for recursive data type

I have a data type called EntrySearchableInfo written like this
type EntryDate = UTCTime -- From Data.Time
type EntryTag = Tag -- String
type EntryName = Name -- String
type EntryDescription = Description -- String
type EntryId = Int
data EntrySearchableInfo
= SearchableEntryDate EntryDate
| SearchableEntryTag EntryTag
| SearchableEntryName EntryName
| SearchableEntryDescription EntryDescription
| SearchableEntryId EntryId
Basically represents things that make sense in 'search' context.
I want to write a function with this type
entrySearchableInfoParser :: Parser (Either String EntrySearchableInfo)
which (I think) will be a combination of several primitive Parser <Type> functions I have already written
entryDateParser :: Parser (Either String UTCTime)
entryDateParser = parseStringToUTCTime <$> strOption
(long "date" <> short 'd' <> metavar "DATE" <> help entryDateParserHelp)
searchableEntryDateParser :: Parser (Either String EntrySearchableInfo)
searchableEntryDateParser = SearchableEntryDate <$$> entryDateParser -- <$$> is just (fmap . fmap)
searchableEntryTagParser :: Parser (Either String EntrySearchableInfo)
searchableEntryTagParser = ...
...
So I have two questions:
How do I combine those parsers to make entrySearchableInfoParser functions.
EntrySearchableInfo type is a part of a larger Entry type defined like this
data Entry
= Add EntryDate EntryInfo EntryTag EntryNote EntryId
| Replace EntrySearchableInfo Entry
| ...
...
I already have a function with type
entryAdd :: Parser (Either String Entry)
which constructs Entry using Add.
But I'm not sure how to make Entry type using Replace with entrySearchableInfoParser and entryAdd.

So combining those parsers were a lot simpler than I imagined.
I just had to use <|>
entrySearchableInfoParser :: Parser (Either String EntrySearchableInfo)
entrySearchableInfoParser =
searchableEntryDateParser
<|> searchableEntryTagParser
<|> searchableEntryNameParser
<|> searchableEntryDescriptionParser
<|> searchableEntryIdParser
and constructing Entry type using Replace with entrySearchableInfoParser and entryAdd was too.
entryAdd :: Parser (Either String Entry)
entryAdd = ...
entryReplace :: Parser (Either String Entry)
entryReplace = liftA2 Edit <$> entrySearchableInfoParser <*> entryAdd
Now it works perfectly!

Parser Combinators (ReadP) - return entire list if it fails, otherwise return only the passing one

I have a string, for example "MMMMABCNNNXYZPPPPP". I know that this string may have ABC in it and may have XYZ in it, but it is not required to have either. Additionally, the XYZ may be swapped for DEF (e.g. "MMMMABCNNNDEFPPPPP") and the behavior should remain the same.
I would like to parse the string and return the sequences between them, as well as which one of XYZ or DEF was present. Example:
data Divider1 = Abc
data Divider2 = Xyz | Def
--"MMMMABCNNNXYZPPPPP" should return ("MMMM", Just Abc, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNDEFPPPPP" should return ("MMMM", Just Abc, "NNN", Just Def, "PPPPP")
Note that if ABC is not present, I would like to return everything before the divider2 and if XYZ and DEF are both not present, I would like to return everything after divider 1.
Example:
--"MMMMNNNXYZPPPPP" should return ("MMMM", Nothing, "NNN", Just Xyz, "PPPPP")
--"MMMMABCNNNPPPPP" should return ("MMMM", Just Abc, "NNN", Nothing, "PPPPP")
If neither ABC nor XYZ is present then I don't care if it returns nothing, or if it returns the entire string.
Currently my code is
parseEverything = many $ satisfy someGeneralCondition--check if all characters are valid
parseAbc = (\str -> Abc) <$> string "ABC"
parseXyz = (\str -> Xyz) <$> string "XYZ"
parseDef = (\str -> Def) <$> string "DEF"
parseFull = do
beforeAbc <- gather parseEverything
parseAbc <- (Just <$> parseAbc) <++ return Nothing
beforeDivider2 <- gather parseEverything
parseDivider2 <- (Just <$> parseXyz) <++ (Just <$> parseDef) <++ (Just <$> Nothing)
everythingElse <- look
return (beforeAbc, parseAbc, beforeDivider2, parseDivider2, everythingElse)
But when I run this on the example string "MMMMABCNNNXYZPPPPP", I get mostly failed results with just one result that I want. The problem is that I need to return everything in beforeAbc if parseAbc fails, but if parseAbc passes then I just need to return that. And the same thing with parseXyz and parseDef. I don't think that <++ is the correct operator to do this. I also tried a variant of this code using option, but it gave the same result. Is there a simple solution that I am missing, and/or should I set up the parsers in a different way?
Thanks in advance!
Edit: does this have anything to do with chainl or chainr or manyTill?

Updated: See note on applicative parsers below.
Here's what's going wrong with your current approach. As you undoubtedly know, the parsers in Text.ParserCombinators.ReadP generate all possible valid parses of all possible prefixes of the string. If you write a parser:
letterAndOther = do
letters <- many (satisfy isLetter)
others <- many get
return (letters, others)
which grabs an initial string of letters followed by the "rest" of the string and run it on a simple test string, you'll usually get way more than you bargained for:
> readP_to_S letterAndOther "abc"
[(("",""),"abc"),(("","a"),"bc"),(("a",""),"bc"),(("","ab"),"c"),
(("a","b"),"c"),(("ab",""),"c"),(("","abc"),""),(("a","bc"),""),
(("ab","c"),""),(("abc",""),"")]
In other words, in a do-block, each monadic action will typically generate a tree of possible parses. In your current code, the very first line of the do-block:
beforeAbc <- gather parseEverything
introduces a whole tree of parse branches, one branch for each possible initial prefix. These branches only get pruned if a later line of the do-block introduces a parse that fails. But, every line of your do-block represents a parser that always succeeds. For example, this always succeeds:
parseAbc <- (Just <$> parseAbc) <++ return Nothing
because even if the first divider isn't found, the right-hand side parser return Nothing will always succeed.
I would suggest the following approach. First, as we discovered in the comments, the first thing you want to do is figure out what your parser should return. Instead of trying to shoehorn the result into a weird tuple, it's a good idea to leverage Haskell's best feature, it's algebraic data types. Define a return type for your parse:
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
This is unambiguous and covers all possibilities. Admittedly, including Divider1 in the first two constructors is redundant, since there's only one possible Divider1, but programs are for humans to read, too, and keeping Divider1 explicit improves readability.
Now, let's define parsers for the first and second dividers:
divider1 = Abc <$ string "ABC"
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
Note that I've chosen to define a single divider2 instead of separate parsers for Def and Xyz. Since, in your grammar, it's always the case that "DEF" can appear anywhere "XYZ" can and vice versa, it makes sense to combine them into one parser.
We'll also want a parser for arbitrary strings (basically your parseEverything):
anything = many $ satisfy isLetter -- valid characters
Now, let's write a parser for the full string. A key insight here is that we have four alternatives (i.e., the four constructors for our Result type). It's true that they share some structure, but a first crack at a parser can just treat them as independent alternatives. We'll use the <++ operator to choose the best match:
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
A quick test of this will show we've forgotten something:
> readP_to_S result "MMMMABCNNNXYZPPPPP"
[(TwoDividers "MMMM" Abc "NNN" Xyz "","PPPPP"),...]
By default, the parser combinators will try every possible prefix of the input string, leaving more for later parsers. So, we should wrap this up in a final parser function that checks for the end-of-string:
parseResult = readP_to_S (result <* eof)
and with the tests:
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]
we get the expected unique parsed output:
[(TwoDividers "MMMM" Abc "NNN" Xyz "PPPPP","")]
[(TwoDividers "MMMM" Abc "NNN" Def "PPPPP","")]
[(SecondDivider "MMMMNNN" Xyz "PPPPP","")]
[(FirstDivider "MMMM" Abc "NNNPPPPP","")]
Note on Applicative Parsers. I've used applicative syntax here, rather than the monad syntax. The difference isn't purely syntactical -- you can always write an applicative expression in monadic form, but there are monadic operations that can't be expressed applicatively, so the monadic syntax is strictly more powerful. However, when an expression can be written both ways, often the applicative syntax is more succinct and easier to write and understand, at least once you get used to it.
In a nutshell, the expression p <*> x <*> y <*> z creates a new parser that applies the parsers p, x, y, and z in order, and then applies the result from parser p (which needs to be a function f) to the results from the rest of the parsers (which must be appropriate arguments for f). In many cases, the function f is a known function and doesn't need to be returned by a parser, so a common variant is to write f <$> x <*> y <*> z. This applies the parsers x, y, and z in order, and then applies f (given directly instead of returned by a parser) to the results from those parsers. For example, the expression:
FirstDivider <$> anything <*> divider1 <*> anything
runs three parsers in order to get anything, followed by a divider1, followed by anything, and then applies the function/contructor FirstDivider to the three arguments resulting from those parsers.
The operators <* and *> can be thought of as variants of <*>. The expression p <*> x first parses p, then parses x, then applies the result of the former to the latter. The expression p <* x first parses p, then parses x, but instead of applying the former to the latter, it returns the value the arrow is pointing to (i.e., whatever p produced) and throws away the other value. Similarly p *> x parses p then parses x, then returns whatever x produced. In particular:
someParser <* eof
first runs someParser, then parses (i.e., checks for) EOF, then returns whatever someParser produced.
This syntax can really shine when parsing more traditional languages into an abstract syntax tree. If you want to parse statements like:
let x = 1 + 5
into a Statement type like:
data Statement = ... | Let Var Expr | ...
you can write a Parsec parser that looks like:
statement = ...
<|> Let <$ string "let" <*> var <* symbol "=" <*> expr
...
The monadic equivalent in do-notation looks like this:
do string "let"
v <- var
symbol "="
e <- expr
return $ Let v e
which is fine, I suppose, but kind of obscures the simple structure of the parse. The applicative version is basically just the list of tokens to parse, with a little bit of syntactic sugar sprinkled in.
Anyway, here's the full program:
import Data.Char
import Text.ParserCombinators.ReadP
data Divider1 = Abc deriving (Show)
data Divider2 = Xyz | Def deriving (Show)
data Result
= TwoDividers String Divider1 String Divider2 String
| FirstDivider String Divider1 String
| SecondDivider String Divider2 String
| NoDividers String
deriving (Show)
anything :: ReadP String
anything = many $ satisfy isLetter -- valid characters
divider1 :: ReadP Divider1
divider1 = Abc <$ string "ABC"
divider2 :: ReadP Divider2
divider2 = (Def <$ string "DEF") +++ (Xyz <$ string "XYZ")
result :: ReadP Result
result =
(TwoDividers <$> anything <*> divider1 <*> anything <*> divider2 <*> anything)
<++ (FirstDivider <$> anything <*> divider1 <*> anything)
<++ (SecondDivider <$> anything <*> divider2 <*> anything)
<++ (NoDividers <$> anything)
parseResult :: String -> [(Result, String)]
parseResult = readP_to_S (result <* eof)
main :: IO ()
main = mapM_ (print . parseResult)
[ "MMMMABCNNNXYZPPPPP"
, "MMMMABCNNNDEFPPPPP"
, "MMMMNNNXYZPPPPP"
, "MMMMABCNNNPPPPP"
]

Checking if the remaining input is whitespace with parsec

I'm trying out the parsec library and I'm not sure how to handle this basic task.
Suppose I have the following:
data Foo = A | AB
and I want the string "a" to be parsed as A and "a b" AB. If I just do this:
parseA :: parser Foo
parseA = do
reserved "a"
return A
parseAB :: parser Foo
parseAB = do
reserved "a"
reserved "b"
return AB
parseFoo :: parser Foo
parseFoo = parseA
<|> parseAB
then parseFoo will parse "a b" as A since parseA doesn't care that there is non-whitespace still left after consuming the 'a'. How can this be fixed?

You need to change grammar to AB | A and use try from parsec, which gives lookahead capability to your parser.
This should work
parseFoo = try Parse AB <|> parse A

Yet Another Haskell Rigid Type Variable Error

I've investigated many answers to other rigid type variable error questions; but, alas, none of them, to my knowledge, apply to my case. So I'll ask yet another question.
Here's the relevant code:
module MultipartMIMEParser where
import Control.Applicative ((<$>), (<*>), (<*))
import Text.ParserCombinators.Parsec hiding (Line)
data Header = Header { hName :: String
, hValue :: String
, hAddl :: [(String,String)] } deriving (Eq, Show)
data Content a = Content a | Posts [Post a] deriving (Eq, Show)
data Post a = Post { pHeaders :: [Header]
, pContent :: [Content a] } deriving (Eq, Show)
post :: Parser (Post a)
post = do
hs <- headers
c <- case boundary hs of
"" -> content >>= \s->return [s]
b -> newline >> (string b) >> newline >>
manyTill content (string b)
return $ Post { pHeaders=hs, pContent=c }
boundary hs = case lookup "boundary" $ concatMap hAddl hs of
Just b -> "--" ++ b
Nothing -> ""
-- TODO: lookup "boundary" needs to be case-insensitive.
content :: Parser (Content a)
content = do
xs <- manyTill line blankField
return $ Content $ unlines xs -- N.b. This is the line the error message refers to.
where line = manyTill anyChar newline
headers :: Parser [Header]
headers = manyTill header blankField
blankField = newline
header :: Parser Header
header =
Header <$> fieldName <* string ":"
<*> fieldValue <* optional (try newline)
<*> nameValuePairs
where fieldName = many $ noneOf ":"
fieldValue = spaces >> many (noneOf "\r\n;")
nameValuePairs = option [] $ many nameValuePair
nameValuePair :: Parser (String,String)
nameValuePair = do
try $ do n <- name
v <- value
return $ (n,v)
name :: Parser String
name = string ";" >> spaces >> many (noneOf "=")
value :: Parser String
value = string "=" >> between quote quote (many (noneOf "\r\n;\""))
where quote = string "\""
And the error message:
Couldn't match type `a' with `String'
`a' is a rigid type variable bound by
the type signature for content :: Parser (Content a)
at MultipartMIMEParser.hs:(See comment in code.)
Expected type: Text.Parsec.Prim.ParsecT
String () Data.Functor.Identity.Identity (Content a)
Actual type: Text.Parsec.Prim.ParsecT
String () Data.Functor.Identity.Identity (Content String)
Relevant bindings include
content :: Parser (Content a)
(bound at MultipartMIMEParser.hs:72:1)
In a stmt of a 'do' block: return $ Content $ unlines xs
In the expression:
do { xs <- manyTill line blankField;
return $ Content $ unlines xs }
In an equation for `content':
content
= do { xs <- manyTill line blankField;
return $ Content $ unlines xs }
where
line = manyTill anyChar newline
From what I've seen, the problem is that I'm explicitly returning a String using unlines xs, and that breaks the generic nature of a in the type signature. Am I close to understanding?
I've declared Content to be generic because, presumably, this parser might eventually be used on types other than String. Perhaps I'm abstracting prematurely. I did try removing all my as, but I started getting many more compile errors. I think I'd like to stick with the generic approach, if that's reasonable at this point.
Is it clear from the code what I'm trying to do? If so, any suggestions on how to do it best?

You're telling the compiler that content has type Parser (Content a), but the line causing the error is
return $ Content $ unlines xs
Since unlines returns a String, and the Content constructor has type a -> Content a, here you would have String ~ a, so the value Content $ unlines xs has type Content String. If you change the type signature of content to Parser (Content String) then it should compile.
I've declared Content to be generic because, presumably, this parser might eventually be used on types other than String. Perhaps I'm abstracting prematurely. I did try removing all my as, but I started getting many more compile errors. I think I'd like to stick with the generic approach, if that's reasonable at this point.
It's fine to declare Content to be generic, and in many cases it is the exact right way to solve the problem, the issue is that while your container is generic, whenever you fill your container with something concrete, the type variables also have to be concrete. In particular:
> :t Container (1 :: Int)
Container 1 :: Container Int
> :t Container "test"
Container "test" :: Container String
> :t Container (Container "test")
Container (Container "test") :: Container (Container String)
Notice how all of these have their types inferred without any type variables left. You can use the container to hold whatever you want, you just have to make sure that you're accurately telling the compiler what it is.

using parsec to pick data out of text file

As a learning exercise I'm using parsec to look for values in a test file. I'd normally use regexp for this particular case, but want to see if parsec makes sense as well. Unfortunately, I'm running into some problems.
The data file consists of repeating sections that look similar to the following. The 'SHEF' is one of six values and changes from page to page, and I want to use it in constructing a data type.
Part A SHEF Nov/14/2011 (10:52)
-------------------
Portfolio Valuation
-------------------
FOREIGN COMMON STOCK 6,087,152.65
FOREIGN COMMON STOCK - USA 7,803,858.84
RIGHTS 0.00
I'm constructing a data type of the amounts in each asset class:
type Sector = String
type Amount = Double
type FundCode = String
data SectorAmount = SectorAmount (Sector,Amount) deriving (Show, Eq)
data FundSectors = FundSectors {
fund :: FundCode
, sectorAmounts :: [SectorAmount]
} deriving (Show, Eq)
My code, which compiles successfully, is as shown below. It parses the file and correctly retrieves the values in each asset class, but I'm never able to set the state correctly in the fundValue parser. I've tested the fundValue parser with an input string and it does successfully parse it, but for some reason the line function isn't working the way I thought it would. I want it to look for lines in the file which start with "Part A", find the code and store it in state for later use when the tag parser successfully parses a line.
Is the use of fail causing the problem?
allocationParser :: String -> Either ParseError [FundSectors]
allocationParser input = do
runParser allocationFile "" "" input
allocationFile :: GenParser Char FundCode [FundSectors]
allocationFile = do
secAmt <- many line
return secAmt
line :: GenParser Char FundCode FundSectors
line = try (do fund <- try fundValue
eol
fail "")
<|> do result <- try tag
eol
f <- getState
return $ FundSectors {fund=f, sectorAmounts = [result]}
fundValue :: GenParser Char FundCode FundCode
fundValue = do manyTill anyChar . try $ lookAhead (string "Part A ")
string "Part A "
fCode <- try fundCode
setState fCode
v <- many (noneOf "\n\r")
eol
return fCode
fundCode :: GenParser Char FundCode String
fundCode = try (string "SHSF")
<|> try (string "SHIF")
<|> try (string "SHFF")
<|> try (string "SHEF")
<|> try (string "SHGE")
<|> try (string "SHSE")
<|> fail "Couldn't match fundCode"
tag :: GenParser Char FundCode SectorAmount
tag = do manyTill anyChar . try $ lookAhead tagName
name <- tagName
v <- many (noneOf "\n\r")
let value = read ([x | x <- v, x /= ',']) :: Double -- remove commas from currency
return $ SectorAmount (name,value)
eol :: GenParser Char FundCode String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<|> fail "Couldn't find EOL"
Thanks in advance.

Yes, the fail in the "try fundValue" block undoes the setState.
You will need to slightly redesign the parser, but you seem close.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell - Parsec :: Parse spaces until string literal - haskell

Related

Making an 'optional' parser using optparse-applicative and constructing value for recursive data type

Parser Combinators (ReadP) - return entire list if it fails, otherwise return only the passing one

Checking if the remaining input is whitespace with parsec

Yet Another Haskell Rigid Type Variable Error

using parsec to pick data out of text file

Categories

Resources