Permutation parsing with megaparsec + parser-combinators too lenient - haskell

I'm attempting to parse permutations of flags. The behavior I want is "one or more flags in any order, without repetition". I'm using the following packages:
megaparsec
parser-combinators
The code I have is outputting what I want, but is too lenient on inputs. I don't understand why it's accepting multiples of the same flags. What am I doing wrong here?
pFlags :: Parser [Flag]
pFlags = runPermutation $ f <$>
toPermutation (optional (GroupFlag <$ char '\'')) <*>
toPermutation (optional (LeftJustifyFlag <$ char '-'))
where f a b = catMaybes [a, b]
Examples:
"'-" = [GroupFlag, LeftJustifyFlag] -- CORRECT
"-'" = [LeftJustifyFlag, GroupFlag] -- CORRECT
"''''-" = [GroupFlag, LeftJustifyFlag] -- INCORRECT, should fail if there's more than one of the same flag.

Instead of toPermutation with optional, I believe you need to use toPermutationWithDefault, something like this (untested):
toPermutationWithDefault Nothing (Just GroupFlag <$ char '\'')
The reasoning is described in the paper “Parsing Permutation Phrases” (PDF) in §4, “adding optional elements” (emph. added):
Consider, for example […] all permutations of a, b and c. Suppose b can be empty and we want to recognise ac. This can be done in three different ways since the empty b can be recognised before a, after a or after c. Fortunately, it is irrelevant for the result of a parse where exactly the empty b is derived, since order is not important. This allows us to use a strategy similar to the one proposed by Cameron: parse nonempty constituents as they are seen and allow the parser to stop if all remaining elements are optional. When the parser stops the default values are returned for all optional elements that have not been recognised.
To implement this strategy we need to be able to determine whether a parser can derive the empty string and split it into its default value and its non-empty part, i.e. a parser that behaves the same except that it does not recognise the empty string.
That is, the permutation parser needs to know which elements can succeed without consuming input, otherwise it will be too eager to commit to a branch. I don’t know why this would lead to accepting multiples of an element, though; perhaps you’re also missing an eof?

Related

Haskell # type annotation [duplicate]

I've come across a piece of Haskell code that looks like this:
ps#(p:pt)
What does the # symbol mean in this context? I can't seem to find any info on Google (it's unfortunately hard to search for symbols on Google), and I can't find the function in the Prelude documentation, so I imagine it must be some sort of syntactic sugar instead.
Yes, it's just syntactic sugar, with # read aloud as "as". ps#(p:pt) gives you names for
the list: ps
the list's head : p
the list's tail: pt
Without the #, you'd have to choose between (1) or (2):(3).
This syntax actually works for any constructor; if you have data Tree a = Tree a [Tree a], then t#(Tree _ kids) gives you access to both the tree and its children.
The # Symbol is used to both give a name to a parameter and match that parameter against a pattern that follows the #. It's not specific to lists and can also be used with other data structures.
This is useful if you want to "decompose" a parameter into it's parts while still needing the parameter as a whole somewhere in your function. One example where this is the case is the tails function from the standard library:
tails :: [a] -> [[a]]
tails [] = [[]]
tails xxs#(_:xs) = xxs : tails xs
I want to add that # works at all levels, meaning you can do this:
let a#(b#(Just c), Just d) = (Just 1, Just 2) in (a, b, c, d)
Which will then produce this: ((Just 1, Just 2), Just 1, 1, 2)
So basically it's a way for you to bind a pattern to a value. This also means that it works with any kind of pattern, not just lists, as demonstrated above. This is a very useful thing to know, as it means you can use it in many more cases.
In this case, a is the entire Maybe Tuple, b is just the first Just in the tuple, and c and d are the values contained in the first and second Just in the tuple respectively
To add to what the other people have said, they are called as-patterns (in ML the syntax uses the keyword "as"), and are described in the section of the Haskell Report on patterns.

How can I write a parser using Parsec that only accepts unique elements?

I have recently started learning Haskell and have been trying my hand at Parsec. However, for the past couple of days I have been stuck with a problem that I have been unable to find the solution to. So what I am trying to do is write a parser that can parse a string like this:
<"apple", "pear", "pineapple", "orange">
The code that I wrote to do that is:
collection :: Parser [String]
collection = (char '<') *> (string `sepBy` char ',')) <* (char '>')
string :: Parser String
string = char '"' *> (many (noneOf ['\"', '\r', '\n', '"'])) <* char '"'
This works fine for me as it is able to parse the string that I have defined above. Nevertheless, I would now like to enforce the rule that every element in this collection must be unique and that is where I am having trouble. One of the first results I found when searching on the internet was this one, which suggest the usage of the nub function. Although the problem stated in that question is not the same, it would in theory solve my problem. But what I don't understand is how I can apply this function within a Parser. I have tried adding the nub function to several parts of the code above without any success. Later I also tried doing it the following way:
collection :: Parser [String]
collection = do
char '<'
value <- (string `sepBy` char ','))
char '>'
return nub value
But this does not work as the type does not match what nub is expecting, which I believe is one of the problems I am struggling with. I am also not entirely sure whether nub is the right way to go. My fear is that I am going in the wrong direction and that I won't be able to solve my problem like this. Is there perhaps something I am missing? Any advice or help anyone could provide would be greatly appreciated.
The Parsec Parser type is an instance of MonadPlus which means that we can always fail (ie cause a parse error) whenever we want. A handy function for this is guard:
guard :: MonadPlus m => Bool -> m ()
This function takes a boolean. If it's true, it return () and the whole computation (a parse in this case) does not fail. If it's false, the whole thing fails.
So, as long as you don't care about efficiency, here's a reasonable approach: parse the whole list, check for whether all the elements are unique and fail if they aren't.
To do this, the first thing we have to do is write a predicate that checks if every element of a list is unique. nub does not quite do the right thing: it return a list with all the duplicates taken out. But if we don't care much about performance, we can use it to check:
allUnique ls = length (nub ls) == length ls
With this predicate in hand, we can write a function unique that wraps any parser that produces a list and ensures that list is unique:
unique parser = do res <- parser
guard (allUnique res)
return res
Again, if guard is give True, it doesn't affect the rest of the parse. But if it's given False, it will cause an error.
Here's how we could use it:
λ> parse (unique collection) "<interactive>" "<\"apple\",\"pear\",\"pineapple\",\"orange\">"
Right ["apple","pear","pineapple","orange"]
λ> parse (unique collection) "<interactive>" "<\"apple\",\"pear\",\"pineapple\",\"orange\",\"apple\">"
Left "<interactive>" (line 1, column 46):unknown parse error
This does what you want. However, there's a problem: there is no error message supplied. That's not very user friendly! Happily, we can fix this using <?>. This is an operator provided by Parsec that lets us set the error message of a parser.
unique parser = do res <- parser
guard (allUnique res) <?> "unique elements"
return res
Ahhh, much better:
λ> parse (unique collection) "<interactive>" "<\"apple\",\"pear\",\"pineapple\",\"orange\",\"apple\">"
Left "<interactive>" (line 1, column 46):
expecting unique elements
All this works but, again, it's worth noting that it isn't efficient. It parses the whole list before realizing elements aren't unique, and nub takes quadratic time. However, this works and it's probably more than good enough for parsing small to medium-sized files: ie most things written by hand rather than autogenerated.

Parsec: key value pairs where value type depends on key

I am trying to parse sgf files (files that describe games of go). In these files there are key value pairs of the form (in the sgf specifiction they are called property id's and property values, but i am using key and value in the hope that people will know what I am talking about when reading the title):
key[value]
or
key[value1][value2]...[valuen]
That is, there might be 1 or many values. The catch is that the type of the value depends in the key. So for example, if the key is B (for play a black stone in go). The value is supposed to be a coordinate described by two letters for example: B[ab]. It might also be that the key is AB (for adding a number of black stones, for setting up the board), then the value is a list of coordinates like this: AB[ab][cd][fg]. It could also be that the key is C (for a comment). Then the value is just a string C[this is a comment].
Of course this could be described by the type
type Property = (String, [String])
But i think it would be nicer to have something like
data Property = B Coordinate | AB [Coordinate] | C String ...
Or maybe some other type that better utilizes the type system and won't require that I convert to and from strings all the time.
The problem is that then I would need a parser that depending on the key type returns a different value type, but I think that would cause type problems since a parser can only return one type of value.
How would you parse something like this?
This is actually a straightforward choice, and doesn't need monadic parsing. I'll use the applicative interface to demonstrate this point.
Build a parser for each property id and its property values somewhat like this:
black = B <$> (char 'C' *> coordinate)
white = W <$> (char 'W' *> coordinate)
addBlack = AB <$> (string "AB" *> many1 coordinate)
(assuming you've built a coordinate parser that eats the brackets and returns something of type Coordinate).
Each one of those has type Parser Property (with your second, better structured data type), so now we just get the parser to choose between them. If the property ids all have different first letters, when you use the parser for the wrong id, it will fail without consuming input, which is ideal for the choice operator:
myparser = black <|> white <|> addBlack
But I suspect there's an AW id for adding white stones, so we'd need to warn that they overlap using try, which backtracks when a parser fails:
mybetterparser = black <|> white <|> (try addBlack <|> try addWhite)
I've bracketed together parsers with a common start and used try on them to go back to the beginning.

What makes a good name for a helper function?

Consider the following problem: given a list of length three of tuples (String,Int), is there a pair of elements having the same "Int" part? (For example, [("bob",5),("gertrude",3),("al",5)] contains such a pair, but [("bob",5),("gertrude",3),("al",1)] does not.)
This is how I would implement such a function:
import Data.List (sortBy)
import Data.Function (on)
hasPair::[(String,Int)]->Bool
hasPair = napkin . sortBy (compare `on` snd)
where napkin [(_, a),(_, b),(_, c)] | a == b = True
| b == c = True
| otherwise = False
I've used pattern matching to bind names to the "Int" part of the tuples, but I want to sort first (in order to group like members), so I've put the pattern-matching function inside a where clause. But this brings me to my question: what's a good strategy for picking names for functions that live inside where clauses? I want to be able to think of such names quickly. For this example, "hasPair" seems like a good choice, but it's already taken! I find that pattern comes up a lot - the natural-seeming name for a helper function is already taken by the outer function that calls it. So I have, at times, called such helper functions things like "op", "foo", and even "helper" - here I have chosen "napkin" to emphasize its use-it-once, throw-it-away nature.
So, dear Stackoverflow readers, what would you have called "napkin"? And more importantly, how do you approach this issue in general?
General rules for locally-scoped variable naming.
f , k, g, h for super simple local, semi-anonymous things
go for (tail) recursive helpers (precedent)
n , m, i, j for length and size and other numeric values
v for results of map lookups and other dictionary types
s and t for strings.
a:as and x:xs and y:ys for lists.
(a,b,c,_) for tuple fields.
These generally only apply for arguments to HOFs. For your case, I'd go with something like k or eq3.
Use apostrophes sparingly, for derived values.
I tend to call boolean valued functions p for predicate. pred, unfortunately, is already taken.
In cases like this, where the inner function is basically the same as the outer function, but with different preconditions (requiring that the list is sorted), I sometimes use the same name with a prime, e.g. hasPairs'.
However, in this case, I would rather try to break down the problem into parts that are useful by themselves at the top level. That usually also makes naming them easier.
hasPair :: [(String, Int)] -> Bool
hasPair = hasDuplicate . map snd
hasDuplicate :: Ord a => [a] -> Bool
hasDuplicate = not . isStrictlySorted . sort
isStrictlySorted :: Ord a => [a] -> Bool
isStrictlySorted xs = and $ zipWith (<) xs (tail xs)
My strategy follows Don's suggestions fairly closely:
If there is an obvious name for it, use that.
Use go if it is the "worker" or otherwise very similar in purpose to the original function.
Follow personal conventions based on context, e.g. step and start for args to a fold.
If all else fails, just go with a generic name, like f
There are two techniques that I personally avoid. One is using the apostrophe version of the original function, e.g. hasPair' in the where clause of hasPair. It's too easy to accidentally write one when you meant the other; I prefer to use go in such cases. But this isn't a huge deal as long as the functions have different types. The other is using names that might connote something, but not anything that has to do with what the function actually does. napkin would fall into this category. When you revisit this code, this naming choice will probably baffle you, as you will have forgotten the original reason that you named it napkin. (Because napkins have 4 corners? Because they are easily folded? Because they clean up messes? They're found at restaurants?) Other offenders are things like bob and myCoolFunc.
If you have given a function a name that is more descriptive than go or h, then you should be able to look at either the context in which it is used, or the body of the function, and in both situations get a pretty good idea of why that name was chosen. This is where my point #3 comes in: personal conventions. Much of Don's advice applies. If you are using Haskell in a collaborative situation, then coordinate with your team and decide on certain conventions for common situations.

What does the "#" symbol mean in reference to lists in Haskell?

I've come across a piece of Haskell code that looks like this:
ps#(p:pt)
What does the # symbol mean in this context? I can't seem to find any info on Google (it's unfortunately hard to search for symbols on Google), and I can't find the function in the Prelude documentation, so I imagine it must be some sort of syntactic sugar instead.
Yes, it's just syntactic sugar, with # read aloud as "as". ps#(p:pt) gives you names for
the list: ps
the list's head : p
the list's tail: pt
Without the #, you'd have to choose between (1) or (2):(3).
This syntax actually works for any constructor; if you have data Tree a = Tree a [Tree a], then t#(Tree _ kids) gives you access to both the tree and its children.
The # Symbol is used to both give a name to a parameter and match that parameter against a pattern that follows the #. It's not specific to lists and can also be used with other data structures.
This is useful if you want to "decompose" a parameter into it's parts while still needing the parameter as a whole somewhere in your function. One example where this is the case is the tails function from the standard library:
tails :: [a] -> [[a]]
tails [] = [[]]
tails xxs#(_:xs) = xxs : tails xs
I want to add that # works at all levels, meaning you can do this:
let a#(b#(Just c), Just d) = (Just 1, Just 2) in (a, b, c, d)
Which will then produce this: ((Just 1, Just 2), Just 1, 1, 2)
So basically it's a way for you to bind a pattern to a value. This also means that it works with any kind of pattern, not just lists, as demonstrated above. This is a very useful thing to know, as it means you can use it in many more cases.
In this case, a is the entire Maybe Tuple, b is just the first Just in the tuple, and c and d are the values contained in the first and second Just in the tuple respectively
To add to what the other people have said, they are called as-patterns (in ML the syntax uses the keyword "as"), and are described in the section of the Haskell Report on patterns.

Resources