Parsec: Consume all input - haskell

One common problem I have with Parsec is that it tends to ignore invalid input if it occurs in the "right" place.
As a concrete example, suppose we have integer :: Parser Int, and I write
expression = sepBy integer (char '+')
(Ignore whitespace issues for a moment.)
This correctly parses something like "123+456+789". However, if I feed it "123+456-789", it merrily ignores the illegal "-" character and the trailing part of the expression; I actually wanted an error message telling me about the invalid input, not just having it silently ignore that part.
I understand why this happens; what I'm not sure about is how to fix it. What is the general method for designing parsers that consume all supplied input and succeed only if all of it is a valid expression?

It's actually pretty simple--just ensure it's followed by eof:
parse (expression <* eof) "<interactive>" "123+456-789"
eof matches the end of the input, even if the input is just a string and not a file.
Obviously, this only makes sense at the top level of your parser.

Related

Can I get scape characters still behave as such for a string provided by f:read() in Lua?

I'm working on a simple localization function for my scripts and, although it's starting to work quite well so far, I don't know how to avoid scape/special characters to be shown in UI as part of the text after feeding the widgets with the strings returned by f:read().
For example, if in a certain Strings.ES.txt's line I have: Ignorar \"Etiquetas de capa\", I'd expect backslashes didn't end showing up just like when I feed the widget with a normal string between doble quotes like: "Ignorar \"Etiquetas de capa\"", or at least have a way to avoid it. I've been trial-and-erroring with tostring() and load() functions and different (surely nonsense 🙄) concatenations like: load(tostring("[[" .. f:read()" .. ]]")) and such without any success, so here I'm again...
Do someone know if there is a way to get scape characters in a string returned by f:read() still behave as special as when they are found in a regular one?
I don't know how to avoid [e]scape/special characters to be shown in UI as part of the text
What you want is to "unescape" or "unquote" a string to interpret escape sequences as if it were parsed as a quoted string by Lua.
[...] with the strings returned by f:read() [...]
The fact that this string was obtained using f:read() can be ignored; all that matters is that it is a string literal without quotes using quoted string escapes.
I've been trial-and-erroring with tostring() and load() functions and different [...] concatenations like: load(tostring("[[" .. f:read()" .. ]]")) and such without any success [...]
This is almost how to do it, except you chose the wrong string literal type: "Long" strings using pairs square brackets ([ and ]) do not interpret escape sequences at all; they are intended for including long, raw, possibly multiline strings in Lua programs and often come in handy when you need to represent literal strings with backslashes (e.g. regular expressions - not to be confused with Lua patterns, which use % for escapes, and lack the basic alternation operator of regular expressions).
If you instead use single or double quotes to wrap the string, it will work fine:
local function unescape_string(escaped)
return assert(load(('return "%s"'):format(escaped)))()
end
this will produce a tiny Lua program (a "chunk") for each string, which just consists of return "<contents>". Recall that Lua chunks are just functions. Thus you can simply call the function to obtain the value of the string it returns. That way, Lua will interpret the escape sequences for us. The same approach is often used to use Lua for reading data serialized as Lua code.
Note also the use of assert for error handling: load returns nil, err if there is a syntax error. To deal with this gracefully, we can wrap the call to load in assert: assert returns its first argument (the chunk returned by load) if it is truthy; otherwise, if it is falsy (e.g. nil in this case), assert errors, using its second argument as an error message. If you omit the assert and your input causes a syntax error, you will instead get a cryptic "attempt to call a nil value" error.
You probably want to do additional validation, especially if these escaped strings are user-provided - otherwise a malicious string like str"; os.execute("...") can trivially invoke a remote code execution (RCE) vulnerability, allowing it to both execute Lua e.g. to block (while 1 do end), slow down or hijack your application, as well as shell commands using os.execute. To guard against this, searching for an unescaped closing quote should be sufficient (syntax errors e.g. through invalid escapes will still be possible, but RCE should not be possible excepting Lua interpreter bugs):
local function unescape_string(escaped)
-- match start & end of sequences of zero or more backslashes followed by a double quote
for from, to in escaped:gmatch'()\\*()"' do
-- number of preceding backslashes must be odd for the double quote to be escaped
assert((to - from) % 2 ~= 0, "unescaped double quote")
end
return assert(load(('return "%s"'):format(escaped)))()
end
Alternatively, a more robust (but also more complex) and presumably more efficient way of unescaping this would be to manually implement escape sequences through string.gsub; that way you get full control, which is more suitable for user-provided input:
-- Single-character backslash escapes of Lua 5.1 according to the reference manual: https://www.lua.org/manual/5.1/manual.html#2.1
local escapes = {a = '\a', b = '\b', f = '\b', n = '\n', r = '\r', t = '\t', v = '\v', ['\\'] = '\\', ["'"] = "'", ['"'] = '"'}
local function unescape_string(escaped)
return escaped:gsub("\\(.)", escapes)
end
you may implement escapes here as you see fit; for example, this misses decimal escapes, which could easily be implemented as escaped:gsub("\\(%d%d?%d?)", string.char) (this uses coercion of strings to numbers in string.char and a replacement function as second argument to string.gsub).
This function can finally be used straightforwardly as unescape_string(f:read()).

How do I know if a (mega)parsec parser may consume input without experimentation - appears not to be documented

When using the <|> combinator in a Parsec or Megaparsec parser, 'try' maybe needed to force backtracking in case the first parser fails having consumed input. In Parsec I need to use 'try' when parsing strings:
λ: parse (try (string "abc") <|> string "abd") "" "abd"
Right "abd"
Without the 'try', the parse fails as the first parser consumes the 'a', leaving just 'bd' for the second parser which then naturally fails as well.
In Megaparsec, the 'try' is not needed:
λ: parse (string "abc" <|> string "abd") "" "abd"
Right "abd"
and so somehow in Megaparsec the string parser helpfully does not consume input when it fails.
My questions are:
How might I have found out, other than experimenting, that the string parser behaviors are different between Parsec and Megaparsec - I have no seen it documented?
How can I easily (that is, without experimentation) tell in general whether a parser consumes input if it fails?
Thank you.

SML Pattern Matching on char lists

I'm trying to pattern match on char lists in SML. I pass in a char list generated from a string as an argument to the helper function, but I get an error saying "non-constructor applied to argument in pattern". The error goes away if instead of
#"a"::#"b"::#"c"::#"d"::_::nil
I use:
#"a"::_::nil.
Any explanations regarding why this happens would be much appreciated, and work-arounds if any. I'm guessing I could use the substring function to check this specific substring in the original string, but I find pattern matching intriguing and wanted to take a shot. Also, I need specific information in the char list located somewhere later in the string, and I was wondering if my pattern could be:
#"some useless characters"::#"list of characters I want"::#"newline character"
I checked out How to do pattern matching on string in SML? but it didn't help.
fun somefunction(#"a"::#"b"::#"c"::#"d"::_::nil) = print("true\n")
| somefunction(_) = print("false\n")
If you add parentheses around the characters the problem goes away:
fun somefunction((#"a")::(#"b")::(#"c")::(#"d")::_::nil) = print("true\n")
| somefunction(_) = print("false\n")
Then somefunction (explode "abcde") prints true and somefunction (explode "abcdef") prints false.
I'm not quite sure why the SML parser had difficulties parsing the original definition. The error message suggests that is was interpreting # as a function which is applied to strings. The problem doesn't arise simply in pattern matching. SML also has difficulty with an expression like #"a"::#"b"::[]. At first it seems like a precedence problem (of # and ::) but that isn't the issue since #"a"::explode "bc" works as expected (matching your observation of how your definition worked when only one # appeared). I suspect that the problem traces to the fact that characters where added to the language with SML 97. The earlier SML 90 viewed characters as strings of length 1. Perhaps there is some sort of behind-the-scenes kludge with the way the symbol # as a part of character literals was grafted onto the language.

attoparsec: succeeding on part of the input instead of failing

I have an attoparsec parser, and tests for it, what annoys me is that if I comment part of the parser and run the tests, the parser doesn't return Left "parse error at line ..." but instead I get Right [].
Note that I'm using parseOnly to make it clear that there'll be no more input.
Otherwise it's nice to get the partially parsed input, it can definitely be useful and I'm glad to have it. However I'd like to be informed that the whole input was not consumed. Maybe to get a character offset of the last consumed letter, or if that's what it takes, at least an option to be returned Left.
If it's relevant, the parser can be found there.
If I comment for instance the line:
<|> PlainText <$> choice (string <$> ["[", "]", "*", "`"])
And run the tests, I get for instance:
1) notes parsing tests parses notes properly
simple test
expected: Right [NormalLine [PlainText "one line* # hello world"]]
but got: Right []
This is from that test.
Depending on if consuming the whole input should be the property of parseNoteDocument or just the tests, I'd extend one or the other with endOfInput or atEnd.
I'd suggest to define a proper Parser for your documents, like
parseNoteDocument' :: Text -> Parsec NoteDocument
parseNoteDocument' = many parseLine
and then define parseNoteDocument in terms of it. Then you can use parseNoteDocument' in the tests by defining a helper that parses a given piece of text using
parseNoteDocument' <* endOfInput
to ensure that the whole input is consumed.

Parser skips lines

I want to write a simple parser for a subset of Jade, generating some XmlHtml for further processing.
The parser is quite simple, but as often with Parsec, a bit long. Since I don't know if I am allowed to make such long code posts, I have the full working example here.
I've dabbled with Parsec before, but rarely successfully. Right now, I don't quite understand why it seems to swallow following lines. For example, the jade input of
.foo.bar
| Foo
| Bar
| Baz
tested with parseTest tag txt, returns this:
Element {elementTag = "div", elementAttrs = [("class","foo bar")], elementChildren = [TextNode "Foo"]}
My parser seems to be able to deal with any kind of nesting, but never more than one line. What did I miss?
If Parsec cannot match the remaining input, it will stop parsing at that point and simply ignore that input. Here, the problem is that after having parsed a tag, you don't consume the whitespace in the beginning of the line before the next tag, so Parsec cannot parse the remaining input and bails. (There might also be other issues, I can't test the code right now)
There are many ways of adding something that consumes the spaces, but I am not familiar with Jade so I cannot tell you which way is the "correct" way (I don't know how the indentation syntax works) but just adding whiteSpace somewhere at the end of tag should do it.
By the way, you should consider splitting up your parser into a Lexer and Parser. The Lexer produces a token stream like [Ident "bind", OpenParen, Ident "tag", Equals, StringLiteral "longname", ..., Indentation 1, ...] and the parser parses that token stream (Yes, Parsec can parse lists of anything). I think that it would make your job easier/less confusing.

Resources