I have an antlr4 grammar that takes a very long time to parse long expressions like f()()()()...()()() due to superlinear runtime of the predictor. The expression rule also has the usual set of arithmetic-like operators with different precedences. I can cut runtime down to a reasonable level by switching to SLL mode. My problem is, a significant fraction of input programs will have parse errors. I have to retry with LL to distinguish real errors from spurious errors generated by the less powerful parser. That means average performance will still be poor.
How can I tell if a grammar is safe to parse in SLL mode, without retrying in LL?
Are there any general tricks to rewriting a grammar so long expressions can be parsed less slowly?
Related
In the yacc.py file I defined the output of a grammar and also an error, like this:
def p_error(p):
if p:
print("Error when trying to read the symbol '%s' (Token type: %s)" % (p.value, p.type))
else:
print("Syntax error at EOF")
exit()
In addition to this error message, I also want to print what was the production read at the time of the error, something like:
print("Error in production: ifstat -> IF LPAREN expression RPAREN statement elsestat")
How can I do this?
Really, you can't. You particularly can't with a bottom-up parser like the one generated by Ply, but even with top-down parsers "the production read at the time" is not a very well-defined concept.
For example, consider the erroneous code:
if (x < y return 42;
in which the error is a missing parentheses. At least, that's how I'd describe the error. But a closing parenthesis is not the only thing which could follow the 0. For example, a correct program might include any of the following:
if (x < y) return 42;
if (x < y + 10) return 42;
if (x < y && give_up_early) return 42;
and many more.
So which production is the parser trying to complete when it sees the token return? Evidently, it's still trying to complete expression (which might actually have a hierarchy of different expression types, or which might be relying on precedence declarations to be a single non-terminal, or some combination of the two.) But that doesn't really help identify the error as a missing close parenthesis.
In a top-down parser, it would be possible to walk up the parser stack to get a list of partially-completed productions in inclusion order. (At least, that would be possible if the parser maintained its own stack. If it were a recursive-descent parser, examining the stack would be more complicated.)
But in a bottom-up parser, the parser state is more complicated. Bottom-up parsers are more flexible than top-down parsers precisely because they can, in effect, consider multiple productions at the same time. So there often really isn't one single partial production; the parser will decide which production it is looking at by gradually eliminating all the possibilities which don't work.
That description makes it sound like the bottom-up parser is doing a lot of work, which is misleading. The work was already done by the parser generator, which compiles a simple state transition table to guide the parse. What that means in practice is that the parser knows how to handle every possibly-correct token at each moment in the parse. So, for example, when it sees a ) following if (x < y, it immediately knows that it must finish up the expression production and proceed with the rest of the if statement.
Bison -- a C implementation of yacc -- has an optional feature which allows it to list the possible correct tokens when an error is encountered. That's not as simple as it sounds, and implementing it correctly creates a noticeable overhead in parsing time, but it is sometimes useful. (It's often not useful, though, because the list of possible tokens can be very long. In the case of the error I'm using as an example, the list would include every single arithmetic operator, as well as those tokens which could start a postfix operator. The bison extended error handler stops trying when it reaches the sixth possible token, which means that it will rarely generated an extended error message if the parse is in the middle of an expression.) In any case, Ply does not have such a feature.
Ply, like bison, does implement error recovery through the error pseudo-token. The error-recovery algorithm works best with languages which have an unambiguous resynchronisation point, as in languages with a definite statement terminator (unlike C-like languages, in which many statements do not end with ;). But you can use error productions to force the parser to pop its stack back to some containing production in order to produce a better error message. In my experience, a lot of experimentation is needed to get this strategy right.
In short, producing meaningful error messages is hard. My recommendation is to first focus on getting your parser working on correct inputs.
Given the following expression
x = a + 3 + b * 5
I would like to write that in the following data structure, where I'm only interested to capture the variables used on the RHS and keep the string intact. Not interesting in parsing a more specific structure since I'm doing a transformation from language to language, and not handling the evaluation
Variable "x" (Expr ["a","b"] "a + 3 + b * 5")
I've been using this tutorial as my starting point, but I'm not sure how to write an expression parser without buildExpressionParser. That doesn't seem to be the way I should approach this.
I am not sure why you want to avoid buildExpressionParser, as it hides a lot of the complexity in parsing expressions with infix operators. It is the right way to do things....
Sorry about that, but now that I got that nag out of the way, I can answer your question.
First, here is some background-
The main reason writing a parser for expressions with infix operators is hard is because of operator precedence. You want to make sure that this
x+y*z
parses as this
+
/ \
x *
/\
y z
and not this
*
/ \
+ z
/ \
x y
Choosing the correct parsetree isn't a very hard problem to solve.... But if you aren't paying attention, you can write some really bad code. Why? Performance....
The number of possible parsetrees, ignoring precedence, grows exponentially with the size of the input. For instance, if you write code to try all possibilities then throw away all but the ones with the proper precedence, you will have a nasty surprise when your parser tackles anything in the real world (remember, exponential complexity often ain't just slow, it is basically not a solution at all.... You may find that you are waiting half an hour for a simple parse, no one will use that parser).
I won't repeat the details of the "proper" solution here (a google search will give the details), except to note that the proper solution runs at O(n) with the size of the input, and that buildExpressionParser hides all the complexity of writing such a parser for you.
So, back to your original question....
Do you need to use buildExpressionParser to get the variables out of the RHS, or is there a better way?
You don't need it....
Since all you care about is getting the variables used in the right side, you don't care about operator precedence. You can just make everything left associative and write a simple O(n) parser. The parsetrees will be wrong, but who cares? You will still get the same variables out. You don't even need a context free grammar for this, this regular expression basically does it
<variable>(<operator><variable>)*
(where <variable> and <operator> are defined in the obvious way).
However....
I wouldn't recommend this, because, as simple as it is, it still will be more work than using buildExpressionParser. And it will be trickier to extend (like adding parenthesis). But most important, later on, you may accidentally use it somewhere where you do need a full parsetree, and be confused for a while why the operator precedence is so completely messed up.
Another solution is, you could rewrite your grammar to remove the ambiguity (again, google will tell you how).... This would be good as a learning exercise, but you basically would be repeating what buildExpressionParser is doing internally.
When writing a parser in a parser combinator library like Haskell's Parsec, you usually have 2 choices:
Write a lexer to split your String input into tokens, then perform parsing on [Token]
Directly write parser combinators on String
The first method often seems to make sense given that many parsing inputs can be understood as tokens separated by whitespace.
In other places, I have seen people recommend against tokenizing (or scanning or lexing, how some call it), with simplicity being quoted as the main reason.
What are general trade-offs between lexing and not doing it?
The most important difference is that lexing will translate your input domain.
A nice result of this is that
You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle space parsers in all places where whitespace is allowed to be, which is easy to forget and it clutters your code if whitespace must separate all your tokens anyway.
You can think about your input in a piece-by-piece manner, which is easy for humans.
However, if you do perform lexing, you get the problems that
You cannot use common parsers on String anymore - e.g. for parsing a number with a library Function parseFloat :: Parsec String s Float (that operates on a String input stream), you have to do something like takeNextToken :: TokenParser String and execute the parseFloat parser on it, inspecting the parse result (usually Either ErrorMessage a). This is messy to write and limits composability.
You have adjust all error messages. If your parser on tokens fails at the 20th token, where in the input string is that? You'll have to manually map error locations back to the input string, which is tedious (in Parsec this means to adjust all SourcePos values).
Error reporting is generally worse. Running string "hello" *> space *> float on wrong input like "hello4" will tell you precisely that there is expected whitespace missing after the hello, while a lexer will just claim to have found an "invalid token".
Many things that one would expect to be atomic units and to be separated by a lexer are actually pretty "too hard" for a lexer to identify. Take for example String literals - suddenly "hello world" are not two tokens "hello and world" anymore (but only, of course, if quotes are not escaped, like \") - while this is very natural for a parser, it means complicated rules and special cases for a lexer.
You cannot re-use parsers on tokens as nicely. If you define how to parse a double out of a String, export it and the rest of the world can use it; they cannot run your (specialized) tokenizer first.
You are stuck with it. When you are developing the language to parse, using a lexer might lead you into making early decisions, fixing things that you might want to change afterwards. For example, imagine you defined a language that contains some Float token. At some point, you want to introduce negative literals (-3.4 and - 3.4) - this might not be possible due to the lexer interpreting whitespace as token separator. Using a parser-only approach, you can stay more flexible, making changes to your language easier. This is not really surprising since a parser is a more complex tool that inherently encodes rules.
To summarize, I would recommend writing lexer-free parsers for most cases.
In the end, a lexer is just a "dumbed-down"* parser - if you need a parser anyway, combine them into one.
* From computing theory, we know that all regular languages are also context-free languages; lexers are usually regular, parsers context-free or even context-sensitve (monadic parsers like Parsec can express context-sensitiveness).
I'm having trouble articulating the difference between Chomsky type 2 (context free languages) and Chomsky type 3 (Regular languages).
Can someone out there give me an answer in plain English? I'm having trouble understanding the whole hierarchy thing.
A Type II grammar is a Type III grammar with a stack
A Type II grammar is basically a Type III grammar with nesting.
Type III grammar (Regular):
Use Case - CSV (Comma Separated Values)
Characteristics:
can be read with a using a FSM (Finite State Machine)
requires no intermediate storage
can be read with Regular Expressions
usually expressed using a 1D or 2D data structure
is flat, meaning no nesting or recursive properties
Ex:
this,is,,"an "" example",\r\n
"of, a",type,"III\n",grammar\r\n
As long as you can figure out all of the rules and edge cases for the above text you can parse CSV.
Type II grammar (Context Free):
Use Case - HTML (Hyper Text Markup Language) or SGML in general
Characteristics:
can be read using a DPDA (Deterministic Pushdown Automata)
will require a stack for intermediate storage
may be expressed as an AST (Abstract Syntax Tree)
may contain nesting and/or recursive properties
HTML could be expressed as a regular grammar:
<h1>Useless Example</h1>
<p>Some stuff written here</p>
<p>Isn't this fun</p>
But it's try parsing this using a FSM:
<body>
<div id=titlebar>
<h1>XHTML 1.0</h1>
<h2>W3C's failed attempt to enforce HTML as a context-free language</h2>
</div>
<p>Back when the web was still pretty boring, the W3C attempted to standardize away the quirkiness of HTML by introducing a strict specification</p
<p>Unfortunately, everybody ignored it.</p>
</body>
See the difference? Imagine you were writing a parser, you could start on an open tag and finish on a closing tag but what happens when you encounter a second opening tag before reaching the closing tag?
It's simple, you push the first opening tag onto a stack and start parsing the second tag. Repeat this process for as many levels of nesting that exist and if the syntax is well-structured, the stack can be un-rolled one layer at a time in the opposite level that it was built
Due to the strict nature of 'pure' context-free languages, they're relatively rare unless they're generated by a program. JSON, is a prime example.
The benefit of context-free languages is that, while very expressive, they're still relatively simple to parse.
But wait, didn't I just say HTML is context-free. Yep, if it is well-formed (ie XHTML).
While XHTML may be considered context-free, the looser-defined HTML would actually considered Type I (Ie Context Sensitive). The reason being, when the parser reaches poorly structured code it actually makes decisions about how to interpret the code based on the surrounding context. For example if an element is missing its closing tags, it would need to determine where that element exists in the hierarchy before it can decide where the closing tag should be placed.
Other features that could make a context-free language context-sensitive include, templates, imports, preprocessors, macros, etc.
In short, context-sensitive languages look a lot like context-free languages but the elements of a context-sensitive languages may be interpreted in different ways depending on the program state.
Disclaimer: I am not formally trained in CompSci so this answer may contain errors or assumptions. If you asked me the difference between a terminal and a non-terminal you'll earn yourself a blank stare. I learned this much by actually building a Type III (Regular) parser and by reading extensively about the rest.
The wikipedia page has a good picture and bullet points.
Roughly, the underlying machine that can describe a regular language does not need memory. It runs as a statemachine (DFA/NFA) on the input. Regular languages can also be expressed with regular expressions.
A language with the "next" level of complexity added to it is a context free language. The underlying machine describing this kind of language will need some memory to be able to represent the languages that are context free and not regular. Note that adding memory to your machine makes it a little more powerful, so it can still express languages (e.g. regular languages) that didn't need the memory to begin with. The underlying machine is typically a push-down automaton.
Type 3 grammars consist of a series of states. They cannot express embedding. For example, a Type 3 grammar cannot require matching parentheses because it has no way to show that the parentheses should be "wrapped around" their contents. This is because, as Derek points out, a Type 3 grammar does not "remember" anything about the previous states that it passed through to get to the current state.
Type 2 grammars consist of a set of "productions" (you can think of them as patterns) that can have other productions embedded within them. Thus, they are recursively defined. A production can only be defined in terms of what it contains, and cannot "see" outside of itself; this is what makes the grammar context-free.
I spent a week or two programming a simple logic solver. Having built it, I found myself wondering whether the language it solves is Turing-complete or not. So I coded up a small set of equations which accept any valid expression in the SKI combinator calculus, and produce a result set which contains the normal form of that expression. Since SKI is Turing-complete, proving that my language can execute SKI would demonstrate its Turing-completeness.
There is a glitch, however. The solver does not reduce the expression in normal order. Actually what it does is to try every possible reduction order. Which means that the solution set is typically huge. If a normal form exists, it will be in there somewhere, but it's difficult to tell where.
This brings me to two questions:
Is my language Turing-complete? Or do I need to find a better proof?
Is the number of solutions a computable function of the input?
(At first I assumed that the size of the solution set was exponential or factorial in the input size. But on closer inspection, this is not true. You can write huge expressions which are already in normal form, and tiny expressions which do not terminate. I have a feeling that determining the size of the solution set might be equivilent to solving the Halting Problem, but I'm not completely sure...)
A) As augustss says, your system is clearly turing complete.
B) You're right that determining solution size is the same as the halting problem. If a sequence doesn't terminate, then you get an infinite solution set. So to determine if the set is infinite, you need to determine if a reduction sequence terminates. But that's precisely the halting problem!
C) As I recall, a system that, given a set of instructions to a turing machine, merely says how many steps they take to terminate (which is, I suppose, the cardinality of your solution set) or fails to terminate if the instructions themselves fail to terminate is in itself turing complete. So that should help with the intuition here.
In answer to my own question... I found that by tweaking the source code, I can make it so that if the input SKI expression has a normal form, that normal form will always be solution #1. So if you just ignore any further solutions, the program reduces any SKI expression to normal form.
I believe this constitutes a "better proof". ;-)