With ANTLR 4.6, snapshot of 11/23/2016.
I have two rules which are each left-recursive. I expanded several of the alternatives to expose the left recursion. ANTLR4 handles this, because the left recursion is explicit. However, the two rules are also mutually left-recursive.
How do I resolve the mutual left recursion AND do so so that the rules aren't a total mess? Right now I have nice comments showing what was expanded, and I moved that to primary2 and constant_primary2 which aren't involved in the mutual left recursion.
constant_primary :
constant_primary2
| primary '.' method_call_body
| constant_primary '\'' '(' constant_expr ')'
;
primary :
primary2
| primary '.' method_call_body
| constant_primary '\'' '(' expr ')'
;
One option is to switch to using my fork of ANTLR 4, which is available through Maven using the Group ID com.tunnelvisionlabs. This fork handles mutual left recursion while producing parse trees that match the form you actually wrote in the grammar.
Note that this feature is somewhat experimental. If you run into problems feel free to post an issue on the issue tracker for my fork.
Related
I am using the antlr4 c grammar as inspiration for my own grammar. I came over one thing I dont really get. Why is there Lexer rules for datatypes when they are not used? For example the rule Double : 'double'; is never used but the parser rule typeSpecifier:('double' | ... );(other datatypes has been removed to simplify) is used several places. Is there a reason why the parser rule typeSpecifier is not using the lexer rule Double?
All the grammars on that page are volunteer submissions and not part of ANTLR4. It's clearly a mistake, but the way lexer rules are matched, it won't make a difference in lexing. You can choose to implement either the explicit rule:
Double : 'double';
or the implicit one:
typeSpecifier
: ('void'
| 'char'
| 'short'
| 'int'
| 'long'
| 'float'
| 'double'
with no ill effects either way, even if you mix methods. In fact, if you take a more global look at that whole grammar, the author did the same thing with numerous other lexer rules, like Register for example. Makes no difference in actual practice.
Bottom line? Choose whichever method you like and apply it consistently. My personal preference is toward brevity, so I like the the implicit tokens so long as they are used in only one place in the grammar. As soon as a token might be used in two places, I prefer to make an explicit token out of it and update the two or more locations where it's used.
I'm designing a grammar for a markdown based language but without the context awareness.
For example I want to detect tokens like ## ##.
I found two different ways of designing rules for that and I'm not quite sure which way could be the best approach.
The first way: Defining more complex tokens and a simple rule.
fragment
HEAD
: '#'
;
fragment
HEADING_TEXT
: (~[#]|'\\#')+?
;
SUBHEADLINE
: HEAD HEAD HEADING_TEXT HEAD HEAD
;
subheadline
: SUBHEADLINE
;
Due to the fragments HEAD and HEADING_TEXT would get to the parser. I'm prototyping within IntelliJ and the parsing works well. And the errors message show something like "missing SUBHEADLINE" what's great for the main application (I think I can change those errors easily to human readable ones).
The second approach: Much simpler tokens and more complex rules for the parser.
HEAD
: '#'
;
HEADING_TEXT
: (~[#]|'\\#')+?
;
subheadline
: HEAD HEAD HEADING_TEXT HEAD HEAD
;
Works fine, too. The errors are more specific and maybe not very good for transforming them to human readable ones.
But I'm overall not sure which approach I should follow and why?! The more complex tokens are easier to write in this case because there won't be any complex rules like normal programming languages contains. But it don't feel like this is the correct way of doing it.
Both ways have their own behavior and it depends on what you need to decide what to use. Defining the subheadline in the lexer the way you did does not allow skipped/hidden tokens between e.g. '#', which is probably what you intend. Doing that in the parser instead allows to have e.g. # /*acomment*/headline## which is probably not the intended behavior. Also I would combine things that strictly belong together into one rule. For instance HEADING_TEXT in your second variant may match input that you want to have matched in a different way. Instead define the subheading exactly as the language dictates:
SUBHEADING: '##' .*? '##';
This is even conciser than your simpler variant while still not allowing skipped input between the markers.
I have written tokenizer and expression evaluator for a preprocessor language that I plan to use in my later projects. I started thinking that maybe I should describe the language with EBNF (Extended Backus–Naur Form) to keep the syntax more maintainable or even use it to generate later versions of a parser.
My first impression was that EBNF is used for tokenizing process and syntax validation. Later I discovered that it can also be used to describe operator precedence like in this post or in the Wikipedia article:
expression ::= equality-expression
equality-expression ::= additive-expression ( ( '==' | '!=' ) additive-expression ) *
additive-expression ::= multiplicative-expression ( ( '+' | '-' ) multiplicative-expression ) *
multiplicative-expression ::= primary ( ( '*' | '/' ) primary ) *
primary ::= '(' expression ')' | NUMBER | VARIABLE | '-' primary
I can see how that allows generator to produce code with operator precedence built in but is this really how precedence should be expressed? Isn't operator precedence more about semantics and EBNF about syntax? If I decide to write description of my language in EBNF, should I write it with operator precedence taken into account or document that in a separate section?
Did a similar stuff for my collegue degree.
I suggest DO NOT use the operator precedence feature, even if looks easier like "syntact sugar".
Why ? Because most languages to be described by EBNF, use a lot of operators with different features, that are better to describe & update, with EBNF expressions, instead of operator precedence.
Some operators are unary prefix, some unary posfix, some are binary (a.k.a. "infix"), some binary are evaluated from left to right, & some are evaluated from right to left. Some symbols are operators in some context, and used as other tokens, in other context, example "+", "-", that can be binary operators ("x - y"), unary prefix operators ("x - -y"), or part of a literal ("x + -5").
In my experience its more "safe" to describe them with EBNF expressions. Unless the programming language you describe, is very small, with very few and similar syntax operators (example: all binary, or all prefix unary).
Just my 2 cents.
In the Alloy grammar spec on the Alloy Web site, I find myself confused by the use of square brackets.
In a production like the following, things seem clear.
specification ::= [module] open* paragraph*
I guess the square brackets indicate optionality and that the asterisks are Kleene closures, so that the rule just quoted means a specification consists of at most one module statement, zero or more open clauses, and zero or more paragraphs. This makes sense to me (though I am gradually coming to use Wirth's EBNF notation wherever possible, so my notes show this as [module] {open} {paragraph}).
In the following production, though, the brackets are confusing me.
cmdDecl ::= [name ":"] ["run"|"check"] [name|block] scope
It would surprise me a great deal if the keywords run and check were optional in commands, and ditto for the name of the predicate to be run, the name of the assertion to be checked, or the anonymous block to be run or checked. But that's what it looks as if this rule is saying.
So question 1: what do square brackets indicate in the grammar?
Question 2: is the use of square brackets where some readers might expect parentheses a typo? I.e. should this rule instead take the following form?
cmdDecl ::= [name ":"] ("run"|"check") (name|block) scope
Maybe I'm just not familiar enough with the variety of grammatical notations to be found in the wild; perhaps it would be helpful to indicate the tool, or point to a description of the notation.
Question 3: is this notation used by some parser generation tool? Which?
question 1: what do square brackets indicate in the grammar?
You rightly pointed out that the use of square brackets is inconsistent in the grammar you referred to. I think that grammar was copied from the first edition of the "Software Abstractions" book; I'm not sure if the second edition of the book contains the same grammar.
Question 2: is the use of square brackets where some readers might expect parentheses a typo?
Exactly right.
Question 3: is this notation used by some parser generation tool? Which?
It is not. The Alloy Analyzer uses a grammar written in Cup. The .lex and .cup files (Alloy.lex and Alloy.cup) are included in the Alloy distribution jar file (located in "edu/mit/csail/sdg/alloy4compiler/parser/").
Thanks, Michael. The production for cmdDecl was indeed wrong in the book, so I've posted an erratum. Aleks has also updated the grammar on the Alloy website, which had a couple of other errors.
I'm reading the book: Formal Syntax and Semantics of
Programming Languages. I don't understand this exercise:
Consider the following two grammars, each of which generates strings of
correctly balanced parentheses and brackets. Determine if either or both
is ambiguous. The Greek letter ε repreents an empty string.
<string> ::= <string> <string> | ( <string> ) |[ <string> ] | ε
<string> ::= ( <string> ) <string> | [ <string> ] <string> | ε
The first is ambiguous and the second is not. This is a question about how a context-free grammar (CFG) can be turned into a parse tree. In the first CFG, the first production is the source of the ambiguity. If I write the string "()()()" it is unclear which part of this string could match the left non-terminal and which could match the right non-terminal.
One valid parse tree for that string is that the first two characters "()" match the first non-terminal, which then matches the second production and the rest of the string "()()" matches the right non-terminal, which again matches the first production again.
Another valid parse tree is for the first four characters "()()" to match the left non-terminal and for the rest "()" to match the right non-terminal. Both are equally valid so there is an ambiguity. Parser tools like LR parsers call this a shift/reduce conflict.
This has absolutely no problem if you just want to see if a string belongs to a language. If any parse works, you're good. This has really problematic effects, however, if you're trying to create a parse tree to use as, for example, an abstract syntax tree for a programming language.
To show why this is a problem for parsing a language take a look at this example.
<expression> ::= <expression> <expression> | <expression> + <expression> | <expression> * <expression>
How do you parse "1+2*3"? Is it "(1+2)*3" or "1+(2*3)"? The grammar I gave has a shift/reduce conflict so it is not specified. Most LR parse tools will resolve this conflict for you automatically an arbitrarily. This is dangerous because if I'm writing a programming language there should be a well-defined understanding of which the programmer will get. Since this is a typical arithmetic expression we should probably follow the math convention and have the answer be "1+(2*3)".
The solution is to rewrite the grammar so that it's unambiguous or many parser tools also just allow us to explicitly specify the associativity and precedence of our lexical symbols, which is very convenient for keeping your grammar nice and readable.