The following test grammars differ only in that the first alternative of the rule 'expr' is either specified inline or refers to another rule 'notExpression' with just the same definition. But this grammars produce different trees parsing this: '! a & b'. Why?
I really want the grammar to produce the first result (with NOT associated with identifier, not with AND expression) but still need to have 'expr' to reference 'notExpression' in my real grammar. What do I have to change?
grammar test;
s: expr ';' <EOF>;
expr:
NOT expr
| left=expr AND right=expr
| identifier
;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree one
grammar test;
s: expr ';' <EOF>;
expr:
notExpression
| left=expr AND right=expr
| identifier
;
notExpression: NOT expr;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree two
I kind of got an answer to the second part of my question, which still do not quite give me a satisfaction because using this approach in real elaborate grammar is going to be ugly. As to the first part (WHY) I still have no idea, so more answers are welcome.
Anyway, to fix precedence in presence of referenced rule the 'notExpression' rule can be modified as follows:
notExpression: NOT (identifier|expr);
Which produces the tree different from both shown in original question, but at least the NOT does acquire higher precedence.
Parse tree
Related
The antlr book has the following sample code to resolve grammar ambiguities using semantic predicates:
// predicates/PredCppStat.g4
#parser::members {
Set<String> types = new HashSet<String>() {{add("T");}};
boolean istype() { return types.contains(getCurrentToken().getText());}
}
stat: decl ';' {System.out.println("decl "+$decl.text);}
| expr ';' {System.out.println("expr "+$expr.text);}
;
decl: ID ID
| {istype()}? ID '(' ID ')'
;
expr: INT
| ID
| {!istype()}? ID '(' expr ')'
;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
WS : [ \t\n\r]+ -> skip ;
Here, the predicate is the first function called in a rule, determining whether the rule should be fired or not. And it uses getCurrentToken() to take its decision.
However, if we alter the grammar slightly, to use hierarchical names instead of simple ID, like this:
decl: ID ID
| {istype()}? hier_id '(' ID ')'
;
expr: INT
| ID
| {!istype()}? hier_id '(' expr ')'
;
hier_id : ID ('.' ID)* ;
Then the istype() predicate can no longer use getCurrentToken to take its decision. It will need the entire chain of tokens in the hier_id to determine whether the chain is a type symbol or not.
That means, that we will need to do one of the following:
(1) put the predicate after hier_id, and access these value from istype(). Is this possible? I tried it, and I am getting compiler errors on the generated code.
(2) break up the grammar into sub-rules, and then place istype() after hier_id tokens are consumed. But this will wreck the readability of the grammar, and I would not like to do it.
What is the best way to solve this problem? I am using antlr-4.6.
One solution is to make ID itself to contain '.', thereby making hier_id a lexer token. In that case, the semantic predicate's call to getCurrentToken() will have access to the full chain of names.
Note that hier_id will subsume ID if it becomes a lexer token. And that comes at a cost. If the grammar has other references to ID only (and I guess it will have), then you have to add predicates in all those situations to avoid false matches. This will slow down the parser.
So I guess the question, in its general sense (ie how can rules be restricted by pedicates if the currentToken information is not enough to make the decision), still needs to be answered by Antlr4 experts.
I've got the following (heavily stripped down) Happy grammar
%token
'{' { Langle }
'}' { Rangle }
'..' { DotDot }
'::' { ColonColon }
'#' { At }
mut { Mut }
ident { Ident }
%%
pattern
: binding_mode ident at_pat { error "identifier pattern" }
| expr_path { error "constant expression" }
| expr_path '{' '..' '}' { error "struct pattern" }
binding_mode
: mut { }
| { }
at_pat
: '#' pat { }
| { }
expr_path
: expr_path '::' ident { }
| ident { }
Which has shift/reduce conflicts around identifiers in patterns. By default, Happy chooses to shift, but in this case that isn't what I want: it tries to shoe-horn everything into constant expression even when it could be an identifier pattern.
I've read that precedence/associativity is the way to solve this sort of problem, but nothing I've added has been able to budge the grammar in the right direction (to be fair, I've been taking mostly shots in the dark).
Using some obvious tokenization, I would like to have:
x to yield identifier pattern
mut x to yield identifier pattern
std::pi to yield constant expression
point{..} to yield struct pattern
std::point{..} to yield struct pattern
Basically, unless there is a { or :: token waiting to be consumed, an identifier should go to the identifier pattern case.
I apologize if my question is unclear - part of the problem is that I'm having a tough time pinpointing what the problem even is. :(
First, it's important to understand what a shift is. A shift is the result of accepting the next input token, and putting it on the parser stack (where it will eventually be part of a production, but it is not necessary to know which one yet.) A reduction is taking zero or more tokens off the top of the stack which match the right-hand side of some production, and replacing them with the left-hand side.
When the parser decides to create an identifier pattern out of binding_mode ident at_pat where at_pat is empty, it is not shifting; it is reducing. In fact, it reduces twice: first it reduces zero stacked symbols into an empty at_pat and then it reduces the top three stack symbols into an identifier pattern. Had there been no binding_mode, it could have reduced ident to expr_path and then reduced the expr_path into a constant_expression. So that would be a reduce/reduce conflict.
But there is another issue, precisely because binding_mode is nullable. When the parser sees an ident, it does not know whether or not a binding_mode would be possible, so it doesn't know whether to reduce an empty binding_mode or to shift the ident. That is a shift/reduce conflict. Since it prefers shift to reduce, it chooses to shift the ident, which means that an empty binding_mode cannot be produced , which in turn precludes the reduce/reduce conflict (and prevents ident # pat from being recognized at all.)
So to untangle all of that, we need to start by avoiding the necessity to reduce an empty binding_mode. We do that by the usual nullable-production elimination algorithm, which involves making two copies of the right-hand side, one with the nullable nonterminal, and the other without; we then remove the nullable production. Once we do that, the reduce/reduce conflict appears.
To avoid the reduce/reduce conflict, we need to be explicit about which production is preferred. Reduce/reduce conflicts cannot be resolved through precedence declarations because the precedence algorithm always involves the comparison between a production (which could be reduced) and a terminal (which could be shifted). So the resolution must be explicit, and that means that we need to say that a bare ident is a pattern, while an expr_path which is not an ident is a constant expression. That leaves us with the following:
(Note that I've used non-terminals to label the three different productions for pattern, rather than relying on actions. For me that makes it easier to think about, and read.)
pattern: identifier_pattern | constant_expression | struct_pattern
Here is the null production elimination:
identifier_pattern: ident at_pat
| binding_mode ident at_pat
Here is the explicit prohibition on idents:
constant_expression: complex_expr_path
struct_pattern: expr_path '{' '..' '}'
binding_mode is no longer nullable:
binding_mode: mut
at_pat
: '#' pat
| %empty
Here we create the two different expr_paths:
complex_expr_path
: complex_expr_path '::' ident
| ident '::' ident
expr_path: ident | complex_expr_path
I hope that solution has some relationship to your original grammar.
I must define a rule which expresses the following statement: {x in y | x > 0}.
For the first part of that comprehension "x in y", i have the subrule:
FIRSTPART: Name "in" Name
, whereas Name can be everything.
My problem is that I do not want a greedy behaviour. So, it should parse until the "|" sign and then stop. Since I am new in ANTLR4, I do not know how to achieve that.
best regards,
Normally, the lexer/parser rules should represent the allowable syntax of the source input stream.
The evaluation (and consequences) of how the source matches any rule or subrule is a matter of semantics -- whether the input matches a particular subrule and whether that should control how the rule is finally evaluated.
Normally, semantics are implemented as part of the tree-walker analysis. You can use alternate subrule lables (#inExpr, etc) to create easily distinguishable tree nodes for analysis purposes:
comprehension : LBrace expression property? RBrace ;
expression : ....
| Name In Name #inExpr
| Name BinOp Name #binExpr
| ....
;
property : Provided expression ;
BinOp : GT | LT | GTE | .... ;
Provided : '|' ;
In : 'in' ;
I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}
I am parsing a C++ like declaration with this scaled down grammar (many details removed to make it a fully working example). It fails to work mysteriously (at least to me). Is it related to the use of context dependent predicate? If yes, what is the proper way to implement the "counting the number of child nodes logic"?
grammar CPPProcessor;
cppCompilationUnit : decl_specifier_seq? init_declarator* ';' EOF;
init_declarator: declarator initializer?;
declarator: identifier;
initializer: '=0';
decl_specifier_seq
locals [int cnt=0]
#init { $cnt=0; }
: decl_specifier+ ;
decl_specifier : #init { System.out.println($decl_specifier_seq::cnt); }
'const'
| {$decl_specifier_seq::cnt < 1}? type_specifier {$decl_specifier_seq::cnt += 1;} ;
type_specifier: identifier ;
identifier:IDENTIFIER;
CRLF: '\r'? '\n' -> channel(2);
WS: [ \t\f]+ -> channel(1);
IDENTIFIER:[_a-zA-Z] [0-9_a-zA-Z]* ;
I need to implement the standard C++ rule that no more than 1 type_specifier is allowed under an decl_specifier_seq.
Semantic predicate before type_specifier seems to be the solution. And the count is naturally declared as a local variable in decl_specifier_seq since nested decl_specifier_seq are possible.
But it seems that a context dependent semantic predicate like the one I used will produce incorrect parsing i.e. a semantic predicate that references $attributes. First an input file with correct result (to illustrate what a normal parse tree looks like):
int t=0;
and the parse tree:
But, an input without the '=0' to aid the parsing
int t;
0
1
line 1:4 no viable alternative at input 't'
1
the parsing failed with the 'no viable alternative' error (the numbers printed in the console is debug print of the $decl_specifier_cnt::cnt value as a verification of the test condition). i.e. the semantic predicate cannot prevent the t from being parsed as type_specifier and t is no longer considered a init_declarator. What is the problem here? Is it because a context dependent predicate having $decl_specifier_seq::cnt is used?
Does it mean context dependent predicate cannot be used to implement "counting the number of child nodes" logic?
EDIT
I tried new versions whose predicate uses member variable instead of the $decl_specifier_seq::cnt and surprisingly the grammar now works proving that the Context Dependent predicate did cause the previous grammar to fail:
....
#parser::members {
public int cnt=0;
}
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 }? type_specifier {cnt++;} ;
A normal parse tree is resulted:
This gives rise to the question of how to support nested rule if we must use member variables to replace the local variables to avoid context sensitive predicates?
And a weird result is that if I add a /*$ctx*/ after the predicate, it fails again:
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 /*$ctx*/ }? type_specifier {cnt++;} ;
line 1:4 no viable alternative at input 't'
The parsing failed with no viable alternative. Why the /*$ctx*/ causes the parsing to fail like when $decl_specifier_seq::cnt is used although the actual logic uses a member variable only?
And, without the /*$ctx*/, another issue related to the predicate called before #init block appears(described here)
ANTLR 4 evaluates semantic predicates in two cases.
The generated code evaluates a semantic predicate during parsing, and throws an exception of the evaluation returns false. All predicates traversed during parsing are evaluated in this way, including context-dependent predicates and predicates which do not appear at the left side of a decision.
The prediction method evaluates predicates in order to make correct decisions during parsing. In this case, predicates which appear anywhere other than the left edge of the decision being evaluated are assumed to return true (i.e. they are ignored). In addition, context-dependent predicates are only evaluated if the context data is available. The prediction algorithm will not create context structures that were not already provided by the parsing code. If a context-dependent predicate is encountered during prediction and no context is available, the predicate is assumed to return true (i.e. it is ignored for that decision).
The code generator does not evaluate the semantics of the target language, so it has no way to know that $ctx is semantically irrelevant when it appears in /*$ctx*/. Both cases result in the predicate being treated as context-dependent.