I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}
Related
I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.
At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having
data Expr = Add Expr Expr | ...
one would have
data Expr a = Add a Expr Expr
and use a for whatever annotation (e.g. for comments that come after the expression).
However, there are some not so exciting cases. The language features C-like comments (// ..., /* .. */) and a simple for loop like this:
for (i in 1:10)
{
... // list of statements
}
Now, excluding the body there are at least 10 places where one could put one (or more) comments:
/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/
{ /*K*/
...
In other words, while the for loop could previously be comfortably represented as an identifier (i), two expressions (1 & 10) and a list of statements (the body), we would now at least had to include 10 more parameters or records for annotations.
This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?
You can probably capture most of those positions with just two generic comment productions:
Expr -> Comment Expr
Stmt -> Comment Stmt
This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for production (maybe four, with one hidden in Range here):
Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt
In other words: one before each literal string but the first. Seems not too onerous, ultimately.
The following test grammars differ only in that the first alternative of the rule 'expr' is either specified inline or refers to another rule 'notExpression' with just the same definition. But this grammars produce different trees parsing this: '! a & b'. Why?
I really want the grammar to produce the first result (with NOT associated with identifier, not with AND expression) but still need to have 'expr' to reference 'notExpression' in my real grammar. What do I have to change?
grammar test;
s: expr ';' <EOF>;
expr:
NOT expr
| left=expr AND right=expr
| identifier
;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree one
grammar test;
s: expr ';' <EOF>;
expr:
notExpression
| left=expr AND right=expr
| identifier
;
notExpression: NOT expr;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree two
I kind of got an answer to the second part of my question, which still do not quite give me a satisfaction because using this approach in real elaborate grammar is going to be ugly. As to the first part (WHY) I still have no idea, so more answers are welcome.
Anyway, to fix precedence in presence of referenced rule the 'notExpression' rule can be modified as follows:
notExpression: NOT (identifier|expr);
Which produces the tree different from both shown in original question, but at least the NOT does acquire higher precedence.
Parse tree
Im making a simple calculator that prints postfix to learn bison. i could make the postfix part work ,but now i need to do assigments to variables (a-z) tex: a=3+2; should print: a32+=; for example. Im trying modified my working postfix code to be able to read a char too.
If I understand correctly, to be able to put different types in the $$ i need a union and to make the nodes i should use a struct because in my case 'expr' can be a int or a char.
This is my parser:
%code requires{
struct nPtr{
char *val;
int num;
};
}
%union {
int iValue;
char sIndex;
struct nPtr *e;
};
%token PLUS MINUS STAR LPAREN RPAREN NEWLINE DIV ID NUMBER POW EQL
%type <iValue> NUMBER
%type <sIndex> ID
%type <e> expr line
%left PLUS MINUS
%left STAR DIV
%left POW
%left EQL
%%
line : /* empty */
|line expr NEWLINE { printf("=\n%d\n", $2.num); }
expr : LPAREN expr RPAREN { $$.num = $2.num; }
| expr PLUS expr { $$.num = $1.num + $3.num; printf("+"); }
| expr MINUS expr { $$.num = $1.num - $3.num; printf("-"); }
| expr STAR expr { $$.num = $1.num * $3.num; printf("*"); }
| expr DIV expr { $$.num = $1.num / $3.num; printf("/");}
| expr POW expr { $$.num = pow($1.num, $3.num); printf("**");}
| NUMBER { $$.num = $1.num; printf("%d", yylval); }
| ID EQL expr { printf("%c", yylval); }
;
%%
I have this in the lex to handle the "=" and variables
"=" { return EQL; }
[a-z] { yylval.sIndex = strdup(yytext); return ID; }
i get and error
warning empty rule for typed nonterminal and no action
the only answer i found here about that is this:
Bison warning: Empty rule for typed nonterminal
It says to just remove the /* empty */ part in:
line: /* empty */
| line expr NEWLINE { printf("=\n%d\n", $2.num); }
when i do that i get 3 new warnings:
warning 3 nonterminal useless in grammar
warning 10 rules useless in grammar
fatal error: start symbol line does not derive any sentence
I googled and got some solutions that just gave me other problems, for example.
when i change:
line:line expr NEWLINE { printf("=\n%d\n", $2.num); }
to
line:expr NEWLINE { printf("=\n%d\n", $2.num); }
bison works but when i try to run the code in visual studio i get a lot of errors like:
left of '.e' must have struct/union type
'pow': too few arguments for call
'=': cannot convert from 'int' to 'YYSTYPE'
thats as afar as i got. I cant find a simple example similar to my needs. I just want to make 'expr' be able to read a char and print it. If someone could check my code and recommend some changes . it will be really really appreciated.
It is important to actually understand what you are asking Bison to do, and what it is telling you when it suggests that your instructions are wrong.
Applying a fix from someone else's problem will only work if they made the same mistake as you did. Just making random changes based on Google searching is neither a very structured way to debug nor to learn a new tool.
Now, what you said was:
%type <e> expr line
Which means that expr and line both produce a value whose union tag is e. (Tag e is a pointer to a struct nPtr. I doubt that's what you wanted either, but we'll start with the Bison warnings.)
If a non-terminal produces a value, it produce a value in every production. To produce a value, the semantic rule needs to assign a value (of the correct tag-type) to $$. For convenience, if there is no semantic action and if $1 has the correct tag-type, then Bison will provide the default rule $$ = $1. (That sentence is not quite correct but it's a useful approximation.) Many people think that you should not actually rely on this default, but I think it's OK as long as you know that it is happening, and have verified that the preconditions are valid.
In your case, you have two productions for line. Neither of them assigns a value to $$. This might suggest that you didn't really intend for line to even have a semantic value, and if we look more closely, we'll see that you never attempt to use a semantic value for any instance of the non-terminal line. Bison does not, as far as I know, attempt such a detailed code examination, but it is capable of noting that the production:
line : /* empty */
has no semantic action, and
does not have any symbols on the right-hand side, so no default action can be constructed.
Consequently, it warns you that there is some circumstance in which line will not have its value set. Think of this as the same warning as your C compiler might give you if it notices that a variable you use has never had a value assigned to it.
In this case, as I said, Bison has not noticed that you never use value of line. It just assumes that you wouldn't have declared the type of the value if you don't intend to somehow provide a value. You did declare the type, and you never provide a value. So your instructions to Bison were not clear, right?
Now, the solution is not to remove the production. There is no problem with the production, and if you remove it then line is left with only one production:
line : line expr NEWLINE
That's a recursive production, and the recursion can only terminate if there is some other production for line which does not recurse. That would be the production line: %empty, which you just deleted. So now Bison can see that the non-terminal line is useless. It is useless because it can never derive any string without non-terminals.
But that's a big problem, because that's the non-terminal your grammar is supposed to recognize. If line can't derive any string, then your grammar cannot recognize anything. Also, because line is useless, and expr is only used by a production in line (other than recursive uses), expr itself can never actually be used. So it, too, becomes useless. (You can think of this as the equivalent of your C compiler complaining that you have unreachable code.)
So then you attempt to fix the problem by making (the only remaining) rule for line non-recursive. Now you have:
line : expr NEWLINE
That's fine, and it makes line useful again. But it does not recognise the language you originally set out to recognise, because it will now only recognise a single line. (That is, an expr followed by a NEWLINE, as the production says.)
I hope you have now figured out that your original problem was not the grammar, which was always correct, but rather the fact that you told Bison that line produced a value, but you did not actually do anything to provide that value. In other words, the solution would have been to not tell Bison that line produces a value. (You could, instead, provide a value for line in the semantic rule associated with every production for line, but why would you do that if you have no intention of every using that value?)
Now, let's go back to the actual type declarations, to see why the C compiler complains. We can start with a simple one:
expr : NUM { $$.num = $1.num; }
$$ refers the the expr itself, which is declared as having type tag e, which is a struct nPtr *. The star is important: it indicates that the semantic value is a pointer. In order to get at a field in a struct from a pointer to the struct, you need to use the -> operator, just like this simple C example:
struct nPtr the_value; /* The actual struct */
struct nPtr *p = &the_value; /* A pointer to the struct */
p.num = 3; /* Error: p is not a struct */
p->num = 3; /* Fine. p points to a struct */
/* which has a field called num */
(*p).num = 3; /* Also fine, means exactly the same thing. */
/* But don't write this. p->num is much */
/* more readable. */
It's also worth noting that in the C example, we had to make sure that p was initialised to the address of some actual memory. If we hadn't done that and we attempted to assign something to p->num, then we would be trying to poke a value into an undefined address. If you're lucky, that would segfault. Otherwise, it might overwrite anything in your executable.
If any of that was not obvious, please go back to your C textbook and make sure that you understand how C pointers work, before you try to use them.
So, that was the $$.num part. The actual statement was
$$.num = $1.num;
So now lets look at the right-hand side. $1 refers to a NUM which has tag type iValue, which is an int. An int is not a struct and it has no named fields. $$.num = $1.num is going to be processed into something like:
int i = <some value>;
$$.num = i.num;
which is totally meaningless. Here, the compiler will again complain that iValue is not a struct, but for a different reason: on the left-hand side, e was a pointer to a struct (which is not a struct); on the right-hand side iValue is an integer (which is also not a struct).
I hope that gives you some ideas on what you need to do. If in doubt, there are a variety of fully-worked examples in the Bison manual.
The antlr book has the following sample code to resolve grammar ambiguities using semantic predicates:
// predicates/PredCppStat.g4
#parser::members {
Set<String> types = new HashSet<String>() {{add("T");}};
boolean istype() { return types.contains(getCurrentToken().getText());}
}
stat: decl ';' {System.out.println("decl "+$decl.text);}
| expr ';' {System.out.println("expr "+$expr.text);}
;
decl: ID ID
| {istype()}? ID '(' ID ')'
;
expr: INT
| ID
| {!istype()}? ID '(' expr ')'
;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
WS : [ \t\n\r]+ -> skip ;
Here, the predicate is the first function called in a rule, determining whether the rule should be fired or not. And it uses getCurrentToken() to take its decision.
However, if we alter the grammar slightly, to use hierarchical names instead of simple ID, like this:
decl: ID ID
| {istype()}? hier_id '(' ID ')'
;
expr: INT
| ID
| {!istype()}? hier_id '(' expr ')'
;
hier_id : ID ('.' ID)* ;
Then the istype() predicate can no longer use getCurrentToken to take its decision. It will need the entire chain of tokens in the hier_id to determine whether the chain is a type symbol or not.
That means, that we will need to do one of the following:
(1) put the predicate after hier_id, and access these value from istype(). Is this possible? I tried it, and I am getting compiler errors on the generated code.
(2) break up the grammar into sub-rules, and then place istype() after hier_id tokens are consumed. But this will wreck the readability of the grammar, and I would not like to do it.
What is the best way to solve this problem? I am using antlr-4.6.
One solution is to make ID itself to contain '.', thereby making hier_id a lexer token. In that case, the semantic predicate's call to getCurrentToken() will have access to the full chain of names.
Note that hier_id will subsume ID if it becomes a lexer token. And that comes at a cost. If the grammar has other references to ID only (and I guess it will have), then you have to add predicates in all those situations to avoid false matches. This will slow down the parser.
So I guess the question, in its general sense (ie how can rules be restricted by pedicates if the currentToken information is not enough to make the decision), still needs to be answered by Antlr4 experts.
I am parsing a C++ like declaration with this scaled down grammar (many details removed to make it a fully working example). It fails to work mysteriously (at least to me). Is it related to the use of context dependent predicate? If yes, what is the proper way to implement the "counting the number of child nodes logic"?
grammar CPPProcessor;
cppCompilationUnit : decl_specifier_seq? init_declarator* ';' EOF;
init_declarator: declarator initializer?;
declarator: identifier;
initializer: '=0';
decl_specifier_seq
locals [int cnt=0]
#init { $cnt=0; }
: decl_specifier+ ;
decl_specifier : #init { System.out.println($decl_specifier_seq::cnt); }
'const'
| {$decl_specifier_seq::cnt < 1}? type_specifier {$decl_specifier_seq::cnt += 1;} ;
type_specifier: identifier ;
identifier:IDENTIFIER;
CRLF: '\r'? '\n' -> channel(2);
WS: [ \t\f]+ -> channel(1);
IDENTIFIER:[_a-zA-Z] [0-9_a-zA-Z]* ;
I need to implement the standard C++ rule that no more than 1 type_specifier is allowed under an decl_specifier_seq.
Semantic predicate before type_specifier seems to be the solution. And the count is naturally declared as a local variable in decl_specifier_seq since nested decl_specifier_seq are possible.
But it seems that a context dependent semantic predicate like the one I used will produce incorrect parsing i.e. a semantic predicate that references $attributes. First an input file with correct result (to illustrate what a normal parse tree looks like):
int t=0;
and the parse tree:
But, an input without the '=0' to aid the parsing
int t;
0
1
line 1:4 no viable alternative at input 't'
1
the parsing failed with the 'no viable alternative' error (the numbers printed in the console is debug print of the $decl_specifier_cnt::cnt value as a verification of the test condition). i.e. the semantic predicate cannot prevent the t from being parsed as type_specifier and t is no longer considered a init_declarator. What is the problem here? Is it because a context dependent predicate having $decl_specifier_seq::cnt is used?
Does it mean context dependent predicate cannot be used to implement "counting the number of child nodes" logic?
EDIT
I tried new versions whose predicate uses member variable instead of the $decl_specifier_seq::cnt and surprisingly the grammar now works proving that the Context Dependent predicate did cause the previous grammar to fail:
....
#parser::members {
public int cnt=0;
}
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 }? type_specifier {cnt++;} ;
A normal parse tree is resulted:
This gives rise to the question of how to support nested rule if we must use member variables to replace the local variables to avoid context sensitive predicates?
And a weird result is that if I add a /*$ctx*/ after the predicate, it fails again:
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 /*$ctx*/ }? type_specifier {cnt++;} ;
line 1:4 no viable alternative at input 't'
The parsing failed with no viable alternative. Why the /*$ctx*/ causes the parsing to fail like when $decl_specifier_seq::cnt is used although the actual logic uses a member variable only?
And, without the /*$ctx*/, another issue related to the predicate called before #init block appears(described here)
ANTLR 4 evaluates semantic predicates in two cases.
The generated code evaluates a semantic predicate during parsing, and throws an exception of the evaluation returns false. All predicates traversed during parsing are evaluated in this way, including context-dependent predicates and predicates which do not appear at the left side of a decision.
The prediction method evaluates predicates in order to make correct decisions during parsing. In this case, predicates which appear anywhere other than the left edge of the decision being evaluated are assumed to return true (i.e. they are ignored). In addition, context-dependent predicates are only evaluated if the context data is available. The prediction algorithm will not create context structures that were not already provided by the parsing code. If a context-dependent predicate is encountered during prediction and no context is available, the predicate is assumed to return true (i.e. it is ignored for that decision).
The code generator does not evaluate the semantics of the target language, so it has no way to know that $ctx is semantically irrelevant when it appears in /*$ctx*/. Both cases result in the predicate being treated as context-dependent.