Using Flex and Bison, I have a grammar specification for a boolean query language, which supports logical "and", "or", and "not" operations, as well as nested subexpressions using "()".
All was well until I noticed that queries like "A and B or C and D" which I'd like parsed as "(A & B) | (C & D)" was actually being interpreted as "A & ( B | ( C & D ) )". I'm nearly certain this is an associativity issue, but can't seem to find a proper explanation or example anywhere - that or I'm missing something important.
Pertinent information from boolpars.y:
%token TOKEN
%token OPEN_PAREN CLOSE_PAREN
%right NOT
%left AND
%left OR
%%
query: expression { ... }
;
expression: expression AND expression { ... }
| expression OR expression { ... }
| NOT expression { ... }
| OPEN_PAREN expression CLOSE_PAREN { ... }
| TOKEN { ... }
;
Can anyone find the flaw? I can't see why Bison isn't giving "or" appropriate precedence.
From bison docs:
Operator precedence is determined by
the line ordering of the declarations;
the higher the line number of the
declaration (lower on the page or
screen), the higher the precedence.
So in your case OR is lower on the screen and has higher precedence.
Change the order to
%left OR
%left AND
(I haven't tested it though)
Why not split up the productions, as in this snippet from
a C-ish language
logical_AND_expression:
inclusive_OR_expression
| logical_AND_expression ANDAND inclusive_OR_expression
{$$ = N2(__logand__, $1, $3);}
;
logical_OR_expression:
logical_AND_expression
| logical_OR_expression OROR logical_AND_expression
{$$ = N2(__logor__, $1, $3);}
;
I've performed tests on my own implementation, and from my tests, marcin's answer is correct. If I define the precedence as:
%left OR
%left AND
Then the expression A&B|C&D will be reduced to ((A&B)|(C&D))
If I define the precedence as:
%left AND
%left OR
Then the expression A&B|C&D will be reduced to ((A&(B|C))&D)
One differentiating expression would be:
true & true | true & false
The former precedence definition would render this as true, whereas the latter would render it as false. I've tested both scenarios and both work as explained.
Double check your tests to make sure. Also note that it is the order of the %left, %right, etc. definitions in the header portion that define the precedence, not the order that you define your rules themselves. If it's still not working, maybe it's some other area in your code that's messing it up, or maybe your version of bison is different (I'm just shooting in the dark at this point).
Related
I'm currently reading through John C. Mitchell's Foundations for Programming Languages. Exercise 2.2.3, in essence, asks the reader to show that the (natural-number) exponentiation function cannot be implicitly defined via an expression in a small language. The language consists of natural numbers and addition on said numbers (as well as boolean values, a natural-number equality predicate, & ternary conditionals). There are no loops, recursive constructs, or fixed-point combinators. Here is the precise syntax:
<bool_exp> ::= <bool_var> | true | false | Eq? <nat_exp> <nat_exp> |
if <bool_exp> then <bool_exp> else <bool_exp>
<nat_exp> ::= <nat_var> | 0 | 1 | 2 | … | <nat_exp> + <nat_exp> |
if <bool_exp> then <nat_exp> else <nat_exp>
Again, the object is to show that the exponentiation function n^m cannot be implicitly defined via an expression in this language.
Intuitively, I'm willing to accept this. If we think of exponentiation as repeated multiplication, it seems like we "just can't" express that with this language. But how does one formally prove this? More broadly, how do you prove that an expression from one language cannot be expressed in another?
Here's a simple way to think about it: the expression has a fixed, finite size, and the only arithmetic operation it can do to produce numbers not written as literals or provided as the values of variables is addition. So the largest number it can possibly produce is limited by the number of additions plus 1, multiplied by the largest number involved in the expression.
So, given a proposed expression, let k be the number of additions in it, let c be the largest literal (or 1 if there is none) and choose m and n such that n^m > (k+1)*max(m,n,c). Then the result of the expression for that input cannot be the correct one.
Note that this proof relies on the language allowing arbitrarily large numbers, as noted in the other answer.
No solution, only hints:
First, let me point out that if there are finitely many numbers in the language, then exponentiation is definable as an expression. (You'd have to define what it should produce when the true result is unrepresentable, eg wraparound.) Think about why.
Hint: Imagine that there are only two numbers, 0 and 1. Can you write an expression involving m and n whose result is n^m? What if there were three numbers: 0, 1, and 2? What if there were four? And so on...
Why don't any of those solutions work? Let's index them and call the solution for {0,1} partial_solution_1, the solution for {0,1,2} partial_solution_2, and so on. Why isn't partial_solution_n a solution for the set of all natural numbers?
Maybe you can generalize that somehow with some metric f : Expression -> Nat so that every expression expr with f(expr) < n is wrong somehow...
You may find some inspiration from the strategy of Euclid's proof that there are infinitely many primes.
The following test grammars differ only in that the first alternative of the rule 'expr' is either specified inline or refers to another rule 'notExpression' with just the same definition. But this grammars produce different trees parsing this: '! a & b'. Why?
I really want the grammar to produce the first result (with NOT associated with identifier, not with AND expression) but still need to have 'expr' to reference 'notExpression' in my real grammar. What do I have to change?
grammar test;
s: expr ';' <EOF>;
expr:
NOT expr
| left=expr AND right=expr
| identifier
;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree one
grammar test;
s: expr ';' <EOF>;
expr:
notExpression
| left=expr AND right=expr
| identifier
;
notExpression: NOT expr;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree two
I kind of got an answer to the second part of my question, which still do not quite give me a satisfaction because using this approach in real elaborate grammar is going to be ugly. As to the first part (WHY) I still have no idea, so more answers are welcome.
Anyway, to fix precedence in presence of referenced rule the 'notExpression' rule can be modified as follows:
notExpression: NOT (identifier|expr);
Which produces the tree different from both shown in original question, but at least the NOT does acquire higher precedence.
Parse tree
Im making a simple calculator that prints postfix to learn bison. i could make the postfix part work ,but now i need to do assigments to variables (a-z) tex: a=3+2; should print: a32+=; for example. Im trying modified my working postfix code to be able to read a char too.
If I understand correctly, to be able to put different types in the $$ i need a union and to make the nodes i should use a struct because in my case 'expr' can be a int or a char.
This is my parser:
%code requires{
struct nPtr{
char *val;
int num;
};
}
%union {
int iValue;
char sIndex;
struct nPtr *e;
};
%token PLUS MINUS STAR LPAREN RPAREN NEWLINE DIV ID NUMBER POW EQL
%type <iValue> NUMBER
%type <sIndex> ID
%type <e> expr line
%left PLUS MINUS
%left STAR DIV
%left POW
%left EQL
%%
line : /* empty */
|line expr NEWLINE { printf("=\n%d\n", $2.num); }
expr : LPAREN expr RPAREN { $$.num = $2.num; }
| expr PLUS expr { $$.num = $1.num + $3.num; printf("+"); }
| expr MINUS expr { $$.num = $1.num - $3.num; printf("-"); }
| expr STAR expr { $$.num = $1.num * $3.num; printf("*"); }
| expr DIV expr { $$.num = $1.num / $3.num; printf("/");}
| expr POW expr { $$.num = pow($1.num, $3.num); printf("**");}
| NUMBER { $$.num = $1.num; printf("%d", yylval); }
| ID EQL expr { printf("%c", yylval); }
;
%%
I have this in the lex to handle the "=" and variables
"=" { return EQL; }
[a-z] { yylval.sIndex = strdup(yytext); return ID; }
i get and error
warning empty rule for typed nonterminal and no action
the only answer i found here about that is this:
Bison warning: Empty rule for typed nonterminal
It says to just remove the /* empty */ part in:
line: /* empty */
| line expr NEWLINE { printf("=\n%d\n", $2.num); }
when i do that i get 3 new warnings:
warning 3 nonterminal useless in grammar
warning 10 rules useless in grammar
fatal error: start symbol line does not derive any sentence
I googled and got some solutions that just gave me other problems, for example.
when i change:
line:line expr NEWLINE { printf("=\n%d\n", $2.num); }
to
line:expr NEWLINE { printf("=\n%d\n", $2.num); }
bison works but when i try to run the code in visual studio i get a lot of errors like:
left of '.e' must have struct/union type
'pow': too few arguments for call
'=': cannot convert from 'int' to 'YYSTYPE'
thats as afar as i got. I cant find a simple example similar to my needs. I just want to make 'expr' be able to read a char and print it. If someone could check my code and recommend some changes . it will be really really appreciated.
It is important to actually understand what you are asking Bison to do, and what it is telling you when it suggests that your instructions are wrong.
Applying a fix from someone else's problem will only work if they made the same mistake as you did. Just making random changes based on Google searching is neither a very structured way to debug nor to learn a new tool.
Now, what you said was:
%type <e> expr line
Which means that expr and line both produce a value whose union tag is e. (Tag e is a pointer to a struct nPtr. I doubt that's what you wanted either, but we'll start with the Bison warnings.)
If a non-terminal produces a value, it produce a value in every production. To produce a value, the semantic rule needs to assign a value (of the correct tag-type) to $$. For convenience, if there is no semantic action and if $1 has the correct tag-type, then Bison will provide the default rule $$ = $1. (That sentence is not quite correct but it's a useful approximation.) Many people think that you should not actually rely on this default, but I think it's OK as long as you know that it is happening, and have verified that the preconditions are valid.
In your case, you have two productions for line. Neither of them assigns a value to $$. This might suggest that you didn't really intend for line to even have a semantic value, and if we look more closely, we'll see that you never attempt to use a semantic value for any instance of the non-terminal line. Bison does not, as far as I know, attempt such a detailed code examination, but it is capable of noting that the production:
line : /* empty */
has no semantic action, and
does not have any symbols on the right-hand side, so no default action can be constructed.
Consequently, it warns you that there is some circumstance in which line will not have its value set. Think of this as the same warning as your C compiler might give you if it notices that a variable you use has never had a value assigned to it.
In this case, as I said, Bison has not noticed that you never use value of line. It just assumes that you wouldn't have declared the type of the value if you don't intend to somehow provide a value. You did declare the type, and you never provide a value. So your instructions to Bison were not clear, right?
Now, the solution is not to remove the production. There is no problem with the production, and if you remove it then line is left with only one production:
line : line expr NEWLINE
That's a recursive production, and the recursion can only terminate if there is some other production for line which does not recurse. That would be the production line: %empty, which you just deleted. So now Bison can see that the non-terminal line is useless. It is useless because it can never derive any string without non-terminals.
But that's a big problem, because that's the non-terminal your grammar is supposed to recognize. If line can't derive any string, then your grammar cannot recognize anything. Also, because line is useless, and expr is only used by a production in line (other than recursive uses), expr itself can never actually be used. So it, too, becomes useless. (You can think of this as the equivalent of your C compiler complaining that you have unreachable code.)
So then you attempt to fix the problem by making (the only remaining) rule for line non-recursive. Now you have:
line : expr NEWLINE
That's fine, and it makes line useful again. But it does not recognise the language you originally set out to recognise, because it will now only recognise a single line. (That is, an expr followed by a NEWLINE, as the production says.)
I hope you have now figured out that your original problem was not the grammar, which was always correct, but rather the fact that you told Bison that line produced a value, but you did not actually do anything to provide that value. In other words, the solution would have been to not tell Bison that line produces a value. (You could, instead, provide a value for line in the semantic rule associated with every production for line, but why would you do that if you have no intention of every using that value?)
Now, let's go back to the actual type declarations, to see why the C compiler complains. We can start with a simple one:
expr : NUM { $$.num = $1.num; }
$$ refers the the expr itself, which is declared as having type tag e, which is a struct nPtr *. The star is important: it indicates that the semantic value is a pointer. In order to get at a field in a struct from a pointer to the struct, you need to use the -> operator, just like this simple C example:
struct nPtr the_value; /* The actual struct */
struct nPtr *p = &the_value; /* A pointer to the struct */
p.num = 3; /* Error: p is not a struct */
p->num = 3; /* Fine. p points to a struct */
/* which has a field called num */
(*p).num = 3; /* Also fine, means exactly the same thing. */
/* But don't write this. p->num is much */
/* more readable. */
It's also worth noting that in the C example, we had to make sure that p was initialised to the address of some actual memory. If we hadn't done that and we attempted to assign something to p->num, then we would be trying to poke a value into an undefined address. If you're lucky, that would segfault. Otherwise, it might overwrite anything in your executable.
If any of that was not obvious, please go back to your C textbook and make sure that you understand how C pointers work, before you try to use them.
So, that was the $$.num part. The actual statement was
$$.num = $1.num;
So now lets look at the right-hand side. $1 refers to a NUM which has tag type iValue, which is an int. An int is not a struct and it has no named fields. $$.num = $1.num is going to be processed into something like:
int i = <some value>;
$$.num = i.num;
which is totally meaningless. Here, the compiler will again complain that iValue is not a struct, but for a different reason: on the left-hand side, e was a pointer to a struct (which is not a struct); on the right-hand side iValue is an integer (which is also not a struct).
I hope that gives you some ideas on what you need to do. If in doubt, there are a variety of fully-worked examples in the Bison manual.
I must define a rule which expresses the following statement: {x in y | x > 0}.
For the first part of that comprehension "x in y", i have the subrule:
FIRSTPART: Name "in" Name
, whereas Name can be everything.
My problem is that I do not want a greedy behaviour. So, it should parse until the "|" sign and then stop. Since I am new in ANTLR4, I do not know how to achieve that.
best regards,
Normally, the lexer/parser rules should represent the allowable syntax of the source input stream.
The evaluation (and consequences) of how the source matches any rule or subrule is a matter of semantics -- whether the input matches a particular subrule and whether that should control how the rule is finally evaluated.
Normally, semantics are implemented as part of the tree-walker analysis. You can use alternate subrule lables (#inExpr, etc) to create easily distinguishable tree nodes for analysis purposes:
comprehension : LBrace expression property? RBrace ;
expression : ....
| Name In Name #inExpr
| Name BinOp Name #binExpr
| ....
;
property : Provided expression ;
BinOp : GT | LT | GTE | .... ;
Provided : '|' ;
In : 'in' ;
I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}