Solving this Shift/Reduce Conflict in Happy/Bison - haskell

I am making a simple parser in Happy (Bison equivalent for Haskell) and I stumbled upon a shift/reduce conflict in these rules:
ClassBlock :
"{" ClassAttributes ClassConstructor ClassFunctions "}" {ClassBlock $2 $3 $4}
ClassAttributes :
{- empty -} { ClassAttributesEmpty }
| ClassAttributes ClassAttribute {ClassAttributes $1 $2}
ClassAttribute :
"[+]" Variable {ClassAttributePublic $2 }
| "[-]" Variable {ClassAttributePrivate $2 }
ClassFunctions :
{- empty -} { ClassFunctionsEmpty }
| ClassFunctions ClassFunction {ClassFunctions $1 $2}
ClassFunction :
"[+]" Function {ClassFunctionPublic $2}
| "[-]" Function {ClassFunctionPrivate $2}
ClassConstructor :
{- empty -} { ClassConstructorEmpty }
| TypeFuncParams var_identifier Params Block {ClassConstructor $1 $2 $3 $4}
TypeFuncParams :
Primitive ClosingBracketsNoIdentifier { TypeFuncParamsPrimitive $1 $2}
| class_identifier ClosingBracketsNoIdentifier { TypeFuncParamsClassId $1 $2}
| ListType {TypeFuncParamsList $1}
The info file states the shift/reduce conflict:
ClassBlock -> "{" ClassAttributes . ClassConstructor ClassFunctions "}" (rule 52)
ClassAttributes -> ClassAttributes . ClassAttribute (rule 54)
"[+]" shift, and enter state 85
(reduce using rule 61)
"[-]" shift, and enter state 86
(reduce using rule 61)
Rule 61 is this one:
ClassConstructor :
{- empty -} { ClassConstructorEmpty }
I am not really sure how to solve this problem. I tried using precedence rules to silence the warning, but it didn't work out as I expected.

Below is a simplified grammar which exhibits the same problem.
To construct it, I removed
all actions
the prefix "Class" from all nonterminal names
I also simplified most of the rules. I did this as an illustration of how you can construct a minimal, complete, verifiable example, as suggested by the StackOverflow guidelines, which makes it easier to focus on the problem while still permitting an actual trial. (I used bison, not happy, but the syntax is very similar.)
Block : "{" Attributes Constructor Functions "}"
Attributes : {- empty -} | Attributes Attribute
Constructor: {- empty -} | "constructor"
Functions : {- empty -} | Functions Function
Attribute : "[+]" "attribute"
Function : "[+]" "function"
Now, let's play parser, and suppose that we've (somehow) identified a prefix which could match Attributes. (Attributes can match the empty string, so we could be right at the beginning of the input.) And suppose the next token is [+].
At this point, we cannot tell if the [+] will later turn out to be the beginning of an Attribute or if it is the start of a Function which follows an empty Constructor. However, we need to know that in order to continue the parse.
If we've finished with the Attributes and about to start on the Functions, this is the moment where we have to reduce the empty nonterminal Constructor. Unless we do that now, we cannot then go on to recognize a Function. On the other hand, if we haven't seen the last Attribute but we do reduce a Constructor, then the parse will eventually fail because the next Attribute cannot follow the Constructor we just reduced.
In cases like this, it is often useful to remove the empty production by factoring the options into the places where the non-terminal is used:
Block : "{" Attributes "constructor" Functions "}"
| "{" Attributes Functions "}"
Attributes : {- empty -} | Attributes Attribute
Functions : {- empty -} | Functions Function
Attribute : "[+]" "attribute"
Function : "[+]" "function"
But just removing Constructor isn't enough here. In order to start parsing the list of Functions, we need to first reduce an empty Functions to provide the base case of the Functions recursion, so we still need to guess where the Functions start in order to find the correct parse. And if we wrote the two lists as right-recursions instead of left-recursions, we'd instead need an empty Attributes to terminate the recursion of the Attributes recursion.
What we could do, in this particular case, is use a cunning combination of left- and right-recursion:
Block : "{" Attributes "constructor" Functions "}"
| "{" Attributes Functions "}"
Attributes : {- empty -} | Attributes Attribute
Functions : {- empty -} | Function Functions
Attribute : "[+]" "attribute"
Function : "[+]" "function"
By making the first list left-recursive and the second list right-recursive, we avoid the need to reduce an empty non-terminal between the two lists. That, in turn, allows the parser to decide whether a phrase was an Attribute or a Function after it has seen the phrase, at which point it is no longer necessary to consult an oracle.
However, that solution is not very pretty for a number of reasons, not the least of which being that it only works for the concatenation of two optional lists. If we wanted to add another list of a different kind of item which could also start with the [+] token, a different solution would be needed..
The simplest one, which is used by a number of languages, is to allow the programmer to intermingle the various list elements. You might consider that bad style, but it is not always necessary to castigate bad style by making it a syntax error.
A simple solution would be:
Block : "{" Things "}"
Things : {- empty -} | Things Attribute | Things Function | Things Constructor
Attribute : "[+]" "attribute"
Constructor: "constructor"
Function : "[+]" "function"
but that doesn't limit a Block to at most one Constructor, which seems to be a syntactic requirement. However, as long as Constructor cannot start with a [+], you could implement the "at most one Constructor" limitation with:
Block : "{" Things Constructor Things "}"
| "{" Things "}"
Things : {- empty -} | Things Attribute | Things Function
Attribute : "[+]" "attribute"
Constructor: "constructor"
Function : "[+]" "function"

Related

ANTLR4 different precedence in two seemingly equivalent grammars

The following test grammars differ only in that the first alternative of the rule 'expr' is either specified inline or refers to another rule 'notExpression' with just the same definition. But this grammars produce different trees parsing this: '! a & b'. Why?
I really want the grammar to produce the first result (with NOT associated with identifier, not with AND expression) but still need to have 'expr' to reference 'notExpression' in my real grammar. What do I have to change?
grammar test;
s: expr ';' <EOF>;
expr:
NOT expr
| left=expr AND right=expr
| identifier
;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree one
grammar test;
s: expr ';' <EOF>;
expr:
notExpression
| left=expr AND right=expr
| identifier
;
notExpression: NOT expr;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree two
I kind of got an answer to the second part of my question, which still do not quite give me a satisfaction because using this approach in real elaborate grammar is going to be ugly. As to the first part (WHY) I still have no idea, so more answers are welcome.
Anyway, to fix precedence in presence of referenced rule the 'notExpression' rule can be modified as follows:
notExpression: NOT (identifier|expr);
Which produces the tree different from both shown in original question, but at least the NOT does acquire higher precedence.
Parse tree

Dynamic operator precedence and associativity in ANTLR4?

i've been working on an antlr4 grammar for Z Notation (ISO UTF version), and the specification calls for a lex phase, and then a "2 phased" parse.
you first lex it into a bunch of NAME (or DECORWORD) tokens, and then you parse the resulting tokens against the operatorTemplate rules in the spec's parser grammar, replace appropriate tokens, and then finally parse your new modified token stream to get the AST.
i have the above working, but i can't figure out how to set the precedence and associativity of the parser rules dynamically, so the parse trees are wrong.
the operator syntax looks like (numbers are precedence):
-generic 5 rightassoc (_ → _)
-function 65 rightassoc (_ ◁ _)
i don't see any api to set the associativity on a rule, so i tried with semantic predicates, something like:
expression:
: {ZSupport.isLeftAssociative()}? expression I expression
| <assoc=right> expression i expression
;
or
expression:
: {ZSupport.isLeftAssociative()}? expression i expression
| <assoc=right> {ZSupport.isRightAssociative()}? expression I expression
;
but then i get "The following sets of rules are mutually left-recursive [expression]"
can this be done?
I was able to accomplish this by moving the semantic predicate:
expression:
: expression {ZSupport.isLeftAssociative()}? I expression
| <assoc=right> expression I expression
;
I was under the impression that this wasn't going to work based on this discussion:
https://stackoverflow.com/a/23677069/7711235
...but it does seem to work correctly in all my test cases...

Express a rule with ANTLR4

I must define a rule which expresses the following statement: {x in y | x > 0}.
For the first part of that comprehension "x in y", i have the subrule:
FIRSTPART: Name "in" Name
, whereas Name can be everything.
My problem is that I do not want a greedy behaviour. So, it should parse until the "|" sign and then stop. Since I am new in ANTLR4, I do not know how to achieve that.
best regards,
Normally, the lexer/parser rules should represent the allowable syntax of the source input stream.
The evaluation (and consequences) of how the source matches any rule or subrule is a matter of semantics -- whether the input matches a particular subrule and whether that should control how the rule is finally evaluated.
Normally, semantics are implemented as part of the tree-walker analysis. You can use alternate subrule lables (#inExpr, etc) to create easily distinguishable tree nodes for analysis purposes:
comprehension : LBrace expression property? RBrace ;
expression : ....
| Name In Name #inExpr
| Name BinOp Name #binExpr
| ....
;
property : Provided expression ;
BinOp : GT | LT | GTE | .... ;
Provided : '|' ;
In : 'in' ;

Bison/Flex, reduce/reduce, identifier in different production

I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}

Flex and Bison Associativity difficulty

Using Flex and Bison, I have a grammar specification for a boolean query language, which supports logical "and", "or", and "not" operations, as well as nested subexpressions using "()".
All was well until I noticed that queries like "A and B or C and D" which I'd like parsed as "(A & B) | (C & D)" was actually being interpreted as "A & ( B | ( C & D ) )". I'm nearly certain this is an associativity issue, but can't seem to find a proper explanation or example anywhere - that or I'm missing something important.
Pertinent information from boolpars.y:
%token TOKEN
%token OPEN_PAREN CLOSE_PAREN
%right NOT
%left AND
%left OR
%%
query: expression { ... }
;
expression: expression AND expression { ... }
| expression OR expression { ... }
| NOT expression { ... }
| OPEN_PAREN expression CLOSE_PAREN { ... }
| TOKEN { ... }
;
Can anyone find the flaw? I can't see why Bison isn't giving "or" appropriate precedence.
From bison docs:
Operator precedence is determined by
the line ordering of the declarations;
the higher the line number of the
declaration (lower on the page or
screen), the higher the precedence.
So in your case OR is lower on the screen and has higher precedence.
Change the order to
%left OR
%left AND
(I haven't tested it though)
Why not split up the productions, as in this snippet from
a C-ish language
logical_AND_expression:
inclusive_OR_expression
| logical_AND_expression ANDAND inclusive_OR_expression
{$$ = N2(__logand__, $1, $3);}
;
logical_OR_expression:
logical_AND_expression
| logical_OR_expression OROR logical_AND_expression
{$$ = N2(__logor__, $1, $3);}
;
I've performed tests on my own implementation, and from my tests, marcin's answer is correct. If I define the precedence as:
%left OR
%left AND
Then the expression A&B|C&D will be reduced to ((A&B)|(C&D))
If I define the precedence as:
%left AND
%left OR
Then the expression A&B|C&D will be reduced to ((A&(B|C))&D)
One differentiating expression would be:
true & true | true & false
The former precedence definition would render this as true, whereas the latter would render it as false. I've tested both scenarios and both work as explained.
Double check your tests to make sure. Also note that it is the order of the %left, %right, etc. definitions in the header portion that define the precedence, not the order that you define your rules themselves. If it's still not working, maybe it's some other area in your code that's messing it up, or maybe your version of bison is different (I'm just shooting in the dark at this point).

Resources