Shift reduce conflicts in happy grammar - haskell

Hello good programmers,
I have built following grammar in happy (haskell):
P : program C {Prog $2}
E : int {Num $1}
| ident {Id $1}
| true {BoolConst True}
| false {BoolConst False}
| read {ReadInput}
| '(' E ')' {Parens $2}
| E '+' E { Add $1 $3 }
| E '-' E { Sub $1 $3 }
| E '*' E { Mult $1 $3 }
| E '/' E { Div $1 $3 }
| E '=' E { Eq $1 $3 }
| E '>' E { Gt $1 $3 }
| E '<' E { Lt $1 $3 }
C : '(' C ')' {$2}
| ident assign E {Assign $1 $3}
| if E then C else C {Cond $2 $4 $6}
| output E {OutputComm $2}
| while E do C {While $2 $4 }
| begin D ';' C end {Declare $2 $4}
| C ';' C {Seq $1 $3 }
D : D ';' D {DSeq $1 $3 }
| '(' D ')' {$2}
| var ident assign E {Var $2 $4}
Now, While loop includes all the commands that follow after 'do'. How to change this behavior? I've already tried %left, %right... :(

Invent two kinds of C, one that allows sequencing, and one that doesn't. Something like this, perhaps:
Cnoseq : '(' Cseq ')'
| ident assign E
| while E do Cnoseq
Cseq : Cnoseq ';' Cseq
| Cnoseq

Consider this code fragment:
while True do output 1 ; output 2
This could be parsed as either
(while True do output 1) ; (output 2)
or
while True do (output 1 ; output 2)
This sort of ambiguity is the source of the conflict.
If you don't want to change the grammar as #DanielWagner suggested, you can use precedence rules to resolve the ambiguity. Exactly how depends on what the precedence should be, however it's probably something like this:
%right do
%right then else
%right ';'
Precedences are listed from low to high, so in this case the above example would be parsed in the latter manner. Just add the precedence rules below the tokens but before the %% line.

Related

Antlr4 Grammar fails to parse

I am new to ANTLR, and here is a grammar that I am working on and its failing for a given input string - A.B() && ((C.D() || E.F())).
I have tried a number of combinations but its failing on the same place.
grammar Expressions;
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
;
logicBlock
: logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: NOT? '(' method ')'
| NOT? method
;
method
: object '.' function ('.' function)*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: (object | function | method | logicUnit | logicBlock)
(
','
(object | function | method | logicUnit | logicBlock)
)*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
AND : '&&' ;
OR : '||' ;
EQUALS : '==' ;
ASSIGN : '=' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
NOTEQUALS : '!=' ;
NOT : '!' ;
PLUS : '+' ;
MINUS : '-' ;
IF : 'if' ;
THINGS
: [a-zA-Z] [a-zA-Z0-9]*
| '"' .*? '"'
| [0-9]+
| ([0-9]+)? '.' ([0-9])+
;
WS : [ \t\r\n]+ -> skip
;
The error that I getting for this input - A.B() && ((C.D() || E.F())) is below. any help and/or suggestion to improve would be highly appreciated.
This rule looks odd to me:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? '('? logicBlock ')'? (logicalConnector NOT? '('? logicBlock ')'?)*
// ^ ^ ^ ^
// | | | |
;
All the optional parenthesis can't be right: this means the the first can be present, and the other 3 would be omitted (and any other combination), leaving your expression with unbalanced parenthesis.
The get your grammar working in the input A.B() && ((C.D() || E.F())) , you'll want to do something like this:
expression
: logicBlock (logicalConnector logicBlock)*
| NOT? logicBlock (logicalConnector NOT? logicBlock)*
;
logicBlock
: '(' expression ')'
| logicUnit comparator THINGS
| logicUnit comparator logicUnit
| logicUnit
;
logicUnit
: '(' expression ')'
| NOT? '(' method ')'
| NOT? method
;
But ANTLR4 supports left recursive rules, allowing you to define the expression rule like this:
expression
: '(' expression ')'
| NOT expression
| expression comparator expression
| expression logicalConnector expression
| method
| object
| THINGS
;
method
: object '.' function ( '.' function )*
;
object
: THINGS
|'(' THINGS ')'
;
function
: THINGS '(' arguments? ')'
;
arguments
: expression ( ',' expression )*
;
logicalConnector
: AND | OR | PLUS | MINUS
;
comparator
: GT | LT | GTE | LTE | EQUALS | NOTEQUALS
;
making it much more readable. Sure, it doesn't produce the exact parse tree your original grammar was producing, but you might be flexible in that regard. This proposed grammar also matches more than yours: like NOT A, which your grammar does not allow, where my proposed grammar does accept it.

Making pairs of words based on one column

I want to make pairs of words based on the third column (identifier). My file is similar to this example:
A ID.1
B ID.2
C ID.1
D ID.1
E ID.2
F ID.3
The result I want is:
A C ID.1
A D ID.1
B E ID.2
C D ID.1
Note that I don't want to obtain the same word pair in the opposite order. In my real file some words appear more than one time with different identifiers.
I tried this code which works well but requires a lot of time (and I don't know if there are redundancies):
counter=2
cat filtered_go_annotation.txt | while read f1 f2; do
tail -n +$counter go_annotation.txt | grep $f2 | awk '{print "'$f1' " $1}';
((counter++))
done > go_network2.txt
The 'tail' is used to delete a line when it's read.
Awk solution:
awk '{ a[$2] = ($2 in a? a[$2] FS : "") $1 }
END {
for (k in a) {
len = split(a[k], items);
for (i = 1; i <= len; i++)
for (j = i+1; j <= len; j++)
print items[i], items[j], k
}
}' filtered_go_annotation.txt
The output:
A C ID.1
A D ID.1
C D ID.1
B E ID.2
With GNU awk for sorted_in and true multi-dimensional arrays:
$ cat tst.awk
{ vals[$2][$1] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in vals) {
for (j in vals[i]) {
for (k in vals[i]) {
if (j != k) {
print j, k, i
}
}
delete vals[i][j]
}
}
}
$ awk -f tst.awk file
A C ID.1
A D ID.1
C D ID.1
B E ID.2
I wonder if this would work (in GNU awk):
$ awk '
($2 in a) && !($1 in a[$2]) { # if ID.x is found in a and A not in a[ID.X]
for(i in a[$2]) # loop all existing a[ID.x]
print i,$1,$2 # and output combination of current and all previous matching
}
{
a[$2][$1] # hash to a
}' file
A C ID.1
A D ID.1
C D ID.1
B E ID.2
in two steps
$ sort -k2 file > file.s
$ join -j2 file.s{,} | awk '!(a[$2,$3]++ + a[$3,$2]++){print $2,$3,$1}'
A C ID.1
A D ID.1
C D ID.1
B E ID.2
If your input is large, it may be faster to solve it in steps, e.g.:
# Create temporary directory for generated data
mkdir workspace; cd workspace
# Split original file
awk '{ print $1 > $2 }' ../infile
# Find all combinations
perl -MMath::Combinatorics \
-n0777aE \
'
$c=Math::Combinatorics->new(count=>2, data=>[#F]);
while(#C = $c->next_combination) {
say join(" ", #C) . " " . $ARGV
}
' *
Output:
C D ID.1
C A ID.1
D A ID.1
B E ID.2
Perl
solution using regex backtracking
perl -n0777E '/^([^ ]*) (.*)\n(?:.*\n)*?([^ ]*) (\2)\n(?{say"$1 $3 $2"})(?!)/mg' foo.txt
flags see perl -h.
^([^ ]*) (.*)\n : matches a line with at least one space first capturing group at the left side of first space, second capturing group the right side.
(?:.*\n)*?: matches (without capturing) 0 or more lines lazily to try following pattern first before matching more lines.
([^ ]*) (\2)\n : similar to first match using backreference \2 to match a line with the same key.
(?{say"$1 $3 $2"}) : code to display the captured groups
(?!) : to make the match fail to backtrack.
Note that it could be shortened a bit
perl -n0777E '/^(\S+)(.+)[\s\S]*?^((?1))(\2)$(?{say"$1 $3$2"})(?!)/mg' foo.txt
Yet another awk making use of the redefinition of $0. This makes the solution of RomanPerekhrest a bit shorter :
{a[$2]=a[$2] FS $1}
END { for(i in a) { $0=a[i]; for(j=1;j<NF;j++)for(k=j+1;k<=NF;++k) print $j,$k,i} }

Antlr4 expressions are chaining

from what i have read the way i defined my 'expression' this should provide me with the following:
This is my input:
xyz = a + b + c + d
This should be my output:
xyz = ( ( a + b ) + ( c + d) )
But instead i get:
xyz = ( a + (b + (c + d) ) )
I bet this has been solved before and i just wasnt able to find the solution.
statementList : s=statement sl=statementList #multipleStatementList
| s=statement #singleStatementList
;
statement : statementAssign
| statementIf
;
statementAssign : var=VAR ASSIGN expr=expression #overwriteStatementAssign
| var=VAR PLUS ASSIGN expr=expression #addStatementAssign
| var=VAR MINUS ASSIGN expr=expression #subStatementAssign
;
;
expression : BRACKET_OPEN expr=expression BRACKET_CLOSE #priorityExp
| left=expression operand=('*'|'/') right=expression #mulDivExp
| left=expression operand=('+'|'-') right=expression #addSubExp
| <assoc=right> left=expression POWER right=expression #powExp
| variable=VAR #varExp
| number=NUMBER #numExp
;
BRACKET_OPEN : '(' ;
BRACKET_CLOSE : ')' ;
ASTERISK : '*' ;
SLASH : '/' ;
PLUS : '+' ;
MINUS : '-' ;
POWER : '^' ;
MODULO : '%' ;
ASSIGN : '=' ;
NUMBER : [0-9]+ ;
VAR : [a-z][a-zA-Z0-9\-]* ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
If you want the expression
xyz = ( ( a + b ) + ( c + d) )
Then you'll need to control the binding of the + operator with parentheses just as you've done in your expected output. Otherwise, with
xyz = ( a + (b + (c + d) ) )
is the way the parser is going to parse it because all the + operators have the same precedence, and the parser continues parsing until it reaches the end of the expression.
It recursively applies
left=expression operand=('+'|'-') right=expression
until the expression is completed.
and you get the grouping you got. So use those parentheses, that's what they're for if you want to force the order of expression evaluation. ;)
If you change your input to
xyz = a * b + c + d
you'll see what I mean about the precedence, because the multiplication rule appears before the addition rule -- and hence binds earlier -- which is a mathematical convention (lacking parentheses to group terms.)
You're doing it right and the parser is too. Just group what you want if you want a specific binding order.

ANTLR single grammar input mismatch

So far I've been testing with ANTLR4, I've tested with this single grammar:
grammar LivingDSLParser;
options{
language = Java;
//tokenVocab = LivingDSLLexer;
}
living
: query #QUERY
;
query
: K_QUERY entity K_WITH expr
;
entity
: STAR #ALL
| D_FUAS #FUAS
| D_RESOURCES #RESOURCES
;
field
: ((D_FIELD | D_PROPERTY | D_METAINFO) DOT)? IDENTIFIER
| STAR
;
expr
: field
| expr ( '*' | '/' | '%' ) expr
| expr ( '+' | '-' ) expr
| expr ( '<<' | '>>' | '&' | '|' ) expr
| expr ( '<' | '<=' | '>' | '>=' ) expr
| expr ( '=' | '==' | '!=' | '<>' ) expr
| expr K_AND expr
| expr K_OR expr
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
K_QUERY : Q U E R Y;
K_WITH: W I T H;
K_OR: O R;
K_AND: A N D;
D_FUAS : F U A S;
D_RESOURCES : R E S O U R C E S;
D_METAINFO: M E T A I N F O;
D_PROPERTY: P R O P E R T Y;
D_FIELD: F I E L D;
STAR : '*';
PLUS : '+';
MINUS : '-';
PIPE2 : '||';
DIV : '/';
MOD : '%';
LT2 : '<<';
GT2 : '>>';
AMP : '&';
PIPE : '|';
LT : '<';
LT_EQ : '<=';
GT : '>';
GT_EQ : '>=';
EQ : '==';
NOT_EQ1 : '!=';
NOT_EQ2 : '<>';
OPEN_PAR : '(';
CLOSE_PAR : ')';
SCOL : ';';
DOT : '.';
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
//so on...
As far I've been able to figure out, when I write some input like this:
query fuas with field.xxx == property.yyy
, it should match.
However I recive this message:
LivingDSLParser::living:1:0: mismatched input 'query' expecting K_QUERY
I have no idea where's the problem and neither what this message means.
Whenever ANTLR can match 2 or more rules to some input, it chooses the first rule. Since both IDENTIFIER and K_QUERY match the input "query"
, and IDENTIFIER is defined before K_QUERY, IDENTIFIER is matched.
Solution: move your IDENTIFIER rule below your keyword definitions.

Advice on handling an ambiguous operator in an ANTLR 4 grammar

I am writing an antlr grammar file for a dialect of basic. Most of it is either working or I have a good idea of what I need to do next. However, I am not at all sure what I should do with the '=' character which is used for both equality tests as well as assignment.
For example, this is a valid statement
t = (x = 5) And (y = 3)
This evaluates if x is EQUAL to 5, if y is EQUAL to 3 then performs a logical AND on those results and ASSIGNS the result to t.
My grammar will parse this; albeit incorrectly, but I think that will resolve itself once the ambiguity is resolved .
How do I differentiate between the two uses of the '=' character?
1) Should I remove the assignment rule from expression and handle these cases (assignment vs equality test) in my visitor and\or listener implementation during code generation
2) Is there a better way to define the grammar such that it is already sorted out
Would someone be able to simply point me in the right direction as to how best implement this language "feature"?
Also, I have been reading through the Definitive guide to ANTLR4 as well as Language Implementation Patterns looking for a solution to this. It may be there but I have not yet found it.
Below is the full parser grammar. The ASSIGN token is currently set to '='. EQUAL is set to '=='.
parser grammar wlParser;
options { tokenVocab=wlLexer; }
program
: multistatement (NEWLINE multistatement)* NEWLINE?
;
multistatement
: statement (COLON statement)*
;
statement
: declarationStat
| defTypeStat
| assignment
| expression
;
assignment
: lvalue op=ASSIGN expression
;
expression
: <assoc=right> left=expression op=CARAT right=expression #exponentiationExprStat
| (PLUS|MINUS) expression #signExprStat
| IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN #arrayIndexExprStat
| left=expression op=(ASTERISK|FSLASH) right=expression #multDivExprStat
| left=expression op=BSLASH right=expression #integerDivExprStat
| left=expression op=KW_MOD right=expression #modulusDivExprStat
| left=expression op=(PLUS|MINUS) right=expression #addSubExprStat
| left=string op=AMPERSAND right=string #stringConcatenation
| left=expression op=(RELATIONALOPERATORS | KW_IS | KW_ISA) right=expression #relationalComparisonExprStat
| left=expression (op=LOGICALOPERATORS right=expression)+ #logicalOrAndExprStat
| op=KW_LIKE patternString #likeExprStat
| LPAREN expression RPAREN #groupingExprStat
| NUMBER #atom
| string #atom
| IDENTIFIER DATATYPESUFFIX? #atom
;
lvalue
: (IDENTIFIER DATATYPESUFFIX?) | (IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN)
;
string
: STRING
;
patternString
: DQUOT (QUESTIONMARK | POUND | ASTERISK | LBRACKET BANG? .*? RBRACKET)+ DQUOT
;
referenceType
: DATATYPE
;
declarationStat
: constDecl
| varDecl
;
constDecl
: CONSTDECL? KW_CONST IDENTIFIER EQUAL expression
;
varDecl
: VARDECL (varDeclPart (COMMA varDeclPart)*)? | listDeclPart
;
varDeclPart
: IDENTIFIER DATATYPESUFFIX? ((arrayBounds)? KW_AS DATATYPE (COMMA DATATYPE)*)?
;
listDeclPart
: IDENTIFIER DATATYPESUFFIX? KW_LIST KW_AS DATATYPE
;
arrayBounds
: LPAREN (arrayDimension (COMMA arrayDimension)*)? RPAREN
;
arrayDimension
: INTEGER (KW_TO INTEGER)?
;
defTypeStat
: DEFTYPES DEFTYPERANGE (COMMA DEFTYPERANGE)*
;
This is the lexer grammar.
lexer grammar wlLexer;
NUMBER
: INTEGER
| REAL
| BINARY
| OCTAL
| HEXIDECIMAL
;
RELATIONALOPERATORS
: EQUAL
| NEQUAL
| LT
| LTE
| GT
| GTE
;
LOGICALOPERATORS
: KW_OR
| KW_XOR
| KW_AND
| KW_NOT
| KW_IMP
| KW_EQV
;
INSTANCEOF
: KW_IS
| KW_ISA
;
CONSTDECL
: KW_PUBLIC
| KW_PRIVATE
;
DATATYPE
: KW_BOOLEAN
| KW_BYTE
| KW_INTEGER
| KW_LONG
| KW_SINGLE
| KW_DOUBLE
| KW_CURRENCY
| KW_STRING
;
VARDECL
: KW_DIM
| KW_STATIC
| KW_PUBLIC
| KW_PRIVATE
;
LABEL
: IDENTIFIER COLON
;
DEFTYPERANGE
: [a-zA-Z] MINUS [a-zA-Z]
;
DEFTYPES
: KW_DEFBOOL
| KW_DEFBYTE
| KW_DEFCUR
| KW_DEFDBL
| KW_DEFINT
| KW_DEFLNG
| KW_DEFSNG
| KW_DEFSTR
| KW_DEFVAR
;
DATATYPESUFFIX
: PERCENT
| AMPERSAND
| BANG
| POUND
| AT
| DOLLARSIGN
;
STRING
: (DQUOT (DQUOTESC|.)*? DQUOT)
| (LBRACE (RBRACEESC|.)*? RBRACE)
| (PIPE (PIPESC|.|NEWLINE)*? PIPE)
;
fragment DQUOTESC: '\"\"' ;
fragment RBRACEESC: '}}' ;
fragment PIPESC: '||' ;
INTEGER
: DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
REAL
: DIGIT+ PERIOD DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
BINARY
: AMPERSAND B BINARYDIGIT+
;
OCTAL
: AMPERSAND O OCTALDIGIT+
;
HEXIDECIMAL
: AMPERSAND H HEXDIGIT+
;
QUESTIONMARK: '?' ;
COLON: ':' ;
ASSIGN: '=';
SEMICOLON: ';' ;
AT: '#' ;
LPAREN: '(' ;
RPAREN: ')' ;
DQUOT: '"' ;
LBRACE: '{' ;
RBRACE: '}' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
CARAT: '^' ;
PLUS: '+' ;
MINUS: '-' ;
ASTERISK: '*' ;
FSLASH: '/' ;
BSLASH: '\\' ;
AMPERSAND: '&' ;
BANG: '!' ;
POUND: '#' ;
DOLLARSIGN: '$' ;
PERCENT: '%' ;
COMMA: ',' ;
APOSTROPHE: '\'' ;
TWOPERIODS: '..' ;
PERIOD: '.' ;
UNDERSCORE: '_' ;
PIPE: '|' ;
NEWLINE: '\r\n' | '\r' | '\n';
EQUAL: '==' ;
NEQUAL: '<>' | '><' ;
LT: '<' ;
LTE: '<=' | '=<';
GT: '>' ;
GTE: '=<'|'<=' ;
KW_AND: A N D ;
KW_BINARY: B I N A R Y ;
KW_BOOLEAN: B O O L E A N ;
KW_BYTE: B Y T E ;
KW_DATATYPE: D A T A T Y P E ;
KW_DATE: D A T E ;
KW_INTEGER: I N T E G E R ;
KW_IS: I S ;
KW_ISA: I S A ;
KW_LIKE: L I K E ;
KW_LONG: L O N G ;
KW_MOD: M O D ;
KW_NOT: N O T ;
KW_TO: T O ;
KW_FALSE: F A L S E ;
KW_TRUE: T R U E ;
KW_SINGLE: S I N G L E ;
KW_DOUBLE: D O U B L E ;
KW_CURRENCY: C U R R E N C Y ;
KW_STRING: S T R I N G ;
fragment BINARYDIGIT: ('0'|'1') ;
fragment OCTALDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7') ;
fragment DIGIT: '0'..'9' ;
fragment HEXDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' | A | B | C | D | E | F) ;
fragment A: ('a'|'A');
fragment B: ('b'|'B');
fragment C: ('c'|'C');
fragment D: ('d'|'D');
fragment E: ('e'|'E');
fragment F: ('f'|'F');
fragment G: ('g'|'G');
fragment H: ('h'|'H');
fragment I: ('i'|'I');
fragment J: ('j'|'J');
fragment K: ('k'|'K');
fragment L: ('l'|'L');
fragment M: ('m'|'M');
fragment N: ('n'|'N');
fragment O: ('o'|'O');
fragment P: ('p'|'P');
fragment Q: ('q'|'Q');
fragment R: ('r'|'R');
fragment S: ('s'|'S');
fragment T: ('t'|'T');
fragment U: ('u'|'U');
fragment V: ('v'|'V');
fragment W: ('w'|'W');
fragment X: ('x'|'X');
fragment Y: ('y'|'Y');
fragment Z: ('z'|'Z');
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_~]*
;
LINE_ESCAPE
: (' ' | '\t') UNDERSCORE ('\r'? | '\n')
;
WS
: [ \t] -> skip
;
Take a look at this grammar (Note that this grammar is not supposed to be a grammar for BASIC, it's just an example to show how to disambiguate using "=" for both assignment and equality):
grammar Foo;
program:
(statement | exprOtherThanEquality)*
;
statement:
assignment
;
expr:
equality | exprOtherThanEquality
;
exprOtherThanEquality:
boolAndOr
;
boolAndOr:
atom (BOOL_OP expr)*
;
equality:
atom EQUAL expr
;
assignment:
VAR EQUAL expr ENDL
;
atom:
BOOL |
VAR |
INT |
group
;
group:
LEFT_PARENTH expr RGHT_PARENTH
;
ENDL : ';' ;
LEFT_PARENTH : '(' ;
RGHT_PARENTH : ')' ;
EQUAL : '=' ;
BOOL:
'true' | 'false'
;
BOOL_OP:
'and' | 'or'
;
VAR:
[A-Za-z_]+ [A-Za-z_0-9]*
;
INT:
'-'? [0-9]+
;
WS:
[ \t\r\n] -> skip
;
Here is the parse tree for the input: t = (x = 5) and (y = 2);
In one of the comments above, I asked you if we can assume that the first equal sign on a line always corresponds to an assignment. I retract that assumption slightly... The first equal sign on a line always corresponds to an assignment unless it is contained within parentheses. With the above grammar, this is a valid line: (x = 2). Here is the parse tree:

Resources