Error when generating a grammar for chess PGN files - antlr4

I made this ANTLR4 grammar in order to parse a PGN inside my Java programm, but I can't manage to solve the ambiguity in it :
grammar Pgn;
file: game (NEWLINE+ game)*;
game: (tag+ NEWLINE+)? notation;
tag: [TAG_TYPE "TAG_VALUE"];
notation: move+ END_RESULT?;
move: MOVE_NUMBER\. MOVE_DESC MOVE_DESC #CompleteMove
| MOVE_NUMBER\. MOVE_DESC #OnlyWhiteMove
| MOVE_NUMBER\.\.\. MOVE_DESC #OnlyBlackMove
;
END_RESULT: '1-0'
| '0-1'
| '1/2-1/2'
;
TAG_TYPE: LETTER+;
TAG_VALUE: .*;
MOVE_NUMBER: DIGIT+;
MOVE_DESC: .*;
NEWLINE: \r? \n;
SPACES: [ \t]+ -> skip;
fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];
And this is the error output :
$ antlr4 Pgn.g4
error(50): Pgn.g4:6:6: syntax error: 'TAG_TYPE "TAG_VALUE"' came as a complete surprise to me while matching alternative
I think the error come from the fact that " [ ", " ] " and ' " ' can't be used freely, neither in Grammar nor Lexer.
Helps or advices are welcome.

Looking at the specs for PGN, http://www.thechessdrum.net/PGN_Reference.txt, I see there's a formal definition of the PGN format there:
18: Formal syntax
<PGN-database> ::= <PGN-game> <PGN-database>
<empty>
<PGN-game> ::= <tag-section> <movetext-section>
<tag-section> ::= <tag-pair> <tag-section>
<empty>
<tag-pair> ::= [ <tag-name> <tag-value> ]
<tag-name> ::= <identifier>
<tag-value> ::= <string>
<movetext-section> ::= <element-sequence> <game-termination>
<element-sequence> ::= <element> <element-sequence>
<recursive-variation> <element-sequence>
<empty>
<element> ::= <move-number-indication>
<SAN-move>
<numeric-annotation-glyph>
<recursive-variation> ::= ( <element-sequence> )
<game-termination> ::= 1-0
0-1
1/2-1/2
*
<empty> ::=
I highly recommend you to let your ANTLR grammar resemble that as much as possible. I made a small project with ANTLR 4 on Github which you can try out: https://github.com/bkiers/PGN-parser
The grammar (without comments):
parse
: pgn_database EOF
;
pgn_database
: pgn_game*
;
pgn_game
: tag_section movetext_section
;
tag_section
: tag_pair*
;
tag_pair
: LEFT_BRACKET tag_name tag_value RIGHT_BRACKET
;
tag_name
: SYMBOL
;
tag_value
: STRING
;
movetext_section
: element_sequence game_termination
;
element_sequence
: (element | recursive_variation)*
;
element
: move_number_indication
| san_move
| NUMERIC_ANNOTATION_GLYPH
;
move_number_indication
: INTEGER PERIOD?
;
san_move
: SYMBOL
;
recursive_variation
: LEFT_PARENTHESIS element_sequence RIGHT_PARENTHESIS
;
game_termination
: WHITE_WINS
| BLACK_WINS
| DRAWN_GAME
| ASTERISK
;
WHITE_WINS
: '1-0'
;
BLACK_WINS
: '0-1'
;
DRAWN_GAME
: '1/2-1/2'
;
REST_OF_LINE_COMMENT
: ';' ~[\r\n]* -> skip
;
BRACE_COMMENT
: '{' ~'}'* '}' -> skip
;
ESCAPE
: {getCharPositionInLine() == 0}? '%' ~[\r\n]* -> skip
;
SPACES
: [ \t\r\n]+ -> skip
;
STRING
: '"' ('\\\\' | '\\"' | ~[\\"])* '"'
;
INTEGER
: [0-9]+
;
PERIOD
: '.'
;
ASTERISK
: '*'
;
LEFT_BRACKET
: '['
;
RIGHT_BRACKET
: ']'
;
LEFT_PARENTHESIS
: '('
;
RIGHT_PARENTHESIS
: ')'
;
LEFT_ANGLE_BRACKET
: '<'
;
RIGHT_ANGLE_BRACKET
: '>'
;
NUMERIC_ANNOTATION_GLYPH
: '$' [0-9]+
;
SYMBOL
: [a-zA-Z0-9] [a-zA-Z0-9_+#=:-]*
;
SUFFIX_ANNOTATION
: [?!] [?!]?
;
UNEXPECTED_CHAR
: .
;
For a version with comments, see: https://github.com/bkiers/PGN-parser/blob/master/src/main/antlr4/nl/bigo/pp/PGN.g4

Related

ANTLR4 Grammar Issue with Decimal Numbers

I'm new to ANTLR and using ANTLR4 (4.7.2 Jar file). I'm currently working on Oracle Parser.
I'm having issues with Decimal numbers. I have kept only the relevant parts.
My grammar file is as below.
Now when I parse the below statement it is fine. ".1" is a valid number in my case.
BEGIN a NUMBER:=.1; END;
I haven't shown the grammar but the below are valid cases for me in Oracle.
a NUMBER:= .1; // with Space after operator
a NUMBER:=1.1; // without Space after operator
a NUMBER:=1; // without Space after operator
a NUMER:= 3; // with Space after operator
Now I need to create a tablespace as below.
CREATE TABLESPACE tbs_01 DATAFILE +DATA/BR/CONTROLFILE/Current.260.750;
Here the Digits 260 & 750 are tokenized along with the DOT (as per the definition of NUMERIC_LITERAL). I would want this to be 2 separate digits separated by DOT (and assigned to filenumber and incarnation_number resp as shown in the grammar).
How do I do this?
I have tried using _input.LA(-1)!='.'}? etc but was not working correctly for me.
I tried many other steps mentioned (most solutions were for ANTLR3 and not working in ANTLR4). Is there a simple way to do this in LEXER? I do not want to write a Parser rule to split the decimal digits.
grammar Oracle;
parse
: ( sql_statements | error )* EOF
;
error
: UNEXPECTED_CHAR
{
throw new RuntimeException("UNEXPECTED_CHAR=" + $UNEXPECTED_CHAR.text);
}
;
sql_statements
: 'CREATE' 'TABLESPACE' tablespace_name 'DATAFILE' fully_qualified_file_name ';'
| 'BEGIN' var1 'NUMBER' ':=' num1 ';' 'END' ';'
;
tablespace_name : IDENTIFIER;
fully_qualified_file_name : K_PLUS_SIGN diskgroup_name K_SOLIDUS db_name K_SOLIDUS file_type K_SOLIDUS file_type_tag '.' filenumber '.' incarnation_number;
diskgroup_name : IDENTIFIER;
db_name : IDENTIFIER;
file_type : IDENTIFIER;
file_type_tag : IDENTIFIER;
filenumber : NUMERIC_LITERAL;
incarnation_number : NUMERIC_LITERAL;
var1 : IDENTIFIER;
num1 : NUMERIC_LITERAL;
IDENTIFIER : [a-zA-Z_] ([a-zA-Z] | '$' | '_' | '#' | DIGIT)* ;
K_PLUS_SIGN : '+';
K_SOLIDUS : '/';
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
;
SPACES : [ \u000B\t\r\n] -> skip;
WS : [ \t\r\n]+ -> skip;
UNEXPECTED_CHAR : . ;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
Your Dsl has a natural ambiguity: in some instances, numbers are integers and in others, decimals.
If the Dsl provides sufficient guard conditions, Antlr modes can be used to isolate the instances. For example, in the given Dsl, decimal numbers appear to always occur between := and ; guards.
...
K_ASSIGN : ':=' -> pushMode(Decimals);
K_SEMI : ';' ;
NUMERIC_LITERAL : DIGIT+ ;
...
mode Decimals;
D_SEMI : ';' -> type(K_SEMI), popMode ;
NUMERIC:
DIGIT+ ( '.' DIGIT+ )? ( E ('+'|'-')? DIGIT+ )? 'D'
| 'F')?
| '.' DIGIT+ ( E ('+'|'-')? DIGIT+ )? ('D' | 'F')?
-> type(NUMERIC_LITERAL);

Problem matching single digits when integers are defined as tokens

I'm having problem trying to get a grammar working. Here is the simplified version. The language I try to parse has expressions like these:
testing1(2342);
testing2(idfor2);
testing3(4654);
testing4[1..n];
testing5[0..1];
testing6(7);
testing7(1);
testing8(o);
testing9(n);
The problem arises when I introduce the rules for the [1..n] or [0..1] expressions. The grammar file (one of the many variations I've tried):
grammar test;
tests
: test* ;
test
: call
| declaration ;
call
: callName '(' callParameter ')' ';' ;
callName : Identifier ;
callParameter : Identifier | Integer ;
declaration
: declarationName '[' declarationParams ']' ';' ;
declarationName : Identifier ;
declarationParams
: decMin '..' decMax ;
decMin : '0' | '1' ;
decMax : '1' | 'n' ;
Integer : [0-9]+ ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
WS : [ \t\r\n]+ -> skip ;
When I parse the sample with this grammar, it fails on testing7(1); and testint(9);. It matches as decMin or decMax instead of Integer or Identifier:
line 8:9 mismatched input '1' expecting {Integer, Identifier}
line 10:9 mismatched input 'n' expecting {Integer, Identifier}
I've tried many variations but I can't make it work fine.
I think your problem comes from not using lexer rules clearly defining what you want.
When you added this rule :
decMin : '0' | '1' ;
You in fact created an unnamed lexer rule that matches '0' and another one matching '1' :
UNNAMED_0_RULE : '0';
UNNAMED_1_RULE : '1';
And your parser rule became :
decMin : UNNAMED_0_RULE | UNNAMED_1_RULE ;
Problem : now, when your lexer see
testing7(1);
**it doesn't see **
callName '(' callParameter ')' ';'
anymore, it sees
callName '(' UNNAMED_1_RULE ')' ';'
and it doesn't understand that.
And that is because lexer rules are effective before the parser rules.
To solve your problem, define your lexer rules efficiently, It would probably look like that :
grammar test;
/*---------------- PARSER ----------------*/
tests
: test*
;
test
: call
| declaration
;
call
: callName '(' callParameter ')' ';'
;
callName
: identifier
;
callParameter
: identifier
| integer
;
declaration
: declarationName '[' declarationParams ']' ';'
;
declarationName
: identifier
;
declarationParams
: decMin '..' decMax
;
decMin
: INTEGER_ZERO
| INTEGER_ONE
;
decMax
: INTEGER_ONE
| LETTER_N
;
integer
: (INTEGER_ZERO | INTEGER_ONE | INTEGER_OTHERS)+
;
identifier
: LETTER_N
| IDENTIFIER
;
/*---------------- LEXER ----------------*/
LETTER_N: N;
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_]*
;
WS
: [ \t\r\n]+ -> skip
;
INTEGER_ZERO: '0';
INTEGER_ONE: '1';
INTEGER_OTHERS: '2'..'9';
fragment N: [nN];
I just tested this grammar and it works.
The drawback is that it will cut your integers at the lexer step (cutting 1245 into 1 2 4 5 in lexer rules, and the considering the parser rule as uniting 1 2 4 and 5).
I think it would be better to be less precise and simply write :
decMin: integer | identifier;
But then it depends on what you do with your grammar...

Antlr4 Mismatch input

First of all, I have read the solutions for the following similar questions: q1 q2 q3
Still I don't understand why I get the following message:
line 1:0 missing 'PROGRAM' at 'PROGRAM'
when I try to match the following:
PROGRAM test
BEGIN
END
My grammar:
grammar Wengo;
program : PROGRAM id BEGIN pgm_body END ;
id : IDENTIFIER ;
pgm_body : decl func_declarations ;
decl : string_decl decl | var_decl decl | empty ;
string_decl : STRING id ASSIGN str SEMICOLON ;
str : STRINGLITERAL ;
var_decl : var_type id_list SEMICOLON ;
var_type : FLOAT | INT ;
any_type : var_type | VOID ;
id_list : id id_tail ;
id_tail : COMA id id_tail | empty ;
param_decl_list : param_decl param_decl_tail | empty ;
param_decl : var_type id ;
param_decl_tail : COMA param_decl param_decl_tail | empty ;
func_declarations : func_decl func_declarations | empty ;
func_decl : FUNCTION any_type id (param_decl_list) BEGIN func_body END ;
func_body : decl stmt_list ;
stmt_list : stmt stmt_list | empty ;
stmt : base_stmt | if_stmt | loop_stmt ;
base_stmt : assign_stmt | read_stmt | write_stmt | control_stmt ;
assign_stmt : assign_expr SEMICOLON ;
assign_expr : id ASSIGN expr ;
read_stmt : READ ( id_list )SEMICOLON ;
write_stmt : WRITE ( id_list )SEMICOLON ;
return_stmt : RETURN expr SEMICOLON ;
expr : expr_prefix factor ;
expr_prefix : expr_prefix factor addop | empty ;
factor : factor_prefix postfix_expr ;
factor_prefix : factor_prefix postfix_expr mulop | empty ;
postfix_expr : primary | call_expr ;
call_expr : id ( expr_list ) ;
expr_list : expr expr_list_tail | empty ;
expr_list_tail : COMA expr expr_list_tail | empty ;
primary : ( expr ) | id | INTLITERAL | FLOATLITERAL ;
addop : ADD | MIN ;
mulop : MUL | DIV ;
if_stmt : IF ( cond ) decl stmt_list else_part ENDIF ;
else_part : ELSE decl stmt_list | empty ;
cond : expr compop expr | TRUE | FALSE ;
compop : LESS | GREAT | EQUAL | NOTEQUAL | LESSEQ | GREATEQ ;
while_stmt : WHILE ( cond ) decl stmt_list ENDWHILE ;
control_stmt : return_stmt | CONTINUE SEMICOLON | BREAK SEMICOLON ;
loop_stmt : while_stmt | for_stmt ;
init_stmt : assign_expr | empty ;
incr_stmt : assign_expr | empty ;
for_stmt : FOR ( init_stmt SEMICOLON cond SEMICOLON incr_stmt ) decl stmt_list ENDFOR ;
COMMENT : '--' ~[\r\n]* -> skip ;
WS : [ \t\r\n]+ -> skip ;
NEWLINE : [ \n] ;
EMPTY : $ ;
KEYWORD : PROGRAM|BEGIN|END|FUNCTION|READ|WRITE|IF|ELSE|ENDIF|WHILE|ENDWHILE|RETURN|INT|VOID|STRING|FLOAT|TRUE|FALSE|FOR|ENDFOR|CONTINUE|BREAK ;
OPERATOR : ASSIGN|ADD|MIN|MUL|DIV|EQUAL|NOTEQUAL|LESS|GREAT|LBRACKET|RBRACKET|SEMICOLON|COMA|LESSEQ|GREATEQ ;
IDENTIFIER : [a-zA-Z][a-zA-Z0-9]* ;
INTLITERAL : [0-9]+ ;
FLOATLITERAL : [0-9]*'.'[0-9]+ ;
STRINGLITERAL : '"' (~[\r\n"] | '""')* '"' ;
PROGRAM : 'PROGRAM';
BEGIN : 'BEGIN';
END : 'END';
FUNCTION : 'FUNCTION';
READ : 'READ';
WRITE : 'WRITE';
IF : 'IF';
ELSE : 'ELSE';
ENDIF : 'ENDIF';
WHILE : 'WHILE';
ENDWHILE : 'ENDWHILE';
RETURN : 'RETURN';
INT : 'INT';
VOID : 'VOID';
STRING : 'STRING';
FLOAT : 'FLOAT' ;
TRUE : 'TRUE';
FALSE : 'FALSE';
FOR : 'FOR';
ENDFOR : 'ENDFOR';
CONTINUE : 'CONTINUE';
BREAK : 'BREAK';
ASSIGN : ':=';
ADD : '+';
MIN : '-';
MUL : '*';
DIV : '/';
EQUAL : '=';
NOTEQUAL : '!=';
LESS : '<';
GREAT : '>';
LBRACKET : '(';
RBRACKET : ')';
SEMICOLON : ';';
COMA : ',';
LESSEQ : '<=';
GREATEQ : '>=';
From what I've read, I think there's a mismatch between KEYWORD and PROGRAM, but removing KEYWORD altogether does not solve the problem.
EDIT:
Removing KEYWORD gives the following message:
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
This my grun output when KEYWORD is available:
[#0,0:6='PROGRAM',<KEYWORD>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<KEYWORD>,2:0]
[#3,19:21='END',<KEYWORD>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 1:0 mismatched input 'PROGRAM' expecting 'PROGRAM'
(program PROGRAM test BEGIN END)
This is the output when KEYWORD is removed:
[#0,0:6='PROGRAM',<'PROGRAM'>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<'BEGIN'>,2:0]
[#3,19:21='END',<'END'>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
(program PROGRAM (id test) BEGIN (pgm_body decl func_declarations) END)
The error about "missing 'PROGRAM'" has been solved when you removed the KEYWORD rule (note that you should also remove the OPERATOR rule for the same reasons).
The error you're encountering now is completely unrelated.
Your current problem concerns the definition of empty, which you didn't show. You've said that you tried both EMPTY : $ ; and EMPTY : ^$ ; (and then presumably empty: EMPTY;), but none of those even compile, so they wouldn't cause the parse error you posted. Either way, the concept of an EMPTY token can't work. When would such a token be generated? Once between every other token? In that case, you'd get a lot of "unexpected EMPTY" errors. No, the whole point of an empty rule is that it should succeed without consuming any tokens.
To achieve that, you can just define empty : ; and remove EMPTY altogether. Alternatively you could remove empty as well and just use an empty alternative (i.e. | ;) wherever you're currently using empty. Either approach will make your code work, but there's a better way:
You're using empty as the base case for rules that basically amount to lists. ANTLR offers the repetition operators * (0 or more) , + (1 or more) as well as the ? operator to make things optional. These allow you to define lists non-recursively and without an empty rule. For example stmt_list could be defined like this:
stmt_list : stmt* ;
And id_list like this:
id_list : (id (',' id)*)? ;
On an unrelated note, your grammar can simplified greatly by making use of the fact that ANTLR 4 supports direct left recursion, so you can get rid of all the different expression rules and just have one that's left-recursive.
That'd give you:
expr : primary
| id '(' expr_list ')'
| expr mulop expr
| expr addop expr
;
And the rules expr_prefix, factor, factor_prefix and postfix_expr and call_expr could all be removed.

how can I refactor this ANTLR4 grammar so that it isn't mutually left recursive?

I can't seem to figure out why this grammar won't compile. It compiled fine until I modified line 145 from
(Identifier '.')* functionCall
to
(primary '.')? functionCall
I've been trying to figure out how to solve this issue for a while but I can't seem to be able to. Here's the error:
The following sets of rules are mutually left-recursive [primary]
grammar Tadpole;
#header
{package net.tadpole.compiler.parser;}
file
: fileContents*
;
fileContents
: structDec
| functionDec
| statement
| importDec
;
importDec
: 'import' Identifier ';'
;
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| NoneLiteral
| arrayLiteral
;
arrayLiteral
: '[' expressionList? ']'
;
expressionList
: expression (',' expression)*
;
expression
: primary
| unaryExpression
| <assoc=right> expression binaryOpPrec0 expression
| <assoc=left> expression binaryOpPrec1 expression
| <assoc=left> expression binaryOpPrec2 expression
| <assoc=left> expression binaryOpPrec3 expression
| <assoc=left> expression binaryOpPrec4 expression
| <assoc=left> expression binaryOpPrec5 expression
| <assoc=left> expression binaryOpPrec6 expression
| <assoc=left> expression binaryOpPrec7 expression
| <assoc=left> expression binaryOpPrec8 expression
| <assoc=left> expression binaryOpPrec9 expression
| <assoc=left> expression binaryOpPrec10 expression
| <assoc=right> expression binaryOpPrec11 expression
;
unaryExpression
: unaryOp expression
| prefixPostfixOp primary
| primary prefixPostfixOp
;
unaryOp
: '+'
| '-'
| '!'
| '~'
;
prefixPostfixOp
: '++'
| '--'
;
binaryOpPrec0
: '**'
;
binaryOpPrec1
: '*'
| '/'
| '%'
;
binaryOpPrec2
: '+'
| '-'
;
binaryOpPrec3
: '>>'
| '>>>'
| '<<'
;
binaryOpPrec4
: '<'
| '>'
| '<='
| '>='
| 'is'
;
binaryOpPrec5
: '=='
| '!='
;
binaryOpPrec6
: '&'
;
binaryOpPrec7
: '^'
;
binaryOpPrec8
: '|'
;
binaryOpPrec9
: '&&'
;
binaryOpPrec10
: '||'
;
binaryOpPrec11
: '='
| '**='
| '*='
| '/='
| '%='
| '+='
| '-='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '<-'
;
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| (primary '.')? functionCall
;
functionCall
: functionName '(' expressionList? ')'
;
functionName
: Identifier
;
dimension
: '[' expression ']'
;
statement
: '{' statement* '}'
| expression ';'
| 'recall' ';'
| 'return' expression? ';'
| variableDec
| 'if' '(' expression ')' statement ('else' statement)?
| 'while' '(' expression ')' statement
| 'do' expression 'while' '(' expression ')' ';'
| 'do' '{' statement* '}' 'while' '(' expression ')' ';'
;
structDec
: 'struct' structName ('(' parameterList ')')? '{' variableDec* functionDec* '}'
;
structName
: Identifier
;
fieldName
: Identifier
;
variableDec
: type fieldName ('=' expression)? ';'
;
type
: primitiveType ('[' ']')*
| objType ('[' ']')*
;
primitiveType
: 'byte'
| 'short'
| 'int'
| 'long'
| 'char'
| 'boolean'
| 'float'
| 'double'
;
objType
: (Identifier '.')? structName
;
functionDec
: 'def' functionName '(' parameterList? ')' ':' type '->' functionBody
;
functionBody
: statement
;
parameterList
: parameter (',' parameter)*
;
parameter
: type fieldName
;
IntegerLiteral
: DecimalIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral
;
fragment
DecimalIntegerLiteral
: DecimalNumeral IntegerSuffix?
;
fragment
HexIntegerLiteral
: HexNumeral IntegerSuffix?
;
fragment
OctalIntegerLiteral
: OctalNumeral IntegerSuffix?
;
fragment
BinaryIntegerLiteral
: BinaryNumeral IntegerSuffix?
;
fragment
IntegerSuffix
: [lL]
;
fragment
DecimalNumeral
: Digit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitsAndUnderscores? Digit)?
;
fragment
Digit
: [0-9]
;
fragment
DigitsAndUnderscores
: DigitOrUnderscore+
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
fragment
HexNumeral
: '0' [xX] HexDigits
;
fragment
HexDigits
: HexDigit (HexDigitsAndUnderscores? HexDigit)?
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
HexDigitsAndUnderscores
: HexDigitOrUnderscore+
;
fragment
HexDigitOrUnderscore
: HexDigit
| '_'
;
fragment
OctalNumeral
: '0' [oO] Underscores? OctalDigits
;
fragment
OctalDigits
: OctalDigit (OctalDigitsAndUnderscores? OctalDigit)?
;
fragment
OctalDigit
: [0-7]
;
fragment
OctalDigitsAndUnderscores
: OctalDigitOrUnderscore+
;
fragment
OctalDigitOrUnderscore
: OctalDigit
| '_'
;
fragment
BinaryNumeral
: '0' [bB] BinaryDigits
;
fragment
BinaryDigits
: BinaryDigit (BinaryDigitsAndUnderscores? BinaryDigit)?
;
fragment
BinaryDigit
: [01]
;
fragment
BinaryDigitsAndUnderscores
: BinaryDigitOrUnderscore+
;
fragment
BinaryDigitOrUnderscore
: BinaryDigit
| '_'
;
// §3.10.2 Floating-Point Literals
FloatingPointLiteral
: DecimalFloatingPointLiteral FloatingPointSuffix?
| HexadecimalFloatingPointLiteral FloatingPointSuffix?
;
fragment
FloatingPointSuffix
: [fFdD]
;
fragment
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
fragment
HexadecimalFloatingPointLiteral
: HexSignificand BinaryExponent
;
fragment
HexSignificand
: HexNumeral '.'?
| '0' [xX] HexDigits? '.' HexDigits
;
fragment
BinaryExponent
: BinaryExponentIndicator SignedInteger
;
fragment
BinaryExponentIndicator
: [pP]
;
BooleanLiteral
: 'true'
| 'false'
;
CharacterLiteral
: '\'' SingleCharacter '\''
| '\'' EscapeSequence '\''
;
fragment
SingleCharacter
: ~['\\]
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| OctalEscape
| UnicodeEscape
;
fragment
OctalEscape
: '\\' OctalDigit
| '\\' OctalDigit OctalDigit
| '\\' ZeroToThree OctalDigit OctalDigit
;
fragment
ZeroToThree
: [0-3]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
NoneLiteral
: 'nil'
;
Identifier
: IdentifierStartChar IdentifierChar*
;
fragment
IdentifierStartChar
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
IdentifierChar
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
WS : [ \t\r\n\u000C]+ -> skip
;
LINE_COMMENT
: '#' ~[\r\n]* -> skip
;
The left recursive invocation needs to be the first, so no parenthesis can be placed before it.
You can rewrite it like this:
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| primary '.' functionCall
| functionCall
;
which is equivalent.

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Please help me with my ANTLR4 Grammar.
Sample "formel":
(Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') &
(Arbejde.ArbejdsTimerPrUge = 40)
(Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato;
'+62Å') < DagsDato)
(Arb.BorI=860)
My problem is that Arb.BorI=860 is not handled correct. I get this error:
Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException
Please notis that Arb.BorI contains the word 'or'.
I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt'
So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated.
My Grammar:
grammar UnikFormel;
formel : boolExpression # BooleanExpr
| expression # Expr
| '(' formel ')' # Parentes;
boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+;
expression : element compareOps element # Compare;
element : datakildefelt # DatakildeId
| function # Funktion
| int # Integer
| decimal # Real
| string # Text;
datakildefelt : datakilde '.' felt;
datakilde : identifyer;
felt : identifyer;
function : funktionsnavn ('(' funcParameters? ')')?;
funktionsnavn : identifyer;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
identifyer : LETTER+;
int : DIGIT+;
decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
string : QUOTE .*? QUOTE;
booleanOps : (AND | OR);
compareOps : (LT | GT | EQ | GTEQ | LTEQ);
QUOTE : '\'';
OPERATOR: '+';
DIGIT: [0-9];
LETTER: [a-åA-Å];
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
GT : '>';
LT : '<';
EQ : '=';
GTEQ : '>=';
LTEQ : '<=';
AND : '&' | 'and';
OR : '?' | 'or';
WS : ' '+ -> skip;
Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too.
EDIT
Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).
Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is:
grammar UnikFormel;
formel : '(' formel ')' # Parentes
| expression # Expr
| boolExpression # BooleanExpr
;
boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+
| '(' formel ')' ( booleanOps '(' formel ')' )+;
expression : element compareOps element # Compare;
datakildefelt : ID '.' ID;
function : ID ('(' funcParameters? ')')?;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
element : datakildefelt # DatakildeId
| function # Funktion
| INT # Integer
| DECIMAL # Real
| STRING # Text;
booleanOps : (AND | OR);
compareOps : ( GTEQ | LTEQ | LT | GT | EQ |);
AND : '&' | 'and';
OR : '?' | 'or';
GTEQ : '>=';
LTEQ : '<=';
GT : '>';
LT : '<';
EQ : '=';
ID : LETTER ( LETTER | DIGIT)*;
INT : DIGIT+;
DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
STRING : QUOTE .*? QUOTE;
fragment QUOTE : '\'';
fragment DIGIT: [0-9];
fragment LETTER: [a-åA-Å];
WS : [ \t\r\n]+ -> skip;

Resources