ANTLR4 line comments and text parsing issue - antlr4

I'm writing the parser of c++ header style file and facing the issue with correct line comment handling.
CustomLexer.g4
lexer grammar CustomLexer;
SPACES : [ \r\n\t]+ -> skip;
COMMENT_START : '//' -> pushMode(COMMENT_MODE);
PRAGMA : '#pragma';
SECTION : '#section';
DEFINE : '#define';
UNDEF : '#undef';
IF : '#if';
ELIF : '#elif';
ELSE : '#else';
IFDEF : '#ifdef';
IFNDEF : '#ifndef';
ENDIF : '#endif';
ENABLED : 'ENABLED';
DISABLED : 'DISABLED';
EITHER : 'EITHER';
ANY : 'ANY';
DEFINED : 'defined';
BOTH : 'BOTH';
BOOLEAN_LITERAL : 'true' | 'false';
STRING : '"' .*? '"';
HEXADECIMAL : '0x' ([a-fA-F0-9])+;
LITERAL_SUFFIX : 'L'|'u'|'U'|'Lu'|'LU'|'uL'|'UL'|'f'|'F';
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT : '/**' .*? '*/';
NUMBER : ('-')? Int ('.' Digit*)? | '0';
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE : '{' .*? '}';
OPAREN : '(';
CPAREN : ')';
OBRACE : '{';
CBRACE : '}';
ADD : '+';
SUBTRACT : '-';
MULTIPLY : '*';
DIVIDE : '/';
MODULUS : '%';
OR : '||';
AND : '&&';
EQUALS : '==';
NEQUALS : '!=';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
EXCL : '!';
QMARK : '?';
COLON : ':';
COMA : ',';
OTHER : .;
fragment Int : [0-9] Digit* | '0';
fragment Digit : [0-9];
mode COMMENT_MODE;
COMMENT_MODE_DEFINE : '#define' -> type(DEFINE), popMode;
COMMENT_MODE_SECTION : '#section' -> type(SECTION), popMode;
COMMENT_MODE_IF : '#if' -> type(IF), popMode;
COMMENT_MODE_ENDIF : '#endif' -> type(ENDIF), popMode;
COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;
COMMENT_MODE_PART : ~[\r\n];
CustomParser.g4:
parser grammar CustomParser;
options { tokenVocab=CustomLexer; }
compilationUnit
: statement* EOF
;
statement
: comment? pragmaDirective
| comment? defineDirective
| comment? undefDirective
| comment? ifDirective
| comment? ifdefDirective
| comment? ifndefDirective
| sectionLineComment
| comment
;
pragmaDirective
: PRAGMA char_sequence
;
subDirectives
: ifDirective+
| ifdefDirective+
| ifndefDirective+
| defineDirective+
| undefDirective+
| comment+
;
ifdefDirective
: IFDEF IDENTIFIER subDirectives+ ENDIF
;
ifndefDirective
: IFNDEF IDENTIFIER subDirectives+ ENDIF
;
ifDirective
: ifStatement elseIfStatement* elseStatement? ENDIF
;
ifStatement
: IF expression (subDirectives)*
;
elseIfStatement
: ELIF expression (subDirectives)*
;
elseStatement
: ELSE (subDirectives)*
;
defineDirective
: BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER BOOLEAN_LITERAL info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER (char_sequence COMA?)+ info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OPAREN? NUMBER LITERAL_SUFFIX? CPAREN? info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER HEXADECIMAL info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER STRING info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER OBRACE? (ARRAY_SEQUENCE COMA?)+ CBRACE? info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER expression info_comment?
| BLOCK_COMMENT? COMMENT_START? DEFINE IDENTIFIER info_comment?
;
undefDirective
: BLOCK_COMMENT? COMMENT_START? UNDEF IDENTIFIER info_comment?;
sectionLineComment
: COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment+
;
expression
: simpleExpression
| customExpression
| enabledExpression
| disabledExpression
| bothExpression
| eitherExpression
| anyExpression
| definedExpression
| comparisonExpression
| arithmeticExpression
;
arithmeticExpression
: arithmeticExpression (MULTIPLY | DIVIDE) arithmeticExpression
| arithmeticExpression (ADD | SUBTRACT) arithmeticExpression
| OPAREN arithmeticExpression CPAREN
| expressionIdentifier
;
comparisonExpression
: comparisonExpression (EQUALS | NEQUALS | GTEQUALS | LTEQUALS | GT | LT) comparisonExpression
| comparisonExpression (AND | OR) comparisonExpression
| EXCL? OPAREN comparisonExpression CPAREN
| eitherExpression
| enabledExpression
| bothExpression
| anyExpression
| definedExpression
| disabledExpression
| customExpression
| simpleExpression
| expressionIdentifier
;
enabledExpression : EXCL? OPAREN? ENABLED OPAREN IDENTIFIER CPAREN CPAREN?;
disabledExpression : EXCL? OPAREN? DISABLED OPAREN IDENTIFIER CPAREN CPAREN?;
bothExpression : EXCL? OPAREN? BOTH OPAREN identifiers identifiers CPAREN CPAREN?;
eitherExpression : EXCL? OPAREN? EITHER OPAREN identifiers+ CPAREN CPAREN?;
anyExpression : EXCL? OPAREN? ANY OPAREN identifiers+ CPAREN CPAREN?;
definedExpression : EXCL? OPAREN? DEFINED OPAREN IDENTIFIER CPAREN CPAREN?;
customExpression : EXCL? IDENTIFIER OPAREN IDENTIFIER CPAREN;
simpleExpression : EXCL? IDENTIFIER;
expressionIdentifier : IDENTIFIER | NUMBER;
identifiers
: IDENTIFIER COMA?
;
line_comment
: COMMENT_START COMMENT_MODE_PART*
;
info_comment
: COMMENT_START COMMENT_MODE_PART*
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
It is working fine with 95% of the directives and comments I have in my header file but few scenarios still not correctly handled:
1. Line comments
Input:
//1
//#define ID1 //2
This is the list of tokens:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. line_comment
08. COMMENT_START: "//"
09. defineDirective:8
10. DEFINE: "#define"
11. IDENTIFIER: "ID1"
12. info_comment
13. COMMENT_START: "//"
14. COMMENT_MODE_PART: "2"
15.<EOF>
I want to achieve that the token on line 07 is a part of the token on line 09 and resolved as COMMENT_START token
2. Define directive with text
Other define rules are working correctly but:
#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)
These "define" directives are parsing with an exception
I would appreciate any help with resolving these 2 problems I have at this moment or any recommendations on how my lexer/parser can be optimized.
Thanks in advance!
=================================Update===================================
First test case:
Input:
//1
//#define ID1 //2
Current result:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. line_comment
08. COMMENT_START: "//"
09. defineDirective:8
10. DEFINE: "#define"
11. IDENTIFIER: "ID1"
12. info_comment
13. COMMENT_START: "//"
14. COMMENT_MODE_PART: "2"
15.<EOF>
Expected result:
01. compilationUnit
02. statement:2
03. comment:2
04. line_comment
05. COMMENT_START: "//"
06. COMMENT_MODE_PART: "1"
07. defineDirective:8
08. COMMENT_START: "//"
09. DEFINE: "#define"
10. IDENTIFIER: "ID1"
11. info_comment
12. COMMENT_START: "//"
13. COMMENT_MODE_PART: "2"
14.<EOF>
Second test case:
Input:
#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL
Current result:
01.compilationUnit
02. statement:2
03. defineDirective:5
04. DEFINE: "#define"
05. IDENTIFIER: "USER_DESC_2"
06. STRING: "\"Preheat for \""
07. IDENTIFIER: "PREHEAT_1_LABEL"
<EOF>
Expected result:
01.compilationUnit
02. statement:2
03. defineDirective:5
04. DEFINE: "#define"
05. IDENTIFIER: "USER_DESC_2"
06. STRING: "\"Preheat for \" PREHEAT_1_LABEL"
<EOF>
In the expected result, STRING represents the result text. Here I do not really know if it is better to enhance STRING Lexer token definition or introduce new parsing rule to cover this case

Mixing this post, your previous question and Bart's answer, and supposing that a define directive is in the form
optional_// #define IDENTIFIER replacement_value optional_line_comment
and given the input file input.txt
/**
* BLOCK COMMENT
*/
#pragma once
//#pragma once
/**
* BLOCK COMMENT
*/
#define CONFIGURATION_H_VERSION 12345
#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd
#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30 { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0
//================================================================
//============================= INFO =============================
//================================================================
/**
* SEPARATE BLOCK COMMENT
*/
// Line 1
// Line 2
//
//======================= this is a section ======================
// #section test
// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5
// Line 6
#define IDENTIFIER_THREE
//1
//#define ID1 //2
#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL
#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)
#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)
if I have well understood your two questions, the grammar must produce a statement for each directive or comment not followed by a directive. A directive can be preceded by a comment, which becomes part of the statement. A directive can be commented out and followed by an inline line comment (that is, on the same line).
Grammar Header.g4 (without trace) :
grammar Header;
compilationUnit
#init {System.out.println("Last update 1253");}
: ( statement {System.out.println("Statement found : `" + $statement.text + "`");}
)* EOF
;
statement
: comment? pragma_directive
| comment? define_directive
| section
| comment
;
pragma_directive
: PRAGMA char_sequence
;
define_directive
: define_identifier replacement_comment[$define_identifier.statement_line]
;
define_identifier returns [int statement_line]
: LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
;
replacement_comment [int statement_line]
: anything+ line_comment?
| {getCurrentToken().getLine() == $statement_line}? line_comment
| {getCurrentToken().getLine() != $statement_line}?
;
section
: LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment
| SEPARATOR ( IDENTIFIER | EQUALS )*
;
line_comment
: LINE_COMMENT_DELIMITER anything*
;
anything
: IDENTIFIER
| CHAR_SEQUENCE
| STRING
| NUMBER
| OTHER
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA : '#pragma';
SECTION : '#section';
DEFINE : '#define';
STRING : '"' .*? '"';
EQUALS : '='+ ;
SEPARATOR : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS : [ \t]+ -> channel(HIDDEN) ;
NL : ( '\r' '\n'?
| '\n'
) -> channel(HIDDEN) ;
OTHER : . ;
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.9-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Header.g4
$ javac Header*.java
$ grun Header compilationUnit -tokens input.txt
[#0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[#1,24:24='\n',<NL>,channel=1,3:3]
[#2,25:31='#pragma',<'#pragma'>,4:0]
[#3,32:32=' ',<WS>,channel=1,4:7]
[#4,33:36='once',<IDENTIFIER>,4:8]
[#5,37:37='\n',<NL>,channel=1,4:12]
...
[#84,315:321='#define',<'#define'>,19:0]
[#85,322:322=' ',<WS>,channel=1,19:7]
[#86,323:340='IDENTIFIER_20_30_A',<IDENTIFIER>,19:8]
[#87,341:343=' ',<WS>,channel=1,19:26]
[#88,344:344='[',<OTHER>,19:29]
[#89,345:345=' ',<WS>,channel=1,19:30]
[#90,346:346='1',<NUMBER>,19:31]
[#91,347:347=',',<OTHER>,19:32]
...
[#139,644:668='//=======================',<SEPARATOR>,34:0]
[#140,669:669=' ',<WS>,channel=1,34:25]
[#141,670:673='this',<IDENTIFIER>,34:26]
...
[#257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1253
Statement found : `/**
* BLOCK COMMENT
*/
#pragma once`
Statement found : `//#pragma once`
...
Statement found : `#define DEFAULT_A 10.0`
...
Statement found : `// Line 2`
Statement found : `//`
...
Statement found : `//#define IDENTIFIER_3 Version.h // Line 5`
Statement found : `// Line 6
#define IDENTIFIER_THREE`
Statement found : `//1
//#define ID1 //2`
Statement found : `#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
Statement found : `#define USER_DESC_2 "abc " DEF "ABC2 \" M100 (100)`
Statement found : `#define USER_GCODE_2 "M140 S" STRINGIFY(PREHEAT_1_TEMP_BED) "\nM104 S" STRINGIFY(PREHEAT_1_TEMP_HOTEND)`
Grammar Header_trace.g4 (with trace) :
grammar Header_trace;
compilationUnit
#init {System.out.println("Last update 1137");}
: statement[this.getRuleNames() /* parser rule names */]* EOF
;
statement [String[] rule_names]
locals [String rule_name, int start_line, int end_line]
#after { System.out.print("The next statement is a " + $rule_name);
$start_line = $start.getLine();
$end_line = $stop.getLine();
if ($start_line == $end_line)
System.out.print(" on line " + $start_line);
else
System.out.print(" on lines " + $start_line + " to " + $end_line);
System.out.println(" : ");
System.out.println("`" + $text + "`");
}
: comment? pragma_directive [rule_names] {$rule_name = $pragma_directive.rule_name;}
| comment? define_directive [rule_names] {$rule_name = $define_directive.rule_name;}
| section [rule_names] {$rule_name = $section.rule_name;}
| comment_only [rule_names] {$rule_name = $comment_only.rule_name;}
// comment_only can be replaced by comment when the trace is removed
;
pragma_directive [String[] rule_names] returns [String rule_name]
: PRAGMA char_sequence
{ $rule_name = rule_names[$ctx.getRuleIndex()]; }
;
define_directive [String[] rule_names] returns [String rule_name]
locals [String dir_rule_name, int statement_line = 0]
#init {$dir_rule_name = rule_names[_localctx.getRuleIndex()];}
: define_identifier replacement_comment[$dir_rule_name, $define_identifier.statement_line]
{ $rule_name = $replacement_comment.rule_name; }
;
define_identifier returns [int statement_line]
: LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();} IDENTIFIER
;
replacement_comment [String dir_rule_name, int statement_line] returns [String rule_name]
: any+=anything+ line_comment?
{ $rule_name = $dir_rule_name + " with replacement value";
System.out.print(" anything matched : " );
if ($any.size() > 0)
for (AnythingContext r : $any)
System.out.print(r.getText());
else
System.out.print("(nothing)");
System.out.println();
}
| {getCurrentToken().getLine() == $statement_line}?
line_comment
{ $rule_name = $dir_rule_name + " WITHOUT replacement value and with inline line comment"; }
| {getCurrentToken().getLine() != $statement_line}?
{ $rule_name = $dir_rule_name + " WITHOUT replacement value"; }
;
section [String[] rule_names] returns [String rule_name]
: LINE_COMMENT_DELIMITER OTHER? SECTION char_sequence
{ $rule_name = rule_names[$ctx.getRuleIndex()]; }
;
comment_only [String[] rule_names] returns [String rule_name]
: comment
{ $rule_name = rule_names[$ctx.getRuleIndex()]; }
;
comment
: BLOCK_COMMENT
| line_comment
| SEPARATOR ( IDENTIFIER | EQUALS )*
;
line_comment
: LINE_COMMENT_DELIMITER anything*
;
anything
: IDENTIFIER
| CHAR_SEQUENCE
| STRING
| NUMBER
| OTHER
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;
LINE_COMMENT_DELIMITER : '//' ;
PRAGMA : '#pragma';
SECTION : '#section';
DEFINE : '#define';
STRING : '"' .*? '"';
EQUALS : '='+ ;
SEPARATOR : LINE_COMMENT_DELIMITER EQUALS ;
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
NUMBER : [0-9.]+ ;
BLOCK_COMMENT : '/**' .*? '*/';
WS : [ \t]+ -> channel(HIDDEN) ;
NL : ( '\r' '\n'?
| '\n'
) -> channel(HIDDEN) ;
OTHER : .;
Execution :
$ a4 Header_trace.g4
$ javac Header*.java
$ grun Header_trace compilationUnit -tokens input.txt
[#0,0:23='/**\n * BLOCK COMMENT\n */',<BLOCK_COMMENT>,1:0]
[#1,24:24='\n',<NL>,channel=1,3:3]
[#2,25:31='#pragma',<'#pragma'>,4:0]
[#3,32:32=' ',<WS>,channel=1,4:7]
[#4,33:36='once',<IDENTIFIER>,4:8]
[#5,37:37='\n',<NL>,channel=1,4:12]
...
[#257,1103:1102='<EOF>',<EOF>,51:0]
Last update 1137
The next statement is a pragma_directive on lines 1 to 4 :
`/**
* BLOCK COMMENT
*/
#pragma once`
...
anything matched : 10.0
The next statement is a define_directive with replacement value on line 20 :
`#define DEFAULT_A 10.0`
The next statement is a comment_only on line 22 :
`//================================================================`
...
The next statement is a comment_only on line 31 :
`// Line 2`
The next statement is a comment_only on line 32 :
`//`
...
anything matched : Version.h
The next statement is a define_directive with replacement value on line 39 :
`//#define IDENTIFIER_3 Version.h // Line 5`
The next statement is a define_directive WITHOUT replacement value on lines 41 to 42 :
`// Line 6
#define IDENTIFIER_THREE`
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 44 to 45 :
`//1
//#define ID1 //2`
anything matched : "Preheat for "PREHEAT_1_LABEL
The next statement is a define_directive with replacement value on line 47 :
`#define USER_DESC_2 "Preheat for " PREHEAT_1_LABEL`
...
It happened that thanks to LINE_COMMENT_DELIMITER?, as you did with COMMENT_START?, at the beginning of the define directive rule, and because there is no special token after //, it was no longer necessary to switch to mode COMMENT_MODE when encountering a line comment delimiter.
There was one difficulty with this first approach :
define_directive
: LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER anything+ line_comment?
| LINE_COMMENT_DELIMITER? DEFINE {$statement_line = getCurrentToken().getLine();}
IDENTIFIER same_line_line_comment[$statement_line]
| LINE_COMMENT_DELIMITER? DEFINE IDENTIFIER
same_line_line_comment [int statement_line]
: {getCurrentToken().getLine() == $statement_line}?
line_comment
The following lines
// Line 6
#define IDENTIFIER_THREE
//1
were parsed with the second alternative instead of the third :
compare statement line 42 with comment line 44
line 44:0 rule same_line_line_comment failed predicate: {getCurrentToken().getLine() == $statement_line}?
The next statement is a define_directive WITHOUT replacement value and with inline line comment on lines 41 to 42 :
`// Line 6
#define IDENTIFIER_THREE`
Despite the fact that the subrule same_line_line_comment was guarded with a false value, the semantic predicate had no effect. The FailedPredicateException was undesirable and the trace message was wrong. It may have to do with Finding Visible Predicates.
The solution was to split the processing of the #define directive into a fixed part define_identifier rule and a variable part replacement_comment rule with the semantic predicate (which, to be effective in the parsing decision, must be placed at the beginning of the alternative).

Related

Antlr4 Mismatch input

First of all, I have read the solutions for the following similar questions: q1 q2 q3
Still I don't understand why I get the following message:
line 1:0 missing 'PROGRAM' at 'PROGRAM'
when I try to match the following:
PROGRAM test
BEGIN
END
My grammar:
grammar Wengo;
program : PROGRAM id BEGIN pgm_body END ;
id : IDENTIFIER ;
pgm_body : decl func_declarations ;
decl : string_decl decl | var_decl decl | empty ;
string_decl : STRING id ASSIGN str SEMICOLON ;
str : STRINGLITERAL ;
var_decl : var_type id_list SEMICOLON ;
var_type : FLOAT | INT ;
any_type : var_type | VOID ;
id_list : id id_tail ;
id_tail : COMA id id_tail | empty ;
param_decl_list : param_decl param_decl_tail | empty ;
param_decl : var_type id ;
param_decl_tail : COMA param_decl param_decl_tail | empty ;
func_declarations : func_decl func_declarations | empty ;
func_decl : FUNCTION any_type id (param_decl_list) BEGIN func_body END ;
func_body : decl stmt_list ;
stmt_list : stmt stmt_list | empty ;
stmt : base_stmt | if_stmt | loop_stmt ;
base_stmt : assign_stmt | read_stmt | write_stmt | control_stmt ;
assign_stmt : assign_expr SEMICOLON ;
assign_expr : id ASSIGN expr ;
read_stmt : READ ( id_list )SEMICOLON ;
write_stmt : WRITE ( id_list )SEMICOLON ;
return_stmt : RETURN expr SEMICOLON ;
expr : expr_prefix factor ;
expr_prefix : expr_prefix factor addop | empty ;
factor : factor_prefix postfix_expr ;
factor_prefix : factor_prefix postfix_expr mulop | empty ;
postfix_expr : primary | call_expr ;
call_expr : id ( expr_list ) ;
expr_list : expr expr_list_tail | empty ;
expr_list_tail : COMA expr expr_list_tail | empty ;
primary : ( expr ) | id | INTLITERAL | FLOATLITERAL ;
addop : ADD | MIN ;
mulop : MUL | DIV ;
if_stmt : IF ( cond ) decl stmt_list else_part ENDIF ;
else_part : ELSE decl stmt_list | empty ;
cond : expr compop expr | TRUE | FALSE ;
compop : LESS | GREAT | EQUAL | NOTEQUAL | LESSEQ | GREATEQ ;
while_stmt : WHILE ( cond ) decl stmt_list ENDWHILE ;
control_stmt : return_stmt | CONTINUE SEMICOLON | BREAK SEMICOLON ;
loop_stmt : while_stmt | for_stmt ;
init_stmt : assign_expr | empty ;
incr_stmt : assign_expr | empty ;
for_stmt : FOR ( init_stmt SEMICOLON cond SEMICOLON incr_stmt ) decl stmt_list ENDFOR ;
COMMENT : '--' ~[\r\n]* -> skip ;
WS : [ \t\r\n]+ -> skip ;
NEWLINE : [ \n] ;
EMPTY : $ ;
KEYWORD : PROGRAM|BEGIN|END|FUNCTION|READ|WRITE|IF|ELSE|ENDIF|WHILE|ENDWHILE|RETURN|INT|VOID|STRING|FLOAT|TRUE|FALSE|FOR|ENDFOR|CONTINUE|BREAK ;
OPERATOR : ASSIGN|ADD|MIN|MUL|DIV|EQUAL|NOTEQUAL|LESS|GREAT|LBRACKET|RBRACKET|SEMICOLON|COMA|LESSEQ|GREATEQ ;
IDENTIFIER : [a-zA-Z][a-zA-Z0-9]* ;
INTLITERAL : [0-9]+ ;
FLOATLITERAL : [0-9]*'.'[0-9]+ ;
STRINGLITERAL : '"' (~[\r\n"] | '""')* '"' ;
PROGRAM : 'PROGRAM';
BEGIN : 'BEGIN';
END : 'END';
FUNCTION : 'FUNCTION';
READ : 'READ';
WRITE : 'WRITE';
IF : 'IF';
ELSE : 'ELSE';
ENDIF : 'ENDIF';
WHILE : 'WHILE';
ENDWHILE : 'ENDWHILE';
RETURN : 'RETURN';
INT : 'INT';
VOID : 'VOID';
STRING : 'STRING';
FLOAT : 'FLOAT' ;
TRUE : 'TRUE';
FALSE : 'FALSE';
FOR : 'FOR';
ENDFOR : 'ENDFOR';
CONTINUE : 'CONTINUE';
BREAK : 'BREAK';
ASSIGN : ':=';
ADD : '+';
MIN : '-';
MUL : '*';
DIV : '/';
EQUAL : '=';
NOTEQUAL : '!=';
LESS : '<';
GREAT : '>';
LBRACKET : '(';
RBRACKET : ')';
SEMICOLON : ';';
COMA : ',';
LESSEQ : '<=';
GREATEQ : '>=';
From what I've read, I think there's a mismatch between KEYWORD and PROGRAM, but removing KEYWORD altogether does not solve the problem.
EDIT:
Removing KEYWORD gives the following message:
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
This my grun output when KEYWORD is available:
[#0,0:6='PROGRAM',<KEYWORD>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<KEYWORD>,2:0]
[#3,19:21='END',<KEYWORD>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 1:0 mismatched input 'PROGRAM' expecting 'PROGRAM'
(program PROGRAM test BEGIN END)
This is the output when KEYWORD is removed:
[#0,0:6='PROGRAM',<'PROGRAM'>,1:0]
[#1,8:11='test',<IDENTIFIER>,1:8]
[#2,13:17='BEGIN',<'BEGIN'>,2:0]
[#3,19:21='END',<'END'>,3:0]
[#4,23:22='<EOF>',<EOF>,4:0]
line 3:0 mismatched input 'END' expecting {'INT', 'STRING', 'FLOAT', '+'}
(program PROGRAM (id test) BEGIN (pgm_body decl func_declarations) END)
The error about "missing 'PROGRAM'" has been solved when you removed the KEYWORD rule (note that you should also remove the OPERATOR rule for the same reasons).
The error you're encountering now is completely unrelated.
Your current problem concerns the definition of empty, which you didn't show. You've said that you tried both EMPTY : $ ; and EMPTY : ^$ ; (and then presumably empty: EMPTY;), but none of those even compile, so they wouldn't cause the parse error you posted. Either way, the concept of an EMPTY token can't work. When would such a token be generated? Once between every other token? In that case, you'd get a lot of "unexpected EMPTY" errors. No, the whole point of an empty rule is that it should succeed without consuming any tokens.
To achieve that, you can just define empty : ; and remove EMPTY altogether. Alternatively you could remove empty as well and just use an empty alternative (i.e. | ;) wherever you're currently using empty. Either approach will make your code work, but there's a better way:
You're using empty as the base case for rules that basically amount to lists. ANTLR offers the repetition operators * (0 or more) , + (1 or more) as well as the ? operator to make things optional. These allow you to define lists non-recursively and without an empty rule. For example stmt_list could be defined like this:
stmt_list : stmt* ;
And id_list like this:
id_list : (id (',' id)*)? ;
On an unrelated note, your grammar can simplified greatly by making use of the fact that ANTLR 4 supports direct left recursion, so you can get rid of all the different expression rules and just have one that's left-recursive.
That'd give you:
expr : primary
| id '(' expr_list ')'
| expr mulop expr
| expr addop expr
;
And the rules expr_prefix, factor, factor_prefix and postfix_expr and call_expr could all be removed.

How to express a required 'RETURN' statement in the grammar

I am still a newbie to ANTLR, so sorry if I am posting an obvious question.
I have a relatively simple grammar. What I need is for the user to be able to enter something like the following:
if (condition)
{
return true
}
else if (condition)
{
return false
}
else
{
if (condition)
{
return true
}
return false
}
In my grammar below, is there a way to make sure that an error will be flagged if the input string does not contain a 'return' statement? If not, can I do it via the Listener, and if so, how?
grammar Evaluator;
parse
: block EOF
;
block
: statement
;
statement
: return_statement
| if_statement
;
return_statement
: RETURN (TRUE | FALSE)
;
if_statement
: IF condition_block (ELSE IF condition_block)* (ELSE statement_block)?
;
condition_block
: expression statement_block
;
statement_block
: OBRACE block CBRACE
;
expression
: MINUS expression #unaryMinusExpression
| NOT expression #notExpression
| expression op=(MULT | DIV) expression #multiplicationExpression
| expression op=(PLUS | MINUS) expression #additiveExpression
| expression op=(LTEQ | GTEQ | LT | GT) expression #relationalExpression
| expression op=(EQ | NEQ) expression #equalityExpression
| expression AND expression #andExpression
| expression OR expression #orExpression
| atom #atomExpression
;
atom
: function #functionAtom
| OPAR expression CPAR #parenExpression
| (INT | FLOAT) #numberAtom
| (TRUE | FALSE) #booleanAtom
| ID #idAtom
;
function
: ID OPAR (parameter (',' parameter)*)? CPAR
;
parameter
: expression #expressionParameter
;
OR : '||';
AND : '&&';
EQ : '==';
NEQ : '!=';
GT : '>';
LT : '<';
GTEQ : '>=';
LTEQ : '<=';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
NOT : '!';
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
ASSIGN : '=';
RETURN : 'return';
TRUE : 'true';
FALSE : 'false';
IF : 'if';
ELSE : 'else';
// ID either starts with a letter then followed by any number of a-zA-Z_0-9
// or starts with one or more numbers, then followed by at least one a-zA-Z_ then followed
// by any number of a-zA-Z_0-9
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
// Anything not recognized above will be an error
ErrChar
: .
;
Ross' answer is perfectly correct. You design your grammar to accept a certain input. If the input stream does not correspond, the parser will complain.
Allow me to rewrite your grammar like this :
grammar Question;
/* enforce each block to end with a return statement */
a_grammar
: if_statement EOF
;
if_statement
: 'if' expression statement+ ( 'else' statement+ )?
;
statement
: if_statement
// other statements
| statement_block
;
statement_block
: '{' statement* return_statement '}'
;
return_statement
: 'return' ( 'true' | 'false' )
;
expression // reduced to a strict minimum to answer the OP question
: atom
| atom '<=' atom
| '(' expression ')'
;
atom
: ID
| INT
;
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT : [0-9]+ ;
WS : [ \t\r\n] -> skip ;
// Anything not recognized above will be an error
ErrChar
: .
;
With the following input
if (a <= 7)
{
return true
}
else
if (xyz <= 99)
{
return false
}
else incor##!$rect
{
if (b <= a)
{
return true
}
return false
}
you get these tokens
[#0,0:1='if',<'if'>,1:0]
[#1,3:3='(',<'('>,1:3]
[#2,4:4='a',<ID>,1:4]
[#3,6:7='<=',<'<='>,1:6]
...
[#21,82:85='else',<'else'>,10:1]
[#22,87:91='incor',<ID>,10:6]
[#23,92:92='#',<ErrChar>,10:11]
[#24,93:93='#',<ErrChar>,10:12]
[#25,94:94='!',<ErrChar>,10:13]
[#26,95:95='$',<ErrChar>,10:14]
[#27,96:99='rect',<ID>,10:15]
[#28,102:102='{',<'{'>,11:1]
...
line 10:6 mismatched input 'incor' expecting {'if', '{'}
If you run the test rig with the -gui option, it displays the parse tree with erroneous tokens nicely displayed in pink !
grun Question a_grammar -gui data.txt
I've never played with the Listener before.
Via the Visitor, in the VisitStatement(StatementContext context) method, check if the context.return_statement() (ReturnStatementContext) is null. If it is null, throw an exception.
I'm a newbie as well. I was thinking of forcing the lexer to barf by
requiring a return statement, so instead of:
statement
: return_statement
| if_statement
;
Which says a statement is EITHER a if_statement OR a return_statement I would try something like:
statement
: (if_statement)? return_statement
;
Which (I believe), says the if_statement is optional but the return_statement MUST always occur. But you might want to try something like:
block_data : statements+ return_statement;
Where statements could be if_statements etc, and one or more of those are allowed.
I would take everything above with a grain of salt, as I have only been working with ANTLR4 a week or so. I have 4 .g4 files working, and am happy with ANTLR, but you may actually have more ANTLR stick time than I.
-Regards

how can I refactor this ANTLR4 grammar so that it isn't mutually left recursive?

I can't seem to figure out why this grammar won't compile. It compiled fine until I modified line 145 from
(Identifier '.')* functionCall
to
(primary '.')? functionCall
I've been trying to figure out how to solve this issue for a while but I can't seem to be able to. Here's the error:
The following sets of rules are mutually left-recursive [primary]
grammar Tadpole;
#header
{package net.tadpole.compiler.parser;}
file
: fileContents*
;
fileContents
: structDec
| functionDec
| statement
| importDec
;
importDec
: 'import' Identifier ';'
;
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| NoneLiteral
| arrayLiteral
;
arrayLiteral
: '[' expressionList? ']'
;
expressionList
: expression (',' expression)*
;
expression
: primary
| unaryExpression
| <assoc=right> expression binaryOpPrec0 expression
| <assoc=left> expression binaryOpPrec1 expression
| <assoc=left> expression binaryOpPrec2 expression
| <assoc=left> expression binaryOpPrec3 expression
| <assoc=left> expression binaryOpPrec4 expression
| <assoc=left> expression binaryOpPrec5 expression
| <assoc=left> expression binaryOpPrec6 expression
| <assoc=left> expression binaryOpPrec7 expression
| <assoc=left> expression binaryOpPrec8 expression
| <assoc=left> expression binaryOpPrec9 expression
| <assoc=left> expression binaryOpPrec10 expression
| <assoc=right> expression binaryOpPrec11 expression
;
unaryExpression
: unaryOp expression
| prefixPostfixOp primary
| primary prefixPostfixOp
;
unaryOp
: '+'
| '-'
| '!'
| '~'
;
prefixPostfixOp
: '++'
| '--'
;
binaryOpPrec0
: '**'
;
binaryOpPrec1
: '*'
| '/'
| '%'
;
binaryOpPrec2
: '+'
| '-'
;
binaryOpPrec3
: '>>'
| '>>>'
| '<<'
;
binaryOpPrec4
: '<'
| '>'
| '<='
| '>='
| 'is'
;
binaryOpPrec5
: '=='
| '!='
;
binaryOpPrec6
: '&'
;
binaryOpPrec7
: '^'
;
binaryOpPrec8
: '|'
;
binaryOpPrec9
: '&&'
;
binaryOpPrec10
: '||'
;
binaryOpPrec11
: '='
| '**='
| '*='
| '/='
| '%='
| '+='
| '-='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '<-'
;
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| (primary '.')? functionCall
;
functionCall
: functionName '(' expressionList? ')'
;
functionName
: Identifier
;
dimension
: '[' expression ']'
;
statement
: '{' statement* '}'
| expression ';'
| 'recall' ';'
| 'return' expression? ';'
| variableDec
| 'if' '(' expression ')' statement ('else' statement)?
| 'while' '(' expression ')' statement
| 'do' expression 'while' '(' expression ')' ';'
| 'do' '{' statement* '}' 'while' '(' expression ')' ';'
;
structDec
: 'struct' structName ('(' parameterList ')')? '{' variableDec* functionDec* '}'
;
structName
: Identifier
;
fieldName
: Identifier
;
variableDec
: type fieldName ('=' expression)? ';'
;
type
: primitiveType ('[' ']')*
| objType ('[' ']')*
;
primitiveType
: 'byte'
| 'short'
| 'int'
| 'long'
| 'char'
| 'boolean'
| 'float'
| 'double'
;
objType
: (Identifier '.')? structName
;
functionDec
: 'def' functionName '(' parameterList? ')' ':' type '->' functionBody
;
functionBody
: statement
;
parameterList
: parameter (',' parameter)*
;
parameter
: type fieldName
;
IntegerLiteral
: DecimalIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral
;
fragment
DecimalIntegerLiteral
: DecimalNumeral IntegerSuffix?
;
fragment
HexIntegerLiteral
: HexNumeral IntegerSuffix?
;
fragment
OctalIntegerLiteral
: OctalNumeral IntegerSuffix?
;
fragment
BinaryIntegerLiteral
: BinaryNumeral IntegerSuffix?
;
fragment
IntegerSuffix
: [lL]
;
fragment
DecimalNumeral
: Digit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitsAndUnderscores? Digit)?
;
fragment
Digit
: [0-9]
;
fragment
DigitsAndUnderscores
: DigitOrUnderscore+
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
fragment
HexNumeral
: '0' [xX] HexDigits
;
fragment
HexDigits
: HexDigit (HexDigitsAndUnderscores? HexDigit)?
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
HexDigitsAndUnderscores
: HexDigitOrUnderscore+
;
fragment
HexDigitOrUnderscore
: HexDigit
| '_'
;
fragment
OctalNumeral
: '0' [oO] Underscores? OctalDigits
;
fragment
OctalDigits
: OctalDigit (OctalDigitsAndUnderscores? OctalDigit)?
;
fragment
OctalDigit
: [0-7]
;
fragment
OctalDigitsAndUnderscores
: OctalDigitOrUnderscore+
;
fragment
OctalDigitOrUnderscore
: OctalDigit
| '_'
;
fragment
BinaryNumeral
: '0' [bB] BinaryDigits
;
fragment
BinaryDigits
: BinaryDigit (BinaryDigitsAndUnderscores? BinaryDigit)?
;
fragment
BinaryDigit
: [01]
;
fragment
BinaryDigitsAndUnderscores
: BinaryDigitOrUnderscore+
;
fragment
BinaryDigitOrUnderscore
: BinaryDigit
| '_'
;
// §3.10.2 Floating-Point Literals
FloatingPointLiteral
: DecimalFloatingPointLiteral FloatingPointSuffix?
| HexadecimalFloatingPointLiteral FloatingPointSuffix?
;
fragment
FloatingPointSuffix
: [fFdD]
;
fragment
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
fragment
HexadecimalFloatingPointLiteral
: HexSignificand BinaryExponent
;
fragment
HexSignificand
: HexNumeral '.'?
| '0' [xX] HexDigits? '.' HexDigits
;
fragment
BinaryExponent
: BinaryExponentIndicator SignedInteger
;
fragment
BinaryExponentIndicator
: [pP]
;
BooleanLiteral
: 'true'
| 'false'
;
CharacterLiteral
: '\'' SingleCharacter '\''
| '\'' EscapeSequence '\''
;
fragment
SingleCharacter
: ~['\\]
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| OctalEscape
| UnicodeEscape
;
fragment
OctalEscape
: '\\' OctalDigit
| '\\' OctalDigit OctalDigit
| '\\' ZeroToThree OctalDigit OctalDigit
;
fragment
ZeroToThree
: [0-3]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
NoneLiteral
: 'nil'
;
Identifier
: IdentifierStartChar IdentifierChar*
;
fragment
IdentifierStartChar
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
IdentifierChar
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
WS : [ \t\r\n\u000C]+ -> skip
;
LINE_COMMENT
: '#' ~[\r\n]* -> skip
;
The left recursive invocation needs to be the first, so no parenthesis can be placed before it.
You can rewrite it like this:
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| primary '.' functionCall
| functionCall
;
which is equivalent.

ANTLR4 Grammar picks up 'and' and 'or' in variable names

Please help me with my ANTLR4 Grammar.
Sample "formel":
(Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') &
(Arbejde.ArbejdsTimerPrUge = 40)
(Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato;
'+62Å') < DagsDato)
(Arb.BorI=860)
My problem is that Arb.BorI=860 is not handled correct. I get this error:
Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException
Please notis that Arb.BorI contains the word 'or'.
I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt'
So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated.
My Grammar:
grammar UnikFormel;
formel : boolExpression # BooleanExpr
| expression # Expr
| '(' formel ')' # Parentes;
boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+;
expression : element compareOps element # Compare;
element : datakildefelt # DatakildeId
| function # Funktion
| int # Integer
| decimal # Real
| string # Text;
datakildefelt : datakilde '.' felt;
datakilde : identifyer;
felt : identifyer;
function : funktionsnavn ('(' funcParameters? ')')?;
funktionsnavn : identifyer;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
identifyer : LETTER+;
int : DIGIT+;
decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
string : QUOTE .*? QUOTE;
booleanOps : (AND | OR);
compareOps : (LT | GT | EQ | GTEQ | LTEQ);
QUOTE : '\'';
OPERATOR: '+';
DIGIT: [0-9];
LETTER: [a-åA-Å];
MUL : '*';
DIV : '/';
ADD : '+';
SUB : '-';
GT : '>';
LT : '<';
EQ : '=';
GTEQ : '>=';
LTEQ : '<=';
AND : '&' | 'and';
OR : '?' | 'or';
WS : ' '+ -> skip;
Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too.
EDIT
Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).
Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is:
grammar UnikFormel;
formel : '(' formel ')' # Parentes
| expression # Expr
| boolExpression # BooleanExpr
;
boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+
| '(' formel ')' ( booleanOps '(' formel ')' )+;
expression : element compareOps element # Compare;
datakildefelt : ID '.' ID;
function : ID ('(' funcParameters? ')')?;
funcParameters : funcParameter (';' funcParameter)*;
funcParameter : element;
element : datakildefelt # DatakildeId
| function # Funktion
| INT # Integer
| DECIMAL # Real
| STRING # Text;
booleanOps : (AND | OR);
compareOps : ( GTEQ | LTEQ | LT | GT | EQ |);
AND : '&' | 'and';
OR : '?' | 'or';
GTEQ : '>=';
LTEQ : '<=';
GT : '>';
LT : '<';
EQ : '=';
ID : LETTER ( LETTER | DIGIT)*;
INT : DIGIT+;
DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+;
STRING : QUOTE .*? QUOTE;
fragment QUOTE : '\'';
fragment DIGIT: [0-9];
fragment LETTER: [a-åA-Å];
WS : [ \t\r\n]+ -> skip;

Error when generating a grammar for chess PGN files

I made this ANTLR4 grammar in order to parse a PGN inside my Java programm, but I can't manage to solve the ambiguity in it :
grammar Pgn;
file: game (NEWLINE+ game)*;
game: (tag+ NEWLINE+)? notation;
tag: [TAG_TYPE "TAG_VALUE"];
notation: move+ END_RESULT?;
move: MOVE_NUMBER\. MOVE_DESC MOVE_DESC #CompleteMove
| MOVE_NUMBER\. MOVE_DESC #OnlyWhiteMove
| MOVE_NUMBER\.\.\. MOVE_DESC #OnlyBlackMove
;
END_RESULT: '1-0'
| '0-1'
| '1/2-1/2'
;
TAG_TYPE: LETTER+;
TAG_VALUE: .*;
MOVE_NUMBER: DIGIT+;
MOVE_DESC: .*;
NEWLINE: \r? \n;
SPACES: [ \t]+ -> skip;
fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];
And this is the error output :
$ antlr4 Pgn.g4
error(50): Pgn.g4:6:6: syntax error: 'TAG_TYPE "TAG_VALUE"' came as a complete surprise to me while matching alternative
I think the error come from the fact that " [ ", " ] " and ' " ' can't be used freely, neither in Grammar nor Lexer.
Helps or advices are welcome.
Looking at the specs for PGN, http://www.thechessdrum.net/PGN_Reference.txt, I see there's a formal definition of the PGN format there:
18: Formal syntax
<PGN-database> ::= <PGN-game> <PGN-database>
<empty>
<PGN-game> ::= <tag-section> <movetext-section>
<tag-section> ::= <tag-pair> <tag-section>
<empty>
<tag-pair> ::= [ <tag-name> <tag-value> ]
<tag-name> ::= <identifier>
<tag-value> ::= <string>
<movetext-section> ::= <element-sequence> <game-termination>
<element-sequence> ::= <element> <element-sequence>
<recursive-variation> <element-sequence>
<empty>
<element> ::= <move-number-indication>
<SAN-move>
<numeric-annotation-glyph>
<recursive-variation> ::= ( <element-sequence> )
<game-termination> ::= 1-0
0-1
1/2-1/2
*
<empty> ::=
I highly recommend you to let your ANTLR grammar resemble that as much as possible. I made a small project with ANTLR 4 on Github which you can try out: https://github.com/bkiers/PGN-parser
The grammar (without comments):
parse
: pgn_database EOF
;
pgn_database
: pgn_game*
;
pgn_game
: tag_section movetext_section
;
tag_section
: tag_pair*
;
tag_pair
: LEFT_BRACKET tag_name tag_value RIGHT_BRACKET
;
tag_name
: SYMBOL
;
tag_value
: STRING
;
movetext_section
: element_sequence game_termination
;
element_sequence
: (element | recursive_variation)*
;
element
: move_number_indication
| san_move
| NUMERIC_ANNOTATION_GLYPH
;
move_number_indication
: INTEGER PERIOD?
;
san_move
: SYMBOL
;
recursive_variation
: LEFT_PARENTHESIS element_sequence RIGHT_PARENTHESIS
;
game_termination
: WHITE_WINS
| BLACK_WINS
| DRAWN_GAME
| ASTERISK
;
WHITE_WINS
: '1-0'
;
BLACK_WINS
: '0-1'
;
DRAWN_GAME
: '1/2-1/2'
;
REST_OF_LINE_COMMENT
: ';' ~[\r\n]* -> skip
;
BRACE_COMMENT
: '{' ~'}'* '}' -> skip
;
ESCAPE
: {getCharPositionInLine() == 0}? '%' ~[\r\n]* -> skip
;
SPACES
: [ \t\r\n]+ -> skip
;
STRING
: '"' ('\\\\' | '\\"' | ~[\\"])* '"'
;
INTEGER
: [0-9]+
;
PERIOD
: '.'
;
ASTERISK
: '*'
;
LEFT_BRACKET
: '['
;
RIGHT_BRACKET
: ']'
;
LEFT_PARENTHESIS
: '('
;
RIGHT_PARENTHESIS
: ')'
;
LEFT_ANGLE_BRACKET
: '<'
;
RIGHT_ANGLE_BRACKET
: '>'
;
NUMERIC_ANNOTATION_GLYPH
: '$' [0-9]+
;
SYMBOL
: [a-zA-Z0-9] [a-zA-Z0-9_+#=:-]*
;
SUFFIX_ANNOTATION
: [?!] [?!]?
;
UNEXPECTED_CHAR
: .
;
For a version with comments, see: https://github.com/bkiers/PGN-parser/blob/master/src/main/antlr4/nl/bigo/pp/PGN.g4

Resources