from what i have read the way i defined my 'expression' this should provide me with the following:
This is my input:
xyz = a + b + c + d
This should be my output:
xyz = ( ( a + b ) + ( c + d) )
But instead i get:
xyz = ( a + (b + (c + d) ) )
I bet this has been solved before and i just wasnt able to find the solution.
statementList : s=statement sl=statementList #multipleStatementList
| s=statement #singleStatementList
;
statement : statementAssign
| statementIf
;
statementAssign : var=VAR ASSIGN expr=expression #overwriteStatementAssign
| var=VAR PLUS ASSIGN expr=expression #addStatementAssign
| var=VAR MINUS ASSIGN expr=expression #subStatementAssign
;
;
expression : BRACKET_OPEN expr=expression BRACKET_CLOSE #priorityExp
| left=expression operand=('*'|'/') right=expression #mulDivExp
| left=expression operand=('+'|'-') right=expression #addSubExp
| <assoc=right> left=expression POWER right=expression #powExp
| variable=VAR #varExp
| number=NUMBER #numExp
;
BRACKET_OPEN : '(' ;
BRACKET_CLOSE : ')' ;
ASTERISK : '*' ;
SLASH : '/' ;
PLUS : '+' ;
MINUS : '-' ;
POWER : '^' ;
MODULO : '%' ;
ASSIGN : '=' ;
NUMBER : [0-9]+ ;
VAR : [a-z][a-zA-Z0-9\-]* ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
If you want the expression
xyz = ( ( a + b ) + ( c + d) )
Then you'll need to control the binding of the + operator with parentheses just as you've done in your expected output. Otherwise, with
xyz = ( a + (b + (c + d) ) )
is the way the parser is going to parse it because all the + operators have the same precedence, and the parser continues parsing until it reaches the end of the expression.
It recursively applies
left=expression operand=('+'|'-') right=expression
until the expression is completed.
and you get the grouping you got. So use those parentheses, that's what they're for if you want to force the order of expression evaluation. ;)
If you change your input to
xyz = a * b + c + d
you'll see what I mean about the precedence, because the multiplication rule appears before the addition rule -- and hence binds earlier -- which is a mathematical convention (lacking parentheses to group terms.)
You're doing it right and the parser is too. Just group what you want if you want a specific binding order.
Related
I am still a newbie to ANTLR, so sorry if I am posting an obvious question.
I have a relatively simple grammar. What I need is for the user to be able to enter something like the following:
if (condition)
{
return true
}
else if (condition)
{
return false
}
else
{
if (condition)
{
return true
}
return false
}
In my grammar below, is there a way to make sure that an error will be flagged if the input string does not contain a 'return' statement? If not, can I do it via the Listener, and if so, how?
grammar Evaluator;
parse
: block EOF
;
block
: statement
;
statement
: return_statement
| if_statement
;
return_statement
: RETURN (TRUE | FALSE)
;
if_statement
: IF condition_block (ELSE IF condition_block)* (ELSE statement_block)?
;
condition_block
: expression statement_block
;
statement_block
: OBRACE block CBRACE
;
expression
: MINUS expression #unaryMinusExpression
| NOT expression #notExpression
| expression op=(MULT | DIV) expression #multiplicationExpression
| expression op=(PLUS | MINUS) expression #additiveExpression
| expression op=(LTEQ | GTEQ | LT | GT) expression #relationalExpression
| expression op=(EQ | NEQ) expression #equalityExpression
| expression AND expression #andExpression
| expression OR expression #orExpression
| atom #atomExpression
;
atom
: function #functionAtom
| OPAR expression CPAR #parenExpression
| (INT | FLOAT) #numberAtom
| (TRUE | FALSE) #booleanAtom
| ID #idAtom
;
function
: ID OPAR (parameter (',' parameter)*)? CPAR
;
parameter
: expression #expressionParameter
;
OR : '||';
AND : '&&';
EQ : '==';
NEQ : '!=';
GT : '>';
LT : '<';
GTEQ : '>=';
LTEQ : '<=';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
NOT : '!';
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
ASSIGN : '=';
RETURN : 'return';
TRUE : 'true';
FALSE : 'false';
IF : 'if';
ELSE : 'else';
// ID either starts with a letter then followed by any number of a-zA-Z_0-9
// or starts with one or more numbers, then followed by at least one a-zA-Z_ then followed
// by any number of a-zA-Z_0-9
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
SPACE
: [ \t\r\n] -> skip
;
// Anything not recognized above will be an error
ErrChar
: .
;
Ross' answer is perfectly correct. You design your grammar to accept a certain input. If the input stream does not correspond, the parser will complain.
Allow me to rewrite your grammar like this :
grammar Question;
/* enforce each block to end with a return statement */
a_grammar
: if_statement EOF
;
if_statement
: 'if' expression statement+ ( 'else' statement+ )?
;
statement
: if_statement
// other statements
| statement_block
;
statement_block
: '{' statement* return_statement '}'
;
return_statement
: 'return' ( 'true' | 'false' )
;
expression // reduced to a strict minimum to answer the OP question
: atom
| atom '<=' atom
| '(' expression ')'
;
atom
: ID
| INT
;
ID
: [a-zA-Z] [a-zA-Z_0-9]*
| [0-9]+ [a-zA-Z_]+ [a-zA-Z_0-9]*
;
INT : [0-9]+ ;
WS : [ \t\r\n] -> skip ;
// Anything not recognized above will be an error
ErrChar
: .
;
With the following input
if (a <= 7)
{
return true
}
else
if (xyz <= 99)
{
return false
}
else incor##!$rect
{
if (b <= a)
{
return true
}
return false
}
you get these tokens
[#0,0:1='if',<'if'>,1:0]
[#1,3:3='(',<'('>,1:3]
[#2,4:4='a',<ID>,1:4]
[#3,6:7='<=',<'<='>,1:6]
...
[#21,82:85='else',<'else'>,10:1]
[#22,87:91='incor',<ID>,10:6]
[#23,92:92='#',<ErrChar>,10:11]
[#24,93:93='#',<ErrChar>,10:12]
[#25,94:94='!',<ErrChar>,10:13]
[#26,95:95='$',<ErrChar>,10:14]
[#27,96:99='rect',<ID>,10:15]
[#28,102:102='{',<'{'>,11:1]
...
line 10:6 mismatched input 'incor' expecting {'if', '{'}
If you run the test rig with the -gui option, it displays the parse tree with erroneous tokens nicely displayed in pink !
grun Question a_grammar -gui data.txt
I've never played with the Listener before.
Via the Visitor, in the VisitStatement(StatementContext context) method, check if the context.return_statement() (ReturnStatementContext) is null. If it is null, throw an exception.
I'm a newbie as well. I was thinking of forcing the lexer to barf by
requiring a return statement, so instead of:
statement
: return_statement
| if_statement
;
Which says a statement is EITHER a if_statement OR a return_statement I would try something like:
statement
: (if_statement)? return_statement
;
Which (I believe), says the if_statement is optional but the return_statement MUST always occur. But you might want to try something like:
block_data : statements+ return_statement;
Where statements could be if_statements etc, and one or more of those are allowed.
I would take everything above with a grain of salt, as I have only been working with ANTLR4 a week or so. I have 4 .g4 files working, and am happy with ANTLR, but you may actually have more ANTLR stick time than I.
-Regards
So I defined a grammar to parse an C style syntax language:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression (op=('*'|'/') expression)*
| expression (op=('+'|'-') expression)*
| relation
| INT
| ID
| '(' expression ')'
;
relation
: expression (op=('<'|'>') expression)*
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
And I can parse a simple test as this:
int i = (2+3)*3/2*(3+36);
int j = i;
int k = 2*1+i*3;
if (k > 2)
k = k + 1;
i = i / 3;
j = j / 3;
fi;
loop (i < 10)
i = i + 1 * (i+k);
j = (j + 1) * (j-k);
k = i + j;
print(k);
pool;
However, when I want to generate ANTLR Recogonizers in intelliJ, I got this error:
sCalc.g4:19:0: left recursive rule expression contains a left recursive alternative which can be followed by the empty string
I wonder if this is caused by my ID could be an empty string?
There are a couple of issues with your grammar:
you have INT as an alternative inside expression while you probably want INTEGER instead
there is no need to do expression (op=('+'|'-') expression)*: this will do: expression op=('+'|'-') expression
ANTLR4 does not support indirect left recursive rules: you must include relation inside expression
Something like this ought to do it:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression op=('*'|'/') expression
| expression op=('+'|'-') expression
| expression op=('<'|'>') expression
| INTEGER
| ID
| '(' expression ')'
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
Also not that this (statement)* can simply be written as statement*
It's about your expression and relation rules. The expression rule can match relation in one alt, which in turn recurses back to expression. Rule relation additionally can potentially match nothing because of (op=('<'|'>') expression)*
A better approach is probably to have relation call expression and remove the relation alt from expression. Then use relation everywhere you used expression now. That's a typical scenario in expressions, starting out with low precedence operations as top level rules and drilling down to higher precedence rules, ultimately ending at a simple expression rule (or similar).
So far I've been testing with ANTLR4, I've tested with this single grammar:
grammar LivingDSLParser;
options{
language = Java;
//tokenVocab = LivingDSLLexer;
}
living
: query #QUERY
;
query
: K_QUERY entity K_WITH expr
;
entity
: STAR #ALL
| D_FUAS #FUAS
| D_RESOURCES #RESOURCES
;
field
: ((D_FIELD | D_PROPERTY | D_METAINFO) DOT)? IDENTIFIER
| STAR
;
expr
: field
| expr ( '*' | '/' | '%' ) expr
| expr ( '+' | '-' ) expr
| expr ( '<<' | '>>' | '&' | '|' ) expr
| expr ( '<' | '<=' | '>' | '>=' ) expr
| expr ( '=' | '==' | '!=' | '<>' ) expr
| expr K_AND expr
| expr K_OR expr
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
K_QUERY : Q U E R Y;
K_WITH: W I T H;
K_OR: O R;
K_AND: A N D;
D_FUAS : F U A S;
D_RESOURCES : R E S O U R C E S;
D_METAINFO: M E T A I N F O;
D_PROPERTY: P R O P E R T Y;
D_FIELD: F I E L D;
STAR : '*';
PLUS : '+';
MINUS : '-';
PIPE2 : '||';
DIV : '/';
MOD : '%';
LT2 : '<<';
GT2 : '>>';
AMP : '&';
PIPE : '|';
LT : '<';
LT_EQ : '<=';
GT : '>';
GT_EQ : '>=';
EQ : '==';
NOT_EQ1 : '!=';
NOT_EQ2 : '<>';
OPEN_PAR : '(';
CLOSE_PAR : ')';
SCOL : ';';
DOT : '.';
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
//so on...
As far I've been able to figure out, when I write some input like this:
query fuas with field.xxx == property.yyy
, it should match.
However I recive this message:
LivingDSLParser::living:1:0: mismatched input 'query' expecting K_QUERY
I have no idea where's the problem and neither what this message means.
Whenever ANTLR can match 2 or more rules to some input, it chooses the first rule. Since both IDENTIFIER and K_QUERY match the input "query"
, and IDENTIFIER is defined before K_QUERY, IDENTIFIER is matched.
Solution: move your IDENTIFIER rule below your keyword definitions.
I am writing an antlr grammar file for a dialect of basic. Most of it is either working or I have a good idea of what I need to do next. However, I am not at all sure what I should do with the '=' character which is used for both equality tests as well as assignment.
For example, this is a valid statement
t = (x = 5) And (y = 3)
This evaluates if x is EQUAL to 5, if y is EQUAL to 3 then performs a logical AND on those results and ASSIGNS the result to t.
My grammar will parse this; albeit incorrectly, but I think that will resolve itself once the ambiguity is resolved .
How do I differentiate between the two uses of the '=' character?
1) Should I remove the assignment rule from expression and handle these cases (assignment vs equality test) in my visitor and\or listener implementation during code generation
2) Is there a better way to define the grammar such that it is already sorted out
Would someone be able to simply point me in the right direction as to how best implement this language "feature"?
Also, I have been reading through the Definitive guide to ANTLR4 as well as Language Implementation Patterns looking for a solution to this. It may be there but I have not yet found it.
Below is the full parser grammar. The ASSIGN token is currently set to '='. EQUAL is set to '=='.
parser grammar wlParser;
options { tokenVocab=wlLexer; }
program
: multistatement (NEWLINE multistatement)* NEWLINE?
;
multistatement
: statement (COLON statement)*
;
statement
: declarationStat
| defTypeStat
| assignment
| expression
;
assignment
: lvalue op=ASSIGN expression
;
expression
: <assoc=right> left=expression op=CARAT right=expression #exponentiationExprStat
| (PLUS|MINUS) expression #signExprStat
| IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN #arrayIndexExprStat
| left=expression op=(ASTERISK|FSLASH) right=expression #multDivExprStat
| left=expression op=BSLASH right=expression #integerDivExprStat
| left=expression op=KW_MOD right=expression #modulusDivExprStat
| left=expression op=(PLUS|MINUS) right=expression #addSubExprStat
| left=string op=AMPERSAND right=string #stringConcatenation
| left=expression op=(RELATIONALOPERATORS | KW_IS | KW_ISA) right=expression #relationalComparisonExprStat
| left=expression (op=LOGICALOPERATORS right=expression)+ #logicalOrAndExprStat
| op=KW_LIKE patternString #likeExprStat
| LPAREN expression RPAREN #groupingExprStat
| NUMBER #atom
| string #atom
| IDENTIFIER DATATYPESUFFIX? #atom
;
lvalue
: (IDENTIFIER DATATYPESUFFIX?) | (IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN)
;
string
: STRING
;
patternString
: DQUOT (QUESTIONMARK | POUND | ASTERISK | LBRACKET BANG? .*? RBRACKET)+ DQUOT
;
referenceType
: DATATYPE
;
declarationStat
: constDecl
| varDecl
;
constDecl
: CONSTDECL? KW_CONST IDENTIFIER EQUAL expression
;
varDecl
: VARDECL (varDeclPart (COMMA varDeclPart)*)? | listDeclPart
;
varDeclPart
: IDENTIFIER DATATYPESUFFIX? ((arrayBounds)? KW_AS DATATYPE (COMMA DATATYPE)*)?
;
listDeclPart
: IDENTIFIER DATATYPESUFFIX? KW_LIST KW_AS DATATYPE
;
arrayBounds
: LPAREN (arrayDimension (COMMA arrayDimension)*)? RPAREN
;
arrayDimension
: INTEGER (KW_TO INTEGER)?
;
defTypeStat
: DEFTYPES DEFTYPERANGE (COMMA DEFTYPERANGE)*
;
This is the lexer grammar.
lexer grammar wlLexer;
NUMBER
: INTEGER
| REAL
| BINARY
| OCTAL
| HEXIDECIMAL
;
RELATIONALOPERATORS
: EQUAL
| NEQUAL
| LT
| LTE
| GT
| GTE
;
LOGICALOPERATORS
: KW_OR
| KW_XOR
| KW_AND
| KW_NOT
| KW_IMP
| KW_EQV
;
INSTANCEOF
: KW_IS
| KW_ISA
;
CONSTDECL
: KW_PUBLIC
| KW_PRIVATE
;
DATATYPE
: KW_BOOLEAN
| KW_BYTE
| KW_INTEGER
| KW_LONG
| KW_SINGLE
| KW_DOUBLE
| KW_CURRENCY
| KW_STRING
;
VARDECL
: KW_DIM
| KW_STATIC
| KW_PUBLIC
| KW_PRIVATE
;
LABEL
: IDENTIFIER COLON
;
DEFTYPERANGE
: [a-zA-Z] MINUS [a-zA-Z]
;
DEFTYPES
: KW_DEFBOOL
| KW_DEFBYTE
| KW_DEFCUR
| KW_DEFDBL
| KW_DEFINT
| KW_DEFLNG
| KW_DEFSNG
| KW_DEFSTR
| KW_DEFVAR
;
DATATYPESUFFIX
: PERCENT
| AMPERSAND
| BANG
| POUND
| AT
| DOLLARSIGN
;
STRING
: (DQUOT (DQUOTESC|.)*? DQUOT)
| (LBRACE (RBRACEESC|.)*? RBRACE)
| (PIPE (PIPESC|.|NEWLINE)*? PIPE)
;
fragment DQUOTESC: '\"\"' ;
fragment RBRACEESC: '}}' ;
fragment PIPESC: '||' ;
INTEGER
: DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
REAL
: DIGIT+ PERIOD DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
;
BINARY
: AMPERSAND B BINARYDIGIT+
;
OCTAL
: AMPERSAND O OCTALDIGIT+
;
HEXIDECIMAL
: AMPERSAND H HEXDIGIT+
;
QUESTIONMARK: '?' ;
COLON: ':' ;
ASSIGN: '=';
SEMICOLON: ';' ;
AT: '#' ;
LPAREN: '(' ;
RPAREN: ')' ;
DQUOT: '"' ;
LBRACE: '{' ;
RBRACE: '}' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
CARAT: '^' ;
PLUS: '+' ;
MINUS: '-' ;
ASTERISK: '*' ;
FSLASH: '/' ;
BSLASH: '\\' ;
AMPERSAND: '&' ;
BANG: '!' ;
POUND: '#' ;
DOLLARSIGN: '$' ;
PERCENT: '%' ;
COMMA: ',' ;
APOSTROPHE: '\'' ;
TWOPERIODS: '..' ;
PERIOD: '.' ;
UNDERSCORE: '_' ;
PIPE: '|' ;
NEWLINE: '\r\n' | '\r' | '\n';
EQUAL: '==' ;
NEQUAL: '<>' | '><' ;
LT: '<' ;
LTE: '<=' | '=<';
GT: '>' ;
GTE: '=<'|'<=' ;
KW_AND: A N D ;
KW_BINARY: B I N A R Y ;
KW_BOOLEAN: B O O L E A N ;
KW_BYTE: B Y T E ;
KW_DATATYPE: D A T A T Y P E ;
KW_DATE: D A T E ;
KW_INTEGER: I N T E G E R ;
KW_IS: I S ;
KW_ISA: I S A ;
KW_LIKE: L I K E ;
KW_LONG: L O N G ;
KW_MOD: M O D ;
KW_NOT: N O T ;
KW_TO: T O ;
KW_FALSE: F A L S E ;
KW_TRUE: T R U E ;
KW_SINGLE: S I N G L E ;
KW_DOUBLE: D O U B L E ;
KW_CURRENCY: C U R R E N C Y ;
KW_STRING: S T R I N G ;
fragment BINARYDIGIT: ('0'|'1') ;
fragment OCTALDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7') ;
fragment DIGIT: '0'..'9' ;
fragment HEXDIGIT: ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' | A | B | C | D | E | F) ;
fragment A: ('a'|'A');
fragment B: ('b'|'B');
fragment C: ('c'|'C');
fragment D: ('d'|'D');
fragment E: ('e'|'E');
fragment F: ('f'|'F');
fragment G: ('g'|'G');
fragment H: ('h'|'H');
fragment I: ('i'|'I');
fragment J: ('j'|'J');
fragment K: ('k'|'K');
fragment L: ('l'|'L');
fragment M: ('m'|'M');
fragment N: ('n'|'N');
fragment O: ('o'|'O');
fragment P: ('p'|'P');
fragment Q: ('q'|'Q');
fragment R: ('r'|'R');
fragment S: ('s'|'S');
fragment T: ('t'|'T');
fragment U: ('u'|'U');
fragment V: ('v'|'V');
fragment W: ('w'|'W');
fragment X: ('x'|'X');
fragment Y: ('y'|'Y');
fragment Z: ('z'|'Z');
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_~]*
;
LINE_ESCAPE
: (' ' | '\t') UNDERSCORE ('\r'? | '\n')
;
WS
: [ \t] -> skip
;
Take a look at this grammar (Note that this grammar is not supposed to be a grammar for BASIC, it's just an example to show how to disambiguate using "=" for both assignment and equality):
grammar Foo;
program:
(statement | exprOtherThanEquality)*
;
statement:
assignment
;
expr:
equality | exprOtherThanEquality
;
exprOtherThanEquality:
boolAndOr
;
boolAndOr:
atom (BOOL_OP expr)*
;
equality:
atom EQUAL expr
;
assignment:
VAR EQUAL expr ENDL
;
atom:
BOOL |
VAR |
INT |
group
;
group:
LEFT_PARENTH expr RGHT_PARENTH
;
ENDL : ';' ;
LEFT_PARENTH : '(' ;
RGHT_PARENTH : ')' ;
EQUAL : '=' ;
BOOL:
'true' | 'false'
;
BOOL_OP:
'and' | 'or'
;
VAR:
[A-Za-z_]+ [A-Za-z_0-9]*
;
INT:
'-'? [0-9]+
;
WS:
[ \t\r\n] -> skip
;
Here is the parse tree for the input: t = (x = 5) and (y = 2);
In one of the comments above, I asked you if we can assume that the first equal sign on a line always corresponds to an assignment. I retract that assumption slightly... The first equal sign on a line always corresponds to an assignment unless it is contained within parentheses. With the above grammar, this is a valid line: (x = 2). Here is the parse tree:
The grammar below parses ( left part = right part # comment ), # comment is optional.
Two questions:
Sometimes warning (ANTLRWorks 1.4.2):
Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2 (referencing id2)
But only sometimes!
The next extension should be that the comment (id2) can contain chars '(' and ')'.
The grammar:
grammar NestedBrackets1a1;
//==========================================================
// Lexer Rules
//==========================================================
Int
: Digit+
;
fragment Digit
: '0'..'9'
;
Special
: ( TCUnderscore | TCQuote )
;
TCListStart : '(' ;
TCListEnd : ')' ;
fragment TCUnderscore : '_' ;
fragment TCQuote : '"' ;
// A word must start with a letter
Word
: ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
;
Space
: ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
;
//==========================================================
// Parser Rules
//==========================================================
assignment
: TCListStart id1 '=' id1 ( comment )? TCListEnd
;
id1
: Word+
;
comment
: '#' ( id2 )*
;
id2
: ( Word | Int )+
;
ANTLRStarter wrote:
Sometimes warning (ANTLRWorks 1.4.2): Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2 (referencing id2)
But only sometimes!
No, the grammar you posted will always produce this warning. Perhaps you don't always notice it (your IDE-plugin or ANTLRWorks might show it in a tab you don't have opened), but the warning is there. Convince yourself by creating a lexer/parser from the command line:
java -cp antlr-3.4-complete.jar org.antlr.Tool NestedBrackets1a1.g
will produce:
warning(200): NestedBrackets1a1.g:49:19:
Decision can match input such as "{Int, Word}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
This is because you have a * after ( id2 ) inside your comment rule, and id2 also is a repetition of tokens: ( Word | Int )+. Let's say your input is "# foo bar" (a # followed by two Word tokens). ANTLR can now parse the input in more than 1 way: the 2 tokens "foo" and "bar" could be matched by ( id2 )*, where id2 matches a single Word token at a time, but "foo" and "bar" could also be matches in one go of the id2 rule.
Look at the merged rules:
comment
: '#' ( ( Word | Int )+ )*
;
See how you're repeating a repetition: ( ( ... )+ )*? This is usually a problem, as it is in your case.
Resolve this problem by either replacing the * with a ?:
comment
: '#' ( id2 )?
;
id2
: ( Word | Int )+
;
or by removing the +:
comment
: '#' ( id2 )*
;
id2
: ( Word | Int )
;
ANTLRStarter wrote:
The next extension should be that the comment (id2) can contain chars '(' and ')'.
That is asking for trouble since a comment is followed by a TCListEnd, which is a ). I don't recommend letting a comment match ).
EDIT
Note that comments are usually stripped from the source file while tokenizing the input source. That way you don't need to account for them in your parser rules. You can do that by "skipping" these tokens in a lexer rule:
Comment
: '#' ~('\r' | '\n')* {skip();}
;