I am using antlr 'org.antlr:antlr4:4.9.2' and come across the "dangling else" ambiguity problem; see the following grammar IfStat.g4.
// file: IfStat.g4
grammar IfStat;
stat : 'if' expr 'then' stat
| 'if' expr 'then' stat 'else' stat
| expr
;
expr : ID ;
ID : LETTER (LETTER | [0-9])* ;
fragment LETTER : [a-zA-Z] ;
WS : [ \t\n\r]+ -> skip ;
I tested this grammar against the input "if a then if b then c else d". It is parsed as `"if a then (if b then c else d)" as expected. How does ANTLR4 resolve this ambiguity?
ANTLR will choose the first possible (successful) path it is able to make.
You can enable ANTLR to report such ambiguities in your grammar. Check this Q&A for that: Ambiguity in grammar not reported by ANTLR
My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).
Hard to explain with the title, but here's the problem:
I'm trying to parse an equation with ANTLR4.
My current definition of a number:
NUMBER: SUBTRACTION? DIGIT+ ([.,] DIGIT+)? ;
Where DIGIT and SUBTRACTION are all digits and '-' respectively.
My parser rule for subtraction:
subtraction: value? SUBTRACTION (value|operation)?;
The idea is the parser still works with a missing value.
Let's say I have this input
1-2
The problem is with this input, ANTLR says that 1 is a number, and -2 is a number. ANTLR doesn't group it as subtraction, like 1 SUBTRACTION 2.
What can I do to get the correct grouping?
When you define rules like this:
SUBTRACTION : '-';
NUMBER : SUBTRACTION? DIGIT+ ([.,] DIGIT+)? ;
input like -2 will always be a single NUMBER token: it will never be tokenised as separate SUBTRACTION and NUMBER tokens. ANTLR always tries to match as much characters as possible.
You should not glue the - to the NUMBER in the lexer, but do that in the parser like SUBTRACTION expr:
expr
: SUBTRACTION expr
| expr SUBTRACTION expr
| NUMBER
;
SUBTRACTION : '-';
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
fragment DIGIT : [0-9];
which will parse both 1-2 and -2 correctly.
I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).
I am stuck trying to parse an ISO 8601 duation string (e.g. "P3M2D"). Note that this does not allow embedded spaces. I am using antlr4.7.
When I tried using a lexer rule
ISO8601_INTERVAL
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( d=NUMBER_INT 'D' )?
| 'T' etc
;
I get a compile time warning like "labels in lexer rules are not supported in ANTLR 4; actions cannot reference elements of lexical rules but you can use getText() to get the entire text matched for the rule".
I would like to avoid this manual parsing.
When I tried using a parser rule
iso8601_INTERVAL
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( d=NUMBER_INT 'D' )?
| 'T' etc
;
I get an error like "line 8:39 mismatched input 'P2D' expecting {'P'..."
Is it because the lexer is expecting tokens to be separated by WS? If yes, how to temporarily suspend that?
What's the right way to having antlr4 parse out the parts of the duration input? I am rather new to antlr or compilers.
No ANTLR doesn't expect lexer tokens to be separated by whitespace.
From what you provided in your question the following grammar should do the job:
specs:
iso*
;
iso:
P (y=INT Y)? (m=INT M)? (d=INT D)?
;
P: 'P' ;
Y: 'Y' ;
M: 'M' ;
D: 'D' ;
INT: [0-9] ;
As you can see I didn't really change anything in your grammar. That is because I suspect that the error lays somewhere else in your grammar but as you haven't provided the whole grammar, the only thing I can give you is this small but (hopefully) working stand-alone grammar.