antlr4 grammar fails to match initial identifier - antlr4

I have a grammar that is failing on the first token. I've stripped it down some to reduce the choices, but still have the error:
line 1:0 mismatched input 'main' expecting {, '#', 'def', IDENTIFIER}
I expect the token 'main' to match IDENTIFIER, which has this lexical production:
IDENTIFIER : [a-zA-Z][a-zA-Z0-9]*;
Why would that be failing?

One of the following is happening:
You have another rule in the grammar located before IDENTIFIER that also matches the input main.
You have a combined grammar (declared as grammar T instead of parser grammar T or lexer grammar T), where one of the parser rules contains the literal 'main' which is causing a separate lexer rule to be implicitly created for this literal.

Related

ANTLR 4: Recognises 'and' but not 'or' without a space

I'm using the ANTLR 4 plugin in IntelliJ, and I have the most bizarre bug. I'll start with the relevant parser/lexer rules:
// Take care of whitespace.
WS : [ \r\t\f\n]+ -> skip;
OTHER: . -> skip;
STRING
: '"' [A-z ]+ '"'
;
evaluate // starting rule.
: textbox? // could be an empty textbox.
;
textbox
: (row '\n')*
;
row
: ability
| ability_list
ability
: activated_ability
| triggered_ability
| static_ability
triggered_ability
: trigger_words ',' STRING
;
trigger_words
: ('when'|'whenever'|'as') whenever_triggers|'at'
;
whenever_triggers
: triggerer (('or'|'and') triggerer)* // this line has the issue.
;
triggerer
: self
self: '~'
I pass it this text: whenever ~ or ~, and it fails on the or, saying line 1:10 mismatched input ' or' expecting {'or', 'and'}. However, if I add a space to the whenever_triggers rule's or string (making it ' or'|'and'), it works fine.
The weirdest thing is that if I try whenever ~ and ~, it works fine even without the rule having a space in the and string. This doesn't change if I make 'and'|'or' a lexer rule either. It's just bizarre. I've confirmed this bug happens when running the 'test rig' in Antlrworks 2, so it's not just an IntelliJ thing.
This is an image of the parse tree when the error occurs:
Alright you have found the answer more or less by yourself so with this answer of mine I will focus on explaining why the problem occured in the first place.
First of all - for everyone stumbling upon this question - the problem was that he had another implicit lexer rule defined that looked like this ' or' (notice the whitespace). Changing that to 'or' resolved the problem.
But why was that a problem?
In order to understand that you have to understand what ANTLR does if you write '<something>' in one of your parser rules: When compiling the grammar it will generate a new lexer rule for each of those declarations. These lexer rules will be created before the lexer rules defined in your grammar. The lexer itself will match the given input into tokens and for that it processes each lexer rule at a time in the order they have been declared. Therefore it will always start with the implicit token definitions and then move on to the topmost "real" lexer rule.
The problem is that the lexer isn't too clever about this process that means once it has matched some input with the current lexer rule it will create a respective token and moves on with the trailing input.
As a result a lexer rule that comes afterwards that would have matched the input as well (but as another token as it is a different lexer rule) will be skipped so that the respective input might not have the expected token type because the lexer rules have overwrritten themselves.
In your example the self-overwriting rules are ' or'(Token 1) and 'or'(Token 2). Each of those implicit lexer rule declarations will result in a different lexer rule and as the first one got matched I assume that it is declared before the second one.
Now look at your input: whenever ~ or ~ The lexer will start interpreting it and the first rule it comes across is ' or' (After the start is matched of course) and it will match the input as there really is a space before the or. Therefore it will match it as Token 1.
The parser on the other hand is expecting a Token 2 at this point so that it will complain about the given input (although it really is complaining about the wrong token type). Altering the input to whenever ~or ~ will result in the correct interpretation.
Exactly that is the reason why you shouldn't use implicit token definitions in your grammar (unless it is really small). Create a new lexer rule for every input and start with the most specific rules. That means rules that match special character sequences (e.g. keywords) should be declared before general lexer rules like ID or STRING or something like that. Rules that will match all the characters in order to prevent the lexer from throwing an error upon unrecognized input have to declared last as they would overwrite every lexer rule after them.

Simple Xtext example generates grammar that Antlr4 doesn't like - who's to blame?

While using XText, I have come across a problem and I am not sure if Antlr4 or XText is at fault or if I'm just missing something. I understand that Antlr4 is not supported by Xtext, but it seems like this particular case should not cause a problem.
Here is a simple Xtext file:
grammar com.github.jsculley.antlr4.Test with org.eclipse.xtext.common.Terminals
generate test "http://www.github.com/jsculley/antlr4/test"
aRule:
name=STRING
;
STRING is defined in the XText rule from org.eclipse.xtext.common.Terminals:
terminal STRING :
'"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' |
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'"
;
The generated Antlr grammar has the following rule:
RULE_STRING : ('"' ('\\' .|~(('\\'|'"')))* '"'|'\'' ('\\' .|~(('\\'|'\'')))* '\'');
The Antlr 3.5.2 tool has no problem with this rule, but the Antlr4 tool spits out the following errors:
error(50): InternalTest.g:102:29: syntax error: '(' came as a complete surprise to me while looking for lexer rule element
error(50): InternalTest.g:102:62: syntax error: '(' came as a complete surprise to me while looking for lexer rule element
error(50): InternalTest.g:102:74: syntax error: mismatched input ')' expecting SEMI while matching a lexer rule
error(50): InternalTest.g:106:25: syntax error: '(' came as a complete surprise to me while looking for lexer rule element
error(50): InternalTest.g:106:36: syntax error: mismatched input ')' expecting SEMI while matching a lexer rule
Antlr4 doesn't like the extra (and seemingly uneccessary) sets of parentheses around the group after each '~' operator. So the question is, is Xtext generating a bad grammar, or is Antlr4 not handling a valid construct?
Xtext generates an Antlr 3.x grammar and Antlr 4 grammars are incompatible.
It seems that ANTLR 4 does not handle parenthesis correctly: Parser issues mutual left recursion error when the left-recursive part of a rule is in parenthesis.
So, just remove useless parenthesis and ANTLR 4 should generate a fully ANLTR 3 compatible parser. I ported PL/SQL grammar from ANTLR 3 -> ANTLR 4. Moreover, ANLTR 4 have a more powerfull parsing algorithm compare to the previous version.

Solving ambiguous input: mismatched input

I have this grammar:
grammar MkSh;
script
: (statement
| targetRule
)*
;
statement
: assignment
;
assignment
: ID '=' STRING
;
targetRule
: TARGET ':' TARGET*
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
) -> channel(HIDDEN)
;
STRING
: '\"' CHR* '\"'
;
fragment
CHR
: ('a'..'z'|'A'..'Z'|' ')
;
TARGET
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
;
and this input file:
hello="world"
target: CLASSES
When running my parser I'm getting this error:
line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='
Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').
How can I get that to happen?
(This is my first Antlr project so I'm open to anything.)
First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.
As #cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.
(Warning: writing off the top of my head without testing :-)
Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:
id : TARGET
{ complain if not legal identifier (e.g., contains slashes, etc.) }
;
assignment
: id '=' STRING
;
Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.

why whitespace not allowed between two keywords/constants written in xtext file for DSL

White space between if and ( is not allowed. For example, this works IF( but IF ( causes a parser error.
The Rule is:
Condition returns ResultExpression:
'IF' '(' condition=BooleanExpression ')' '{' then=ResultExpressionRhs '}'
(=> 'ELSE' '{' else=ResultExpression '}')?;
It's hard to tell what's going on from just this minimal grammar snippet.
Please check your xtext file for the following things:
A proper hidden clause that includes the WS
A keyword 'IF(' that may have been introduced by accident
Warnings when executing the workflow.

Troubles with returns declaration on the first parser rule in an ANTLR4 grammar

I am using returns for my parser rules which works for all parser rules except the first one. If the first parser rule in my grammer uses the returns declaration ANTLR4 complains as follows:
expecting ARG_ACTION while matching a rule
If I add another parser rule above which does not use "returns" ANTLR does not complain.
Here you have a grammar reduced to the problem:
grammar FirstParserRuleReturnIssue;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
aRule returns [String s]: ID { $s = $ID.text; };
I searched to find a special role of the first rule that could explain the behaviour but did not find anything. Is it a bug? Do I miss some understanding?
You need to place parser rules (start with a lowercase letter) before lexer rules (start with an uppercase letter) in your grammar. After encountering a lexer rule, the [ triggers a LEXER_CHAR_SET instead of ARG_ACTION, so the token stream seen by the compiler looks like you're passing a set of characters where the return value should be.

Resources