Antlr failing to match [a-z]+ - antlr4

Trying to split off the first word of the input, so that apple a type of fruit would be split into apple and a type of fruit.
grammar Hello;
entry
: headword definition
;
headword
: HEADWORD
;
definition
: ANYTHING
;
HEADWORD : [a-z]+ ;
SPACE : [ \t]+ -> skip ;
ANYTHING : .+;
Using the Getting Started walkthrough:
C:\Code\antlr\hello>java org.antlr.v4.Tool Hello.g4
warning(131): Hello.g4:17:12: greedy block ()+ contains wildcard; the non-greedy syntax ()+? may be preferred
C:\Code\antlr\hello>javac Hello*.java
C:\Code\antlr\hello>grun Hello entry -tree
C:\Code\antlr\hello>java org.antlr.v4.gui.TestRig Hello entry -tree
asdf asdf
^Z
line 1:0 missing HEADWORD at 'asdf asdf\r\n'
(entry (headword <missing HEADWORD>) (definition asdf asdf\r\n))
Why is it failing to match HEADWORD?
Even trying to match HEADWORD directly doesn't work:
C:\Code\antlr\hello>grun Hello HEADWORD -tree
asdf^Z
Terminate batch job (Y/N)? y
It seems to loop forever so I had to kill it with Ctrl-C.

An ANTLR lexer rule consumes as many characters as possible. So your rule ANYTHING : .+; consumes the entire input, causing it to be the only token to be created. That is why HEADWORD is not created.
Yes, ANTLR's lexer rule are being matched top to bottom, but the top to bottom only has meaning when 2 (or more) lexer rules match the same amount of characters. Then the rule defined first "wins", but only if the (multiple) rules match the same amount of character. In your case, ANYTHING will (almost) always match the most characters, and will therefor be the sole token to be created.
If your input would've been just "apple", then a HEADWORD token would be created, because both HEADWORD and ANYTHING match the input, but HEADWORD is defined first, so it gets precedence.
And changing .+ to .+? will just cause ANYTHING to be never matched for your input apple a type of fruit. Just 5 HEADWORD tokens will be created.
As a rule of thumb, it is never a good idea to let a lexer rule end with .* or .+ (or .*? or .+?).

Related

ANTLR 4: Recognises 'and' but not 'or' without a space

I'm using the ANTLR 4 plugin in IntelliJ, and I have the most bizarre bug. I'll start with the relevant parser/lexer rules:
// Take care of whitespace.
WS : [ \r\t\f\n]+ -> skip;
OTHER: . -> skip;
STRING
: '"' [A-z ]+ '"'
;
evaluate // starting rule.
: textbox? // could be an empty textbox.
;
textbox
: (row '\n')*
;
row
: ability
| ability_list
ability
: activated_ability
| triggered_ability
| static_ability
triggered_ability
: trigger_words ',' STRING
;
trigger_words
: ('when'|'whenever'|'as') whenever_triggers|'at'
;
whenever_triggers
: triggerer (('or'|'and') triggerer)* // this line has the issue.
;
triggerer
: self
self: '~'
I pass it this text: whenever ~ or ~, and it fails on the or, saying line 1:10 mismatched input ' or' expecting {'or', 'and'}. However, if I add a space to the whenever_triggers rule's or string (making it ' or'|'and'), it works fine.
The weirdest thing is that if I try whenever ~ and ~, it works fine even without the rule having a space in the and string. This doesn't change if I make 'and'|'or' a lexer rule either. It's just bizarre. I've confirmed this bug happens when running the 'test rig' in Antlrworks 2, so it's not just an IntelliJ thing.
This is an image of the parse tree when the error occurs:
Alright you have found the answer more or less by yourself so with this answer of mine I will focus on explaining why the problem occured in the first place.
First of all - for everyone stumbling upon this question - the problem was that he had another implicit lexer rule defined that looked like this ' or' (notice the whitespace). Changing that to 'or' resolved the problem.
But why was that a problem?
In order to understand that you have to understand what ANTLR does if you write '<something>' in one of your parser rules: When compiling the grammar it will generate a new lexer rule for each of those declarations. These lexer rules will be created before the lexer rules defined in your grammar. The lexer itself will match the given input into tokens and for that it processes each lexer rule at a time in the order they have been declared. Therefore it will always start with the implicit token definitions and then move on to the topmost "real" lexer rule.
The problem is that the lexer isn't too clever about this process that means once it has matched some input with the current lexer rule it will create a respective token and moves on with the trailing input.
As a result a lexer rule that comes afterwards that would have matched the input as well (but as another token as it is a different lexer rule) will be skipped so that the respective input might not have the expected token type because the lexer rules have overwrritten themselves.
In your example the self-overwriting rules are ' or'(Token 1) and 'or'(Token 2). Each of those implicit lexer rule declarations will result in a different lexer rule and as the first one got matched I assume that it is declared before the second one.
Now look at your input: whenever ~ or ~ The lexer will start interpreting it and the first rule it comes across is ' or' (After the start is matched of course) and it will match the input as there really is a space before the or. Therefore it will match it as Token 1.
The parser on the other hand is expecting a Token 2 at this point so that it will complain about the given input (although it really is complaining about the wrong token type). Altering the input to whenever ~or ~ will result in the correct interpretation.
Exactly that is the reason why you shouldn't use implicit token definitions in your grammar (unless it is really small). Create a new lexer rule for every input and start with the most specific rules. That means rules that match special character sequences (e.g. keywords) should be declared before general lexer rules like ID or STRING or something like that. Rules that will match all the characters in order to prevent the lexer from throwing an error upon unrecognized input have to declared last as they would overwrite every lexer rule after them.

Antlr4 match whole input string or bust

I am new to Antlr4 and have been wracking my brain for some days now about a behaviour that I simply don't understand. I have the following combined grammar and expect it to fail and report an error, but it doesn't:
grammar MWE;
parse: cell EOF;
cell: WORD;
WORD: ('a'..'z')+;
If I feed it the input
a4
I expect it to not be able to parse it, because I want it to match the whole input string and not just a part of it, as signified by the EOF. But instead it reports no error (I listen for errors with a errorlistener implementing the IAntlrErrorListener interface) and gives me the following parse tree:
(parse (cell a) <EOF>)
Why is this?
The error recovery mechanism when input is reached which no lexer rule matches is to drop a character and continue with the next one. In your case, the lexer is dropping the 4 character, so your parser is seeing the equivalent of this input:
a
The solution is to instruct the lexer to create a token for the dropped character rather than ignore it, and pass that token on to the parser where an error will be reported. In the grammar, this rule takes the following form and is always added as the last rule in the grammar. If you have multiple lexer modes, a rule with this form should appear as the last rule in the default mode as well as the last rule in each extra mode.
ErrChar
: .
;

Solving ambiguous input: mismatched input

I have this grammar:
grammar MkSh;
script
: (statement
| targetRule
)*
;
statement
: assignment
;
assignment
: ID '=' STRING
;
targetRule
: TARGET ':' TARGET*
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
) -> channel(HIDDEN)
;
STRING
: '\"' CHR* '\"'
;
fragment
CHR
: ('a'..'z'|'A'..'Z'|' ')
;
TARGET
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
;
and this input file:
hello="world"
target: CLASSES
When running my parser I'm getting this error:
line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='
Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').
How can I get that to happen?
(This is my first Antlr project so I'm open to anything.)
First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.
As #cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.
(Warning: writing off the top of my head without testing :-)
Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:
id : TARGET
{ complain if not legal identifier (e.g., contains slashes, etc.) }
;
assignment
: id '=' STRING
;
Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.

ANTLR 4 Lexer rule : how to ignore a part?

Here is a related topic for previous ANTLR version :
Java ANTLR how to ignore part of rule? ignore part after subrule
With a lexer rule like :
R1
: [a-zA-Z0-9]* ';'
;
For example i have this input text :
test;rezrezr
zrezrzerz
It will match "test;" wich is correct. I only need the "test" string.
Do i need to take care of ';' character manually in a custom listener for example ? Or is there a way to specify in the grammar that i want to avoid it (only using lexer rules) ?
UPDATE
test1;rezrezr
zrezrzerz
test2;rezrezr
zrezrzerz
If you want to avoid the ; character, simply remove it from the lexer rule. Note that I also changed the * to a + to ensure that R1 is never a zero-length token.
R1
: [a-zA-Z0-9]+
;

Lexers w/ Phrase Tokens

I'm experimenting w/ ANTL4 on a grammar that would best be tokenized into phrases rather than words (i.e., most of the tokens may contain spaces). In some cases, however, I want to capture specific substring phrases as individual tokens. Consider the following example:
Occurrence A of Encounter Performed
The phrase "Occurrence A of" is special-- whenever I see it, I want to pull it out. The rest of the statement ("Encounter Performed") is fairly arbitrary and for the purposes of this example, could be anything.
For this example, I've whipped up this quick grammar:
grammar test;
stat: OCCURRENCE PHRASE;
OCCURRENCE: 'Occurrence' LABEL 'of' ;
fragment LABEL: [A-Z] ;
PHRASE: (WORD ' ')* WORD ;
fragment WORD: [a-zA-Z\-]+ ;
WS: [ \t\n\r]+ -> skip ;
If I test it against the statement above, it fails ("line 1:0 missing OCCURRENCE at 'Occurrence A of Encounter Performed'"). I believe this is because the lexer will match on the token that can consume the most consecutive characters (PHRASE, in this case).
So... I understand the problem-- I'm just not clear on the best solution. Is it possible? Or do I need to just live with a lexer that matches on word boundaries and a parser that puts them together into phrases? I prefer doing it in the lexer because the phrase (like "Encounter Performed") is really intended to be a single unit.
I'm new to ANTLR (and lexers/parsers in general), so please forgive me if the solution is easy! So far, however, I haven't been able to find an answer. Thanks for your help!
While there is a way to do what you wish in the lexer**, on such a simple grammar it is unlikely to be worth the effort. Also, by packing it all into a single token, you set yourself up to being forced eventually to manually dig around in the token string just to pick out the value of the LABEL.
You can still define semantically appropriate rules -- rules that reflect the what you consider to be 'tokens' -- just as simple, 'lower level' parser rules:
stat: occurrence phrase ;
occurrence: OCCURRENCE label=WORD OF ;
phrase: WORD+ ;
OCCURRENCE: 'Occurrence' ;
OF: 'of' ;
WORD: [a-zA-Z\-]+ ;
WS: [ \t\n\r]+ -> skip ;
** If you really want to, you can implement a lexer mode and, using the 'more' operator, consume the OCCURRENCE... string into a single token. This is untested -- I think "more" will work as shown, but if not you will need to pack the token text yourself. In any event, it illustrates the potential complexity of what you stated you wished to do.
OCCURRENCE: 'Occurrence' -> pushMode(stuff), more ;
mode stuff ;
OF: 'of' -> popMode, more ;
OTHER: . -> more ;

Resources