ANTLR4 : mismatched input - antlr4

I would like to match the input of the form ::
commit a1b2c3
Author: Michael <michael#test.com>
commit d3g4
Author: David <david#test.com>
Here is the grammar I have written:
grammar commit;
file : commitinfo+;
commitinfo : commitdesc authordesc;
commitdesc : 'commit' COMMITHASH NEWLINE;
authordesc : 'Author:' AUTHORNAME '<' EMAIL '>' NEWLINE;
COMMITHASH : [a-z0-9]+;
AUTHORNAME : [a-zA-Z]+;
EMAIL : [a-zA-Z0-9.#]+;
NEWLINE : '\r'?'\n';
WHITESPACE : [ \t]->skip;
The problem with the above parser is that, for the above input it matches perfectly. But when the input changes to :
commit c1d2
Author: michael <michael#test.com>
it throws an error like :
line 2:8 mismatched input 'michael' expecting AUTHORNAME.
When I print the tokens, it seems the string 'michael' gets matched by the token COMMITHASH instead of AUTHORNAME.
How to fix the above case?

ANTLR4 matches the lexer rules according to the sequence in which they have been written.
'michael' gets matched by the rule COMMITHASH : [a-z0-9]+ ; which appears before the rule AUTHORNAME and hence you are having the error.
I can think of the following options to resolve the issue you are facing :
You can use the 'mode' feature in ANTLR : In ANTLR 4, one lexer mode is active at a time, and the longest non-fragment lexer rule in that mode rule will determine which token is created. Your grammar only includes the default mode, so all the lexer rules are active and hence 'michael' gets matched to COMMITHASH as the length of the token matched is same for COMMITHASH and AUTHORNAME but COMMITHASH appears before AUTHORNAME in the grammar.
You can alter your lexical rules by interchanging the way in which they appear in the grammar. Assuming your COMMITHASH rule always has a numeral matched with it. Put AUTHORNAME before COMMITHASH in the following way :
grammar commit;
...
AUTHORNAME : [a-zA-Z]+;
COMMITHASH : [a-z0-9]+;
...
Note: I strongly feel that your lexer rules are not crisply written. Are you sure that your COMMITHASH rule should be [a-z0-9]+; This would mean a token like 'abhdks' will also get matched by your COMMITHASH rule. But that's a different issue altogether.

Related

What is the problem with the following antlr4 grammar

I'm having trouble parsing a simple grammar. I believe the issue is that there are conflicting rules. Here is the text I'm trying to parse:
redis 6.2.6-debian-10-r49 Running
account-migrator 0.83.0 Pending
This represents services that have a name, version and status. Here is the grammar that isn't working:
main : statusLine+;
statusLine : serviceName versionNumber status;
serviceName : SERVICE_NAME;
versionNumber : VERSION_NUMBER;
status : STATUS;
SERVICE_NAME : [a-zA-Z-]+;
VERSION_NUMBER : [a-zA-Z0-9-]+ ('.' [a-zA-Z0-9-]+)*;
STATUS : [a-zA-Z]+;
WS : [ \n\t]+ -> skip;
I believe my grammar confuses the status for a service name because my visitor finds nothing for status on the first visit, but the second visit gets the status of the first line as the service name of the second.
So the question I have is, what can I do to parse these lines correctly?
The problem is that you have multiple rules that match the same input. Everything which STATUS could match, will actually be matched by SERVICE_NAME. Check out this parse tree:
The token STATUS is never produced, but Running became a SERVICE_NAME token.
So, instead of trying to add semantics to the lexer (by using different names, and hence meaning, for the same input) use a common lexer rule:
main : statusLine+;
statusLine : serviceName versionNumber status;
serviceName : IDENTIFIER;
versionNumber : VERSION_NUMBER;
status : IDENTIFIER;
IDENTIFIER: [a-zA-Z-]+;
VERSION_NUMBER: [a-zA-Z0-9-]+ ('.' [a-zA-Z0-9-]+)*;
WHITE_SPACE: [ \u000B\t\r\n] -> skip;
which then gives you the proper parse tree:

Odd ANTLR4 error handling behavior with very simply (trivial) behavior

Given the below super simple grammar:
ddlStatement
: defineStatement
;
defineStatement
: 'define' tableNameToken=Identifier ';'?
;
and the input "add 1 to bob"
I would expect to get an error. However, the parser matches the "defineStatement" rule with a missing "define" token. The following Listener will fire
#Override
public void exitDefineStatement(DDLParser.DefineStatementContext ctx) {
log.info(MessageFormat.format("Defining {0}", ctx.tableNameToken.getText()));
}
and log "Defining add".
I can assign 'define' to a variable and test that variable for NULL but that seems like work I shouldn't have to do.
BTW if the grammar becomes more complete - specifically with the addition of alternatives to the ddlStatement rule - error handling works as I would expect.
This ANTLR's error recovery in action.
In many cases, it's VERY beneficial for ANTLR to assume either a missing token, or ignore a token, if it allows parsing to continue. The missing "define" token should have been reported as an error.
Without this capability, ANTLR would frequently get "stumped" at the first sign of problems. With this, ANTLR is saying "Well, if I assume X, then I can make sense of your input. So I'm assuming X and reporting that as an error so I can continue on.
(Filling a few details to get this to build)
grammar Test
;
ddlStatement: defineStatement;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
if I run antlr on this and compile the Java output. The following command:
echo "add 1 to bob" | grun Test ddlStatement -gui
yields the error:
line 1:0 missing 'define' at 'add'
and produces the parse tree:
The highlighted node is the error node in the tree.
The reason it stops after "add" is that input (assuming a missing "define", would be a ddlStatement
ANTLR will stop processing input once it has recognized your stop rule.
To get it to "pay attention" to the entire input, add an EOF token to your start rule:
grammar Test
;
ddlStatement: defineStatement EOF;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
gives these errors:
line 1:0 missing 'define' at 'add'
line 1:4 mismatched input '1' expecting {<EOF>, ';'}
and this tree:

The lexer chooses the wrong Token

Hi I am new to antrl and have a problem that I am not able to solve during the last days:
I wanted to write a grammar that recognizes this text (in reality I want to parse something different, but for the case of this question I simplified it)
100abc
150100
200def
Here each rows starts with 3 digits, that identifiy the type of the line (header, content, trailer), than 3 characters follow, that are the payload of the line.
I thought I could recogize this with this grammar:
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
header : '100' A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
But this does not work. When the lexer reads the "100" in the second line ("150100") it reads it into one token with 100 as the value and not as three Tokens of type A. So the parser sees a "100" token where it expects an A Token.
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together the '1','0','0'. I found no way to solve this. Putting the Rule A above the parser Rule that contains the string literal "100" did not work. And also factoring the '100' into a fragement as follows did not work.
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
HUNDRED: '100';
header : HUNDRED A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
I also read some other posts like this:
antlr4 mixed fragments in tokens
Lexer, overlapping rule, but want the shorter match
But I did not think, that it solves my problem, or at least I don't see how that could help me.
One of your token definitions is incorrect: A : [a-z|A-Z|0-9]; Don't use a vertical line inside a range [] set. A correct definition is: A : [a-zA-Z0-9];. ANTLR with version >= 4.6 will notify about duplicated chars | inside range set.
As I understand you mixed tokens and rules concept. Tokens defined with UPPER first letter unlike rules that defined with lower case first letter. Your header, content and trailer are tokens, not rules.
So, the final version of correct grammar on my opinion is
grammar Types;
file : Header Content Trailer;
A : [a-zA-Z0-9];
NL: '\r' '\n'? | '\n' | EOF; // Or leave only one type of newline.
Header : '100' A A A NL;
Content: '150' A A A NL;
Trailer: '200' A A A NL;
Your input text will be parsed to (file 100abc\n 150100\n 200def)

ANTLR4 DefaultErrorStrategy fails to inject missing token

I'm trying to run in TestRig the following grammar:
grammar COBOLfragment;
// hidden tokens
WS : [ ]+ -> channel(HIDDEN);
NL : '\n' -> channel(HIDDEN);
// keywords
PERIOD : '.';
DIVISION : 'DIVISION';
SECTION : 'SECTION';
DATA : 'DATA';
WORKING_STORAGE : 'WORKING-STORAGE';
FILE : 'FILE';
FD : 'FD';
EXTERNAL : 'EXTERNAL';
GLOBAL : 'GLOBAL';
BLOCK : 'BLOCK';
CONTAINS : 'CONTAINS';
CHARACTERS : 'CHARACTERS';
// data
INTEGER : [0-9]+;
ID : [A-Z][A-Z0-9]*;
dataDivision :
DATA DIVISION PERIOD
fileSection?
workingStorageSection?
;
fileSection :
FILE SECTION PERIOD
fileDescription*
;
fileDescription :
FD fileName=ID
// (IS? GLOBAL)? // 1. IS GLOBAL clause
// (IS? EXTERNAL)? // 2. IS EXTERNAL clause
blockClause?
PERIOD
;
blockClause :
BLOCK CONTAINS? blockSize=INTEGER CHARACTERS
;
workingStorageSection :
WORKING_STORAGE SECTION PERIOD
;
with the following input:
DATA DIVISION.
FILE SECTION.
FD FD01
WORKING-STORAGE SECTION.
Clearly the third line of input ("FD FD01") is missing the terminator PERIOD asked for in fileDescription rule.
The DefaultErrorStrategy correctly acknowledges this and conjures up the missing token:
On stderr the correct report is displayed: line 4:0 missing '.' at 'WORKING-STORAGE'.
But if the fragments commented out are enabled (that is, the clauses 'IS EXTERNAL' and 'IS GLOBAL' are brought in the grammar again), then single token insertion fails:
On stderr the misleading report is displayed: line 4:0 no viable alternative at input 'WORKING-STORAGE'
How to enable the full grammar (with IS EXTERNAL and IS GLOBAL clauses) retaining the ability to correct the missing PERIOD?
Side note 1: if I enable either IS EXTERNAL or IS GLOBAL, but not both clauses, then the DefaultErrorStrategy works nicely and injects the missing token.
Side note 2: the code generated for a grammar with both clauses enabled has the following extra code (compared to a grammar with just one of them enabled):
public final FileDescriptionContext fileDescription() ... {
...
try {
...
switch ( getInterpreter().adaptivePredict(_input,4,_ctx) ) {
case 1:
{
setState(31);
_la = _input.LA(1);
if (_la==IS) {
{
setState(30); match(IS);
}
}
setState(33); match(GLOBAL);
}
break;
}
...
}
catch (RecognitionException re) {
...
And the adaptivePredict() call is the culprit, because it throws no viable alternative at input 'WORKING-STORAGE' before the parser has a chance to match(PERIOD) (in the generated code, not pasted here).
I've managed to solve it adding a new clause for both IS clauses:
(here just the fragments changed)
...
fileDescription :
FD fileName=ID
isClauses?
blockClause?
PERIOD
;
isClauses :
IS? GLOBAL (IS? EXTERNAL)?
| IS? EXTERNAL
;
...
Now the DefaultErrorStrategy does its work and injects the missing PERIOD.
Why not isClauses : (IS? GLOBAL?) (IS? EXTERNAL)?;
Well, I tried that first, of course. But got a warning (warning(154): rule 'fileDescription' contains an optional block with at least one alternative that can match an empty string) and no missing PERIOD injected.

Inconsistent token handling in ANTLR4

The ANTLR4 book references a multi-mode example
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsLexer.g4
lexer grammar ModeTagsLexer;
// Default mode rules (the SEA)
OPEN : '<' -> mode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
mode ISLAND;
CLOSE : '>' -> mode(DEFAULT_MODE) ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z]+ ; // match/send ID in tag to parser
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsParser.g4
parser grammar ModeTagsParser;
options { tokenVocab=ModeTagsLexer; } // use tokens from ModeTagsLexer.g4
file: (tag | TEXT)* ;
tag : '<' ID '>'
| '<' '/' ID '>'
;
I'm trying to build on this example, but using the « and » characters for delimiters. If I simply substitute I'm getting error 126
cannot create implicit token for string literal in non-combined grammar: '«'
In fact, this seems to occur as soon as I have the « character in the parser tag rule.
tag : '«' ID '>';
with
OPEN : '«' -> pushMode(ISLAND);
TEXT : ~'«'+;
Is there some antlr foo I'm missing? This is using antlr4-maven-plugin 4.2.
The wiki mentions something along these lines, but the way I read it that's contradicting the example on github and anecdotal experience when using <. See "Redundant String Literals" at https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules
One of the following is happening:
You forgot to update the OPEN rule in ModeTagsLexer.g4 to use the following form:
OPEN : '«' -> mode(ISLAND) ;
You found a bug in ANTLR 4, which should be reported to the issue tracker.
Have you specified the file encoding that ANTLR should use when reading the grammar? It should be okay with European characters less than 255 but...

Resources