Inconsistent token handling in ANTLR4 - antlr4

The ANTLR4 book references a multi-mode example
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsLexer.g4
lexer grammar ModeTagsLexer;
// Default mode rules (the SEA)
OPEN : '<' -> mode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
mode ISLAND;
CLOSE : '>' -> mode(DEFAULT_MODE) ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z]+ ; // match/send ID in tag to parser
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsParser.g4
parser grammar ModeTagsParser;
options { tokenVocab=ModeTagsLexer; } // use tokens from ModeTagsLexer.g4
file: (tag | TEXT)* ;
tag : '<' ID '>'
| '<' '/' ID '>'
;
I'm trying to build on this example, but using the « and » characters for delimiters. If I simply substitute I'm getting error 126
cannot create implicit token for string literal in non-combined grammar: '«'
In fact, this seems to occur as soon as I have the « character in the parser tag rule.
tag : '«' ID '>';
with
OPEN : '«' -> pushMode(ISLAND);
TEXT : ~'«'+;
Is there some antlr foo I'm missing? This is using antlr4-maven-plugin 4.2.
The wiki mentions something along these lines, but the way I read it that's contradicting the example on github and anecdotal experience when using <. See "Redundant String Literals" at https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules

One of the following is happening:
You forgot to update the OPEN rule in ModeTagsLexer.g4 to use the following form:
OPEN : '«' -> mode(ISLAND) ;
You found a bug in ANTLR 4, which should be reported to the issue tracker.

Have you specified the file encoding that ANTLR should use when reading the grammar? It should be okay with European characters less than 255 but...

Related

How to parse "Leg 1: Jun 25" with ANTLR

I am starting with antlr4 and after following some tutorials, I started to make my own grammar. For now, I wanted to parse a simple input Leg 1: Jun 25.
fragment DIGIT
: [0-9];
fragment MONTH
: [A-Z][a-z][a-z];
DATE
: MONTH ' ' DIGIT+;
LEG_NUMBER
: DIGIT+;
leg
: 'Leg ' LEG_NUMBER ': ' DATE;
But it's no success, I get the following error
line 1:0 mismatched input 'Leg 1' expecting 'Leg '
I don't understand even the output message... Here is the parse tree in IntelliJ ANTLR plugin
The parse tree is showing you that the Lexer has recognized your input as three tokens: a DATE ("Leg 1"), your : (implicitly defined) token, and then another DATE ("Jun 25").
The first thing to understand is that the Lexer will first tokenize your input stream of characters into a stream of tokens. At this point in the processing, parser rules have absolutely no impact. Parser rules match against the stream of tokens (not your input stream of characters).
Since your DATE rule says "Upper case letter, lowercase letter, lowercase letter, space, one or more numbers", then "Leg 1" is a match, and is recognized as a DATE token. The Lexer doesn't know (or care) that your parser rule wants to start by matching "Leg ".
It's always a good idea to run your input through some tool that shows you the token stream so you can validate your Lexer rules. That can either be the grun alias with the -tokens option, or you should be able to view your token stream in the IntelliJ ANTLR plugin (with some experience you'll also recognize that the parse tree diagram is telling you that as well)
One way to fix this would be to tighten up the MONTH fragment:
fragment MONTH
: (
'Jan'
| 'Feb'
| 'Mar'
| 'Apr'
| 'May'
| 'Jun'
| 'Jul'
| 'Aug'
| 'Sep'
| 'Oct'
| 'Nov'
| 'Dec'
)
;
That will prevent "Leg 1" from matching. I'm not recommending that as a good path forward with a "real" grammar, but it does resolve this immediate issue as you start to work with ANTLR.

Odd ANTLR4 error handling behavior with very simply (trivial) behavior

Given the below super simple grammar:
ddlStatement
: defineStatement
;
defineStatement
: 'define' tableNameToken=Identifier ';'?
;
and the input "add 1 to bob"
I would expect to get an error. However, the parser matches the "defineStatement" rule with a missing "define" token. The following Listener will fire
#Override
public void exitDefineStatement(DDLParser.DefineStatementContext ctx) {
log.info(MessageFormat.format("Defining {0}", ctx.tableNameToken.getText()));
}
and log "Defining add".
I can assign 'define' to a variable and test that variable for NULL but that seems like work I shouldn't have to do.
BTW if the grammar becomes more complete - specifically with the addition of alternatives to the ddlStatement rule - error handling works as I would expect.
This ANTLR's error recovery in action.
In many cases, it's VERY beneficial for ANTLR to assume either a missing token, or ignore a token, if it allows parsing to continue. The missing "define" token should have been reported as an error.
Without this capability, ANTLR would frequently get "stumped" at the first sign of problems. With this, ANTLR is saying "Well, if I assume X, then I can make sense of your input. So I'm assuming X and reporting that as an error so I can continue on.
(Filling a few details to get this to build)
grammar Test
;
ddlStatement: defineStatement;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
if I run antlr on this and compile the Java output. The following command:
echo "add 1 to bob" | grun Test ddlStatement -gui
yields the error:
line 1:0 missing 'define' at 'add'
and produces the parse tree:
The highlighted node is the error node in the tree.
The reason it stops after "add" is that input (assuming a missing "define", would be a ddlStatement
ANTLR will stop processing input once it has recognized your stop rule.
To get it to "pay attention" to the entire input, add an EOF token to your start rule:
grammar Test
;
ddlStatement: defineStatement EOF;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
gives these errors:
line 1:0 missing 'define' at 'add'
line 1:4 mismatched input '1' expecting {<EOF>, ';'}
and this tree:

The lexer chooses the wrong Token

Hi I am new to antrl and have a problem that I am not able to solve during the last days:
I wanted to write a grammar that recognizes this text (in reality I want to parse something different, but for the case of this question I simplified it)
100abc
150100
200def
Here each rows starts with 3 digits, that identifiy the type of the line (header, content, trailer), than 3 characters follow, that are the payload of the line.
I thought I could recogize this with this grammar:
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
header : '100' A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
But this does not work. When the lexer reads the "100" in the second line ("150100") it reads it into one token with 100 as the value and not as three Tokens of type A. So the parser sees a "100" token where it expects an A Token.
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together the '1','0','0'. I found no way to solve this. Putting the Rule A above the parser Rule that contains the string literal "100" did not work. And also factoring the '100' into a fragement as follows did not work.
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
HUNDRED: '100';
header : HUNDRED A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
I also read some other posts like this:
antlr4 mixed fragments in tokens
Lexer, overlapping rule, but want the shorter match
But I did not think, that it solves my problem, or at least I don't see how that could help me.
One of your token definitions is incorrect: A : [a-z|A-Z|0-9]; Don't use a vertical line inside a range [] set. A correct definition is: A : [a-zA-Z0-9];. ANTLR with version >= 4.6 will notify about duplicated chars | inside range set.
As I understand you mixed tokens and rules concept. Tokens defined with UPPER first letter unlike rules that defined with lower case first letter. Your header, content and trailer are tokens, not rules.
So, the final version of correct grammar on my opinion is
grammar Types;
file : Header Content Trailer;
A : [a-zA-Z0-9];
NL: '\r' '\n'? | '\n' | EOF; // Or leave only one type of newline.
Header : '100' A A A NL;
Content: '150' A A A NL;
Trailer: '200' A A A NL;
Your input text will be parsed to (file 100abc\n 150100\n 200def)

ANTLR4 DefaultErrorStrategy fails to inject missing token

I'm trying to run in TestRig the following grammar:
grammar COBOLfragment;
// hidden tokens
WS : [ ]+ -> channel(HIDDEN);
NL : '\n' -> channel(HIDDEN);
// keywords
PERIOD : '.';
DIVISION : 'DIVISION';
SECTION : 'SECTION';
DATA : 'DATA';
WORKING_STORAGE : 'WORKING-STORAGE';
FILE : 'FILE';
FD : 'FD';
EXTERNAL : 'EXTERNAL';
GLOBAL : 'GLOBAL';
BLOCK : 'BLOCK';
CONTAINS : 'CONTAINS';
CHARACTERS : 'CHARACTERS';
// data
INTEGER : [0-9]+;
ID : [A-Z][A-Z0-9]*;
dataDivision :
DATA DIVISION PERIOD
fileSection?
workingStorageSection?
;
fileSection :
FILE SECTION PERIOD
fileDescription*
;
fileDescription :
FD fileName=ID
// (IS? GLOBAL)? // 1. IS GLOBAL clause
// (IS? EXTERNAL)? // 2. IS EXTERNAL clause
blockClause?
PERIOD
;
blockClause :
BLOCK CONTAINS? blockSize=INTEGER CHARACTERS
;
workingStorageSection :
WORKING_STORAGE SECTION PERIOD
;
with the following input:
DATA DIVISION.
FILE SECTION.
FD FD01
WORKING-STORAGE SECTION.
Clearly the third line of input ("FD FD01") is missing the terminator PERIOD asked for in fileDescription rule.
The DefaultErrorStrategy correctly acknowledges this and conjures up the missing token:
On stderr the correct report is displayed: line 4:0 missing '.' at 'WORKING-STORAGE'.
But if the fragments commented out are enabled (that is, the clauses 'IS EXTERNAL' and 'IS GLOBAL' are brought in the grammar again), then single token insertion fails:
On stderr the misleading report is displayed: line 4:0 no viable alternative at input 'WORKING-STORAGE'
How to enable the full grammar (with IS EXTERNAL and IS GLOBAL clauses) retaining the ability to correct the missing PERIOD?
Side note 1: if I enable either IS EXTERNAL or IS GLOBAL, but not both clauses, then the DefaultErrorStrategy works nicely and injects the missing token.
Side note 2: the code generated for a grammar with both clauses enabled has the following extra code (compared to a grammar with just one of them enabled):
public final FileDescriptionContext fileDescription() ... {
...
try {
...
switch ( getInterpreter().adaptivePredict(_input,4,_ctx) ) {
case 1:
{
setState(31);
_la = _input.LA(1);
if (_la==IS) {
{
setState(30); match(IS);
}
}
setState(33); match(GLOBAL);
}
break;
}
...
}
catch (RecognitionException re) {
...
And the adaptivePredict() call is the culprit, because it throws no viable alternative at input 'WORKING-STORAGE' before the parser has a chance to match(PERIOD) (in the generated code, not pasted here).
I've managed to solve it adding a new clause for both IS clauses:
(here just the fragments changed)
...
fileDescription :
FD fileName=ID
isClauses?
blockClause?
PERIOD
;
isClauses :
IS? GLOBAL (IS? EXTERNAL)?
| IS? EXTERNAL
;
...
Now the DefaultErrorStrategy does its work and injects the missing PERIOD.
Why not isClauses : (IS? GLOBAL?) (IS? EXTERNAL)?;
Well, I tried that first, of course. But got a warning (warning(154): rule 'fileDescription' contains an optional block with at least one alternative that can match an empty string) and no missing PERIOD injected.

ANTLR4 : mismatched input

I would like to match the input of the form ::
commit a1b2c3
Author: Michael <michael#test.com>
commit d3g4
Author: David <david#test.com>
Here is the grammar I have written:
grammar commit;
file : commitinfo+;
commitinfo : commitdesc authordesc;
commitdesc : 'commit' COMMITHASH NEWLINE;
authordesc : 'Author:' AUTHORNAME '<' EMAIL '>' NEWLINE;
COMMITHASH : [a-z0-9]+;
AUTHORNAME : [a-zA-Z]+;
EMAIL : [a-zA-Z0-9.#]+;
NEWLINE : '\r'?'\n';
WHITESPACE : [ \t]->skip;
The problem with the above parser is that, for the above input it matches perfectly. But when the input changes to :
commit c1d2
Author: michael <michael#test.com>
it throws an error like :
line 2:8 mismatched input 'michael' expecting AUTHORNAME.
When I print the tokens, it seems the string 'michael' gets matched by the token COMMITHASH instead of AUTHORNAME.
How to fix the above case?
ANTLR4 matches the lexer rules according to the sequence in which they have been written.
'michael' gets matched by the rule COMMITHASH : [a-z0-9]+ ; which appears before the rule AUTHORNAME and hence you are having the error.
I can think of the following options to resolve the issue you are facing :
You can use the 'mode' feature in ANTLR : In ANTLR 4, one lexer mode is active at a time, and the longest non-fragment lexer rule in that mode rule will determine which token is created. Your grammar only includes the default mode, so all the lexer rules are active and hence 'michael' gets matched to COMMITHASH as the length of the token matched is same for COMMITHASH and AUTHORNAME but COMMITHASH appears before AUTHORNAME in the grammar.
You can alter your lexical rules by interchanging the way in which they appear in the grammar. Assuming your COMMITHASH rule always has a numeral matched with it. Put AUTHORNAME before COMMITHASH in the following way :
grammar commit;
...
AUTHORNAME : [a-zA-Z]+;
COMMITHASH : [a-z0-9]+;
...
Note: I strongly feel that your lexer rules are not crisply written. Are you sure that your COMMITHASH rule should be [a-z0-9]+; This would mean a token like 'abhdks' will also get matched by your COMMITHASH rule. But that's a different issue altogether.

Resources