ANTLR4 DefaultErrorStrategy fails to inject missing token

ANTLR4 DefaultErrorStrategy fails to inject missing token - antlr4

I'm trying to run in TestRig the following grammar:
grammar COBOLfragment;
// hidden tokens
WS : [ ]+ -> channel(HIDDEN);
NL : '\n' -> channel(HIDDEN);
// keywords
PERIOD : '.';
DIVISION : 'DIVISION';
SECTION : 'SECTION';
DATA : 'DATA';
WORKING_STORAGE : 'WORKING-STORAGE';
FILE : 'FILE';
FD : 'FD';
EXTERNAL : 'EXTERNAL';
GLOBAL : 'GLOBAL';
BLOCK : 'BLOCK';
CONTAINS : 'CONTAINS';
CHARACTERS : 'CHARACTERS';
// data
INTEGER : [0-9]+;
ID : [A-Z][A-Z0-9]*;
dataDivision :
DATA DIVISION PERIOD
fileSection?
workingStorageSection?
;
fileSection :
FILE SECTION PERIOD
fileDescription*
;
fileDescription :
FD fileName=ID
// (IS? GLOBAL)? // 1. IS GLOBAL clause
// (IS? EXTERNAL)? // 2. IS EXTERNAL clause
blockClause?
PERIOD
;
blockClause :
BLOCK CONTAINS? blockSize=INTEGER CHARACTERS
;
workingStorageSection :
WORKING_STORAGE SECTION PERIOD
;
with the following input:
DATA DIVISION.
FILE SECTION.
FD FD01
WORKING-STORAGE SECTION.
Clearly the third line of input ("FD FD01") is missing the terminator PERIOD asked for in fileDescription rule.
The DefaultErrorStrategy correctly acknowledges this and conjures up the missing token:
On stderr the correct report is displayed: line 4:0 missing '.' at 'WORKING-STORAGE'.
But if the fragments commented out are enabled (that is, the clauses 'IS EXTERNAL' and 'IS GLOBAL' are brought in the grammar again), then single token insertion fails:
On stderr the misleading report is displayed: line 4:0 no viable alternative at input 'WORKING-STORAGE'
How to enable the full grammar (with IS EXTERNAL and IS GLOBAL clauses) retaining the ability to correct the missing PERIOD?
Side note 1: if I enable either IS EXTERNAL or IS GLOBAL, but not both clauses, then the DefaultErrorStrategy works nicely and injects the missing token.
Side note 2: the code generated for a grammar with both clauses enabled has the following extra code (compared to a grammar with just one of them enabled):
public final FileDescriptionContext fileDescription() ... {
...
try {
...
switch ( getInterpreter().adaptivePredict(_input,4,_ctx) ) {
case 1:
{
setState(31);
_la = _input.LA(1);
if (_la==IS) {
{
setState(30); match(IS);
}
}
setState(33); match(GLOBAL);
}
break;
}
...
}
catch (RecognitionException re) {
...
And the adaptivePredict() call is the culprit, because it throws no viable alternative at input 'WORKING-STORAGE' before the parser has a chance to match(PERIOD) (in the generated code, not pasted here).

I've managed to solve it adding a new clause for both IS clauses:
(here just the fragments changed)
...
fileDescription :
FD fileName=ID
isClauses?
blockClause?
PERIOD
;
isClauses :
IS? GLOBAL (IS? EXTERNAL)?
| IS? EXTERNAL
;
...
Now the DefaultErrorStrategy does its work and injects the missing PERIOD.
Why not isClauses : (IS? GLOBAL?) (IS? EXTERNAL)?;
Well, I tried that first, of course. But got a warning (warning(154): rule 'fileDescription' contains an optional block with at least one alternative that can match an empty string) and no missing PERIOD injected.

Related

What is the problem with the following antlr4 grammar

I'm having trouble parsing a simple grammar. I believe the issue is that there are conflicting rules. Here is the text I'm trying to parse:
redis 6.2.6-debian-10-r49 Running
account-migrator 0.83.0 Pending
This represents services that have a name, version and status. Here is the grammar that isn't working:
main : statusLine+;
statusLine : serviceName versionNumber status;
serviceName : SERVICE_NAME;
versionNumber : VERSION_NUMBER;
status : STATUS;
SERVICE_NAME : [a-zA-Z-]+;
VERSION_NUMBER : [a-zA-Z0-9-]+ ('.' [a-zA-Z0-9-]+)*;
STATUS : [a-zA-Z]+;
WS : [ \n\t]+ -> skip;
I believe my grammar confuses the status for a service name because my visitor finds nothing for status on the first visit, but the second visit gets the status of the first line as the service name of the second.
So the question I have is, what can I do to parse these lines correctly?

The problem is that you have multiple rules that match the same input. Everything which STATUS could match, will actually be matched by SERVICE_NAME. Check out this parse tree:
The token STATUS is never produced, but Running became a SERVICE_NAME token.
So, instead of trying to add semantics to the lexer (by using different names, and hence meaning, for the same input) use a common lexer rule:
main : statusLine+;
statusLine : serviceName versionNumber status;
serviceName : IDENTIFIER;
versionNumber : VERSION_NUMBER;
status : IDENTIFIER;
IDENTIFIER: [a-zA-Z-]+;
VERSION_NUMBER: [a-zA-Z0-9-]+ ('.' [a-zA-Z0-9-]+)*;
WHITE_SPACE: [ \u000B\t\r\n] -> skip;
which then gives you the proper parse tree:

Odd ANTLR4 error handling behavior with very simply (trivial) behavior

Given the below super simple grammar:
ddlStatement
: defineStatement
;
defineStatement
: 'define' tableNameToken=Identifier ';'?
;
and the input "add 1 to bob"
I would expect to get an error. However, the parser matches the "defineStatement" rule with a missing "define" token. The following Listener will fire
#Override
public void exitDefineStatement(DDLParser.DefineStatementContext ctx) {
log.info(MessageFormat.format("Defining {0}", ctx.tableNameToken.getText()));
}
and log "Defining add".
I can assign 'define' to a variable and test that variable for NULL but that seems like work I shouldn't have to do.
BTW if the grammar becomes more complete - specifically with the addition of alternatives to the ddlStatement rule - error handling works as I would expect.

This ANTLR's error recovery in action.
In many cases, it's VERY beneficial for ANTLR to assume either a missing token, or ignore a token, if it allows parsing to continue. The missing "define" token should have been reported as an error.
Without this capability, ANTLR would frequently get "stumped" at the first sign of problems. With this, ANTLR is saying "Well, if I assume X, then I can make sense of your input. So I'm assuming X and reporting that as an error so I can continue on.
(Filling a few details to get this to build)
grammar Test
;
ddlStatement: defineStatement;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
if I run antlr on this and compile the Java output. The following command:
echo "add 1 to bob" | grun Test ddlStatement -gui
yields the error:
line 1:0 missing 'define' at 'add'
and produces the parse tree:
The highlighted node is the error node in the tree.
The reason it stops after "add" is that input (assuming a missing "define", would be a ddlStatement
ANTLR will stop processing input once it has recognized your stop rule.
To get it to "pay attention" to the entire input, add an EOF token to your start rule:
grammar Test
;
ddlStatement: defineStatement EOF;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
gives these errors:
line 1:0 missing 'define' at 'add'
line 1:4 mismatched input '1' expecting {<EOF>, ';'}
and this tree:

The lexer chooses the wrong Token

Hi I am new to antrl and have a problem that I am not able to solve during the last days:
I wanted to write a grammar that recognizes this text (in reality I want to parse something different, but for the case of this question I simplified it)
100abc
150100
200def
Here each rows starts with 3 digits, that identifiy the type of the line (header, content, trailer), than 3 characters follow, that are the payload of the line.
I thought I could recogize this with this grammar:
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
header : '100' A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
But this does not work. When the lexer reads the "100" in the second line ("150100") it reads it into one token with 100 as the value and not as three Tokens of type A. So the parser sees a "100" token where it expects an A Token.
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together the '1','0','0'. I found no way to solve this. Putting the Rule A above the parser Rule that contains the string literal "100" did not work. And also factoring the '100' into a fragement as follows did not work.
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
HUNDRED: '100';
header : HUNDRED A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
I also read some other posts like this:
antlr4 mixed fragments in tokens
Lexer, overlapping rule, but want the shorter match
But I did not think, that it solves my problem, or at least I don't see how that could help me.

One of your token definitions is incorrect: A : [a-z|A-Z|0-9]; Don't use a vertical line inside a range [] set. A correct definition is: A : [a-zA-Z0-9];. ANTLR with version >= 4.6 will notify about duplicated chars | inside range set.
As I understand you mixed tokens and rules concept. Tokens defined with UPPER first letter unlike rules that defined with lower case first letter. Your header, content and trailer are tokens, not rules.
So, the final version of correct grammar on my opinion is
grammar Types;
file : Header Content Trailer;
A : [a-zA-Z0-9];
NL: '\r' '\n'? | '\n' | EOF; // Or leave only one type of newline.
Header : '100' A A A NL;
Content: '150' A A A NL;
Trailer: '200' A A A NL;
Your input text will be parsed to (file 100abc\n 150100\n 200def)

Inconsistent token handling in ANTLR4

The ANTLR4 book references a multi-mode example
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsLexer.g4
lexer grammar ModeTagsLexer;
// Default mode rules (the SEA)
OPEN : '<' -> mode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
mode ISLAND;
CLOSE : '>' -> mode(DEFAULT_MODE) ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z]+ ; // match/send ID in tag to parser
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsParser.g4
parser grammar ModeTagsParser;
options { tokenVocab=ModeTagsLexer; } // use tokens from ModeTagsLexer.g4
file: (tag | TEXT)* ;
tag : '<' ID '>'
| '<' '/' ID '>'
;
I'm trying to build on this example, but using the « and » characters for delimiters. If I simply substitute I'm getting error 126
cannot create implicit token for string literal in non-combined grammar: '«'
In fact, this seems to occur as soon as I have the « character in the parser tag rule.
tag : '«' ID '>';
with
OPEN : '«' -> pushMode(ISLAND);
TEXT : ~'«'+;
Is there some antlr foo I'm missing? This is using antlr4-maven-plugin 4.2.
The wiki mentions something along these lines, but the way I read it that's contradicting the example on github and anecdotal experience when using <. See "Redundant String Literals" at https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules

One of the following is happening:
You forgot to update the OPEN rule in ModeTagsLexer.g4 to use the following form:
OPEN : '«' -> mode(ISLAND) ;
You found a bug in ANTLR 4, which should be reported to the issue tracker.

Have you specified the file encoding that ANTLR should use when reading the grammar? It should be okay with European characters less than 255 but...

how to handle conditionally existing components in action code?

This is another problem I am facing while migrating from antlr3 to antlr4. This problem is with the java action code for handling conditional components of rules. One example is shown below.
The following grammar+code worked in antlr3. Here, if the unary operator is not present, then a value of '0' is returned, and the java code checks for this value and takes appropriate action.
exprUnary returns [Expr e]
: (unaryOp)? e1=exprAtom
{if($unaryOp.i==0) $e = $e1.e;
else $e = new ExprUnary($unaryOp.i, $e1.e);
}
;
unaryOp returns [int i]
: '-' {$i = 1;}
| '~' {$i = 2;}
;
In antlr4, this code results in a null pointer exception during a run, because 'unaryOp' is 'null' if it is not present. But if I change the code like below, then antlr generation itself reports an error:
if($unaryOp==null) ...
java org.antlr.v4.Tool try.g4
error(67): missing attribute access on rule reference 'unaryOp' in '$unaryOp'
How should the action be coded for antlr4?
Another example of this situation is in if-then-[else] - here $s2 is null in antlr4:
ifStmt returns [Stmt s]
: 'if' '(' e=cond ')' s1=stmt ('else' s2=stmt)?
{$s = new StmtIf($e.e, $s1.s, $s2.s);}
;
NOTE: question 16392152 provides a solution to this question with listeners, but I am not using listeners, my requirement is for this to be handled in the action code.

There are at least two potential ways to correct this:
The "ANTLR 4" way to do it is to create a listener or visitor instead of placing the Java code inside of actions embedded in the grammar itself. This is the only way I would even consider solving the problem in my own grammars.
If you still use an embedded action, the most efficient way to check if the item exists or not is to access the ctx property, e.g. $unaryOp.ctx. This property resolves to the UnaryOpContext you were assuming would be accessible by $unaryOp by itself.

ANTLR expects you access an attribute. Try its text attribute instead: $unaryOp.text==null

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR4 DefaultErrorStrategy fails to inject missing token - antlr4

Related

What is the problem with the following antlr4 grammar

Odd ANTLR4 error handling behavior with very simply (trivial) behavior

The lexer chooses the wrong Token

Inconsistent token handling in ANTLR4

how to handle conditionally existing components in action code?

Categories

Resources