antlr4 can't extract literal into token

antlr4 can't extract literal into token - antlr4

I have the following grammar and am trying to start out slowly, working up to move complex arguments.
grammar Command;
commands : command+ EOF;
command : NAME args NL;
args : arg | ;
arg : DASH LOWER | LOWER;
//arg : DASH 'a' | 'x';
NAME : [_a-zA-Z0-9]+;
NL : '\n';
WS : [ \t\r]+ -> skip ; // spaces, tabs, newlines
DASH : '-';
LOWER: [a-z];//'a' .. 'z';
I was hoping (for now) to parse files like this:
cmd1
cmd3 -a
If I run that input through grun I get an error:
$ java org.antlr.v4.gui.TestRig Command commands -tree
...
`line 3:6 mismatched input 'a' expecting LOWER`
It seems like LOWER should match 'a'. If I change the arg definition to be the commented out line it works fine and I get the '-a' as an arg. What's the difference between using LOWER and using a 'a' explicitly?

As soon as you have a "mismatched" error, add -tokens to grun to display the tokens, it helps finding the discrepancy between what you THINK the lexer will do and what it actually DOES. With your grammar :
$ alias grun='java org.antlr.v4.gui.TestRig'
$ grun Command commands -tokens -diagnostics t.text
[#0,0:3='cmd1',<NAME>,1:0]
[#1,4:4='\n',<'
'>,1:4]
[#2,5:8='cmd3',<NAME>,2:0]
[#3,10:10='-',<'-'>,2:5]
[#4,11:11='a',<NAME>,2:6]
[#5,12:12='\n',<'
'>,2:7]
[#6,13:12='<EOF>',<EOF>,3:0]
line 2:6 mismatched input 'a' expecting LOWER
you immediately see that the letter a is a NAMEand not the expected LOWER.
Also watch rules with an empty alternative :
args
: arg
|
;
may lead to problems in some circumstances. I prefer to explicitly add the ? suffix which means zero or one time. So my solution would be :
grammar Command;
commands
#init {System.out.println("Question last update 1829");}
: command+ EOF
;
command
: NAME args? NL
;
args
: arg
;
arg : DASH? LOWER ;
LOWER : [a-z] ;
NAME : [_a-zA-Z0-9]+;
DASH : '-' ;
NL : '\n' ;
WS : [ \t\r]+ -> skip ;
Execution :
$ grun Command commands -tokens -diagnostics t.text
[#0,0:3='cmd1',<NAME>,1:0]
[#1,4:4='\n',<'
'>,1:4]
[#2,5:8='cmd3',<NAME>,2:0]
[#3,10:10='-',<'-'>,2:5]
[#4,11:11='a',<LOWER>,2:6]
[#5,12:12='\n',<'
'>,2:7]
[#6,13:12='<EOF>',<EOF>,3:0]
Question last update 1829

Related

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".

There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).

I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

Lexer and Parser rules for a simple command processor

I am attempting to build a simple command processor for a legacy language.
I am attempting to work with C# with antlr4 version "ANTLR", "4.6.6")
I am unable to make progress against one scenario, of several.
The following examples shows various sample invocations of the command PKS.
PKS
PKS?
PKStext_that_is_a_filename
The scenario that I can not solve is the PKS command followed by filename.
Command:
PKS
(block (line (expr (command PKS)) (eol \r\n)) <EOF>)
Command:
PKS?
(block (line (expr (command PKS) (query ?)) (eol \r\n)) <EOF>)
Command:
PKSFILENAME
line 1:0 mismatched input 'PKSFILENAME' expecting COMMAND
(block PKSFILENAME \r\n)
Command:
what I believe to be the relevant snippet of grammar:
block : line+ EOF;
line : (expr eol)+;
expr : command file
| command listOfDouble
| command query
| command
;
command : COMMAND
;
query : QUERY;
file : TEXT ;
eol : EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)* ;
From the lexer:
COMMAND : PKS;
PKS :'PKS' ;
QUERY : '?'
;
fragment LETTER : [A-Z];
fragment DIGIT : [0-9];
fragment UNDER : [_];
TEXT : (LETTER) (LETTER|DIGIT|UNDER)* ;

The main problem here is that your TEXT rule also matches what PKS is supposed to match. And since PKStext_that_is_a_filename can entirely be matched by that TEXT rule it is preferred over the PKS rule, even though it appears first in the grammar (if 2 rules match the same input then the first one wins).
In order to fix that problem you have 2 options:
Require whitespace(s) between the keyword (PKS) and the rest of the expression.
Change the TEXT rule to explicitly exclude "PKS" as valid input.
Option 2 is certainly possible, but will get very messy if you have have more keywords (as they all would have to be excluded). With a whitespace between the keywords and the text the lexer would automatically do that for you.
And let me give you a hint to approach such kind of problems: always check the token list produced by the lexer to see if it generated the tokens you expected. I reworked your grammar a bit, added missing tokens and ran it through my ANTLR4 debugger, which gave me:
Parser error (5, 1): extraneous input 'PKStext_that_is_a_filename' expecting {<EOF>, COMMAND, EOL}
Tokens:
[#0,0:2='PKS',<1>,1:0]
[#1,3:3='\n',<8>,1:3]
[#2,4:4='\n',<8>,2:0]
[#3,5:7='PKS',<1>,3:0]
[#4,8:8='?',<3>,3:3]
[#5,9:9='\n',<8>,3:4]
[#6,10:10='\n',<8>,4:0]
[#7,11:36='PKStext_that_is_a_filename',<7>,5:0]
[#8,37:37='\n',<8>,5:26]
[#9,38:37='<EOF>',<-1>,6:0]
For this input:
PKS
PKS?
PKStext_that_is_a_filename
Here's the grammar I used:
grammar Example;
start: block;
block: line+ EOF;
line: expr? eol;
expr: command (file | listOfDouble | query)?;
command: COMMAND;
query: QUERY;
file: TEXT;
eol: EOL;
listOfDouble: DOUBLE (COMMA DOUBLE)*;
COMMAND: PKS;
PKS: 'PKS';
QUERY: '?';
fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];
fragment UNDER: [_];
COMMA: ',';
DOUBLE: DIGIT+ (DOT DIGIT*)?;
DOT: '.';
TEXT: LETTER (LETTER | DIGIT | UNDER)*;
EOL: [\n\r];
and the generated visual parse tree:

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

I have the following simple grammar:
grammar TestG;
p : pDecl+ ;
pDecl : endianDecl
| dTDecl
;
endianType : E_BIG
| E_LITTLE
;
endianDecl : 'endian' '=' endianType ';' ;
dTDecl : 'dT' '[' STRING ']' '=' ID ';' ;
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
E_BIG : 'big' ;
E_LITTLE : 'little' ;
When I run grun TestG p and input the following:
endian = little;
I get the following:
line 1:9 mismatched input 'little' expecting {'big', 'little'}
What have I done wrong?

Because your lexer rule for ID precedes that for E_LITTLE, your 'little' input is being lexed as an ID.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<ID>,1:9] <== see here it's being lexed as an ID
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'little' expecting {'big', 'little'}
Moving the these lexer tokens above ID like so:
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
E_BIG : 'big' ;
E_LITTLE : 'little' ;
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
yields the correct output from your test input.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<'little'>,1:9] <== see here being lexed correctly
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
Remember, for lexer tokens, the longest match wins, but in the case of a tie, the one that appears FIRST wins. This is why you want your more specific lexer tokens at the top of the lexer token list, and the more general ones (like identifiers, strings, etc.) farther down.

Antlr4 parsing inconsistency

in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?

Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?

ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

antlr4 can't extract literal into token - antlr4

Related

ANTLR4 ambiguity - how to solve

Lexer and Parser rules for a simple command processor

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

Antlr4 parsing inconsistency

antlr tokenizer starts with the last token

Categories

Resources