Antlr4: line 1:14 extraneous input 'w' expecting {<EOF>, ';', ' '} - antlr4

I try to write a Grammar which checks if the following data (a csv-file) is valid:
w;w;w;s;s;s;s
w;s;w;w;w;w;w
w;s;w;w;w;w;w
w;s;w;s;s;s;w
w;s;w;w;w;w;w
w;s;w;w;w;w;w
w;w;s;s;w;w;w
* Define a grammar Battlefield
*/
grammar Battlefield;
file : row* EOF;
row : value (Separator value)* (LineFeed |EOF) ;
value : SimpleValue ;
Separator : ';' ;
// line feed
LineFeed : '\n';
// w or s is allowed
SimpleValue : ('s'|'w'|'\n')+ ;
WS : [ \t\r]+ -> skip ; // skip spaces, tabs
When running the grammar I get the following error code:
line 1:14 extraneous input 'w' expecting {<EOF>, ';', '
'}
What is wrong?

Your SimpleValue rule will always consume a trailing \n, so no LineFeed token will ever be generated. Just remove the \n alt from the SimpleValue rule.

Related

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".
There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).
I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

Want to parse same structure

I would like to make ANTLR4 parse this:
FSM
name type String
state type State
Relation
name type String
And i am using this grammar :
grammar Generator;
classToGenerate:
name=Name NL
(attributes NL)+
classToGenerate| EOF;
attributes: attribute=Name WS 'type' WS type=Name;
Name: ('A'..'Z' | 'a'..'z')+ ;
WS: (' ' | '\t')+;
NL: '\r'? '\n';
I would like to read successfully, i don't know why, but each time i run my program, i get this error :
line 6:18 no viable alternative at input '<EOF>'
Any fix?
The trailing EOF is messing things up for you. Try creating a separate rule that matches the EOF token, preceded by one or more classToGenerate (the parse rule in my example):
grammar Generator;
parse
: classToGenerate+ EOF
;
classToGenerate
: name=Name NL (attributes NL)+
;
attributes
: attribute=Name WS 'type' WS type=Name
;
Name: ('A'..'Z' | 'a'..'z')+ ;
WS: (' ' | '\t')+;
NL: '\r'? '\n';
And do you really need to keep the spaces and line breaks? You could let the lexer discard them, which makes your grammar a whole lot easier to read:
grammar Generator;
parse
: classToGenerate+ EOF
;
classToGenerate
: name=Name attributes+
;
attributes
: attribute=Name 'type' type=Name
;
Name : [a-zA-Z]+;
Spaces : [ \t\r\n] -> skip;

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

I have the following simple grammar:
grammar TestG;
p : pDecl+ ;
pDecl : endianDecl
| dTDecl
;
endianType : E_BIG
| E_LITTLE
;
endianDecl : 'endian' '=' endianType ';' ;
dTDecl : 'dT' '[' STRING ']' '=' ID ';' ;
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
E_BIG : 'big' ;
E_LITTLE : 'little' ;
When I run grun TestG p and input the following:
endian = little;
I get the following:
line 1:9 mismatched input 'little' expecting {'big', 'little'}
What have I done wrong?
Because your lexer rule for ID precedes that for E_LITTLE, your 'little' input is being lexed as an ID.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<ID>,1:9] <== see here it's being lexed as an ID
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'little' expecting {'big', 'little'}
Moving the these lexer tokens above ID like so:
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
E_BIG : 'big' ;
E_LITTLE : 'little' ;
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
yields the correct output from your test input.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<'little'>,1:9] <== see here being lexed correctly
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
Remember, for lexer tokens, the longest match wins, but in the case of a tie, the one that appears FIRST wins. This is why you want your more specific lexer tokens at the top of the lexer token list, and the more general ones (like identifiers, strings, etc.) farther down.

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?
ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Parsing quoted string with escape chars

I'm having a problem parsing a list of lines of format in antlr4
* this is a string
* "first" this is "quoted"
* this is "quoted with \" "
I want to build a parse tree like
(list
(line * (value (string this is a string)))
(line * (value (parameter first) (string this is) (parameter quoted)))
(line * (value (string this is) (parameter quoted with " )))
)
I have an antlr4 grammar of this format
grammar List;
list : line+;
line : '*' (WS)+ value* NEWLINE;
value : string
| parameter
;
string : ((WORD) (WS)*)+;
parameter : '"'((WORD) (WS)*)+ '"';
WORD : (~'\n')+;
WS : '\t' | ' ';
NEWLINE : '\n';
But this is failing in the first character recognition of '*' itself, which baffles me.
line 1:0 mismatched input '* this is a string' expecting '*'
The problem is that your lexer is too greedy. The rule
WORD : (~'\n')+;
matches almost everything. This causes the lexer to produce the following tokens for your input:
token 1: WORD (* this is a string)
token 2: NEWLINE
token 3: WORD (`* "first" this is "quoted")
token 4: NEWLINE
token 5: WORD (* this is "quoted with \" ")
Yes, that is correct: only WORD and NEWLINE tokens. ANTLR's lexer tries to construct tokens with as much characters as possible, it does not "listen" to what the parser is trying to match.
The error message:
line 1:0 mismatched input '* this is a string' expecting '*'
is telling you this: on line 1, index 0 the token with text '* this is a string' (type WORD) is encountered, but the parser is trying to match the token: '*'
Try something like this instead:
grammar List;
parse
: NEWLINE* list* NEWLINE* EOF
;
list
: item (NEWLINE item)*
;
item
: '*' (STRING | WORD)*
;
BULLET : '*';
STRING : '"' (~[\\"] | '\\' [\\"])* '"';
WORD : ~[ \t\r\n"*]+;
NEWLINE : '\r'? '\n' | '\r';
SPACE : [ \t]+ -> skip;
which parses your example input as follows:
(parse
(list
(item
* this is a string) \n
(item
* "first" this is "quoted") \n
(item
* this is "quoted with \" "))
\n
<EOF>)

Resources