ANTLR - Period character not matched when matching "anything"

ANTLR - Period character not matched when matching "anything" - antlr4

I have a simple rule like so:
ifClause: 'if' '(' condition ')' '{' (structField)+ '}' ;
condition: .*?;
This works for parsing:
if (abc == def) {
<something>
}
But errors out on:
if (abc.xyz == def) {
<something>
}
with the error:
line NN:MM token recognition error at: '.'
Why would it not consume '.' character when matching .*?
I am using Antlr 4.5.3 and Python output.

First, the parser rule
condition: .*?;
consumes tokens produced by the lexer, not raw characters.
Second, 'token recognition' errors are produced by the lexer when, as here, a character cannot be matched by a lexer rule (by default, the lexer will skip an unrecognized character, producing the error and no corresponding token for use by the parser, and continue matching the input stream).
To fix, ensure that a '.' will be matched by a lexer rule.

Related

Why does my antlr grammar give me an error?

I have the little grammar below. node is the start production. When my input is (a:b) I get an error: line 1:1 extraneous input 'a' expecting {':', INAME}
Why is this?
EDIT - I forgot that the lexer and parser run as a separate phases. By the time the parser runs, the lexer has completed. When the lexer runs it has no knowledge of the parser rules. It has already made the TYPE/INAME decision choosing TYPE per #bart's reasoning below.
grammar g1;
TYPE: [A-Za-z_];
INAME: [A-Za-z_];
node: '(' namesAndTypes ')';
namesAndTypes:
INAME ':' TYPE
| ':' TYPE
| INAME
;

That is because the lexer will never produce an INAME token. The lexer works in the following was:
try to match as much characters as possible
when 2 or more lexer rules match the same characters, let the one defined first "win"
Because the input "a" and "b" both match the TYPE and INAME rules, the TYPE rule wins because it is defined first. It doesn't matter if the parser is trying to match an INAME rule, the lexer will not produce it. The lexer does not "listen" to he parser.
You could create some sort of ID rule, and then define type and iname parser rules instead:
ID: [A-Za-z_];
node
: '(' namesAndTypes ')'
;
namesAndTypes
: iname ':' type
| ':' type
| iname
;
type
: ID
;
iname
: ID
;

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?

Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

Simple grammar for fluentd?

I am new to antlr4 and I am trying to create grammar to parse a fluentd config files to a tree. Can you point me to what I am doing wrong here?
The fluentd syntax looks a lot like Apache's (pseudo-xml, shell-style comments, kv-pairs in a tag), for example:
# Receive events from 24224/tcp
<source>
#type forward
port 24224
</source>
# example
<match>
# file or memory
buffer_type file
<copy>
file /path
</copy>
</match>
This is my grammar so far:
grammar Fluentd;
// root element
content: (entry | comment)*;
entry: '<' name tag? '>' (entry | comment | param)* '<' '/' close_ '>';
name: NAME;
close_: NAME;
tag: TAG;
comment: '#' NL;
param: name value NL;
value: ANY;
ANY: .*?;
NL: ('\r'?'\n'|'\n') -> skip;
TAG: ('a'..'z' | 'A'..'Z' | '_' | '0'..'9'| '$' |'.' | '*' | '{' | '}')+;
NAME: ('a'..'z'| 'A..Z' | '#' | '_' | '0'..'9')+;
WS: (' '|'\t') -> skip;
...And it fails miserably on the above input:
line 2:2 mismatched input 'Receive' expecting NL
line 3:1 missing NAME at 'source'
line 4:8 mismatched input 'forward' expecting ANY
line 6:2 mismatched input 'source' expecting NAME
line 8:2 mismatched input 'example' expecting NL
line 9:1 missing NAME at 'match'
line 10:6 mismatched input 'file' expecting NL
line 12:2 mismatched input 'match' expecting NAME

The first thing you must realise is that the lexer works independently from the parser. The lexer simply creates tokens by trying to match as much characters as possible. If two or more lexer rules match the same amount of characters, the rule defined first will "win".
Having said that, the input source can therefor never be tokenised as a NAME since the TAG rule also matches this, and is defined before NAME.
A solution to this could be:
tag : SIMPLE_ID | TAG;
name : SIMPLE_ID | NAME;
SIMPLE_ID : [a-zA-Z_0-9]+ ;
TAG : [a-zA-Z_0-9$.*{}]+ ;
NAME : [a-zA-Z_0-9#]+ ;
That way, foobar would become a SIMPLE_ID, foo.bar a TAG and #mu a NAME.
There are more things incorrect in your grammar:
in your lexer, you're skipping NL tokens, but you're using them in parser rules as well: you can't do that (since such tokens will never be created)
ANY: .*?; can potentially match an empty string (of which there are an infinite amount): lexer rules must always match at least 1 character! However, if you change .*? to .+?, it will always match just 1 character since you made it match ungreedy (the trailing ?). And you cannot do .+ because then it will match the entire input. You should do something like this:
// Use a parser rule to "glue" all single ANY tokens to each other
any : ANY+ ;
// all other lexer rules
// This must be very last rule!
ANY : . ;
If you don't define ANY as the last rule, input like X would not be tokenised as a TAG, but an an ANY token (remember my first paragraph).
the rule comment: '#' NL; makes no sense: a comment isn't a # followed by a line break. I'd expect a lexer rule for such a thing:
COMMENT : '#' ~[\r\n]* -> skip;
And there's not need to include a linebreak in this rule: these are already handled in NL.

Antlr Lexer rule excluding tokens

Here is a fragment from my ANTLR4 grammar:
Lexer Rules:
AND : ('a'|'A') ('n'|'N') ('d'|'D');
OR : ('o'|'O') ('r'|'R') ;
NOT : ('n'|'N') ('o'|'O') ('t'|'T') ;
TERM : ('a'..'z'|'A'..'Z'|'0'..'9'|[1-9])+ ;
Parser Rules:
negation: NOT;
logical: AND|OR;
term: TERM;
search
: negation? term (logical negation? term)* ;
;
Essentially I am trying to get it parse the "you and me" string such that the TERM token would match "you", "me" and I would like "and" to be recognized by the AND rule, not the TERM rule.
Right now I am getting: line 1:4 missing TERM at 'and' error.
I understand that my input is being matched by both AND and TERM lexer rules, but I would like to be able to specify that TERM is anything except what matches AND rule.

Try adding the following to your lexer rules:
WS : [ \r\n\t\u000C]+ -> skip ;
Basically this is a token that matches any whitspace, tab, newline, tr and with skip you're telling ANTLR to skip it.

Incorrect Result When ANTLR4 Lexer Action Invokes getText()

It seems that the getText() in a lexer action cannot retrieve the token being matched correctly. Is it a normal behaviour? For example, part of my grammar has these rules for
parsing a C++ style identifier that support a \u sequence to embed unicode characters as part of the identifier name:
grammar CPPDefine;
cppCompilationUnit: (id_token|ALL_OTHER_SYMBOL)+ EOF;
id_token:IDENTIFIER //{System.out.println($text);}
;
CRLF: '\r'? '\n' -> skip;
ALL_OTHER_SYMBOL: '\\';
IDENTIFIER: (NONDIGIT (NONDIGIT | DIGIT)*)
{System.out.println(getText());}
;
fragment DIGIT: [0-9];
fragment NONDIGIT: [_a-zA-Z] | UNIVERSAL_CHARACTER_NAME ;
fragment UNIVERSAL_CHARACTER_NAME: ('\\u' HEX_QUAD | '\\U' HEX_QUAD HEX_QUAD ) ;
fragment HEX_QUAD: [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f];
Tested with this 1 line input containing an identifier with incorrect unicode escape sequence:
dkk\uzzzz
The $text of the id_token parser rule action produces this correct result:
dkk
uzzzz
i.e. input interpreted as 2 identifiers separated by a symbol '\' (symbol '\' not printed by any parser rule).
However, the getText() of IDENTIFIER lexer rule action produces this incorrect result:
dkk\u
uzzzz
Why the lexer rule IDENTIFIER's getText() is different from the parser id_token rule's $text. Afterall, the parser rule contains only this lexer rule?
EDIT:
Issue observed in ANTLR4.1 but not in ANTLR4.2 so it could have been fixed already.

It's hard to tell based on your example, but my instinct is you are using an old version of ANTLR. I am unable to reproduce this issue in ANTLR 4.2.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR - Period character not matched when matching "anything" - antlr4

Related

Why does my antlr grammar give me an error?

Antlr4 grammar wouldn't parse multiline input

Simple grammar for fluentd?

Antlr Lexer rule excluding tokens

Incorrect Result When ANTLR4 Lexer Action Invokes getText()

Categories

Resources