Remove unknown text usign ANTLR4 - antlr4

In my grammar, there are key-value pairs separated by :. An example is like this:
rule1 : string
rule2 : string
I know some directives like rule1 and rule2, and I need to ignore other text (or unknown key-value pairs) found in the file like rule3, randomtext etc...
In my ANTLR4 grammar I have the following:
group : rule (comment | rule)* ;
rule : rule1 | rule2
rule1 : RULE1':' ruleName+ comment* ;
rule2 : RULE2':' ruleName+ comment* ;
ruleName : ID ID*;
comment : '#' ID* ;
When I test with valid files, all is ok, but when I test for malformed input, like this:
rule1: asd
rule2 somestring
It detects the first line as correct (as expected), and the second one with and error that say:
line 2:5 missing ':' at 'somestring'
It is detected as rule2 parser rule.
The question is, how can I ignore all unknown text from my parser? The unknown rules are the ones that are not contemplated in my parser file.
Thanks

Related

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

Match most specific rule

In my grammar, I want to have both "variable identifiers" and "function identifiers". Essentially, I want to be less restrictive on the characters allowed in function identifiers. However, I am running in to the issue that all variable identifiers are valid function identifiers.
As an example, say I want to allow uppercase letters in a function identifier but not in a variable identifier. My current (presumably naive) might look like:
prog : 'func' FunctionId
| 'var' VariableId
;
FunctionId : [a-zA-Z]+ ;
VariableId : [a-z]+ ;
With the above rules, var hello fails to parse. If I understand correctly, this is because FunctionId is defined first, so "hello" is treated as a FunctionId.
Can I make antlr choose the more specific valid rule?
An explanation why your grammar does not work as expected could be found here.
You can solve this with semantic predicates:
grammar Test;
prog : 'func' functionId
| 'var' variableId
;
functionId : Id;
variableId : {isVariableId(getCurrentToken().getText())}? Id ;
Id : [a-zA-Z]+;
On the lexer level there will be only ids. On the parser level you can restrict an id to lowercase characters. isVariableId(String) would look like:
public boolean isVariableId(String text) {
return text.matches("[a-z]+");
}
Can I make antlr choose the more specific valid rule?
No (as already mentioned). The lexer merely matches as much as it can, and in case 2 or more rules match the same, the one defined first "wins". There is no way around this.
I'd go for something like this:
prog : 'func' functionId
| 'var' variableId
;
functionId : LowerCaseId | UpperCaseId ;
variableId : LowerCaseId ;
LowerCaseId : [a-z]+ ;
UpperCaseId : [A-Z] [a-zA-Z]* ;

Simple grammar for fluentd?

I am new to antlr4 and I am trying to create grammar to parse a fluentd config files to a tree. Can you point me to what I am doing wrong here?
The fluentd syntax looks a lot like Apache's (pseudo-xml, shell-style comments, kv-pairs in a tag), for example:
# Receive events from 24224/tcp
<source>
#type forward
port 24224
</source>
# example
<match>
# file or memory
buffer_type file
<copy>
file /path
</copy>
</match>
This is my grammar so far:
grammar Fluentd;
// root element
content: (entry | comment)*;
entry: '<' name tag? '>' (entry | comment | param)* '<' '/' close_ '>';
name: NAME;
close_: NAME;
tag: TAG;
comment: '#' NL;
param: name value NL;
value: ANY;
ANY: .*?;
NL: ('\r'?'\n'|'\n') -> skip;
TAG: ('a'..'z' | 'A'..'Z' | '_' | '0'..'9'| '$' |'.' | '*' | '{' | '}')+;
NAME: ('a'..'z'| 'A..Z' | '#' | '_' | '0'..'9')+;
WS: (' '|'\t') -> skip;
...And it fails miserably on the above input:
line 2:2 mismatched input 'Receive' expecting NL
line 3:1 missing NAME at 'source'
line 4:8 mismatched input 'forward' expecting ANY
line 6:2 mismatched input 'source' expecting NAME
line 8:2 mismatched input 'example' expecting NL
line 9:1 missing NAME at 'match'
line 10:6 mismatched input 'file' expecting NL
line 12:2 mismatched input 'match' expecting NAME
The first thing you must realise is that the lexer works independently from the parser. The lexer simply creates tokens by trying to match as much characters as possible. If two or more lexer rules match the same amount of characters, the rule defined first will "win".
Having said that, the input source can therefor never be tokenised as a NAME since the TAG rule also matches this, and is defined before NAME.
A solution to this could be:
tag : SIMPLE_ID | TAG;
name : SIMPLE_ID | NAME;
SIMPLE_ID : [a-zA-Z_0-9]+ ;
TAG : [a-zA-Z_0-9$.*{}]+ ;
NAME : [a-zA-Z_0-9#]+ ;
That way, foobar would become a SIMPLE_ID, foo.bar a TAG and #mu a NAME.
There are more things incorrect in your grammar:
in your lexer, you're skipping NL tokens, but you're using them in parser rules as well: you can't do that (since such tokens will never be created)
ANY: .*?; can potentially match an empty string (of which there are an infinite amount): lexer rules must always match at least 1 character! However, if you change .*? to .+?, it will always match just 1 character since you made it match ungreedy (the trailing ?). And you cannot do .+ because then it will match the entire input. You should do something like this:
// Use a parser rule to "glue" all single ANY tokens to each other
any : ANY+ ;
// all other lexer rules
// This must be very last rule!
ANY : . ;
If you don't define ANY as the last rule, input like X would not be tokenised as a TAG, but an an ANY token (remember my first paragraph).
the rule comment: '#' NL; makes no sense: a comment isn't a # followed by a line break. I'd expect a lexer rule for such a thing:
COMMENT : '#' ~[\r\n]* -> skip;
And there's not need to include a linebreak in this rule: these are already handled in NL.

Antlr Lexer rule excluding tokens

Here is a fragment from my ANTLR4 grammar:
Lexer Rules:
AND : ('a'|'A') ('n'|'N') ('d'|'D');
OR : ('o'|'O') ('r'|'R') ;
NOT : ('n'|'N') ('o'|'O') ('t'|'T') ;
TERM : ('a'..'z'|'A'..'Z'|'0'..'9'|[1-9])+ ;
Parser Rules:
negation: NOT;
logical: AND|OR;
term: TERM;
search
: negation? term (logical negation? term)* ;
;
Essentially I am trying to get it parse the "you and me" string such that the TERM token would match "you", "me" and I would like "and" to be recognized by the AND rule, not the TERM rule.
Right now I am getting: line 1:4 missing TERM at 'and' error.
I understand that my input is being matched by both AND and TERM lexer rules, but I would like to be able to specify that TERM is anything except what matches AND rule.
Try adding the following to your lexer rules:
WS : [ \r\n\t\u000C]+ -> skip ;
Basically this is a token that matches any whitspace, tab, newline, tr and with skip you're telling ANTLR to skip it.

Resources