Antlr4 grammar wouldn't parse multiline input - antlr4

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?

Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

Related

Choosing lexer mode based on variable

My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).

Match most specific rule

In my grammar, I want to have both "variable identifiers" and "function identifiers". Essentially, I want to be less restrictive on the characters allowed in function identifiers. However, I am running in to the issue that all variable identifiers are valid function identifiers.
As an example, say I want to allow uppercase letters in a function identifier but not in a variable identifier. My current (presumably naive) might look like:
prog : 'func' FunctionId
| 'var' VariableId
;
FunctionId : [a-zA-Z]+ ;
VariableId : [a-z]+ ;
With the above rules, var hello fails to parse. If I understand correctly, this is because FunctionId is defined first, so "hello" is treated as a FunctionId.
Can I make antlr choose the more specific valid rule?
An explanation why your grammar does not work as expected could be found here.
You can solve this with semantic predicates:
grammar Test;
prog : 'func' functionId
| 'var' variableId
;
functionId : Id;
variableId : {isVariableId(getCurrentToken().getText())}? Id ;
Id : [a-zA-Z]+;
On the lexer level there will be only ids. On the parser level you can restrict an id to lowercase characters. isVariableId(String) would look like:
public boolean isVariableId(String text) {
return text.matches("[a-z]+");
}
Can I make antlr choose the more specific valid rule?
No (as already mentioned). The lexer merely matches as much as it can, and in case 2 or more rules match the same, the one defined first "wins". There is no way around this.
I'd go for something like this:
prog : 'func' functionId
| 'var' variableId
;
functionId : LowerCaseId | UpperCaseId ;
variableId : LowerCaseId ;
LowerCaseId : [a-z]+ ;
UpperCaseId : [A-Z] [a-zA-Z]* ;

Simple grammar for fluentd?

I am new to antlr4 and I am trying to create grammar to parse a fluentd config files to a tree. Can you point me to what I am doing wrong here?
The fluentd syntax looks a lot like Apache's (pseudo-xml, shell-style comments, kv-pairs in a tag), for example:
# Receive events from 24224/tcp
<source>
#type forward
port 24224
</source>
# example
<match>
# file or memory
buffer_type file
<copy>
file /path
</copy>
</match>
This is my grammar so far:
grammar Fluentd;
// root element
content: (entry | comment)*;
entry: '<' name tag? '>' (entry | comment | param)* '<' '/' close_ '>';
name: NAME;
close_: NAME;
tag: TAG;
comment: '#' NL;
param: name value NL;
value: ANY;
ANY: .*?;
NL: ('\r'?'\n'|'\n') -> skip;
TAG: ('a'..'z' | 'A'..'Z' | '_' | '0'..'9'| '$' |'.' | '*' | '{' | '}')+;
NAME: ('a'..'z'| 'A..Z' | '#' | '_' | '0'..'9')+;
WS: (' '|'\t') -> skip;
...And it fails miserably on the above input:
line 2:2 mismatched input 'Receive' expecting NL
line 3:1 missing NAME at 'source'
line 4:8 mismatched input 'forward' expecting ANY
line 6:2 mismatched input 'source' expecting NAME
line 8:2 mismatched input 'example' expecting NL
line 9:1 missing NAME at 'match'
line 10:6 mismatched input 'file' expecting NL
line 12:2 mismatched input 'match' expecting NAME
The first thing you must realise is that the lexer works independently from the parser. The lexer simply creates tokens by trying to match as much characters as possible. If two or more lexer rules match the same amount of characters, the rule defined first will "win".
Having said that, the input source can therefor never be tokenised as a NAME since the TAG rule also matches this, and is defined before NAME.
A solution to this could be:
tag : SIMPLE_ID | TAG;
name : SIMPLE_ID | NAME;
SIMPLE_ID : [a-zA-Z_0-9]+ ;
TAG : [a-zA-Z_0-9$.*{}]+ ;
NAME : [a-zA-Z_0-9#]+ ;
That way, foobar would become a SIMPLE_ID, foo.bar a TAG and #mu a NAME.
There are more things incorrect in your grammar:
in your lexer, you're skipping NL tokens, but you're using them in parser rules as well: you can't do that (since such tokens will never be created)
ANY: .*?; can potentially match an empty string (of which there are an infinite amount): lexer rules must always match at least 1 character! However, if you change .*? to .+?, it will always match just 1 character since you made it match ungreedy (the trailing ?). And you cannot do .+ because then it will match the entire input. You should do something like this:
// Use a parser rule to "glue" all single ANY tokens to each other
any : ANY+ ;
// all other lexer rules
// This must be very last rule!
ANY : . ;
If you don't define ANY as the last rule, input like X would not be tokenised as a TAG, but an an ANY token (remember my first paragraph).
the rule comment: '#' NL; makes no sense: a comment isn't a # followed by a line break. I'd expect a lexer rule for such a thing:
COMMENT : '#' ~[\r\n]* -> skip;
And there's not need to include a linebreak in this rule: these are already handled in NL.

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR​ expression)+
| expression (NOT​ expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

antlr does not parse when token only mentions one other token

I am trying to learn EBNF grammars with ANTLR. So I thought I would convert the Wikipedia EBNF grammar to ANTLR 4 and play with it. However I have had a terrible time at it. I was able to reduce the grammar to the one step that generates the problem.
It seems if I have one token reference solely another token then ANTLR 4 can't parse the input.
Here is my grammar:
grammar Hello;
program : statement+ ;
statement : IDENTIFIER STATEMENTEND /*| LETTERS STATEMENTEND */ ;
LETTERS : [a-z]+ ;
IDENTIFIER : LETTERS ;
SEMICOLON : [;] ;
STATEMENTEND : SEMICOLON NEWLINE* | NEWLINE+ ;
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Notice IDENTIFIER refers only to LETTERS.
If I provide this input:
a;
Then I get this error:
line 1:0 mismatched input 'a' expecting IDENTIFIER
(program a ;\n)
However if I uncomment the code and provide the same input I get legit output:
(program (statement a ;\n))
I do not understand why one works and the other does not.
The token a will only be assigned one token type. Since this input text matches both the LETTERS and IDENTIFIER rules, ANTLR 4 will assign the type according to the first rule appearing in the lexer, which means the input a will be a token of type LETTERS.
If you only meant for LETTERS to be a sub-part of other lexer rules, and not form LETTERS tokens themselves, you can declare it as a fragment rule.
fragment LETTERS : [a-z]+;
IDENTIFIER : LETTERS;
In this case, a would be assigned the token type IDENTIFIER and the original parser rule would work.

Resources