Match most specific rule - antlr4

In my grammar, I want to have both "variable identifiers" and "function identifiers". Essentially, I want to be less restrictive on the characters allowed in function identifiers. However, I am running in to the issue that all variable identifiers are valid function identifiers.
As an example, say I want to allow uppercase letters in a function identifier but not in a variable identifier. My current (presumably naive) might look like:
prog : 'func' FunctionId
| 'var' VariableId
;
FunctionId : [a-zA-Z]+ ;
VariableId : [a-z]+ ;
With the above rules, var hello fails to parse. If I understand correctly, this is because FunctionId is defined first, so "hello" is treated as a FunctionId.
Can I make antlr choose the more specific valid rule?

An explanation why your grammar does not work as expected could be found here.
You can solve this with semantic predicates:
grammar Test;
prog : 'func' functionId
| 'var' variableId
;
functionId : Id;
variableId : {isVariableId(getCurrentToken().getText())}? Id ;
Id : [a-zA-Z]+;
On the lexer level there will be only ids. On the parser level you can restrict an id to lowercase characters. isVariableId(String) would look like:
public boolean isVariableId(String text) {
return text.matches("[a-z]+");
}

Can I make antlr choose the more specific valid rule?
No (as already mentioned). The lexer merely matches as much as it can, and in case 2 or more rules match the same, the one defined first "wins". There is no way around this.
I'd go for something like this:
prog : 'func' functionId
| 'var' variableId
;
functionId : LowerCaseId | UpperCaseId ;
variableId : LowerCaseId ;
LowerCaseId : [a-z]+ ;
UpperCaseId : [A-Z] [a-zA-Z]* ;

Related

Choosing lexer mode based on variable

My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).

ANTLR grammar: Boolean literal which can occur as qualified variable name while ignoring whitespace

I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

How to parse keywords as normal words some of the time in ANTLR4

I have a language with keywords like hello that are only keywords in certain types of sentences. In other types of sentences, these words should be matched as an ID, for example. Here's a super simple grammar that tells the story:
grammar Hello;
file : ( sentence )* ;
sentence : 'hello' ID PERIOD
| INT ID PERIOD;
ID : [a-z]+ ;
INT : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
PERIOD : '.' ;
I'd like these sentences to be valid:
hello fred.
31 cheeseburgers.
6 hello.
but that last sentence doesn't work in this grammar. The word hello is a token of type hello and not of type ID. It seems like the lexer grabs all the hellos and turns them into tokens of that type.
Here's a crazy way to do it, to explain what I want:
sentence : 'hello' ID PERIOD
| INT crazyID PERIOD;
crazyID : ID | 'hello' ;
but in my real language, there are a lot of keywords like hello to deal with, so, yeah, that way seems crazy.
Is there a reasonable, compact, target-language-independent way to handle this?
A standard way of handling keywords:
file : ( sentence )* EOF ;
sentence : key=( KEYWORD | INT ) id=( KEYWORD | ID ) PERIOD ;
KEYWORD : 'hello' | 'goodbye' ; // list others as alts
PERIOD : '.' ;
ID : [a-z]+ ;
INT : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
The seeming ambiguity between the KEYWORD and ID rules is resolved based on the KEYWORD rule being listed before the ID rule.
In the parser SentenceContext, TerminalNode variables key and id will be generated and, on parsing, will effectively hold the matched tokens, allowing easy positional identification.

Is there a language-agnostic way to do simple predicates in the parser?

Goal
I want to reduce (or eliminate) the Java-specific actions and predicates in my parser. Perhaps it isn't possible, but I wanted to ask here just in case there's some ANTLR4 feature I've missed. (The language itself is third-party, so I don't have control over that.)
Simplified example
The predicates I want to use are mostly exact (or perhaps case-insensitive) string-matching. I could make big parallel sets of parser rules, but I'd rather not since the real-life example is considerably more convoluted.
Suppose I'm given something like:
isWidget(int) : "Whether it is a widget" : 4 ;
ownerFirstName(string) : "john" ;
ownerLastName(string) : "This is the last-name of the owner" : "doe" ;
I want the parser to look at the default-value (the last item on the line, like 4, "john" or "doe") and parse it based on the earlier type (int), (string), (string).
main
: stmt SEMIC (stmt SEMIC)* EOF
;
stmt
: propname=IDENTIFIER LPAREN datatype=IDENTIFIER RPAREN (COLON description=QUOTSTRING)? COLON df=defaultVal
;
defaultVal
: QUOTSTRING //TODO only this alt if datatype=string
| NUM //TODO only this alt if datatype=int
;
fragment Letter : 'a'..'z' | 'A'..'Z' ;
fragment Digit : '0'..'9' ;
fragment Underscore : '_' ;
SEMIC : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
IDENTIFIER : (Letter|Underscore) (Letter|Underscore|Digit)* ;
QUOTSTRING : '"' ~('"' |'\n' | '\r' | '\u2029' | '\u2028')* '"' ;
NUM : Digit+ ;
WS : [ \t\n\r]+ -> skip ;
I know I can do it with predicates and rule inputs, but then I'm crossing the line from a language-agnostic grammar to one with embedded Java code.
Your parser should handle things like the following without a problem:
isWidget(int) : "Whether it is a widget" : "foo" ;
In other words, do not add a predicate that would fail in this case, or you will lose the ability to report sane error messages. Instead, use a language-specific listener or visitor implementation after the parse is complete to report a semantic error if the type of the default value does not match the declared type.

Resources