Overcoming ambiguity in antlr4?

Overcoming ambiguity in antlr4? - antlr4

I have the grammar below, it's an extract out of something I am working on which is highlighting a problem I can't overcome.
In my grammar an expression is either a literal, which is a number or an expression "+" another expression. So I want to parse:
1 + 2 + 3 + 4
etc.
However my definition of a number means that it can have an optional sign e.g.:
1, +1 or -1
So it's conceivable that I may need to parse:
1 + +1 or 1 + -1
What I am finding is that 1 + 1 (or bigger numbers) are fine.
What I am struggling to parse are inputs without spaces or with extra signs e.g.:
1+2
This causes real problems as the lexer picks up +2 as a Number when actually I want 2 as the number and + to be picked up as the sign in the expression.
How do I get antlr4 to recognise the difference?
grammar example;
example : expression* EOF;
expression
: expression '+' expression
| literal
;
literal : Number;
Number : Sign? Digits;
Sign : [-+];
Digits : Digit+;
Digit : [0-9];
WS : [ \t\r\n\u000C]+ -> skip;

You can delete optional Sign lexem in the Number token. This way you will postpone processing of signs to parser stage, when you have more information about the context of the input. The idea here is to create unary operators for negation, minus sign (-) and plus sign (+) for keeping the number intact.
grammar example;
example : expression* EOF;
expression
: ('+'|'-') expression # unaryOp
| expression ('+'|'-') expression # binaryOp
| Number # number
;
Number : [0-9]+;
WS : [ \t\r\n\u000C]+ -> skip;

Not sure if it's still relevant, but here goes:
Your expression rule seems faulty, it can not match on a "literal + literal" string, because it always expects an expression on the left.
Your rule should look something like:
expression:
literal + literal
| expression + literal;

Related

Why does my antlr grammar give me an error?

I have the little grammar below. node is the start production. When my input is (a:b) I get an error: line 1:1 extraneous input 'a' expecting {':', INAME}
Why is this?
EDIT - I forgot that the lexer and parser run as a separate phases. By the time the parser runs, the lexer has completed. When the lexer runs it has no knowledge of the parser rules. It has already made the TYPE/INAME decision choosing TYPE per #bart's reasoning below.
grammar g1;
TYPE: [A-Za-z_];
INAME: [A-Za-z_];
node: '(' namesAndTypes ')';
namesAndTypes:
INAME ':' TYPE
| ':' TYPE
| INAME
;

That is because the lexer will never produce an INAME token. The lexer works in the following was:
try to match as much characters as possible
when 2 or more lexer rules match the same characters, let the one defined first "win"
Because the input "a" and "b" both match the TYPE and INAME rules, the TYPE rule wins because it is defined first. It doesn't matter if the parser is trying to match an INAME rule, the lexer will not produce it. The lexer does not "listen" to he parser.
You could create some sort of ID rule, and then define type and iname parser rules instead:
ID: [A-Za-z_];
node
: '(' namesAndTypes ')'
;
namesAndTypes
: iname ':' type
| ':' type
| iname
;
type
: ID
;
iname
: ID
;

ANTLR prioritizing optional tag over non-optional tag

Hard to explain with the title, but here's the problem:
I'm trying to parse an equation with ANTLR4.
My current definition of a number:
NUMBER: SUBTRACTION? DIGIT+ ([.,] DIGIT+)? ;
Where DIGIT and SUBTRACTION are all digits and '-' respectively.
My parser rule for subtraction:
subtraction: value? SUBTRACTION (value|operation)?;
The idea is the parser still works with a missing value.
Let's say I have this input
1-2
The problem is with this input, ANTLR says that 1 is a number, and -2 is a number. ANTLR doesn't group it as subtraction, like 1 SUBTRACTION 2.
What can I do to get the correct grouping?

When you define rules like this:
SUBTRACTION : '-';
NUMBER : SUBTRACTION? DIGIT+ ([.,] DIGIT+)? ;
input like -2 will always be a single NUMBER token: it will never be tokenised as separate SUBTRACTION and NUMBER tokens. ANTLR always tries to match as much characters as possible.
You should not glue the - to the NUMBER in the lexer, but do that in the parser like SUBTRACTION expr:
expr
: SUBTRACTION expr
| expr SUBTRACTION expr
| NUMBER
;
SUBTRACTION : '-';
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
fragment DIGIT : [0-9];
which will parse both 1-2 and -2 correctly.

ANTLR grammar: Boolean literal which can occur as qualified variable name while ignoring whitespace

I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.

You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?

Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR expression)+
| expression (NOT expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).

The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.

Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Overcoming ambiguity in antlr4? - antlr4

Not sure if it's still relevant, but here goes: Your expression rule seems faulty, it can not match on a "literal + literal" string, because it always expects an expression on the left. Your rule should look something like: expression: literal + literal | expression + literal;

Related

Why does my antlr grammar give me an error?

ANTLR prioritizing optional tag over non-optional tag

ANTLR grammar: Boolean literal which can occur as qualified variable name while ignoring whitespace

Antlr4 grammar wouldn't parse multiline input

antlr4 all words except the operators

Categories

Resources