I am writing a grammar to handle scalar and vector expressions. The grammar below is simplified to show the problem I have where a scalar expression can be derived from a vector and a vector can be derived from a scalar. For example, a vector could be a literal [1, 2, 3] or the product of a scalar and a vector 2 * [1, 2, 3] (equivalent to [2, 4, 6]). A scalar could be a literal 2 or an index into a vector [1, 2, 3][1] (equivalent to 2).
grammar LeftRecursion;
Integer
: [0-9]+
;
WhiteSpace
: [ \t\r\n]+ -> skip
;
input
: expression EOF;
expression
: scalar
| vector
;
scalar
: Integer
| vector '[' Integer ']'
;
vector
: '[' Integer ',' Integer ',' Integer ']'
| scalar '*' vector
;
ANTLR4 gives me the error: The following sets of rules are mutually left-recursive [scalar, vector]. This makes sense because scalar references vector and vice-versa, but at the same time it should be deterministic.
How would I refactor this grammar to avoid the mutual (indirect) left-recursion? I could expand one of the terms inplace, but that would introduce a lot of duplication in the full grammar where there are more alternatives for vector and scalar. I could also refactor the grammar to have a primary expression, but I don't want to allow scalar '*' scalar as a valid vector alternative. Are there other options?
AFAIK, there is no way around it but to expand to eliminate the indirect recursive rule:
expression
: scalar
| vector
;
scalar
: '[' Integer ',' Integer ',' Integer ']' '[' Integer ']'
| scalar '*' vector '[' Integer ']'
| Integer
;
vector
: '[' Integer ',' Integer ',' Integer ']'
| scalar '*' vector
;
scalar
: Integer
| vector '[' Integer ']'
;
vector
: '[' Integer ',' Integer ',' Integer ']'
| scalar '*' vector
;
gives that you could write an expressions
[i,i,i][i] * [i,i,i][i] * ... * [i,i,i]
which would render a stack overflow of the parser for java and other languages with limited stack-depth.
I think you should create a different grammatical rule for vector-lookups, it's not a scalar, it just results in a scalar, but this should be handled in the parser tree handling, not in ANTLR.
Related
I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.
You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;
I'm working on parsing PDF content streams. I'm having trouble defining an array. The definition of an array in the PDF reference (PDF 32000-1:2008) is:
An array object is a one-dimensional collection of objects arranged sequentially. …an array’s elements may be any combination of numbers, strings, dictionaries, or any other objects, including other arrays. An array may have zero elements.
An array shall be written as a sequence of objects enclosed in SQUARE BRACKETS (using LEFT SQUARE BRACKET (5Bh) and RIGHT SQUARE BRACKET (5Dh)).
EXAMPLE: [549 3.14 false (Ralph) /SomeName]
Here's a stripped-down version of my grammar:
grammar PdfStream;
/*
* Parser Rules
*/
content : stat* ;
stat
: array
| string
;
array: ARRAY ;
string: STRING ;
/*
* Lexer Rules
*/
ARRAY: '[' (ARRAY | DICTIONARY | OBJECT)* ']' ;
DICTIONARY: '<<' (NAME (ARRAY | DICTIONARY | OBJECT))* '>>' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
STRING: (LITERAL_STRING | HEX_STRING) ;
NAME: '/' ID ;
INT: DIGIT+ ;
LITERAL_STRING: '(' .*? ')' ;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
FLOAT: DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
OBJECT
: NULL
| BOOLEAN
| NUMBER
| STRING
| NAME
;
fragment DIGIT: [0-9] ;
// All characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
And here's the test file I'm processing.
<AE93>
(String1)
( String2 )
[]
[549 3.14 false (Ralph) /SomeName]
When I process the file with grun PdfStream tokens -tokens stream.txt I get this output:
line 5:0 token recognition error at: '[549 '
line 5:33 token recognition error at: ']'
[#0,0:5='<AE93>',<STRING>,1:0]
[#1,7:15='(String1)',<STRING>,2:0]
[#2,17:27='( String2 )',<STRING>,3:0]
[#3,29:30='[]',<ARRAY>,4:0]
[#4,37:40='3.14',<NUMBER>,5:5]
[#5,42:46='false',<BOOLEAN>,5:10]
[#6,48:54='(Ralph)',<STRING>,5:16]
[#7,56:64='/SomeName',<NAME>,5:24]
[#8,67:66='<EOF>',<EOF>,6:0]
What's wrong with my grammar that's causing the token recognition errors?
[549 3.14 false (Ralph) /SomeName] isn't recognized as an ARRAY because it contains spaces and the rule for ARRAY does not allow any spaces. If you want spaces to be ignored between the elements of an array, you should turn it into a parser rule instead of a lexer rule (the same applies to DICTIONARY).
You'll also need to make OBJECT a parser rule because otherwise it will never be matched because any input that matches, say, NUMBER will always produce a NUMBER token instead of an OBJECT token because OBJECT comes last in the grammar. Generally you never want multiple lexer rules where everything that can be matched by one of them can also always be matched by at least one other. This also means that you want to turn INT and FLOAT into fragments.
So I defined a grammar to parse an C style syntax language:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression (op=('*'|'/') expression)*
| expression (op=('+'|'-') expression)*
| relation
| INT
| ID
| '(' expression ')'
;
relation
: expression (op=('<'|'>') expression)*
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
And I can parse a simple test as this:
int i = (2+3)*3/2*(3+36);
int j = i;
int k = 2*1+i*3;
if (k > 2)
k = k + 1;
i = i / 3;
j = j / 3;
fi;
loop (i < 10)
i = i + 1 * (i+k);
j = (j + 1) * (j-k);
k = i + j;
print(k);
pool;
However, when I want to generate ANTLR Recogonizers in intelliJ, I got this error:
sCalc.g4:19:0: left recursive rule expression contains a left recursive alternative which can be followed by the empty string
I wonder if this is caused by my ID could be an empty string?
There are a couple of issues with your grammar:
you have INT as an alternative inside expression while you probably want INTEGER instead
there is no need to do expression (op=('+'|'-') expression)*: this will do: expression op=('+'|'-') expression
ANTLR4 does not support indirect left recursive rules: you must include relation inside expression
Something like this ought to do it:
grammar mygrammar;
program
: (declaration)*
(statement)*
EOF
;
declaration
: INT ID '=' expression ';'
;
assignment
: ID '=' expression ';'
;
expression
: expression op=('*'|'/') expression
| expression op=('+'|'-') expression
| expression op=('<'|'>') expression
| INTEGER
| ID
| '(' expression ')'
;
statement
: expression ';'
| ifstatement
| loopstatement
| printstatement
| assignment
;
ifstatement
: IF '(' expression ')' (statement)* FI ';'
;
loopstatement
: LOOP '(' expression ')' (statement)* POOL ';'
;
printstatement
: PRINT '(' expression ')' ';'
;
IF : 'if';
FI : 'fi';
LOOP : 'loop';
POOL : 'pool';
INT : 'int';
PRINT : 'print';
ID : [a-zA-Z][a-zA-Z0-9]*;
INTEGER : [0-9]+;
WS : [ \r\n\t] -> skip;
Also not that this (statement)* can simply be written as statement*
It's about your expression and relation rules. The expression rule can match relation in one alt, which in turn recurses back to expression. Rule relation additionally can potentially match nothing because of (op=('<'|'>') expression)*
A better approach is probably to have relation call expression and remove the relation alt from expression. Then use relation everywhere you used expression now. That's a typical scenario in expressions, starting out with low precedence operations as top level rules and drilling down to higher precedence rules, ultimately ending at a simple expression rule (or similar).
I read "The Definite ANTLR4 Reference" and it says
While ANTLR v4 can handle direct left recursion, it can’t handle indirect left
recursion.
on page 71.
But in json grammar on page 90 i see next
grammar JSON;
json: object
| array
;
object
: '{' pair (',' pair)* '}'
| '{' '}' // empty object
;
pair: STRING ':' value ;
array
: '[' value (',' value)* ']'
| '[' ']' // empty array
;
value
: STRING
| NUMBER
| object // indirect recursion
| array // indirec recursion
| 'true'
| 'false'
| 'null'
;
Is it correct?
The JSON grammar you mentioned is not a problem because it actually doesn't contain any indirect left recursion.
The rule value can produce array and array can again produce something which contains value, but not as it's leftmost part. (there is a [ preceding value)
The value rule would only be a problem if there would be some way to produce value folowed by any terminals and non-terminals.
From the book
A left-recursive rule is one that
either directly or indirectly invokes itself on the left edge of an alternative.
Example:
expr : expr '*' expr // match subexpressions joined with '*'
| expr '+' expr // match subexpressions joined with '+' operator
| INT // matches simple integer atom
;
It is left recursion because there is at least one alternative immediatly started with expr. Also it is direct left recursion.
Example of indirect left recursion:
expr : addition // indirectly invokes expr left recursively via addition
| ...
;
addition : expr '+' expr
;
I have the grammar below, it's an extract out of something I am working on which is highlighting a problem I can't overcome.
In my grammar an expression is either a literal, which is a number or an expression "+" another expression. So I want to parse:
1 + 2 + 3 + 4
etc.
However my definition of a number means that it can have an optional sign e.g.:
1, +1 or -1
So it's conceivable that I may need to parse:
1 + +1 or 1 + -1
What I am finding is that 1 + 1 (or bigger numbers) are fine.
What I am struggling to parse are inputs without spaces or with extra signs e.g.:
1+2
This causes real problems as the lexer picks up +2 as a Number when actually I want 2 as the number and + to be picked up as the sign in the expression.
How do I get antlr4 to recognise the difference?
grammar example;
example : expression* EOF;
expression
: expression '+' expression
| literal
;
literal : Number;
Number : Sign? Digits;
Sign : [-+];
Digits : Digit+;
Digit : [0-9];
WS : [ \t\r\n\u000C]+ -> skip;
You can delete optional Sign lexem in the Number token. This way you will postpone processing of signs to parser stage, when you have more information about the context of the input. The idea here is to create unary operators for negation, minus sign (-) and plus sign (+) for keeping the number intact.
grammar example;
example : expression* EOF;
expression
: ('+'|'-') expression # unaryOp
| expression ('+'|'-') expression # binaryOp
| Number # number
;
Number : [0-9]+;
WS : [ \t\r\n\u000C]+ -> skip;
Not sure if it's still relevant, but here goes:
Your expression rule seems faulty, it can not match on a "literal + literal" string, because it always expects an expression on the left.
Your rule should look something like:
expression:
literal + literal
| expression + literal;