antlr4 json grammar and indirect left recursion - antlr4

I read "The Definite ANTLR4 Reference" and it says
While ANTLR v4 can handle direct left recursion, it can’t handle indirect left
recursion.
on page 71.
But in json grammar on page 90 i see next
grammar JSON;
json: object
| array
;
object
: '{' pair (',' pair)* '}'
| '{' '}' // empty object
;
pair: STRING ':' value ;
array
: '[' value (',' value)* ']'
| '[' ']' // empty array
;
value
: STRING
| NUMBER
| object // indirect recursion
| array // indirec recursion
| 'true'
| 'false'
| 'null'
;
Is it correct?

The JSON grammar you mentioned is not a problem because it actually doesn't contain any indirect left recursion.
The rule value can produce array and array can again produce something which contains value, but not as it's leftmost part. (there is a [ preceding value)
The value rule would only be a problem if there would be some way to produce value folowed by any terminals and non-terminals.
From the book
A left-recursive rule is one that
either directly or indirectly invokes itself on the left edge of an alternative.
Example:
expr : expr '*' expr // match subexpressions joined with '*'
| expr '+' expr // match subexpressions joined with '+' operator
| INT // matches simple integer atom
;
It is left recursion because there is at least one alternative immediatly started with expr. Also it is direct left recursion.
Example of indirect left recursion:
expr : addition // indirectly invokes expr left recursively via addition
| ...
;
addition : expr '+' expr
;

Related

ANTLR parser to throw exception for "true and or false" statement

I'm using ANTLR 4 and have a fairly complex grammar. I'm trying to simplify here...
Given an expression like: true and or false I want a parsing error since the operands defined expect expressions on either side and this has an expr operand operand expr
My reduced grammar is:
grammar MappingExpression;
/* The start rule; begin parsing here.
operator precedence is implied by the ordering in this list */
// =======================
// = PARSER RULES
// =======================
expr:
| op=(TRUE|FALSE) # boolean
| expr op=AND expr # logand
| expr op=OR expr # logor
;
TRUE : 'true';
FALSE : 'false';
WS : [ \t\r\n]+ -> skip; // ignore whitespace
AND : 'and';
OR : 'or';
however, it seems that the parser stops after evaluating true even though it has all four tokens identified (e.g., alt state returned becomes 2 in the parser).
If I can't get a parsing exception (because it is seeing what I deem operands as expressions), if I got the entire parse tree I could throw a runtime exception for two operands in a row (e.g., 'and' and 'or').
Originally, I'd just had:
expr 'and' expr #logand
expr 'or' expr #logor
and this suffered the same parsing problem (stopping early).
You should get a parsing error if you force the parser to consume all tokens by "anchoring" a rule with the built-in EOF
parse
: expr EOF
;
This is what I get when parsing the input true and or false:
See the error in the lower left corner:
line 1:9 extraneous input 'or' expecting {'true', 'false'}
line 1:17 missing {'true', 'false'} at '<EOF>'
Bart Kiers answer above is correct. I just wanted to provide more details for people working with Java who have experienced incomplete parsing issues.
I'd had a fairly complex g4 file that defined an expr as a series of OR'ed rules associated with tags (e.g., following a # that become the method name in the ExpressionsVisitor). While this seemed to work there were situations where I'd expected parsing errors but received none. I also had situations where only part of an input to the parser was interpreted making it impossible to process the entire input statement.
I repaired the g4 file as follows (the full version is here):
// =======================
// = PARSER RULES
// =======================
expr_to_eof : expr EOF ;
expr:
ID # id
| '*' # field_values
| DESCEND # descendant
| DOLLAR # context_ref
| ROOT # root_path
| ARR_OPEN exprOrSeqList? ARR_CLOSE # array_constructor
| OBJ_OPEN fieldList? OBJ_CLOSE # object_constructor
| expr '.' expr # path
| expr ARR_OPEN ARR_CLOSE # to_array
| expr ARR_OPEN expr ARR_CLOSE # array
| expr OBJ_OPEN fieldList? OBJ_CLOSE # object
| VAR_ID (emptyValues | exprValues) # function_call
| FUNCTIONID varList '{' exprList? '}' # function_decl
| VAR_ID ASSIGN (expr | (FUNCTIONID varList '{' exprList? '}')) # var_assign
| (FUNCTIONID varList '{' exprList? '}') exprValues # function_exec
| op=(TRUE|FALSE) # boolean
| op='-' expr # unary_op
| expr op=('*'|'/'|'%') expr # muldiv_op
| expr op=('+'|'-') expr # addsub_op
| expr op='&' expr # concat_op
| expr op=('<'|'<='|'>'|'>='|'!='|'=') expr # comp_op
| expr 'in' expr # membership
| expr 'and' expr #logand
| expr 'or' expr # logor
| expr '?' expr (':' expr)? # conditional
| expr CHAIN expr # fct_chain
| '(' (expr (';' (expr)?)*)? ')' # parens
| VAR_ID # var_recall
| NUMBER # number
| STRING # string
| 'null' # null
;
Based on Bart's suggestion I added the top rule for expr_to_eof that resulted in that method being added to the MappingExpressionParser. So, in my Expressions class where before I'd called tree = parser.expr(); I now needed to call tree = parser.expr_to_eof(); which resulted in a ParseTree that included a last child for the Token.EOF.
Because my code needed to check some conditions for the first and last step performed it was easiest for me to add the following to strip out the <EOF> and get back the ParseTree (ExprContext rather than Expr_to_eofContext) I had been using by adding this statement:
newTree = ((Expr_to_eofContext)tree).expr();
So, overall, it was quite easy to fix a long standing bug (and others I'd postponed addressing) just by adding the new rule in the .g4 file and changing the parser so it would parse to end of file () and then extract the entire expression that was parsed.
I expect this will allow me to add considerably more functions to JSONata4Java to match the JavaScript version jsonata.js
Thanks again Bart!

How do I parse an array in antlr?

I'm working on parsing PDF content streams. I'm having trouble defining an array. The definition of an array in the PDF reference (PDF 32000-1:2008) is:
An array object is a one-dimensional collection of objects arranged sequentially. …an array’s elements may be any combination of numbers, strings, dictionaries, or any other objects, including other arrays. An array may have zero elements.
An array shall be written as a sequence of objects enclosed in SQUARE BRACKETS (using LEFT SQUARE BRACKET (5Bh) and RIGHT SQUARE BRACKET (5Dh)).
EXAMPLE: [549 3.14 false (Ralph) /SomeName]
Here's a stripped-down version of my grammar:
grammar PdfStream;
/*
* Parser Rules
*/
content : stat* ;
stat
: array
| string
;
array: ARRAY ;
string: STRING ;
/*
* Lexer Rules
*/
ARRAY: '[' (ARRAY | DICTIONARY | OBJECT)* ']' ;
DICTIONARY: '<<' (NAME (ARRAY | DICTIONARY | OBJECT))* '>>' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
STRING: (LITERAL_STRING | HEX_STRING) ;
NAME: '/' ID ;
INT: DIGIT+ ;
LITERAL_STRING: '(' .*? ')' ;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
FLOAT: DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
OBJECT
: NULL
| BOOLEAN
| NUMBER
| STRING
| NAME
;
fragment DIGIT: [0-9] ;
// All characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
And here's the test file I'm processing.
<AE93>
(String1)
( String2 )
[]
[549 3.14 false (Ralph) /SomeName]
When I process the file with grun PdfStream tokens -tokens stream.txt I get this output:
line 5:0 token recognition error at: '[549 '
line 5:33 token recognition error at: ']'
[#0,0:5='<AE93>',<STRING>,1:0]
[#1,7:15='(String1)',<STRING>,2:0]
[#2,17:27='( String2 )',<STRING>,3:0]
[#3,29:30='[]',<ARRAY>,4:0]
[#4,37:40='3.14',<NUMBER>,5:5]
[#5,42:46='false',<BOOLEAN>,5:10]
[#6,48:54='(Ralph)',<STRING>,5:16]
[#7,56:64='/SomeName',<NAME>,5:24]
[#8,67:66='<EOF>',<EOF>,6:0]
What's wrong with my grammar that's causing the token recognition errors?
[549 3.14 false (Ralph) /SomeName] isn't recognized as an ARRAY because it contains spaces and the rule for ARRAY does not allow any spaces. If you want spaces to be ignored between the elements of an array, you should turn it into a parser rule instead of a lexer rule (the same applies to DICTIONARY).
You'll also need to make OBJECT a parser rule because otherwise it will never be matched because any input that matches, say, NUMBER will always produce a NUMBER token instead of an OBJECT token because OBJECT comes last in the grammar. Generally you never want multiple lexer rules where everything that can be matched by one of them can also always be matched by at least one other. This also means that you want to turn INT and FLOAT into fragments.

ANTLR grammar: Boolean literal which can occur as qualified variable name while ignoring whitespace

I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;

Antlr4 parsing inconsistency

in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?
Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.

string recursion antlr lexer token

How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.

Resources