Antlr4 parsing inconsistency

Antlr4 parsing inconsistency - antlr4

in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?

Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.

Related

How to use the reserved words inside the string in ANTLR4?

I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?

Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);

Problem matching single digits when integers are defined as tokens

I'm having problem trying to get a grammar working. Here is the simplified version. The language I try to parse has expressions like these:
testing1(2342);
testing2(idfor2);
testing3(4654);
testing4[1..n];
testing5[0..1];
testing6(7);
testing7(1);
testing8(o);
testing9(n);
The problem arises when I introduce the rules for the [1..n] or [0..1] expressions. The grammar file (one of the many variations I've tried):
grammar test;
tests
: test* ;
test
: call
| declaration ;
call
: callName '(' callParameter ')' ';' ;
callName : Identifier ;
callParameter : Identifier | Integer ;
declaration
: declarationName '[' declarationParams ']' ';' ;
declarationName : Identifier ;
declarationParams
: decMin '..' decMax ;
decMin : '0' | '1' ;
decMax : '1' | 'n' ;
Integer : [0-9]+ ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
WS : [ \t\r\n]+ -> skip ;
When I parse the sample with this grammar, it fails on testing7(1); and testint(9);. It matches as decMin or decMax instead of Integer or Identifier:
line 8:9 mismatched input '1' expecting {Integer, Identifier}
line 10:9 mismatched input 'n' expecting {Integer, Identifier}
I've tried many variations but I can't make it work fine.

I think your problem comes from not using lexer rules clearly defining what you want.
When you added this rule :
decMin : '0' | '1' ;
You in fact created an unnamed lexer rule that matches '0' and another one matching '1' :
UNNAMED_0_RULE : '0';
UNNAMED_1_RULE : '1';
And your parser rule became :
decMin : UNNAMED_0_RULE | UNNAMED_1_RULE ;
Problem : now, when your lexer see
testing7(1);
**it doesn't see **
callName '(' callParameter ')' ';'
anymore, it sees
callName '(' UNNAMED_1_RULE ')' ';'
and it doesn't understand that.
And that is because lexer rules are effective before the parser rules.
To solve your problem, define your lexer rules efficiently, It would probably look like that :
grammar test;
/*---------------- PARSER ----------------*/
tests
: test*
;
test
: call
| declaration
;
call
: callName '(' callParameter ')' ';'
;
callName
: identifier
;
callParameter
: identifier
| integer
;
declaration
: declarationName '[' declarationParams ']' ';'
;
declarationName
: identifier
;
declarationParams
: decMin '..' decMax
;
decMin
: INTEGER_ZERO
| INTEGER_ONE
;
decMax
: INTEGER_ONE
| LETTER_N
;
integer
: (INTEGER_ZERO | INTEGER_ONE | INTEGER_OTHERS)+
;
identifier
: LETTER_N
| IDENTIFIER
;
/*---------------- LEXER ----------------*/
LETTER_N: N;
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_]*
;
WS
: [ \t\r\n]+ -> skip
;
INTEGER_ZERO: '0';
INTEGER_ONE: '1';
INTEGER_OTHERS: '2'..'9';
fragment N: [nN];
I just tested this grammar and it works.
The drawback is that it will cut your integers at the lexer step (cutting 1245 into 1 2 4 5 in lexer rules, and the considering the parser rule as uniting 1 2 4 and 5).
I think it would be better to be less precise and simply write :
decMin: integer | identifier;
But then it depends on what you do with your grammar...

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

I have the following simple grammar:
grammar TestG;
p : pDecl+ ;
pDecl : endianDecl
| dTDecl
;
endianType : E_BIG
| E_LITTLE
;
endianDecl : 'endian' '=' endianType ';' ;
dTDecl : 'dT' '[' STRING ']' '=' ID ';' ;
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
E_BIG : 'big' ;
E_LITTLE : 'little' ;
When I run grun TestG p and input the following:
endian = little;
I get the following:
line 1:9 mismatched input 'little' expecting {'big', 'little'}
What have I done wrong?

Because your lexer rule for ID precedes that for E_LITTLE, your 'little' input is being lexed as an ID.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<ID>,1:9] <== see here it's being lexed as an ID
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'little' expecting {'big', 'little'}
Moving the these lexer tokens above ID like so:
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
E_BIG : 'big' ;
E_LITTLE : 'little' ;
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
yields the correct output from your test input.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<'little'>,1:9] <== see here being lexed correctly
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
Remember, for lexer tokens, the longest match wins, but in the case of a tie, the one that appears FIRST wins. This is why you want your more specific lexer tokens at the top of the lexer token list, and the more general ones (like identifiers, strings, etc.) farther down.

Can't figure out grammar tweak

I'm trying to build a v4 grammar for an existing DSL, and am a bit out of my depth. I've tried everything I could think with no luck. We can have a function call like foo(param1, param2);, which I have working. There is an optional construct like foo(y, z) x 100; which means to call the fx 100 times (the x is the literal token, great choice eh!) That's what I can't get to work.
My func_call now looks like this: func_call: Identifier '(' arg_list ')';
Adding a (('x'|'X') expr)? and variations thereof didn't work. It starts to get confused by variables named x.
If it helps, an old yacc grammar for this language had this: rep: func_call REP expr; (where REP is x) Any help would be appreciated. thanks!

Make Identifier a parser rule rather than a lexer rule. This way, the lexer always matches x as a Rep, even if it is contained in identifier. Here is one solution:
grammar Foo;
func_call : identifier '(' arg_list? ')' (Rep expr)? ;
arg_list : identifier (',' identifier)* ;
expr : //TODO implement
;
identifier : idFront (idFront | Digit)* ;
idFront : Rep | OtherThanRep | '_' ;
Digit : [0-9] ;
Rep : 'x' | 'X';
OtherThanRep : [a-wA-W] | 'y' | 'z' | 'Y' | 'Z' ;
WS : [ \t\f\r\n] ->skip;
The generated parser successfully parses x(x,x) x 100

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?

ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Antlr4 parsing inconsistency - antlr4

Related

How to use the reserved words inside the string in ANTLR4?

Problem matching single digits when integers are defined as tokens

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

Can't figure out grammar tweak

antlr tokenizer starts with the last token

Categories

Resources