Beginner: ANTLR does misbehave?

Beginner: ANTLR does misbehave? - antlr4

I have a pretty simple grammer to parse Dice-expressions.
grammar Dice;
function : ( dice | binaryOp | DIGIT );
binaryOp: dice OPERATOR function | DIGIT OPERATOR function;
dice : DIGIT DSEPERATOR DIGIT EXPLODING?;
DSEPERATOR : ( 'd' | 'D' | 'w' | 'W' );
EXPLODING : ( '*' );
OPERATOR : ( ADD | SUB );
ADD : '+';
SUB : '-';
DIGIT : ('0'..'9')+;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It SHOULD parse something like 3D6+2D10 but it doesn't. I get a no viable alternative at input '2d10' with this partial result:
(function (binaryOp (dice 3 W 6) + (function 2 d 10)))
and I do not understand why. Could you please help me understanding this?

Since the function rule is recursive (eventually), you need to add a rule explicitly for the top level so that ANTLR can infer "oh, this top level rule should be the one that implicitly can accept an EOF at the end of input", and then parse from that.
I added a line:
start : function ;
to your grammar, and can now parse:
$ echo "3D6+2D10" | java -cp antlr-4.7.1-complete.jar:. org.antlr.v4.gui.TestRig Dice start -tree
(start (function (binaryOp (dice 3 D 6) + (function (dice 2 D 10)))))

Related

How to use the reserved words inside the string in ANTLR4?

I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?

Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);

Antlr4 parsing inconsistency

in a little test-parser I just wrote, I encountered a weird problem, which I don't quite understand.
Stripping it down to the smallest example showing the problem, let's start with the following grammar:
Testing.g4:
grammar Testing;
cscript // This is the construct I shortened
: (statement_list)* ;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
block : '{' statement_list? '}' ;
expression
: left=expression op=('*'|'/') right=expression # arithmeticExpression
| left=expression op=('+'|'-') right=expression # arithmeticExpression
| left=expression op=Comparison_operator right=expression # comparisonExpression
| ID # variableValueExpression
| constant # ignore // will be executed with the rule name
;
assignment_statement
: ID op=Assignment_operator expression
;
constant
: INT
| REAL;
Assignment_operator : ('=' | '+=' | '-=') ;
Comparison_operator : ('<' | '>' | '==' | '!=') ;
Comment : '//' .*? '\n' -> skip;
fragment NUM : [0-9];
INT : NUM+;
REAL
: NUM* '.' NUM+
| '.' NUM+
| INT
;
ID : [a-zA-Z_] [a-zA-Z_0-9]*;
WS : [ \t\r\n]+ -> skip;
Using the input
z = x + y;
everything is fine, we get a parse tree which goes from cscript to statement_list, statement, assignment_statement, id and expression. Great!
Now, if I add the possibility to declare variables, all goes down the drain:
This is the change to the grammar:
cscript
: (statement_list | variable_declaration ';')* ;
variable_declaration
: type ID ('=' expression)?
;
type
: 'int'
| 'real'
;
statement_list
: statement ';' statement_list?
| block
;
statement
: assignment_statement
;
// (continue as before)
All of a sudden, the same test-input gets wrongly dissected into two statement_lists, each continued to a statement with a "missing ';'" warning, the first going to an incomplete assignment_statement of "z =" and the second to an incomplete assignment_statement "x +".
My attempt to show the parse tree in text-form:
cscript
statement_list
statement
assignment_statement
'z'
'=' [marked as error]
[warning: missing ';']
statement_list
statement
assignment_statement
'x'
'+' [marked as error]
'y' [marked as error]
';'
Can anyone tell me what the problem is? (And how to fix it? ;-))
Edit on 2016-12-26, after Mike's comment:
After replacing all implicit lexer rules with explicit declarations, all of a sudden, the input "z = x + y" worked. (thumbs up)
The next thing I did was restoring more of the original example I had in mind, and adding a new input line
int x = 22;
to the input (which worked previously, but did not make it into the minimal example). Now, that line fails. This is the -token output of the test rig:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
[#3,8:9='22',<20>,1:8]
[#4,10:10=';',<12>,1:10]
[#5,13:13='z',<22>,2:0]
[#6,15:15='=',<1>,2:2]
[#7,17:17='x',<22>,2:4]
[#8,19:19='+',<18>,2:6]
[#9,21:21='y',<22>,2:8]
[#10,22:22=';',<12>,2:9]
[#11,25:24='<EOF>',<-1>,3:0]
line 1:6 mismatched input '=' expecting '='
As the problem seemed to be in the variable_declaration part, I even tried to split this into two parsing rules like this:
cscript
: (statement_list | variable_declaration_and_assignment SEMICOLON | variable_declaration SEMICOLON)* ;
variable_declaration_and_assignment
: type ID EQUAL expression
;
variable_declaration
: type ID
;
With the result:
line 1:6 no viable alternative at input 'intx='
Still stuck :-(
BTW: Splitting the "int x = 22;" into "int x;" and "x = 22;" works. sigh
Edit on 2016-12-26, after Mike's next comment:
Double-checked, and everything is lexer rules. Still, the mismatch between '=' and '=' (which I unfortunately cannot reconstruct anymore) gave me the idea to check the token types. The current status is:
(Shortened grammar)
cscript
: (statement_list | variable_declaration)* ;
...
variable_declaration
: type ID (EQUAL expression)? SEMICOLON
;
...
Assignment_operator : (EQUAL | PLUS_EQ | MINUS_EQ) ;
// among others
PLUS_EQ : '+=';
MINUS_EQ : '-=';
EQUAL: '=';
...
Shortened output:
[#0,0:2='int',<4>,1:0]
[#1,4:4='x',<22>,1:4]
[#2,6:6='=',<1>,1:6]
...
line 1:6 mismatched input '=' expecting ';'
Here, if I understand this correctly, the '=' is parsed to token type 1, which - according to the lexer.tokens output - is Assignment_Operator, while the expected EQUAL would be 13.
Might this be the problem?

Ok, seems the main take away here is: think about your definitions and how you define them. Create explicit lexer rules for your literals instead of defining them implicitly in the parser rules. Check the token values you get from the lexer if the parser gives you weird errors, because they must be correct in the first place or your parse has no chance to do its job.

Overcoming ambiguity in antlr4?

I have the grammar below, it's an extract out of something I am working on which is highlighting a problem I can't overcome.
In my grammar an expression is either a literal, which is a number or an expression "+" another expression. So I want to parse:
1 + 2 + 3 + 4
etc.
However my definition of a number means that it can have an optional sign e.g.:
1, +1 or -1
So it's conceivable that I may need to parse:
1 + +1 or 1 + -1
What I am finding is that 1 + 1 (or bigger numbers) are fine.
What I am struggling to parse are inputs without spaces or with extra signs e.g.:
1+2
This causes real problems as the lexer picks up +2 as a Number when actually I want 2 as the number and + to be picked up as the sign in the expression.
How do I get antlr4 to recognise the difference?
grammar example;
example : expression* EOF;
expression
: expression '+' expression
| literal
;
literal : Number;
Number : Sign? Digits;
Sign : [-+];
Digits : Digit+;
Digit : [0-9];
WS : [ \t\r\n\u000C]+ -> skip;

You can delete optional Sign lexem in the Number token. This way you will postpone processing of signs to parser stage, when you have more information about the context of the input. The idea here is to create unary operators for negation, minus sign (-) and plus sign (+) for keeping the number intact.
grammar example;
example : expression* EOF;
expression
: ('+'|'-') expression # unaryOp
| expression ('+'|'-') expression # binaryOp
| Number # number
;
Number : [0-9]+;
WS : [ \t\r\n\u000C]+ -> skip;

Not sure if it's still relevant, but here goes:
Your expression rule seems faulty, it can not match on a "literal + literal" string, because it always expects an expression on the left.
Your rule should look something like:
expression:
literal + literal
| expression + literal;

Can't figure out grammar tweak

I'm trying to build a v4 grammar for an existing DSL, and am a bit out of my depth. I've tried everything I could think with no luck. We can have a function call like foo(param1, param2);, which I have working. There is an optional construct like foo(y, z) x 100; which means to call the fx 100 times (the x is the literal token, great choice eh!) That's what I can't get to work.
My func_call now looks like this: func_call: Identifier '(' arg_list ')';
Adding a (('x'|'X') expr)? and variations thereof didn't work. It starts to get confused by variables named x.
If it helps, an old yacc grammar for this language had this: rep: func_call REP expr; (where REP is x) Any help would be appreciated. thanks!

Make Identifier a parser rule rather than a lexer rule. This way, the lexer always matches x as a Rep, even if it is contained in identifier. Here is one solution:
grammar Foo;
func_call : identifier '(' arg_list? ')' (Rep expr)? ;
arg_list : identifier (',' identifier)* ;
expr : //TODO implement
;
identifier : idFront (idFront | Digit)* ;
idFront : Rep | OtherThanRep | '_' ;
Digit : [0-9] ;
Rep : 'x' | 'X';
OtherThanRep : [a-wA-W] | 'y' | 'z' | 'Y' | 'Z' ;
WS : [ \t\f\r\n] ->skip;
The generated parser successfully parses x(x,x) x 100

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?

ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Beginner: ANTLR does misbehave? - antlr4

Related

How to use the reserved words inside the string in ANTLR4?

Antlr4 parsing inconsistency

Overcoming ambiguity in antlr4?

Can't figure out grammar tweak

antlr tokenizer starts with the last token

Categories

Resources