Same character used in several Lexer rules? - antlr4

I have (probably) a quite simple question. I am writing a ANTLR grammer to evaluate dice expressions. With this I would like to parse something like 4d6 as well as 4d6d2. The first means "roll 4 six-sided dice" and the second means "roll 4 six-sided dice an drop the 2 lowest". My current grammer is:
grammar Dice;
start : dice ;
dice : NUMBER? DSEPERATOR NUMBER ( KEEP | DROP NUMBER )?;
KEEP : 'k';
DROP : 'd';
DSEPERATOR : [dD];
NUMBER : [0-9]+;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
I seem to be getting a problem with the definition of KEEP and DSEPERATOR as they both use the letter d. The parser stops after the first NUMBER in the dice expression. What is the work around here? What do I have to change in my grammar?

The implicit structure being parsed is
e - number of executions
d - type of object
s - number of object sides
o - op to apply*
a - number of op applications*
So, the match rule is:
decode : e d s ( o a )? ;
Consolidating numbers and expanding:
decode : e=NUM type s=NUM ( op a=NUM )? ; // label for convenience
type : dice | .... ;
op : drop | .... ;
dice : D ;
drop : D ;
NUM : [0-9]+ ;
D : 'd' ;

Related

ANTLR : How to parse fixed length text file based on index position using ANTLR 4?

Input:
101 04200001312345678981107291600A094101US FORD NA TEST COMPANY101
5225TEST COMPANY 11234567898PPDTEST BUYS 110801110801 1098765430000001
Above lines are 94 char fixed length.
Expected output: Based on this input , Antlr grammar should parse based on index positions.
For Example: If parser identify '1' in starting char of line one. It should recognize entire line as a separate string as HEADER1.
Same as if parser finds '5' in starting index of line two. It should recognize entire line as a separate string as HEADER2.
fragment Digit: '0'..'9' ;
fragment Alpha: '_' | 'A'..'Z';
Number: Digit+ ;
Alphanumeric: (Letter | Digit)+ ;
header1: '1' Alphanumeric+
header2: '5' Alphanumeric+
WS
: (' ' | '\t') -> skip //channel (HIDDEN)
;
Which tool are you using for parsing?
I get the following tree while parsing with your grammar using Antlr v4 plugin in Android studio.

Why "point" rule matches more then one number?

I try to write my first parser with ANTLR4. One of the rules I use in a already bigger grammerfile is supposed to match 2 numbers as a 2D-point. Here a cut down example of the grammer:
grammar example;
WS: [ \t\r\n]+ -> channel(HIDDEN);
INT: [0-9]+;
FLOAT: [0-9]*'.'?[0-9]+ ;
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
NUMBER: (INT | FLOAT) ;
id: IDSTRING;
num: NUMBER;
sem: ';' ;
point: num num;
macro: 'MACRO' id macroprops* 'END ' id;
macroprops: macroorigin ;
macroorigin: 'ORIGIN' point sem;
When I know enter a basic example like this:
antlr4 example.g4 -o example/
cd example
javac *.java
echo -e "MACRO m_1\n ORIGIN 7 2.0 ;\nEND m_1" | grun example macro -tree
the first num in point matches both numbers and it throws an error, that any integer (here 0) is not a number:
line 3:9 mismatched input '0' expecting NUMBER
(macro MACRO (id m_1) (macroprops (macroorigin ORIGIN (point (num 0 0) (num <missing NUMBER>)) (sem ;))) END (id m_1))
I tried the definition of NUMBER and point for some different ways, but I suppose it should work like this. I dont't even understandhow num can match two token. Anybody can help?
Seems as ANTLR4 matches the TOKENS as the order of them given in the grammer. Adding fragment to INT and FLOAT fixed the problem, as NUMBER is the only TOKEN matching the number definition allowing both floats and ints then.
grammar example2;
WS: [ \t\r\n]+ -> channel(HIDDEN);
NUMBER: (INT | FLOAT) ;
fragment INT: [0-9]+;
fragment FLOAT: [0-9]*'.'?[0-9]+ ;
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
id: IDSTRING;
num: NUMBER;
sem: ';' ;
point: num num;
macro: 'MACRO' id macroprops* 'END ' id;
macroprops: macroorigin ;
macroorigin: 'ORIGIN' point sem;
Thank you very much for pointing out to watch the token stream. But I still don't understand why it then matches both numbers to the num rule in the original question.
EDIT: Another mistake was, as mentioned by GRosenberg to just define the grammer elements in the rigth order, so NUMBER has a higher priority then it`s subrules.

How to make antlr4 fully tokenize terminal nodes

I'm trying to use Antlr to make a very simple parser, that basically tokenizes a series of .-delimited identifiers.
I've made a simple grammar:
r : STRUCTURE_SELECTOR ;
STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? ;
ID : [_a-z0-9$]* ;
WS : [ \t\r\n]+ -> skip ;
When the parser is generated, I end up with a single terminal node that represents the string instead of being able to find further STRUCTURE_SELECTORs. I'd like instead to see a sequence (perhaps represented as children of the current node). How can I accomplish this?
As an example:
. would yield one terminal node whose text is .
.foobar would yield two nodes, a parent with text . and a child with text foobar
.foobar.baz would yield four nodes, a parent with text ., a child with text foobar, a second-level child with text ., and a third-level child with text baz.
Rules starting with a capital letter are Lexer rules.
With the following input file t.text
.
.foobar
.foobar.baz
your grammar (in file Question.g4) produces the following tokens
$ grun Question r -tokens -diagnostics t.text
[#0,0:0='.',<STRUCTURE_SELECTOR>,1:0]
[#1,2:8='.foobar',<STRUCTURE_SELECTOR>,2:0]
[#2,10:20='.foobar.baz',<STRUCTURE_SELECTOR>,3:0]
[#3,22:21='<EOF>',<EOF>,4:0]
The lexer (parser) is greedy. It tries to read as many input characters (tokens) as it can with the rule. The lexer rule STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? can read a dot, an ID, and again a dot and an ID (due to repetition marker ?), till the NL. That's why each line ends up in a single token.
When compiling the grammar, the error
warning(146): Question.g4:5:0: non-fragment lexer rule ID can match the empty string
comes because the repetition marker of ID is * (which means 0 or more times) instead of +(one or more times).
Now try this grammar :
grammar Question;
r
#init {System.out.println("Question last update 2135");}
: ( structure_selector NL )+ EOF
;
structure_selector
: '.'
| '.' ID structure_selector*
;
ID : [_a-z0-9$]+ ;
NL : [\r\n]+ ;
WS : [ \t]+ -> skip ;
$ grun Question r -tokens -diagnostics t.text
[#0,0:0='.',<'.'>,1:0]
[#1,1:1='\n',<NL>,1:1]
[#2,2:2='.',<'.'>,2:0]
[#3,3:8='foobar',<ID>,2:1]
[#4,9:9='\n',<NL>,2:7]
[#5,10:10='.',<'.'>,3:0]
[#6,11:16='foobar',<ID>,3:1]
[#7,17:17='.',<'.'>,3:7]
[#8,18:20='baz',<ID>,3:8]
[#9,21:21='\n',<NL>,3:11]
[#10,22:21='<EOF>',<EOF>,4:0]
Question last update 2135
line 3:7 reportAttemptingFullContext d=1 (structure_selector), input='.'
line 3:7 reportContextSensitivity d=1 (structure_selector), input='.'
and $ grun Question r -gui t.text displays the hierarchical tree structure you are expecting.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex
Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Parsing fortran-style .op. operators

I'm trying to write an ANTLR4 grammar for a fortran-inspired DSL. I'm having difficulty with the 'ole classic ".op." operators:
if (1.and.1) then
where both "1"s should be intepreted as integer. I looked at the OpenFortranParser for insight, but I can't make sense out of it.
Initially, I had suitable definitions for INTEGER and REAL in my lexer. Consequently, the first "1" above always parsed as a REAL, no matter what I tried. I tried moving things into the parser, and got it to the point where I could reliably recognize the ".and." along with numbers around it as appropriately INTEGER or REAL.
if (1.and.1) # INT/INT
if (1..and..1) # REAL/REAL
...etc...
I of course want to recognize variable-names in such statements:
if (a.and.b)
and have an appropriate rule for ID. In the small grammar below, however, any literals in quotes (ex, 'and', 'if', all the single-character numerical suffixes) are not accepted as an ID, and I get an error; any other ID-conforming string is accepted:
if (a.and.b) # errs, as 'b' is valid INTEGER suffix
if (a.and.c) # OK
Any insights into this behavior, or better suggestions on how to parse the .op. operators in fortran would be greatly appreciated -- Thanks!
grammar Foo;
start : ('if' expr | ID)+ ;
DOT : '.' ;
DIGITS: [0-9]+;
ID : [a-zA-Z0-9][a-zA-Z0-9_]* ;
andOp : DOT 'and' DOT ;
SIGN : [+-];
expr
: ID
| expr andOp expr
| numeric
| '(' expr ')'
;
integer : DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
real
: DIGITS DOT DIGITS? (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
| DOT DIGITS (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
;
numeric : integer | real;
EOLN : '\r'? '\n' -> skip;
WS : [ \t]+ -> skip;
To disambiguate DOT, add a lexer rule with a predicate just before the DOT rule.
DIT : DOT { isDIT() }? ;
DOT : '.' ;
Change the 'andOp'
andOp : DIT 'and' DIT ;
Then add a predicate method
#lexer::members {
public boolean isDIT() {
int offset = _tokenStartCharIndex;
String r = _input.getText(Interval.of(offset-4, offset));
String s = _input.getText(Interval.of(offset, offset+4));
if (".and.".equals(s) || ".and.".equals(r)) {
return true;
}
return false;
}
}
But, that is not really the source of your current problem. The integer parser rule defines lexer constants effectively outside of the lexer, which is why 'b' is not matched to an ID.
Change it to
integer : INT ;
INT: DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
and the lexer will figure out the rest.

Resources