Why "point" rule matches more then one number? - antlr4

I try to write my first parser with ANTLR4. One of the rules I use in a already bigger grammerfile is supposed to match 2 numbers as a 2D-point. Here a cut down example of the grammer:
grammar example;
WS: [ \t\r\n]+ -> channel(HIDDEN);
INT: [0-9]+;
FLOAT: [0-9]*'.'?[0-9]+ ;
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
NUMBER: (INT | FLOAT) ;
id: IDSTRING;
num: NUMBER;
sem: ';' ;
point: num num;
macro: 'MACRO' id macroprops* 'END ' id;
macroprops: macroorigin ;
macroorigin: 'ORIGIN' point sem;
When I know enter a basic example like this:
antlr4 example.g4 -o example/
cd example
javac *.java
echo -e "MACRO m_1\n ORIGIN 7 2.0 ;\nEND m_1" | grun example macro -tree
the first num in point matches both numbers and it throws an error, that any integer (here 0) is not a number:
line 3:9 mismatched input '0' expecting NUMBER
(macro MACRO (id m_1) (macroprops (macroorigin ORIGIN (point (num 0 0) (num <missing NUMBER>)) (sem ;))) END (id m_1))
I tried the definition of NUMBER and point for some different ways, but I suppose it should work like this. I dont't even understandhow num can match two token. Anybody can help?

Seems as ANTLR4 matches the TOKENS as the order of them given in the grammer. Adding fragment to INT and FLOAT fixed the problem, as NUMBER is the only TOKEN matching the number definition allowing both floats and ints then.
grammar example2;
WS: [ \t\r\n]+ -> channel(HIDDEN);
NUMBER: (INT | FLOAT) ;
fragment INT: [0-9]+;
fragment FLOAT: [0-9]*'.'?[0-9]+ ;
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
id: IDSTRING;
num: NUMBER;
sem: ';' ;
point: num num;
macro: 'MACRO' id macroprops* 'END ' id;
macroprops: macroorigin ;
macroorigin: 'ORIGIN' point sem;
Thank you very much for pointing out to watch the token stream. But I still don't understand why it then matches both numbers to the num rule in the original question.
EDIT: Another mistake was, as mentioned by GRosenberg to just define the grammer elements in the rigth order, so NUMBER has a higher priority then it`s subrules.

Related

How to define ANTLR Parser Rule for concatenated tokens/

I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR
You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!

Same character used in several Lexer rules?

I have (probably) a quite simple question. I am writing a ANTLR grammer to evaluate dice expressions. With this I would like to parse something like 4d6 as well as 4d6d2. The first means "roll 4 six-sided dice" and the second means "roll 4 six-sided dice an drop the 2 lowest". My current grammer is:
grammar Dice;
start : dice ;
dice : NUMBER? DSEPERATOR NUMBER ( KEEP | DROP NUMBER )?;
KEEP : 'k';
DROP : 'd';
DSEPERATOR : [dD];
NUMBER : [0-9]+;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
I seem to be getting a problem with the definition of KEEP and DSEPERATOR as they both use the letter d. The parser stops after the first NUMBER in the dice expression. What is the work around here? What do I have to change in my grammar?
The implicit structure being parsed is
e - number of executions
d - type of object
s - number of object sides
o - op to apply*
a - number of op applications*
So, the match rule is:
decode : e d s ( o a )? ;
Consolidating numbers and expanding:
decode : e=NUM type s=NUM ( op a=NUM )? ; // label for convenience
type : dice | .... ;
op : drop | .... ;
dice : D ;
drop : D ;
NUM : [0-9]+ ;
D : 'd' ;

Token collision (??) writing ANTLR4 grammar

I have what I thought a very simple grammar to write:
I want it to allow token called fact. These token can start with a letter and then allow a any kind of these: letter, digit, % or _
I want to concat two facts with a . but the the second fact does not have to start by a letter (a digit, % or _ are also valid from the second token)
Any "subfact" (even the initial one) in the whole fact can be "instantiated" like an array (you will get it by reading my examples)
For example:
Foo
Foo%
Foo_12%
Foo.Bar
Foo.%Bar
Foo.4_Bar
Foo[42]
Foo['instance'].Bar
etc
I tried to write such grammar but I can't get it working:
grammar Common;
/*
* Parser Rules
*/
fact: INITIALFACT instance? ('.' SUBFACT instance?)*;
instance: '[' (LITERAL | NUMERIC) (',' (LITERAL | NUMERIC))* ']';
/*
* Lexer Rules
*/
INITIALFACT: [a-zA-Z][a-zA-Z0-9%_]*;
SUBFACT: [a-zA-Z%_]+;
ASSIGN: ':=';
LITERAL: ('\'' .*? '\'') | ('"' .*? '"');
NUMERIC: ([1-9][0-9]*)?[0-9]('.'[0-9]+)?;
WS: [ \t\r\n]+ -> skip;
For example, if I tried to parse Foo.Bar, I get: Syntax error line 1 position 4: mismatched input 'Bar' expecting SUBFACT.
I think this is because ANTLR first finds Bar match INITIALFACT and stops here. How can I fix this ?
If it is relevent, I am using Antlr4cs.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex
Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Parsing fortran-style .op. operators

I'm trying to write an ANTLR4 grammar for a fortran-inspired DSL. I'm having difficulty with the 'ole classic ".op." operators:
if (1.and.1) then
where both "1"s should be intepreted as integer. I looked at the OpenFortranParser for insight, but I can't make sense out of it.
Initially, I had suitable definitions for INTEGER and REAL in my lexer. Consequently, the first "1" above always parsed as a REAL, no matter what I tried. I tried moving things into the parser, and got it to the point where I could reliably recognize the ".and." along with numbers around it as appropriately INTEGER or REAL.
if (1.and.1) # INT/INT
if (1..and..1) # REAL/REAL
...etc...
I of course want to recognize variable-names in such statements:
if (a.and.b)
and have an appropriate rule for ID. In the small grammar below, however, any literals in quotes (ex, 'and', 'if', all the single-character numerical suffixes) are not accepted as an ID, and I get an error; any other ID-conforming string is accepted:
if (a.and.b) # errs, as 'b' is valid INTEGER suffix
if (a.and.c) # OK
Any insights into this behavior, or better suggestions on how to parse the .op. operators in fortran would be greatly appreciated -- Thanks!
grammar Foo;
start : ('if' expr | ID)+ ;
DOT : '.' ;
DIGITS: [0-9]+;
ID : [a-zA-Z0-9][a-zA-Z0-9_]* ;
andOp : DOT 'and' DOT ;
SIGN : [+-];
expr
: ID
| expr andOp expr
| numeric
| '(' expr ')'
;
integer : DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
real
: DIGITS DOT DIGITS? (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
| DOT DIGITS (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
;
numeric : integer | real;
EOLN : '\r'? '\n' -> skip;
WS : [ \t]+ -> skip;
To disambiguate DOT, add a lexer rule with a predicate just before the DOT rule.
DIT : DOT { isDIT() }? ;
DOT : '.' ;
Change the 'andOp'
andOp : DIT 'and' DIT ;
Then add a predicate method
#lexer::members {
public boolean isDIT() {
int offset = _tokenStartCharIndex;
String r = _input.getText(Interval.of(offset-4, offset));
String s = _input.getText(Interval.of(offset, offset+4));
if (".and.".equals(s) || ".and.".equals(r)) {
return true;
}
return false;
}
}
But, that is not really the source of your current problem. The integer parser rule defines lexer constants effectively outside of the lexer, which is why 'b' is not matched to an ID.
Change it to
integer : INT ;
INT: DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
and the lexer will figure out the rest.

Resources