How to define ANTLR Parser Rule for concatenated tokens/ - antlr4

I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR

You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!

Related

Defining Antlr lexer rule with termination condition

There is a case to parse 2 tokens which are separated by ‘2/’ . Both tokens can be of alphanumeric characters with no fixed length.
Examples: Abcd34D22/ERTD34D or ABCD2/DEF
Desired output : TOKEN1 = ‘Abcd34D2’, SEPARATOR: ‘2/’ , TOKEN2 = ‘ERTD34D’
I would like to know if there is a way to define lexer rule for TOKEN1 and manage the ambiguity so that if 2 is followed by /, it should qualified to be matched as SEPARATOR. Below is the sample token definitions for illustration.
fragment ALPHANUM: [0-9A-Za-z];
fragment SLASH: '/';
TOKEN1 : (ALPHANUM)+;
SEPARATOR : '2' SLASH -> mode(TOKEN2_MODE);
mode TOKEN2_MODE;
TOKEN2 : (ALPHANUM)+;
AFAIK, you'll have to use a predicate, which means you'll have to add some target specific code to your grammar. If your target language is Java, you could do something like this:
TOKEN1
: TOKEN1_ATOM+
;
fragment TOKEN1_ATOM
: [013-9A-Za-z] // match a single alpha-num except '2'
| '2' {_input.LA(1) != '/'}? // only match `2` if there's no '/' after it
;

Token collision (??) writing ANTLR4 grammar

I have what I thought a very simple grammar to write:
I want it to allow token called fact. These token can start with a letter and then allow a any kind of these: letter, digit, % or _
I want to concat two facts with a . but the the second fact does not have to start by a letter (a digit, % or _ are also valid from the second token)
Any "subfact" (even the initial one) in the whole fact can be "instantiated" like an array (you will get it by reading my examples)
For example:
Foo
Foo%
Foo_12%
Foo.Bar
Foo.%Bar
Foo.4_Bar
Foo[42]
Foo['instance'].Bar
etc
I tried to write such grammar but I can't get it working:
grammar Common;
/*
* Parser Rules
*/
fact: INITIALFACT instance? ('.' SUBFACT instance?)*;
instance: '[' (LITERAL | NUMERIC) (',' (LITERAL | NUMERIC))* ']';
/*
* Lexer Rules
*/
INITIALFACT: [a-zA-Z][a-zA-Z0-9%_]*;
SUBFACT: [a-zA-Z%_]+;
ASSIGN: ':=';
LITERAL: ('\'' .*? '\'') | ('"' .*? '"');
NUMERIC: ([1-9][0-9]*)?[0-9]('.'[0-9]+)?;
WS: [ \t\r\n]+ -> skip;
For example, if I tried to parse Foo.Bar, I get: Syntax error line 1 position 4: mismatched input 'Bar' expecting SUBFACT.
I think this is because ANTLR first finds Bar match INITIALFACT and stops here. How can I fix this ?
If it is relevent, I am using Antlr4cs.

Matching the finite closure pattern ({x,y}) of regular expressions

I am trying to write a grammar that will match the finite closure pattern for regular expressions ( i.e foo{1,3} matches 1 to 3 'o' appearances after the 'fo' prefix )
To identify the string {x,y} as finite closure it must not include spaces for example { 1, 3} is recognized as a sequence of seven characters.
I have written the following lexer and parser file but i am not sure if this is the best solution. I am using a lexical mode for the closure pattern which is activated when a regular expression matches a valid closure expression.
lexer grammar closure_lexer;
#header { using System;
using System.IO; }
#lexer::members{
public static bool guard = true;
public static int LBindex = 0;
}
OTHER : .;
NL : '\r'? '\n' ;
CLOSURE_FLAG : {guard}? {LBindex =InputStream.Index; }
'{' INTEGER ( ',' INTEGER? )? '}'
{ closure_lexer.guard = false;
// Go back to the opening brace
InputStream.Seek(LBindex);
Console.WriteLine("Enter Closure Mode");
Mode(CLOSURE);
} -> skip
;
mode CLOSURE;
LB : '{';
RB : '}' { closure_lexer.guard = true;
Mode(0); Console.WriteLine("Enter Default Mode"); };
COMMA : ',' ;
NUMBER : INTEGER ;
fragment INTEGER : [1-9][0-9]*;
and the parser grammar
parser grammar closure_parser;
#header { using System;
using System.IO; }
options { tokenVocab = closure_lexer; }
compileUnit
: ( other {Console.WriteLine("OTHER: {0}",$other.text);} |
closure {Console.WriteLine("CLOSURE: {0}",$closure.text);} )+
;
other : ( OTHER | NL )+;
closure : LB NUMBER (COMMA NUMBER?)? RB;
Is there a better way to handle this situation?
Thanks in advance
This looks quite complex for such a simple task. You can easily let your lexer match one construct (preferably that without whitespaces, if you usually skip them) and the parser matches the other form. You don't even need lexer modes for that.
Define your closure rule:
CLOSURE
: OPEN_CURLY INTEGER (COMMA INTEGER?)? CLOSE_CURLY
;
This rule will not match any form that contains e.g. whitespaces. So, if your lexer does not match CLOSURE you will get all the individual tokens like the curly braces and integers ending up in your parser for matching (where you then can treat them as something different).
NB: doesn't the closure definition also allow {,n} (same as {n})? That requires an additional alt in the CLOSURE rule.
And finally a hint: your OTHER rule will probably give you trouble as it matches any char and is even located before other rules. If you have a whildcard rule then it should be the last in your grammar, matching everything not matched by any other rule.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex
Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Parsing fortran-style .op. operators

I'm trying to write an ANTLR4 grammar for a fortran-inspired DSL. I'm having difficulty with the 'ole classic ".op." operators:
if (1.and.1) then
where both "1"s should be intepreted as integer. I looked at the OpenFortranParser for insight, but I can't make sense out of it.
Initially, I had suitable definitions for INTEGER and REAL in my lexer. Consequently, the first "1" above always parsed as a REAL, no matter what I tried. I tried moving things into the parser, and got it to the point where I could reliably recognize the ".and." along with numbers around it as appropriately INTEGER or REAL.
if (1.and.1) # INT/INT
if (1..and..1) # REAL/REAL
...etc...
I of course want to recognize variable-names in such statements:
if (a.and.b)
and have an appropriate rule for ID. In the small grammar below, however, any literals in quotes (ex, 'and', 'if', all the single-character numerical suffixes) are not accepted as an ID, and I get an error; any other ID-conforming string is accepted:
if (a.and.b) # errs, as 'b' is valid INTEGER suffix
if (a.and.c) # OK
Any insights into this behavior, or better suggestions on how to parse the .op. operators in fortran would be greatly appreciated -- Thanks!
grammar Foo;
start : ('if' expr | ID)+ ;
DOT : '.' ;
DIGITS: [0-9]+;
ID : [a-zA-Z0-9][a-zA-Z0-9_]* ;
andOp : DOT 'and' DOT ;
SIGN : [+-];
expr
: ID
| expr andOp expr
| numeric
| '(' expr ')'
;
integer : DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
real
: DIGITS DOT DIGITS? (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
| DOT DIGITS (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
;
numeric : integer | real;
EOLN : '\r'? '\n' -> skip;
WS : [ \t]+ -> skip;
To disambiguate DOT, add a lexer rule with a predicate just before the DOT rule.
DIT : DOT { isDIT() }? ;
DOT : '.' ;
Change the 'andOp'
andOp : DIT 'and' DIT ;
Then add a predicate method
#lexer::members {
public boolean isDIT() {
int offset = _tokenStartCharIndex;
String r = _input.getText(Interval.of(offset-4, offset));
String s = _input.getText(Interval.of(offset, offset+4));
if (".and.".equals(s) || ".and.".equals(r)) {
return true;
}
return false;
}
}
But, that is not really the source of your current problem. The integer parser rule defines lexer constants effectively outside of the lexer, which is why 'b' is not matched to an ID.
Change it to
integer : INT ;
INT: DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
and the lexer will figure out the rest.

Resources