Defining Antlr lexer rule with termination condition - antlr4

There is a case to parse 2 tokens which are separated by ‘2/’ . Both tokens can be of alphanumeric characters with no fixed length.
Examples: Abcd34D22/ERTD34D or ABCD2/DEF
Desired output : TOKEN1 = ‘Abcd34D2’, SEPARATOR: ‘2/’ , TOKEN2 = ‘ERTD34D’
I would like to know if there is a way to define lexer rule for TOKEN1 and manage the ambiguity so that if 2 is followed by /, it should qualified to be matched as SEPARATOR. Below is the sample token definitions for illustration.
fragment ALPHANUM: [0-9A-Za-z];
fragment SLASH: '/';
TOKEN1 : (ALPHANUM)+;
SEPARATOR : '2' SLASH -> mode(TOKEN2_MODE);
mode TOKEN2_MODE;
TOKEN2 : (ALPHANUM)+;

AFAIK, you'll have to use a predicate, which means you'll have to add some target specific code to your grammar. If your target language is Java, you could do something like this:
TOKEN1
: TOKEN1_ATOM+
;
fragment TOKEN1_ATOM
: [013-9A-Za-z] // match a single alpha-num except '2'
| '2' {_input.LA(1) != '/'}? // only match `2` if there's no '/' after it
;

Related

How to define ANTLR Parser Rule for concatenated tokens/

I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR
You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!

how to match a sequence that has no separation with whitespace

The rule I am trying to match is: hello followed by a sequence of characters. If that sequence contains an alphabet in it, that should match the str rule, else it should match the num rule.
For e.g.
hello123 - 123 should be matched by num rule
hello1a3 - 1a3 should be matched by the str rule
The grammar I wrote is below:
grammar Hello;
r: 'hello'seq;
// seq: str | integ;
seq: num | str;
num : DIGITS;
str : CHARS;
DIGITS: [0-9]+;
CHARS : [0-9a-zA-Z]+;
WS : [ \t\n\r]+ -> skip;
While trying to visualize the parse tree (using grun) (against the first input example above) I got the below parse tree:
However if the input had space in between there was no problem. Please explain why the error.
Lexing in ANTLR (as well as most lexer generators) works according to the maximum munch rule, which says that it always applies the lexer rule that could match the longest prefix of the current input. For the input hello123, the rule 'hello' would match hello, whereas the rule CHARS would match the entire input hello123. Therefore CHARS produces the longer match and is chosen over 'hello'.
If your CHARS and DIGITS tokens can only appear after a 'hello' token, you can use lexer modes to make it so that these rules are only available after a 'hello' has been matched.
Otherwise, to get the behaviour you want, your best bet would probably be to create a single lexer rule that matches 'hello' [0-9a-zA-Z]* and then take apart the tokens generated by that in a separate step. Though it all depends on why you need this.

Matching the finite closure pattern ({x,y}) of regular expressions

I am trying to write a grammar that will match the finite closure pattern for regular expressions ( i.e foo{1,3} matches 1 to 3 'o' appearances after the 'fo' prefix )
To identify the string {x,y} as finite closure it must not include spaces for example { 1, 3} is recognized as a sequence of seven characters.
I have written the following lexer and parser file but i am not sure if this is the best solution. I am using a lexical mode for the closure pattern which is activated when a regular expression matches a valid closure expression.
lexer grammar closure_lexer;
#header { using System;
using System.IO; }
#lexer::members{
public static bool guard = true;
public static int LBindex = 0;
}
OTHER : .;
NL : '\r'? '\n' ;
CLOSURE_FLAG : {guard}? {LBindex =InputStream.Index; }
'{' INTEGER ( ',' INTEGER? )? '}'
{ closure_lexer.guard = false;
// Go back to the opening brace
InputStream.Seek(LBindex);
Console.WriteLine("Enter Closure Mode");
Mode(CLOSURE);
} -> skip
;
mode CLOSURE;
LB : '{';
RB : '}' { closure_lexer.guard = true;
Mode(0); Console.WriteLine("Enter Default Mode"); };
COMMA : ',' ;
NUMBER : INTEGER ;
fragment INTEGER : [1-9][0-9]*;
and the parser grammar
parser grammar closure_parser;
#header { using System;
using System.IO; }
options { tokenVocab = closure_lexer; }
compileUnit
: ( other {Console.WriteLine("OTHER: {0}",$other.text);} |
closure {Console.WriteLine("CLOSURE: {0}",$closure.text);} )+
;
other : ( OTHER | NL )+;
closure : LB NUMBER (COMMA NUMBER?)? RB;
Is there a better way to handle this situation?
Thanks in advance
This looks quite complex for such a simple task. You can easily let your lexer match one construct (preferably that without whitespaces, if you usually skip them) and the parser matches the other form. You don't even need lexer modes for that.
Define your closure rule:
CLOSURE
: OPEN_CURLY INTEGER (COMMA INTEGER?)? CLOSE_CURLY
;
This rule will not match any form that contains e.g. whitespaces. So, if your lexer does not match CLOSURE you will get all the individual tokens like the curly braces and integers ending up in your parser for matching (where you then can treat them as something different).
NB: doesn't the closure definition also allow {,n} (same as {n})? That requires an additional alt in the CLOSURE rule.
And finally a hint: your OTHER rule will probably give you trouble as it matches any char and is even located before other rules. If you have a whildcard rule then it should be the last in your grammar, matching everything not matched by any other rule.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex
Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

ANTLR4 lexer not resolving ambiguity in grammar order

Using ANTLR 4.2, I'm trying a very simple parse of this test data:
RRV0#ABC
Using a minimal grammar:
grammar Tiny;
thing : RRV N HASH ID ;
RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
Running the ANTLR4 test rig with the test data above, the output is
[#0,0:3='RRV0',<4>,1:0]
[#1,4:4='#',<3>,1:4]
[#2,5:7='ABC',<4>,1:5]
[#3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'
I can see the first token is <4> for ID, with the value 'RRV0'
I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.
If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.
I started in ANTLR 4.1 with the same issue.
I checked in ANTLRWorks and from the command line, with the same result both ways.
How can I change the grammar to match lexer item RRV in preference to ID ?
The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.
This strategy is especially important in languages like Java. Consider the following input:
String className = "";
Along with the following two grammar rules (slightly simplified):
CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;
If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

Resources