Token Aliases in Antlr - antlr4

I have rules that look something like this:
INTEGER : [0-9]+;
field3 : INTEGER COMMA INTEGER;
In the parsed tree I get an List called INTEGER with two elements.
I would rather find a way for each of the elements to be named.
But if I do this:
INTEGER : [0-9]+;
DOS : INTEGER;
UNO : INTEGER;
field3 : UNO COMMA DOS;
I still get the array of INTEGERs.
Am I doing it right and I just need to dig deeper to figure out what is wrong?
Is there some kind of syntax to alias INTEGER as UNO just for this command (that is actually what I would prefer)?

Just use labeling to identify the subterms:
field : a=INTEGER COMMA b=INTEGER;
The FieldContext class will be generated with two additional class fields:
TerminalNode a;
TerminalNode b;
The corresponding INTEGER instances will be assigned to these fields. So, no aliasing is actually required in most cases.
However, there can be valid reasons to change the named type of a token and typically is handled in the lexer through the use of modes, actions, and predicates. For example, using modes, if INTEGER alternates between UNO and DOS types:
lexer grammar UD ;
UNO : INT -> mode(two);
mode two;
DOS : INT -> mode(default);
fragment INT : [0-9]+ ;
When to do the mode switch and whether a different specific approach might be more appropriate will depend on details not provided yet.

Related

Is this just a flawed grammar?

I was looking through a grammar for focal and found someone had defined their numbers as follows:
number
: mantissa ('e' signed_)?
;
mantissa
: signed_
| (signed_ '.')
| ('.' signed_)
| (signed_ '.' signed_)
;
signed_
: PLUSMIN? INTEGER
;
PLUSMIN
: '+'
| '-'
;
I was curious because I thought this would mean that, for example, 1.-1 would get identified as a number by the grammar rather than subtraction. Would a branch with unsigned_ be worth it to prevent this issue? I guess this is more of a question for the author, but are there any benefits to structuring it this way (besides the obvious avoiding floats vs ints)?
It’s not necessarily flawed.
It does appear that it will recognize 1.-1 as a mantissa. However, that doesn’t mean that some post-parse validation doesn’t catch this problem.
It would be flawed if there’s an alternative, valid interpretation of 1.-1.
Sometimes, it’s just useful to recognize an invalid construct and produce a parse tree for “the only way to interpret this input”, and then you can detect it in a listener and give the user an error message that might be more meaningful than the default message that ANTLR would produce.
And, then again, it could also just be an oversight.
The `signed_` rule on the other hand, being:
signed_ : PLUSMIN? INTEGER;
Instead of
signed_ : PLUSMIN? INTEGER+;
does make this grammar somewhat suspect as a good example to work from.
Your analyze looks correct to me saying that :
1.-1 is recognized as a number
a branch with unsigned_ could fix it
Saying it's "flawd" taste like a value judgement, which seems not relevant.
If that was for my own usage, I would prefer to :
recognize 0.-4 as an invalid number
recognize -.4 as a valid number
So I do prefer something like :
number
: signed_float('e' signed_integer)?
;
signed_float
: PLUSMIN? unsigned_float
;
unsigned_float
: integer
| (integer '.')
| ('.' integer)
| (integer'.' integer)
;
signed_integer
: PLUSMIN? unsigned_integer
;
PLUSMIN
: '+'
| '-'
;

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex
Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

ANTLR4 lexer not resolving ambiguity in grammar order

Using ANTLR 4.2, I'm trying a very simple parse of this test data:
RRV0#ABC
Using a minimal grammar:
grammar Tiny;
thing : RRV N HASH ID ;
RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
Running the ANTLR4 test rig with the test data above, the output is
[#0,0:3='RRV0',<4>,1:0]
[#1,4:4='#',<3>,1:4]
[#2,5:7='ABC',<4>,1:5]
[#3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'
I can see the first token is <4> for ID, with the value 'RRV0'
I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.
If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.
I started in ANTLR 4.1 with the same issue.
I checked in ANTLRWorks and from the command line, with the same result both ways.
How can I change the grammar to match lexer item RRV in preference to ID ?
The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.
This strategy is especially important in languages like Java. Consider the following input:
String className = "";
Along with the following two grammar rules (slightly simplified):
CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;
If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

Ada Return Concatenated String of Strings

I am trying to finish up a homework assignment, and am down to the last part. First, I'll show you the type that I am dealing with:
TYPE Book_Collection IS
RECORD
Books : Book_Collection_Array;
Max_Size : Integer;
Size : Integer;
END RECORD;
TYPE Book_Type IS
RECORD
Title,
Author,
Publisher : Title_Str;
Year : Year_Type;
Edition : Natural;
Isbn : Isbn_Type;
Price : Dollars;
Stock : Natural;
Format : Format_Type;
END RECORD;
Book_Collection_Array is an array of book_type. These are private types, so the array is bounded (1..200).
There is a function called ToString in a separate package that was provided to us, that takes a book_type as input, and returns a string of all the elements of book_type. What I need to create is a function that takes book_collection is a parameter, and returns a string concatenating all of the strings that are returned by the ToString function that was provided, for the book_types that exist in that book_collection. I have made multiple attempts, but am constantly getting range check failures. Can anyone point me in the right direction?
*Edit:
Thank you to both of you for your help. I went the route of using an unbounded string, and appending each string to it, then declaring an output string and setting it as a constant string equal to the the To_String of the unbounded_string.*
I'll give you a hint.
Ada strings ideally aren't treated or handled much like C or Java strings at all. C strings count on a trailing nul (0) character to designate the end of data in a buffer. Java strings keep track of their own length, and will dynamically reallocate themselves to keep to the proper length if need be. So typical string-handling idioms in those languages think nothing of progressively modifying a string variable.
Ada strings instead are expected to be perfectly-sized when created. Most routines will assume that every element in a string array contains valid character data, and any destination string you assign data into will be perfectly sized to hold it. If that isn't the case, usually an exception is raised (and most likely your program crashes).
There are several ways to deal with this when you are building a string. One way is to create a really big string object as a buffer, and keep a separate length variable to tell your code how much data is really in there at all times. Then when you call Ada string routines you can feed them just the slice of data from the string that is valid. eg: Put_line (My_New_String(1..My_String_Length));
A better way is to just deal with perfectly-sized constant strings. For example, if you want to tack String1 and String2 together, the safe Ada way to do this is:
My_New_String : constant String := String1 & String2;
Then if you later want a string that is this string with String3 tacked on:
My_New_New_String : constant String := My_New_String & String3;
For more info on this, I suggest you look at some of the links over on the right side of this browser window under the heading "Related". I see a lot of good stuff in there.

ada split() method

I am trying to write an Ada equivalent to the split() method in Java or C++. I am to intake a string and an integer and output two seperate string values. For example:
split of "hello" and 2 would return:
"The first part is he
and the second part is llo"
The code I have is as follows:
-- split.adb splits an input string about a specified position.
--
-- Input: Astring, a string,
-- Pos, an integer.
-- Precondition: pos is in Astring'Range.
-- Output: The substrings Astring(Astring'First..Pos) and
-- Astring(Pos+1..Astring'Last).
--------------------------------------------------------------
with Ada.Text_IO, Ada.Integer_Text_IO, Ada.Strings.Fixed;
use Ada.Text_IO, Ada.Integer_Text_IO, Ada.Strings.Fixed;
procedure Split is
EMPTY_STRING : String := " ";
Astring, Part1, Part2 : String := EMPTY_STRING;
Pos, Chars_Read : Natural;
------------------------------------------------
-- Split() splits a string in two.
-- Receive: The_String, the string to be split,
-- Position, the split index.
-- PRE: 0 < Position <= The_String.length().
-- (Ada arrays are 1-relative by default)
-- Passback: First_Part - the first substring,
-- Last_Part - the second substring.
------------------------------------------------
function Split(TheString : in String ; Pos : in Integer; Part1 : out String ; Part2 : out String) return String is
begin
Move(TheString(TheString'First .. Pos), Part1);
Move(TheString(Pos .. TheString'Last), Part2);
return Part1, Part2;
end Split;
begin -- Prompt for input
Put("To split a string, enter the string: ");
Get_Line(Astring, Chars_Read);
Put("Enter the split position: ");
Get(Pos);
Split(Astring, Pos, Part1, Part2);
Put("The first part is ");
Put_Line(Part1);
Put(" and the second part is ");
Put_Line(Part2);
end Split;
The main part I am having trouble with is returning the two separate string values and in general the whole split() function. Any pointers or help is appreciated. Thank you
Instead of a function, consider making Split a procedure having two out parameters, as you've shown. Then decide if Pos is the last index of Part1 or the first index of Part2; I've chosen the latter.
procedure Split(
TheString : in String; Pos : in Integer;
Part1 : out String; Part2 : out String) is
begin
Move(TheString(TheString'First .. Pos - 1), Part1);
Move(TheString(Pos .. TheString'Last), Part2);
end Split;
Note that String indexes are Positive:
type String is array(Positive range <>) of Character;
subtype Positive is Integer range 1 .. Integer'Last;
Doing this is so trivial, I'm not sure why you'd bother making a routine for it. Just about any routine you could come up with is going to be much harder to use anyway.
Front_Half : constant String := Original(Original'first..Index);
Back_Half : constant String := Original(Index+1..Original'last);
Done.
Note that static Ada strings are very different than strings in other languages like C or Java. Due to their static nature, they are best built either inline like I've done above, or as return values from functions. Since functions cannot return more than one value, a single unified "split" routine is just plain not a good fit for static Ada string handling. Instead, you should either do what I did above, call the corresponding routines from Ada.Strings.Fixed (Head and Tail), or switch to using Ada.Strings.Unbounded.Unbounded_String instead of String.
The latter is probably the easiest option, if you want to keep your Java mindset about string handling. If you want to really learn Ada though, I'd highly suggest you learn to deal with static fixed Strings the Ada way.
From looking over your code you really need to read up in general on the String type, because you're dragging in a lot of expectations in from other languages on how to work with them--which aren't going to work with them. Ada's String type is not one of its more flexible features, in that they are always fixed length. While there are ways of working around the limitations in a situation such as you're describing, it would be much easier to simply use Unbounded_Strings.
The input String to your function could remain of type String, which will adjust to the length of the string that you provide to it. The two output Unbounded_Strings then are simply set to the sliced string components after invoking To_Unbounded_String() on each of them.
Given the constraints of your main program, with all strings bounded by the size of EMPTY_STRING. the procedure with out parameters is the correct approach, with the out parameter storage allocated by the caller (on the stack as it happens)
That is not always the case, so it is worth knowing another way. The problem is how to deal with data whose size is unknown until runtime.
Some languages can only offer runtime allocation on the heap (via "new" or "malloc") and can only access the data via pointers, leaving a variety of messy problems including accesses off the end of the data (buffer overruns) or releasing the storage correctly (memory leaks, accessing freed pointers etc)
Ada will allow this method too, but it is usually unnecessary and strongly discouraged. Unbounded_String is a wrapper over this method, while Bounded_String avoids heap allocation where you can accept an upper bound on the string length.
But also, Ada allows variable sized data structures to be created on the stack; the technique just involves creating a new stack frame and declaring new variables where you need to, with "declare". The new variables can be initialised with function calls.
Each function can only return one object, but that object's size can be determined at runtime. So either "Split" can be implemented as 2 functions, returning Part1 or Part2, or it can return a record containing both strings. It would be a record with two size discriminants, so I have chosen the simpler option here. The function results are usually built in place (avoids copying).
The flow in your example would require two nested Declare blocks; if "Pos" could be identified first, they could be collapsed into one...
procedure Split is
function StringBefore( Input : String; Pos : Natural) return String is
begin
return Input(1 .. Pos-1);
end StringBefore;
function StringFrom ...
begin
Put("To split a string, enter the string: ");
declare
AString : String := Get_Line;
Pos : Natural;
begin
Put("Enter the split position: ");
Get(Pos);
declare
Part1 : String := StringBefore(AString, Pos);
Part2 : String := StringFrom(AString, Pos);
begin
Put("The first part is ");
Put_Line(Part1);
Put(" and the second part is ");
Put_Line(Part2);
end; -- Part1 and Part2 are now out of scope
end; -- AString is now out of scope
end Split;
This can obviously be wrapped in a loop, with different size strings each time, with no memory management issues.
Look at the Head and Tail functions in Ada.Strings.Fixed.
function Head (Source : in String; Count : in Natural; Pad : in Character := Space) return String;
function Tail (Source : in String; Count : in Natural; Pad : in Character := Space)
return String;
Here's an approach that just uses slices of the string.
with Ada.Text_IO; use Ada.Text_IO;
with Ada.Strings.Fixed; use Ada.Strings.Fixed;
procedure Main is
str : String := "one,two,three,four,five,six,seven,eight";
pattern : String := ",";
idx, b_idx : Integer;
begin
b_idx := 1;
for i in 1..Ada.Strings.Fixed.Count ( Source => str, Pattern => pattern ) loop
idx := Ada.Strings.Fixed.Index( Source => str(b_idx..str'Last), Pattern => pattern);
Put_Line(str(b_idx..idx-1)); -- process string slice in any way
b_idx := idx + pattern'Length;
end loop;
-- process last string
Put_Line(str(b_idx..str'Last));
end Main;

Resources