Token recognition order

Token recognition order - antlr4

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex

Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Related

Is this just a flawed grammar?

I was looking through a grammar for focal and found someone had defined their numbers as follows:
number
: mantissa ('e' signed_)?
;
mantissa
: signed_
| (signed_ '.')
| ('.' signed_)
| (signed_ '.' signed_)
;
signed_
: PLUSMIN? INTEGER
;
PLUSMIN
: '+'
| '-'
;
I was curious because I thought this would mean that, for example, 1.-1 would get identified as a number by the grammar rather than subtraction. Would a branch with unsigned_ be worth it to prevent this issue? I guess this is more of a question for the author, but are there any benefits to structuring it this way (besides the obvious avoiding floats vs ints)?

It’s not necessarily flawed.
It does appear that it will recognize 1.-1 as a mantissa. However, that doesn’t mean that some post-parse validation doesn’t catch this problem.
It would be flawed if there’s an alternative, valid interpretation of 1.-1.
Sometimes, it’s just useful to recognize an invalid construct and produce a parse tree for “the only way to interpret this input”, and then you can detect it in a listener and give the user an error message that might be more meaningful than the default message that ANTLR would produce.
And, then again, it could also just be an oversight.
The `signed_` rule on the other hand, being:
signed_ : PLUSMIN? INTEGER;
Instead of
signed_ : PLUSMIN? INTEGER+;
does make this grammar somewhat suspect as a good example to work from.

Your analyze looks correct to me saying that :
1.-1 is recognized as a number
a branch with unsigned_ could fix it
Saying it's "flawd" taste like a value judgement, which seems not relevant.
If that was for my own usage, I would prefer to :
recognize 0.-4 as an invalid number
recognize -.4 as a valid number
So I do prefer something like :
number
: signed_float('e' signed_integer)?
;
signed_float
: PLUSMIN? unsigned_float
;
unsigned_float
: integer
| (integer '.')
| ('.' integer)
| (integer'.' integer)
;
signed_integer
: PLUSMIN? unsigned_integer
;
PLUSMIN
: '+'
| '-'
;

ANTLR4 lexer rule ensuring expression does not end with character

I have a syntax where I need to match given the following example:
some-Text->more-Text
From this example, I need ANTLR4 lexer rules that would match 'some-Text' and 'more-Text' into one lexer rule, and the '->' as another rule.
I am using the lexer rules shown below as my starting point, but the trouble is, the '-' character is allowed in the NAMEDELEMENT rule, which causes the first NAMEDELEMENT match to become 'some-Text-', which then causes the '->' to not be captured by the EDGE rule.
I'm looking for a way to ensure that the '-' is not captured as the last character in the NAMEDELEMENT rule (or some other alternative that produces the desired result).
EDGE
: '->'
;
NAMEDELEMENT
: ('a'..'z'|'A'..'Z'|'_'|'#') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')* { _input.LA(1) != '-' && _input.LA(2) != '>' }?
;
Im trying to use the predicate above to look ahead for a sequence of '-' and '>', but it doesn't seem to work. It doesn't seem to do anything at all, actually, as get the same parsing results both with and without the predicate.
The parser rules are as follows, where I am matching on 'selector' rules:
selector
: namedelement (edge namedelement)*
;
edge
: EDGE
;
namedelement
: NAMEDELEMENT
;
Thanks in advance!

After messing around with this for hours, I have a syntax that works, though I fail to see how it is functionally any different than what I posted in the original question.
(I use the uncommented version so that I can put a break point in the generated lexer to ensure that the equality test is evaluating correctly.)
NAMEDELEMENT
//: [a-zA-Z_#] [a-zA-Z_-]* { String.fromCharCode(this._input.LA(1)) != ">" }?
: [a-zA-Z_#] [a-zA-Z_-]* { (function(a){
var c = String.fromCharCode(a._input.LA(1));
return c != ">";
})(this)
}?
;
My target language is JavaScript and both the commented and uncommented forms of the predicate work fine.

Try this:
NAMEDELEMENT
: [a-zA-Z_#] ( '-' {_input.LA(1) != '>'}? | [a-zA-Z0-9_] )*
;
Not sure if _input.LA(1) != '>' is OK with the JavaScript runtime, but in Java it properly tokenises "some-->more" into "some-", "->" and "more".

ANTLR - Handle whitespace in identifier

I am trying to build simple search expression, and couldn't get right answer to below grammar.
Here are my sample search text
LOB WHERE
Line of Business WHERE
Line of Business WHERE
As you can see in above search, first few words reflect search keyword followed by where condition, i want to capture search keyword that can include whitespace. Sharing following sample grammar but doesn't seems to parse properly
sqlsyntax : identifierws 'WHERE';
identifierws : (WSID)+;
WSID: [a-zA-Z0-9 ] ; // match identifiers with space
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Any help in this regard is appreciated.
This is what is happening when I try to parse
Line of Business WHERE
I get following error
line 1:0 no viable alternative at input 'Line'
I get back LineofBusiness text but whitespace got trimmed, i want exact text Line of Business, that is where I am struggling a bit.

The identeriferws rule is consuming all text. Better to prioritize identification of keywords in the lexer:
sqlsyntax : identifierws WHERE identifierws EQ STRING EOF ;
identifierws : (WSID)+;
WHERE: 'WHERE';
EQ : '=' ;
STRING : '\'' .? '\'' ;
WSID: [a-zA-Z0-9 ] ;
WS : [ \t\r\n]+ -> skip ;

For such a simple case I wouldn't use a parser. That's just overkill. All you need to do is to get the current position in the input then search for WHERE (a simple boyer-moore search). Then take the text between start position and WHERE position as your input. Jump over WHERE and set the start position to where you are then. After that do the same search again etc.

No viable alternative at input ' '

I know this question has been asked before, but I haven't found any solution to my specific problem. I am using Antlr4 with the C# target and I have the following lexer rules:
INT : [0-9]+
;
LETTER : [a-zA-Z_]+
;
WS : [ \t\r\n\u000C]+ -> skip
;
LineComment
: '#' ~[\r\n]* -> skip
;
That are all lexer rules, but there are many parser rules which I will not post here since I don't think it is relevant.
The problem I have is that whitespaces do not get skipped. When I inspect the token stream after the lexer ran my input, the whitespaces are still in there and therefore cause an error. The input I use is relatively basic:
"fd 100"
it parses complete until it reaches this parser rule:
noSignFactor
: ':' ident #NoSignFactorArg
| integer #NoSignFactorInt
| float #NoSignFactorFloat
| BOOLEAN #NoSignFactorBool
| '(' expr ')' #NoSignFactorExpr
| 'not' factor #NoSignFactorNot
;
integer : INT #IntegerInt
;

Start by separating your grammar into a separate lexer grammar and parser grammar. For example, if you have a grammar Foo;, create the following:
Create a file FooLexer.g4, and move all of the lexer rules from Foo.g4 into FooLexer.g4.
Create a file FooParser.g4, and move all of the parser rules from Foo.g4 into FooParser.g4.
Include the following option in FooParser.g4:
options {
tokenVocab=FooLexer;
}
This separation will ensure that your parser isn't silently creating lexer rules for you. In a combined grammar, using a literal such as 'not' in a parser rule will create a lexer rule for you if one does not already exist. When this happens, it's easy to lose track of what kinds of tokens your lexer is able to produce. When you use a separate lexer grammar, you will need to explicitly declare a rule like the following in order to use 'not' in a parser rule.
NOT : 'not';
This should solve the problems with whitespace should you have included the literal ' ' somewhere in a parser rule.

ANTLR4 lexer not resolving ambiguity in grammar order

Using ANTLR 4.2, I'm trying a very simple parse of this test data:
RRV0#ABC
Using a minimal grammar:
grammar Tiny;
thing : RRV N HASH ID ;
RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
Running the ANTLR4 test rig with the test data above, the output is
[#0,0:3='RRV0',<4>,1:0]
[#1,4:4='#',<3>,1:4]
[#2,5:7='ABC',<4>,1:5]
[#3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'
I can see the first token is <4> for ID, with the value 'RRV0'
I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.
If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.
I started in ANTLR 4.1 with the same issue.
I checked in ANTLRWorks and from the command line, with the same result both ways.
How can I change the grammar to match lexer item RRV in preference to ID ?

The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.
This strategy is especially important in languages like Java. Consider the following input:
String className = "";
Along with the following two grammar rules (slightly simplified):
CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;
If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Token recognition order - antlr4

Related

Is this just a flawed grammar?

ANTLR4 lexer rule ensuring expression does not end with character

ANTLR - Handle whitespace in identifier

No viable alternative at input ' '

ANTLR4 lexer not resolving ambiguity in grammar order

Categories

Resources