ANTLR - Handle whitespace in identifier

ANTLR - Handle whitespace in identifier - antlr4

I am trying to build simple search expression, and couldn't get right answer to below grammar.
Here are my sample search text
LOB WHERE
Line of Business WHERE
Line of Business WHERE
As you can see in above search, first few words reflect search keyword followed by where condition, i want to capture search keyword that can include whitespace. Sharing following sample grammar but doesn't seems to parse properly
sqlsyntax : identifierws 'WHERE';
identifierws : (WSID)+;
WSID: [a-zA-Z0-9 ] ; // match identifiers with space
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Any help in this regard is appreciated.
This is what is happening when I try to parse
Line of Business WHERE
I get following error
line 1:0 no viable alternative at input 'Line'
I get back LineofBusiness text but whitespace got trimmed, i want exact text Line of Business, that is where I am struggling a bit.

The identeriferws rule is consuming all text. Better to prioritize identification of keywords in the lexer:
sqlsyntax : identifierws WHERE identifierws EQ STRING EOF ;
identifierws : (WSID)+;
WHERE: 'WHERE';
EQ : '=' ;
STRING : '\'' .? '\'' ;
WSID: [a-zA-Z0-9 ] ;
WS : [ \t\r\n]+ -> skip ;

For such a simple case I wouldn't use a parser. That's just overkill. All you need to do is to get the current position in the input then search for WHERE (a simple boyer-moore search). Then take the text between start position and WHERE position as your input. Jump over WHERE and set the start position to where you are then. After that do the same search again etc.

Related

How to get ANTLR4 grammar to parse over a single line without requiring line break in the middle?

I'm currently relearning ANTLR and I'm having a bit of an issue with my grammar and parsing is. I'm editing it in IntelliJ IDEA with the ANTLR plugin and I'm using ANTLR version 4.9.2.
My grammar is as follows
grammar Pattern;
pattern:
patternName
patternMeaning
patternMoves;
patternName : 'Name:' NAME ;
patternMeaning : 'Meaning:' NAME ;
patternMoves : 'Moves:' (patternStep)+ ;
patternStep : 'Turn' angle stance;
stance : 'Walking Stance';
angle : ('90'|'180'|'270'|'360') '°' 'anti-'? 'clockwise';
NAME : WORD (' ' WORD)*;
fragment WORD : [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
now when I try and parse the following text, I get the following error line 2:9 mismatched input 'clockwise Walking Stance' expecting {'anti-', 'clockwise'}
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise Walking Stance
However, if I change the text to the below it works without any issues. How can I tweak my grammar to allow me to parse it on one line?
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise
Walking Stance

Your problem is that clockwise Walking Stance is a valid NAME, so it's interpreted as such rather than as an instance of the clockwise keyword followed by the NAME Walking Stance. Adding a line break fixes this because line breaks can't appear in names.
To fix this, you should turn WORD into a lexer rule and NAME into a parser rule. That way the name rule will only be tried in places where the parser actually expects a name, so it won't try to interpret clockwise as part of a name. And the WORD rule won't eat keywords because the match produced by the WORD rule won't be longer than the keyword, so the keyword wins.

If this is your entire grammar, then there are no lexer rules defining the handling of whaitespace. In fact, the are no explicit lexer rules. (ANTLR will create implicit lexer rules for any literal strings in your parser rules (unless the match an already define grammar rule.))
Your grammar is essentially (in ANTLR’s perception)
grammar Pattern;
patternMoves : T_1 (patternStep)+ ;
patternStep : T_2 angle stance;
stance : T_3;
angle : (T_4|T_5|T_6|T_7) T_8 T_9? T_10;
T_1: ‘Moves:’;
T_2: ‘Turn’;
T_3: 'Walking Stance';
T_4: '90';
T_5: '180';
T_6: '270';
T_7: '360';
T_8: '°';
T_9: 'anti-';
T_10: 'clockwise';
ANTLR’s processing takes a stream of characters, passes them to a lexer, which must decide what to do with all characters (even whitespace). The lexer produces a stream of tokens that the parser rules process.
You need some lexer rule that prescribes how to handle whatespace:
WS: [ \t\r\n]+ -> skip;
Is a common way of handling this. It tokenized all whitespace as a WS token, but then skips handing that token to the parser. (This is very handy as you won’t have to sprinkle WS or WS? items all through your grammar where whitespace is expected.
That your plugin accepts you input would imply to me that it may be treating each line of input as a new parse.

How to make antlr4 fully tokenize terminal nodes

I'm trying to use Antlr to make a very simple parser, that basically tokenizes a series of .-delimited identifiers.
I've made a simple grammar:
r : STRUCTURE_SELECTOR ;
STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? ;
ID : [_a-z0-9$]* ;
WS : [ \t\r\n]+ -> skip ;
When the parser is generated, I end up with a single terminal node that represents the string instead of being able to find further STRUCTURE_SELECTORs. I'd like instead to see a sequence (perhaps represented as children of the current node). How can I accomplish this?
As an example:
. would yield one terminal node whose text is .
.foobar would yield two nodes, a parent with text . and a child with text foobar
.foobar.baz would yield four nodes, a parent with text ., a child with text foobar, a second-level child with text ., and a third-level child with text baz.

Rules starting with a capital letter are Lexer rules.
With the following input file t.text
.
.foobar
.foobar.baz
your grammar (in file Question.g4) produces the following tokens
$ grun Question r -tokens -diagnostics t.text
[#0,0:0='.',<STRUCTURE_SELECTOR>,1:0]
[#1,2:8='.foobar',<STRUCTURE_SELECTOR>,2:0]
[#2,10:20='.foobar.baz',<STRUCTURE_SELECTOR>,3:0]
[#3,22:21='<EOF>',<EOF>,4:0]
The lexer (parser) is greedy. It tries to read as many input characters (tokens) as it can with the rule. The lexer rule STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? can read a dot, an ID, and again a dot and an ID (due to repetition marker ?), till the NL. That's why each line ends up in a single token.
When compiling the grammar, the error
warning(146): Question.g4:5:0: non-fragment lexer rule ID can match the empty string
comes because the repetition marker of ID is * (which means 0 or more times) instead of +(one or more times).
Now try this grammar :
grammar Question;
r
#init {System.out.println("Question last update 2135");}
: ( structure_selector NL )+ EOF
;
structure_selector
: '.'
| '.' ID structure_selector*
;
ID : [_a-z0-9$]+ ;
NL : [\r\n]+ ;
WS : [ \t]+ -> skip ;
$ grun Question r -tokens -diagnostics t.text
[#0,0:0='.',<'.'>,1:0]
[#1,1:1='\n',<NL>,1:1]
[#2,2:2='.',<'.'>,2:0]
[#3,3:8='foobar',<ID>,2:1]
[#4,9:9='\n',<NL>,2:7]
[#5,10:10='.',<'.'>,3:0]
[#6,11:16='foobar',<ID>,3:1]
[#7,17:17='.',<'.'>,3:7]
[#8,18:20='baz',<ID>,3:8]
[#9,21:21='\n',<NL>,3:11]
[#10,22:21='<EOF>',<EOF>,4:0]
Question last update 2135
line 3:7 reportAttemptingFullContext d=1 (structure_selector), input='.'
line 3:7 reportContextSensitivity d=1 (structure_selector), input='.'
and $ grun Question r -gui t.text displays the hierarchical tree structure you are expecting.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex

Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

No viable alternative at input ' '

I know this question has been asked before, but I haven't found any solution to my specific problem. I am using Antlr4 with the C# target and I have the following lexer rules:
INT : [0-9]+
;
LETTER : [a-zA-Z_]+
;
WS : [ \t\r\n\u000C]+ -> skip
;
LineComment
: '#' ~[\r\n]* -> skip
;
That are all lexer rules, but there are many parser rules which I will not post here since I don't think it is relevant.
The problem I have is that whitespaces do not get skipped. When I inspect the token stream after the lexer ran my input, the whitespaces are still in there and therefore cause an error. The input I use is relatively basic:
"fd 100"
it parses complete until it reaches this parser rule:
noSignFactor
: ':' ident #NoSignFactorArg
| integer #NoSignFactorInt
| float #NoSignFactorFloat
| BOOLEAN #NoSignFactorBool
| '(' expr ')' #NoSignFactorExpr
| 'not' factor #NoSignFactorNot
;
integer : INT #IntegerInt
;

Start by separating your grammar into a separate lexer grammar and parser grammar. For example, if you have a grammar Foo;, create the following:
Create a file FooLexer.g4, and move all of the lexer rules from Foo.g4 into FooLexer.g4.
Create a file FooParser.g4, and move all of the parser rules from Foo.g4 into FooParser.g4.
Include the following option in FooParser.g4:
options {
tokenVocab=FooLexer;
}
This separation will ensure that your parser isn't silently creating lexer rules for you. In a combined grammar, using a literal such as 'not' in a parser rule will create a lexer rule for you if one does not already exist. When this happens, it's easy to lose track of what kinds of tokens your lexer is able to produce. When you use a separate lexer grammar, you will need to explicitly declare a rule like the following in order to use 'not' in a parser rule.
NOT : 'not';
This should solve the problems with whitespace should you have included the literal ' ' somewhere in a parser rule.

ANTLR4 lexer not resolving ambiguity in grammar order

Using ANTLR 4.2, I'm trying a very simple parse of this test data:
RRV0#ABC
Using a minimal grammar:
grammar Tiny;
thing : RRV N HASH ID ;
RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
Running the ANTLR4 test rig with the test data above, the output is
[#0,0:3='RRV0',<4>,1:0]
[#1,4:4='#',<3>,1:4]
[#2,5:7='ABC',<4>,1:5]
[#3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'
I can see the first token is <4> for ID, with the value 'RRV0'
I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.
If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.
I started in ANTLR 4.1 with the same issue.
I checked in ANTLRWorks and from the command line, with the same result both ways.
How can I change the grammar to match lexer item RRV in preference to ID ?

The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.
This strategy is especially important in languages like Java. Consider the following input:
String className = "";
Along with the following two grammar rules (slightly simplified):
CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;
If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR - Handle whitespace in identifier - antlr4

The identeriferws rule is consuming all text. Better to prioritize identification of keywords in the lexer: sqlsyntax : identifierws WHERE identifierws EQ STRING EOF ; identifierws : (WSID)+; WHERE: 'WHERE'; EQ : '=' ; STRING : '\'' .? '\'' ; WSID: [a-zA-Z0-9 ] ; WS : [ \t\r\n]+ -> skip ;

Related

How to get ANTLR4 grammar to parse over a single line without requiring line break in the middle?

How to make antlr4 fully tokenize terminal nodes

Token recognition order

No viable alternative at input ' '

ANTLR4 lexer not resolving ambiguity in grammar order

Categories

Resources