Python ANTLR4 extraneous input plus tokens removal

Python ANTLR4 extraneous input plus tokens removal - antlr4

I am trying to parse a text file and I want to create a grammar to catch specific text blocks let's say
a) the word 'specificWordA' or 'specWordB' followed by zero or more digits, or
b) the word 'testC' followed by 1 or more digits.
My grammar looks like this:
grammar Hello;
catchExpr : expr+ EOF;
expr : matchAB | matchC;
matchAB : TEXTAB DIGIT*;
matchC : TEXTC DIGIT+;
TEXTAB : ('specificWordA' | 'specWordB') ;
TEXTC : ('testC') ;
DIGIT : NUMBER+ ;
CHARS : ('a'..'z' | 'A'..'Z')+ ;
SPACES : [ \r\t\n] ->skip;
fragment NUMBER: '0'..'9'+ ;
I am using ANTLR4 and I have compiled the code both on JAVA (to use the TestRig gui command for the AST) and Python2 (to provide a custom listener to traverse the tree). My file contains the following text:
specificWordA 11
specWordB specWordB specWordB testC 22 not not testD
testD 11
testC teeeeeeeeeest
testD 2
end here
Please could someone help my with the following questions:
1) Does ANTLR4 create nodes by default for every token I have defined in my grammar? How can I remove them so as to get a simplified version of the AST (see image below there are nodes for every sequence of characters that match token CHARS)?
2) Why does "testC teeeeeeeeeest testD 2 end here"
matches an expression? My rule is a text block 'testC' followed by at least one digit!
3) When I run my code I get the following messages:
line 3:39 extraneous input 'not' expecting {<EOF>, TEXTAB, 'testC'}
line 7:6 mismatched input 'teeeeeeeeeest' expecting {<EOF>, TEXTAB, 'testC'}
What does extraneous input mean? Do I have to change my grammar or it is just a warning?
Based on these questions,
ANTLR4 extraneous input
ANTLR4: Extraneous Input error
I cannot figure out what is wrong with my grammar!

Related

Antlr4 token recognition error and extraneous input

I'm trying to create SQL interpreter for my project. I ran into these errors when I run my program.
line 2:28 token recognition error at: ''a'
line 2:33 token recognition error at: '','
line 2:30 extraneous input 'nna' expecting Value
This is my test sql query:
INSERT INTO teacher VALUES ('Anna', 21);
And the partial of my grammar is:
insert: INSERT INTO ValidName VALUES '(' Value (',' Value)* ')' ';' ;
Value: Number | String;
ValidName: [a-z][a-z0-9_]*;
Number: [0-9]+;
String: '\''[^']+'\'';
I try to print out ctx.children and got this:
[INSERT, INTO, teacher, VALUES, (, nna, 21, ), ;]
Would anyone please help me where did I do wrong?

A couple of thing should help:
1 - Value should probably be a parser rule rather than a Lexer rule:
value: Number | String;
(and change the Valuess in your rules to values
2 - For your STRING rule, it's a bit simpler to use the non-greedy operator to pick up everything until you match the next character:
STRING: '\'' .*? '\'';

ANTLR : How to parse fixed length text file based on index position using ANTLR 4?

Input:
101 04200001312345678981107291600A094101US FORD NA TEST COMPANY101
5225TEST COMPANY 11234567898PPDTEST BUYS 110801110801 1098765430000001
Above lines are 94 char fixed length.
Expected output: Based on this input , Antlr grammar should parse based on index positions.
For Example: If parser identify '1' in starting char of line one. It should recognize entire line as a separate string as HEADER1.
Same as if parser finds '5' in starting index of line two. It should recognize entire line as a separate string as HEADER2.

fragment Digit: '0'..'9' ;
fragment Alpha: '_' | 'A'..'Z';
Number: Digit+ ;
Alphanumeric: (Letter | Digit)+ ;
header1: '1' Alphanumeric+
header2: '5' Alphanumeric+
WS
: (' ' | '\t') -> skip //channel (HIDDEN)
;

Which tool are you using for parsing?
I get the following tree while parsing with your grammar using Antlr v4 plugin in Android studio.

How to make antlr4 fully tokenize terminal nodes

I'm trying to use Antlr to make a very simple parser, that basically tokenizes a series of .-delimited identifiers.
I've made a simple grammar:
r : STRUCTURE_SELECTOR ;
STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? ;
ID : [_a-z0-9$]* ;
WS : [ \t\r\n]+ -> skip ;
When the parser is generated, I end up with a single terminal node that represents the string instead of being able to find further STRUCTURE_SELECTORs. I'd like instead to see a sequence (perhaps represented as children of the current node). How can I accomplish this?
As an example:
. would yield one terminal node whose text is .
.foobar would yield two nodes, a parent with text . and a child with text foobar
.foobar.baz would yield four nodes, a parent with text ., a child with text foobar, a second-level child with text ., and a third-level child with text baz.

Rules starting with a capital letter are Lexer rules.
With the following input file t.text
.
.foobar
.foobar.baz
your grammar (in file Question.g4) produces the following tokens
$ grun Question r -tokens -diagnostics t.text
[#0,0:0='.',<STRUCTURE_SELECTOR>,1:0]
[#1,2:8='.foobar',<STRUCTURE_SELECTOR>,2:0]
[#2,10:20='.foobar.baz',<STRUCTURE_SELECTOR>,3:0]
[#3,22:21='<EOF>',<EOF>,4:0]
The lexer (parser) is greedy. It tries to read as many input characters (tokens) as it can with the rule. The lexer rule STRUCTURE_SELECTOR: '.' (ID STRUCTURE_SELECTOR?)? can read a dot, an ID, and again a dot and an ID (due to repetition marker ?), till the NL. That's why each line ends up in a single token.
When compiling the grammar, the error
warning(146): Question.g4:5:0: non-fragment lexer rule ID can match the empty string
comes because the repetition marker of ID is * (which means 0 or more times) instead of +(one or more times).
Now try this grammar :
grammar Question;
r
#init {System.out.println("Question last update 2135");}
: ( structure_selector NL )+ EOF
;
structure_selector
: '.'
| '.' ID structure_selector*
;
ID : [_a-z0-9$]+ ;
NL : [\r\n]+ ;
WS : [ \t]+ -> skip ;
$ grun Question r -tokens -diagnostics t.text
[#0,0:0='.',<'.'>,1:0]
[#1,1:1='\n',<NL>,1:1]
[#2,2:2='.',<'.'>,2:0]
[#3,3:8='foobar',<ID>,2:1]
[#4,9:9='\n',<NL>,2:7]
[#5,10:10='.',<'.'>,3:0]
[#6,11:16='foobar',<ID>,3:1]
[#7,17:17='.',<'.'>,3:7]
[#8,18:20='baz',<ID>,3:8]
[#9,21:21='\n',<NL>,3:11]
[#10,22:21='<EOF>',<EOF>,4:0]
Question last update 2135
line 3:7 reportAttemptingFullContext d=1 (structure_selector), input='.'
line 3:7 reportContextSensitivity d=1 (structure_selector), input='.'
and $ grun Question r -gui t.text displays the hierarchical tree structure you are expecting.

ANTLR - Handle whitespace in identifier

I am trying to build simple search expression, and couldn't get right answer to below grammar.
Here are my sample search text
LOB WHERE
Line of Business WHERE
Line of Business WHERE
As you can see in above search, first few words reflect search keyword followed by where condition, i want to capture search keyword that can include whitespace. Sharing following sample grammar but doesn't seems to parse properly
sqlsyntax : identifierws 'WHERE';
identifierws : (WSID)+;
WSID: [a-zA-Z0-9 ] ; // match identifiers with space
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Any help in this regard is appreciated.
This is what is happening when I try to parse
Line of Business WHERE
I get following error
line 1:0 no viable alternative at input 'Line'
I get back LineofBusiness text but whitespace got trimmed, i want exact text Line of Business, that is where I am struggling a bit.

The identeriferws rule is consuming all text. Better to prioritize identification of keywords in the lexer:
sqlsyntax : identifierws WHERE identifierws EQ STRING EOF ;
identifierws : (WSID)+;
WHERE: 'WHERE';
EQ : '=' ;
STRING : '\'' .? '\'' ;
WSID: [a-zA-Z0-9 ] ;
WS : [ \t\r\n]+ -> skip ;

For such a simple case I wouldn't use a parser. That's just overkill. All you need to do is to get the current position in the input then search for WHERE (a simple boyer-moore search). Then take the text between start position and WHERE position as your input. Jump over WHERE and set the start position to where you are then. After that do the same search again etc.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex

Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python ANTLR4 extraneous input plus tokens removal - antlr4

Related

Antlr4 token recognition error and extraneous input

ANTLR : How to parse fixed length text file based on index position using ANTLR 4?

How to make antlr4 fully tokenize terminal nodes

ANTLR - Handle whitespace in identifier

Token recognition order

Categories

Resources