Antlr4 token recognition error and extraneous input - antlr4

I'm trying to create SQL interpreter for my project. I ran into these errors when I run my program.
line 2:28 token recognition error at: ''a'
line 2:33 token recognition error at: '','
line 2:30 extraneous input 'nna' expecting Value
This is my test sql query:
INSERT INTO teacher VALUES ('Anna', 21);
And the partial of my grammar is:
insert: INSERT INTO ValidName VALUES '(' Value (',' Value)* ')' ';' ;
Value: Number | String;
ValidName: [a-z][a-z0-9_]*;
Number: [0-9]+;
String: '\''[^']+'\'';
I try to print out ctx.children and got this:
[INSERT, INTO, teacher, VALUES, (, nna, 21, ), ;]
Would anyone please help me where did I do wrong?

A couple of thing should help:
1 - Value should probably be a parser rule rather than a Lexer rule:
value: Number | String;
(and change the Valuess in your rules to values
2 - For your STRING rule, it's a bit simpler to use the non-greedy operator to pick up everything until you match the next character:
STRING: '\'' .*? '\'';

Related

Mismatched input with binary operator parsing

I'm trying to parse an existing language in ANTLR that's currently being parsed using the Ruby library Parslet.
Here is a stripped down version of my grammar:
grammar FilterMin;
filter : condition_set;
condition_set: condition_set_type (property_condition)?;
condition_set_type: '=' | '^=';
property_condition: property_lhs CONDITION_SEPARATOR property_rhs;
property_lhs: QUOTED_STRING;
property_rhs: entity_rhs | contains_rhs;
contains_rhs: CONTAINS_OP '(' contains_value ')';
contains_value: QUOTED_STRING;
entity_rhs: NOT_OP? MATCH_OP? QUOTED_STRING;
// operators
MATCH_OP: '~';
NOT_OP: '^';
CONTAINS_OP: 'contains';
QUOTED_STRING: QUOTE STRING QUOTE;
STRING: (~['\\])*;
QUOTE: '\'';
CONDITION_SEPARATOR: ':';
This parses fails to parse both ='foo':'bar' and ='foo':contains('bar') with the same either: mismatched input ':' expecting ':' or mismatched input ':contains(' expecting ':'.
Why aren't these inputs parsing?
Your STRING rule matches everything that isn't a backslash or a single quote. So it overlaps with all of your other lexical rules except QUOTED_STRING. Since the lexer will always pick the rule that produces the longest match and that's almost always STRING, your lexer will produce a bunch of STRING tokens and never any CONDITION_SEPERATOR tokens.
Since you never use STRING in your parser rules, it doesn't need to be an actual type of token. In fact, you never want STRING tokens to be generated, you only ever want it to be matched as part of a QUOTED_STRING token. Therefore it should be a fragment.

Python ANTLR4 extraneous input plus tokens removal

I am trying to parse a text file and I want to create a grammar to catch specific text blocks let's say
a) the word 'specificWordA' or 'specWordB' followed by zero or more digits, or
b) the word 'testC' followed by 1 or more digits.
My grammar looks like this:
grammar Hello;
catchExpr : expr+ EOF;
expr : matchAB | matchC;
matchAB : TEXTAB DIGIT*;
matchC : TEXTC DIGIT+;
TEXTAB : ('specificWordA' | 'specWordB') ;
TEXTC : ('testC') ;
DIGIT : NUMBER+ ;
CHARS : ('a'..'z' | 'A'..'Z')+ ;
SPACES : [ \r\t\n] ->skip;
fragment NUMBER: '0'..'9'+ ;
I am using ANTLR4 and I have compiled the code both on JAVA (to use the TestRig gui command for the AST) and Python2 (to provide a custom listener to traverse the tree). My file contains the following text:
specificWordA 11
specWordB specWordB specWordB testC 22 not not testD
testD 11
testC teeeeeeeeeest
testD 2
end here
Please could someone help my with the following questions:
1) Does ANTLR4 create nodes by default for every token I have defined in my grammar? How can I remove them so as to get a simplified version of the AST (see image below there are nodes for every sequence of characters that match token CHARS)?
2) Why does "testC teeeeeeeeeest testD 2 end here"
matches an expression? My rule is a text block 'testC' followed by at least one digit!
3) When I run my code I get the following messages:
line 3:39 extraneous input 'not' expecting {<EOF>, TEXTAB, 'testC'}
line 7:6 mismatched input 'teeeeeeeeeest' expecting {<EOF>, TEXTAB, 'testC'}
What does extraneous input mean? Do I have to change my grammar or it is just a warning?
Based on these questions,
ANTLR4 extraneous input
ANTLR4: Extraneous Input error
I cannot figure out what is wrong with my grammar!

ANTLR - Handle whitespace in identifier

I am trying to build simple search expression, and couldn't get right answer to below grammar.
Here are my sample search text
LOB WHERE
Line of Business WHERE
Line of Business WHERE
As you can see in above search, first few words reflect search keyword followed by where condition, i want to capture search keyword that can include whitespace. Sharing following sample grammar but doesn't seems to parse properly
sqlsyntax : identifierws 'WHERE';
identifierws : (WSID)+;
WSID: [a-zA-Z0-9 ] ; // match identifiers with space
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Any help in this regard is appreciated.
This is what is happening when I try to parse
Line of Business WHERE
I get following error
line 1:0 no viable alternative at input 'Line'
I get back LineofBusiness text but whitespace got trimmed, i want exact text Line of Business, that is where I am struggling a bit.
The identeriferws rule is consuming all text. Better to prioritize identification of keywords in the lexer:
sqlsyntax : identifierws WHERE identifierws EQ STRING EOF ;
identifierws : (WSID)+;
WHERE: 'WHERE';
EQ : '=' ;
STRING : '\'' .? '\'' ;
WSID: [a-zA-Z0-9 ] ;
WS : [ \t\r\n]+ -> skip ;
For such a simple case I wouldn't use a parser. That's just overkill. All you need to do is to get the current position in the input then search for WHERE (a simple boyer-moore search). Then take the text between start position and WHERE position as your input. Jump over WHERE and set the start position to where you are then. After that do the same search again etc.

ANTLR v4 grammar fails to parse due to mismatched EOF

Follows a simple grammar with ANTLR v4. This grammar when walked produces a error message
**line 1:14 mismatched input '' expecting DimensionName*
for trivial input such as "sdarsfd integer" (without quotation marks).
SO has mention f similar errors and a bug perhaps were filed in 4.3 timeframe.
I have been using ANTLR 4.5.
Any help/pointer/solution?
/**
A simple parser for a dimension declaration
*/
grammar Simple;
definition : dim;
dim : DimensionName DataType;
DimensionName : LETTER (LETTER)*; // greedy
DataType: 'integer' | 'decimal';
LETTER : [a-zA-Z];
DIGIT : [0-9];
WS: [ \t\n\r]+ -> skip;
You just have to switch the two lexer rules DataType and DimensionName
...
DataType: 'integer' | 'decimal';
DimensionName : LETTER (LETTER)*; // greedy
...
As DimensionName matches every chars, 'integer' is typed as a DimensionName instead of a DataType. For "sdarsfd integer", the lexer produces two DimensionName token, so the dim rule cannot be matched. By switching the two lexer rules, the lexer produces a DimensionName token and a DataType token which match the dim rule.
Also, you can define LETTER and DIGIT as fragment:
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
Unless you want them to be matched as independent token (in your grammar, "a" will be typed as a LETTER).

What do parenthesis without quantifiers in Lexer rules?

Assume the following grammer:
grammar Demo;
start: START_BLOCK SEPERATOR;
START_BLOCK: '-.-.-';
ID: ( LETTER SEPERATOR ) (LETTER SEPERATOR)+;
fragment LETTER: L_A|L_K;
fragment L_A: '.-';
fragment L_K: '-.-';
SEPERATOR: '!';
I pass the following input to the grammar: -.-.-!
I'd expect that ANTLR recognizes the tokens START_BLOCK and SEPERATOR. But instead it finds a single Token of type ID.
I figured that I can fix the problem by removing the first couple of parenthesis in lexer rule "ID":
ID: LETTER SEPERATOR (LETTER SEPERATOR)+;
Now everything works fine, but why? What did the parenthesis above do to my grammar?
This is a bug in ANTLR 4 which is fixed for the 4.0.1 release. See: https://github.com/antlr/antlr4/issues/224

Resources