What do parenthesis without quantifiers in Lexer rules?

What do parenthesis without quantifiers in Lexer rules? - antlr4

Assume the following grammer:
grammar Demo;
start: START_BLOCK SEPERATOR;
START_BLOCK: '-.-.-';
ID: ( LETTER SEPERATOR ) (LETTER SEPERATOR)+;
fragment LETTER: L_A|L_K;
fragment L_A: '.-';
fragment L_K: '-.-';
SEPERATOR: '!';
I pass the following input to the grammar: -.-.-!
I'd expect that ANTLR recognizes the tokens START_BLOCK and SEPERATOR. But instead it finds a single Token of type ID.
I figured that I can fix the problem by removing the first couple of parenthesis in lexer rule "ID":
ID: LETTER SEPERATOR (LETTER SEPERATOR)+;
Now everything works fine, but why? What did the parenthesis above do to my grammar?

This is a bug in ANTLR 4 which is fixed for the 4.0.1 release. See: https://github.com/antlr/antlr4/issues/224

Related

Recognizing euler's constant (e) only when relevant

I'm learning ANTLR4 to write a parser for a simple language specific to the app developed by the company. So far I've managed to have working arithmetic operations, logic operations, and conditional branchments. When tackling variables though, I ran into a problem. The language defines multiple mathematical constants, such as 'e'. When parsing variables, the parser would recognize the letter e as the constant and not part of the variable.
Below is a small test grammar I wrote to test this specific case, the euler and letter parser rules are there for visual clarity in the trees below
grammar Test; r: str '\r\n' EOF;
str: euler | (letter)* ;
euler: EULER;
letter: LETTER;
EULER: 'e';
LETTER: [a-zA-Z];
Recognition of different strings with this grammar:
"e"
"test"
"qsdf"
"eee"
I thought maybe parser rule precedence had something to do with it, but whatever order the parser rules are in, the output is the same. Swapping the lexer rules allows for correct recognition of "test", but recognizes "e" using the letter rule and not the euler rule. I also thought about defining EULER as:
EULER: ~[a-zA-Z] 'e' ~[a-zA-Z]
but this wouldn't recognize var a=e correctly. Another rule i have in my lexer is the ELSE: 'else' rule, which recognizes the 'else' keyword, which works and doesn't conflict with rule EULER. This is because antlr recognizes the longest input possible, but then why doesn't it recognize "test" as (r (str (letter t) (letter e) (letter s) (letter t)) \r\n <EOF>) as it would for "qsdf"?

You should not have a lexer rule like LETTER that matches a single letter and then "glue" these letters together in a parser rule. Instead, match a variable (consisting of multiple letters) as a single lexer rule:
EULER: 'e';
VARIABLE: [a-zA-Z]+;

I suggest changing your grammar to this:
grammar Test;
r: str '\n' EOF;
str: euler | WORD ;
euler: EULER;
EULER: 'e';
WORD: [a-zA-Z]+;

It appears you wanted a stand-alone "e" to be an euler element, and any other word to be a letter element, but that's not what you coded. Your grammar is doing exactly what you told it to do: Match every "e" as an EULER token (and therefore an euler element), and any other letter as a LETTER token (and therefore a letter element), and build strs out of those two tokens.
An ANTLR4 lexer tokenizes the input stream, trying to build the longest tokens possible, and processing the tokenization rules in the order you code them. Thus EULER will capture every "e", and LETTER will capture "a"-"d", "f"-"z", and "A"-"Z". An ANTLR4 parser maps the stream of tokens (from the lexer) into elements based on the order of tokens and the rules you code. Since the parser will never get a LETTER token for "e", your str elements will always get chopped apart at the "e"s.
The fix for this is to code a lexer rule that collects sequences of letters that aren't stand-alone "e"s into a LETTER token (or, as #pavel-ganelin says, a WORD), and to present that to the parser instead of the individual letters. It's a little more complicated than that, though, becuase you probably want "easy" to be the WORD "easy", not an EULER ("e") followed by the WORD "asy". So, you need to ensure that the "e" starting a string of letters isn't captured as an EULER token. You do that by ensuring that the WORD lexer rule comes before the EULER rule, and that it ignores stand-alone "e"s:
grammar Test;
r: str '\r\n' EOF;
str: euler | word ;
euler: EULER;
word: WORD;
WORD: ('e' [a-zA-Z]+) | [a-zA-Z]+;
EULER: 'e';

Antlr4 token recognition error and extraneous input

I'm trying to create SQL interpreter for my project. I ran into these errors when I run my program.
line 2:28 token recognition error at: ''a'
line 2:33 token recognition error at: '','
line 2:30 extraneous input 'nna' expecting Value
This is my test sql query:
INSERT INTO teacher VALUES ('Anna', 21);
And the partial of my grammar is:
insert: INSERT INTO ValidName VALUES '(' Value (',' Value)* ')' ';' ;
Value: Number | String;
ValidName: [a-z][a-z0-9_]*;
Number: [0-9]+;
String: '\''[^']+'\'';
I try to print out ctx.children and got this:
[INSERT, INTO, teacher, VALUES, (, nna, 21, ), ;]
Would anyone please help me where did I do wrong?

A couple of thing should help:
1 - Value should probably be a parser rule rather than a Lexer rule:
value: Number | String;
(and change the Valuess in your rules to values
2 - For your STRING rule, it's a bit simpler to use the non-greedy operator to pick up everything until you match the next character:
STRING: '\'' .*? '\'';

Mismatched input with binary operator parsing

I'm trying to parse an existing language in ANTLR that's currently being parsed using the Ruby library Parslet.
Here is a stripped down version of my grammar:
grammar FilterMin;
filter : condition_set;
condition_set: condition_set_type (property_condition)?;
condition_set_type: '=' | '^=';
property_condition: property_lhs CONDITION_SEPARATOR property_rhs;
property_lhs: QUOTED_STRING;
property_rhs: entity_rhs | contains_rhs;
contains_rhs: CONTAINS_OP '(' contains_value ')';
contains_value: QUOTED_STRING;
entity_rhs: NOT_OP? MATCH_OP? QUOTED_STRING;
// operators
MATCH_OP: '~';
NOT_OP: '^';
CONTAINS_OP: 'contains';
QUOTED_STRING: QUOTE STRING QUOTE;
STRING: (~['\\])*;
QUOTE: '\'';
CONDITION_SEPARATOR: ':';
This parses fails to parse both ='foo':'bar' and ='foo':contains('bar') with the same either: mismatched input ':' expecting ':' or mismatched input ':contains(' expecting ':'.
Why aren't these inputs parsing?

Your STRING rule matches everything that isn't a backslash or a single quote. So it overlaps with all of your other lexical rules except QUOTED_STRING. Since the lexer will always pick the rule that produces the longest match and that's almost always STRING, your lexer will produce a bunch of STRING tokens and never any CONDITION_SEPERATOR tokens.
Since you never use STRING in your parser rules, it doesn't need to be an actual type of token. In fact, you never want STRING tokens to be generated, you only ever want it to be matched as part of a QUOTED_STRING token. Therefore it should be a fragment.

ANTLR v4 grammar fails to parse due to mismatched EOF

Follows a simple grammar with ANTLR v4. This grammar when walked produces a error message
**line 1:14 mismatched input '' expecting DimensionName*
for trivial input such as "sdarsfd integer" (without quotation marks).
SO has mention f similar errors and a bug perhaps were filed in 4.3 timeframe.
I have been using ANTLR 4.5.
Any help/pointer/solution?
/**
A simple parser for a dimension declaration
*/
grammar Simple;
definition : dim;
dim : DimensionName DataType;
DimensionName : LETTER (LETTER)*; // greedy
DataType: 'integer' | 'decimal';
LETTER : [a-zA-Z];
DIGIT : [0-9];
WS: [ \t\n\r]+ -> skip;

You just have to switch the two lexer rules DataType and DimensionName
...
DataType: 'integer' | 'decimal';
DimensionName : LETTER (LETTER)*; // greedy
...
As DimensionName matches every chars, 'integer' is typed as a DimensionName instead of a DataType. For "sdarsfd integer", the lexer produces two DimensionName token, so the dim rule cannot be matched. By switching the two lexer rules, the lexer produces a DimensionName token and a DataType token which match the dim rule.
Also, you can define LETTER and DIGIT as fragment:
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
Unless you want them to be matched as independent token (in your grammar, "a" will be typed as a LETTER).

ANTLR4 lexer not resolving ambiguity in grammar order

Using ANTLR 4.2, I'm trying a very simple parse of this test data:
RRV0#ABC
Using a minimal grammar:
grammar Tiny;
thing : RRV N HASH ID ;
RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
Running the ANTLR4 test rig with the test data above, the output is
[#0,0:3='RRV0',<4>,1:0]
[#1,4:4='#',<3>,1:4]
[#2,5:7='ABC',<4>,1:5]
[#3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'
I can see the first token is <4> for ID, with the value 'RRV0'
I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.
If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.
I started in ANTLR 4.1 with the same issue.
I checked in ANTLRWorks and from the command line, with the same result both ways.
How can I change the grammar to match lexer item RRV in preference to ID ?

The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.
This strategy is especially important in languages like Java. Consider the following input:
String className = "";
Along with the following two grammar rules (slightly simplified):
CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;
If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What do parenthesis without quantifiers in Lexer rules? - antlr4

This is a bug in ANTLR 4 which is fixed for the 4.0.1 release. See: https://github.com/antlr/antlr4/issues/224

Related

Recognizing euler's constant (e) only when relevant

Antlr4 token recognition error and extraneous input

Mismatched input with binary operator parsing

ANTLR v4 grammar fails to parse due to mismatched EOF

ANTLR4 lexer not resolving ambiguity in grammar order

Categories

Resources