Recognizing euler's constant (e) only when relevant - antlr4

I'm learning ANTLR4 to write a parser for a simple language specific to the app developed by the company. So far I've managed to have working arithmetic operations, logic operations, and conditional branchments. When tackling variables though, I ran into a problem. The language defines multiple mathematical constants, such as 'e'. When parsing variables, the parser would recognize the letter e as the constant and not part of the variable.
Below is a small test grammar I wrote to test this specific case, the euler and letter parser rules are there for visual clarity in the trees below
grammar Test; r: str '\r\n' EOF;
str: euler | (letter)* ;
euler: EULER;
letter: LETTER;
EULER: 'e';
LETTER: [a-zA-Z];
Recognition of different strings with this grammar:
"e"
"test"
"qsdf"
"eee"
I thought maybe parser rule precedence had something to do with it, but whatever order the parser rules are in, the output is the same. Swapping the lexer rules allows for correct recognition of "test", but recognizes "e" using the letter rule and not the euler rule. I also thought about defining EULER as:
EULER: ~[a-zA-Z] 'e' ~[a-zA-Z]
but this wouldn't recognize var a=e correctly. Another rule i have in my lexer is the ELSE: 'else' rule, which recognizes the 'else' keyword, which works and doesn't conflict with rule EULER. This is because antlr recognizes the longest input possible, but then why doesn't it recognize "test" as (r (str (letter t) (letter e) (letter s) (letter t)) \r\n <EOF>) as it would for "qsdf"?

You should not have a lexer rule like LETTER that matches a single letter and then "glue" these letters together in a parser rule. Instead, match a variable (consisting of multiple letters) as a single lexer rule:
EULER: 'e';
VARIABLE: [a-zA-Z]+;

I suggest changing your grammar to this:
grammar Test;
r: str '\n' EOF;
str: euler | WORD ;
euler: EULER;
EULER: 'e';
WORD: [a-zA-Z]+;

It appears you wanted a stand-alone "e" to be an euler element, and any other word to be a letter element, but that's not what you coded. Your grammar is doing exactly what you told it to do: Match every "e" as an EULER token (and therefore an euler element), and any other letter as a LETTER token (and therefore a letter element), and build strs out of those two tokens.
An ANTLR4 lexer tokenizes the input stream, trying to build the longest tokens possible, and processing the tokenization rules in the order you code them. Thus EULER will capture every "e", and LETTER will capture "a"-"d", "f"-"z", and "A"-"Z". An ANTLR4 parser maps the stream of tokens (from the lexer) into elements based on the order of tokens and the rules you code. Since the parser will never get a LETTER token for "e", your str elements will always get chopped apart at the "e"s.
The fix for this is to code a lexer rule that collects sequences of letters that aren't stand-alone "e"s into a LETTER token (or, as #pavel-ganelin says, a WORD), and to present that to the parser instead of the individual letters. It's a little more complicated than that, though, becuase you probably want "easy" to be the WORD "easy", not an EULER ("e") followed by the WORD "asy". So, you need to ensure that the "e" starting a string of letters isn't captured as an EULER token. You do that by ensuring that the WORD lexer rule comes before the EULER rule, and that it ignores stand-alone "e"s:
grammar Test;
r: str '\r\n' EOF;
str: euler | word ;
euler: EULER;
word: WORD;
WORD: ('e' [a-zA-Z]+) | [a-zA-Z]+;
EULER: 'e';

Related

How to get ANTLR4 grammar to parse over a single line without requiring line break in the middle?

I'm currently relearning ANTLR and I'm having a bit of an issue with my grammar and parsing is. I'm editing it in IntelliJ IDEA with the ANTLR plugin and I'm using ANTLR version 4.9.2.
My grammar is as follows
grammar Pattern;
pattern:
patternName
patternMeaning
patternMoves;
patternName : 'Name:' NAME ;
patternMeaning : 'Meaning:' NAME ;
patternMoves : 'Moves:' (patternStep)+ ;
patternStep : 'Turn' angle stance;
stance : 'Walking Stance';
angle : ('90'|'180'|'270'|'360') '°' 'anti-'? 'clockwise';
NAME : WORD (' ' WORD)*;
fragment WORD : [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
now when I try and parse the following text, I get the following error line 2:9 mismatched input 'clockwise Walking Stance' expecting {'anti-', 'clockwise'}
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise Walking Stance
However, if I change the text to the below it works without any issues. How can I tweak my grammar to allow me to parse it on one line?
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise
Walking Stance
Your problem is that clockwise Walking Stance is a valid NAME, so it's interpreted as such rather than as an instance of the clockwise keyword followed by the NAME Walking Stance. Adding a line break fixes this because line breaks can't appear in names.
To fix this, you should turn WORD into a lexer rule and NAME into a parser rule. That way the name rule will only be tried in places where the parser actually expects a name, so it won't try to interpret clockwise as part of a name. And the WORD rule won't eat keywords because the match produced by the WORD rule won't be longer than the keyword, so the keyword wins.
If this is your entire grammar, then there are no lexer rules defining the handling of whaitespace. In fact, the are no explicit lexer rules. (ANTLR will create implicit lexer rules for any literal strings in your parser rules (unless the match an already define grammar rule.))
Your grammar is essentially (in ANTLR’s perception)
grammar Pattern;
patternMoves : T_1 (patternStep)+ ;
patternStep : T_2 angle stance;
stance : T_3;
angle : (T_4|T_5|T_6|T_7) T_8 T_9? T_10;
T_1: ‘Moves:’;
T_2: ‘Turn’;
T_3: 'Walking Stance';
T_4: '90';
T_5: '180';
T_6: '270';
T_7: '360';
T_8: '°';
T_9: 'anti-';
T_10: 'clockwise';
ANTLR’s processing takes a stream of characters, passes them to a lexer, which must decide what to do with all characters (even whitespace). The lexer produces a stream of tokens that the parser rules process.
You need some lexer rule that prescribes how to handle whatespace:
WS: [ \t\r\n]+ -> skip;
Is a common way of handling this. It tokenized all whitespace as a WS token, but then skips handing that token to the parser. (This is very handy as you won’t have to sprinkle WS or WS? items all through your grammar where whitespace is expected.
That your plugin accepts you input would imply to me that it may be treating each line of input as a new parse.

Error in the recognizer grammar for analyzing data type and variable descriptions in ANTLR4

I need to implement a parser for this type of logic:the specified grammar
The S character is the initial character of the grammar; L, T, R, V, K, D, F, and E denote nonterminal characters. The terminal character c corresponds to one of the two scalar types specified in the task. The terminal character t corresponds to one of the data types that can be described in the type section.
I created the following grammar:
grammar Parse;
compileString: S+;
S: TYPE L VAR R;
L: T (SEPARATOR|SEPARATOR L);
R: V (SEPARATOR|SEPARATOR R);
V: [a-zA-Z] ([a-zA-Z]| [0-9]|'_')* DEFINITION (D|C);
T: D|C;
TYPE:'type';
VAR:'var';
D: // acceptable data types
'struct'
| 'union'
| 'array'
;
C: 'byte'
|'word' //scalar type
;
SEPARATOR:';';
DEFINITION :':';
WS : [ \t\n\r]+ -> skip ; // whitespaces
But when I try to execute it for the construction: "type byte; var p1:word;", I get the following output:
Tokens:
[#0,0:3='type',<6>,1:0]
[#1,5:9='byte;',<2>,1:5]
[#2,11:13='var',<7>,1:11]
[#3,15:22='p1:word;',<3>,1:15]
[#4,23:22='<EOF>',<-1>,1:23]
Parse Tree:
compileString (
<Error>"type"
<Error>"byte;"
<Error>"var"
<Error>"p1:word;"
)
I do not understand what the problem may be, debugging was performed in VS Code with a plugin from Antlr. I will be glad to any answer!
In ANTLR lexer rules start with capital letters and parser rules with lower case letters. So all of your rules except compileString are lexer rules.
S: TYPE L VAR R; does not match the input type byte; var p1:word; because there are spaces in it and nothing in the definition of S matches spaces. You're probably thinking that shouldn't matter because you're skipping spaces, but tokens are only skipped between lexer rules not inside of them. So it would work if S were a parser rule, but not as a lexer rule.
The same applies to spaces between the separator and L/R in L and R.
PS: I strongly suggest to give your rules longer names as it is quite hard to follow your grammar. You might also consider using the + operator in L and R instead of recursion.

how to match a sequence that has no separation with whitespace

The rule I am trying to match is: hello followed by a sequence of characters. If that sequence contains an alphabet in it, that should match the str rule, else it should match the num rule.
For e.g.
hello123 - 123 should be matched by num rule
hello1a3 - 1a3 should be matched by the str rule
The grammar I wrote is below:
grammar Hello;
r: 'hello'seq;
// seq: str | integ;
seq: num | str;
num : DIGITS;
str : CHARS;
DIGITS: [0-9]+;
CHARS : [0-9a-zA-Z]+;
WS : [ \t\n\r]+ -> skip;
While trying to visualize the parse tree (using grun) (against the first input example above) I got the below parse tree:
However if the input had space in between there was no problem. Please explain why the error.
Lexing in ANTLR (as well as most lexer generators) works according to the maximum munch rule, which says that it always applies the lexer rule that could match the longest prefix of the current input. For the input hello123, the rule 'hello' would match hello, whereas the rule CHARS would match the entire input hello123. Therefore CHARS produces the longer match and is chosen over 'hello'.
If your CHARS and DIGITS tokens can only appear after a 'hello' token, you can use lexer modes to make it so that these rules are only available after a 'hello' has been matched.
Otherwise, to get the behaviour you want, your best bet would probably be to create a single lexer rule that matches 'hello' [0-9a-zA-Z]* and then take apart the tokens generated by that in a separate step. Though it all depends on why you need this.

Prolog DCG Building/Recognizing Word Strings from Alphanumeric Characters

So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.
In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i
My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.
How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.
Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?
I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.
First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.
In Prolog, you can easily convert atoms to characters with atom_chars/2.
For example:
?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].
Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.
For example:
?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.
The general pattern to express identifiers such as yours with DCGs can look as follows:
identifier -->
[L],
{ letter(L) },
identifier_rest.
identifier_rest --> [].
identifier_rest -->
[I],
{ letter_or_digit(I) },
identifier_rest.
You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.
Further, you can of course introduce an argument to relate such lists to atoms.

What do parenthesis without quantifiers in Lexer rules?

Assume the following grammer:
grammar Demo;
start: START_BLOCK SEPERATOR;
START_BLOCK: '-.-.-';
ID: ( LETTER SEPERATOR ) (LETTER SEPERATOR)+;
fragment LETTER: L_A|L_K;
fragment L_A: '.-';
fragment L_K: '-.-';
SEPERATOR: '!';
I pass the following input to the grammar: -.-.-!
I'd expect that ANTLR recognizes the tokens START_BLOCK and SEPERATOR. But instead it finds a single Token of type ID.
I figured that I can fix the problem by removing the first couple of parenthesis in lexer rule "ID":
ID: LETTER SEPERATOR (LETTER SEPERATOR)+;
Now everything works fine, but why? What did the parenthesis above do to my grammar?
This is a bug in ANTLR 4 which is fixed for the 4.0.1 release. See: https://github.com/antlr/antlr4/issues/224

Resources