how to match a sequence that has no separation with whitespace - antlr4

The rule I am trying to match is: hello followed by a sequence of characters. If that sequence contains an alphabet in it, that should match the str rule, else it should match the num rule.
For e.g.
hello123 - 123 should be matched by num rule
hello1a3 - 1a3 should be matched by the str rule
The grammar I wrote is below:
grammar Hello;
r: 'hello'seq;
// seq: str | integ;
seq: num | str;
num : DIGITS;
str : CHARS;
DIGITS: [0-9]+;
CHARS : [0-9a-zA-Z]+;
WS : [ \t\n\r]+ -> skip;
While trying to visualize the parse tree (using grun) (against the first input example above) I got the below parse tree:
However if the input had space in between there was no problem. Please explain why the error.

Lexing in ANTLR (as well as most lexer generators) works according to the maximum munch rule, which says that it always applies the lexer rule that could match the longest prefix of the current input. For the input hello123, the rule 'hello' would match hello, whereas the rule CHARS would match the entire input hello123. Therefore CHARS produces the longer match and is chosen over 'hello'.
If your CHARS and DIGITS tokens can only appear after a 'hello' token, you can use lexer modes to make it so that these rules are only available after a 'hello' has been matched.
Otherwise, to get the behaviour you want, your best bet would probably be to create a single lexer rule that matches 'hello' [0-9a-zA-Z]* and then take apart the tokens generated by that in a separate step. Though it all depends on why you need this.

Related

Recognizing euler's constant (e) only when relevant

I'm learning ANTLR4 to write a parser for a simple language specific to the app developed by the company. So far I've managed to have working arithmetic operations, logic operations, and conditional branchments. When tackling variables though, I ran into a problem. The language defines multiple mathematical constants, such as 'e'. When parsing variables, the parser would recognize the letter e as the constant and not part of the variable.
Below is a small test grammar I wrote to test this specific case, the euler and letter parser rules are there for visual clarity in the trees below
grammar Test; r: str '\r\n' EOF;
str: euler | (letter)* ;
euler: EULER;
letter: LETTER;
EULER: 'e';
LETTER: [a-zA-Z];
Recognition of different strings with this grammar:
"e"
"test"
"qsdf"
"eee"
I thought maybe parser rule precedence had something to do with it, but whatever order the parser rules are in, the output is the same. Swapping the lexer rules allows for correct recognition of "test", but recognizes "e" using the letter rule and not the euler rule. I also thought about defining EULER as:
EULER: ~[a-zA-Z] 'e' ~[a-zA-Z]
but this wouldn't recognize var a=e correctly. Another rule i have in my lexer is the ELSE: 'else' rule, which recognizes the 'else' keyword, which works and doesn't conflict with rule EULER. This is because antlr recognizes the longest input possible, but then why doesn't it recognize "test" as (r (str (letter t) (letter e) (letter s) (letter t)) \r\n <EOF>) as it would for "qsdf"?
You should not have a lexer rule like LETTER that matches a single letter and then "glue" these letters together in a parser rule. Instead, match a variable (consisting of multiple letters) as a single lexer rule:
EULER: 'e';
VARIABLE: [a-zA-Z]+;
I suggest changing your grammar to this:
grammar Test;
r: str '\n' EOF;
str: euler | WORD ;
euler: EULER;
EULER: 'e';
WORD: [a-zA-Z]+;
It appears you wanted a stand-alone "e" to be an euler element, and any other word to be a letter element, but that's not what you coded. Your grammar is doing exactly what you told it to do: Match every "e" as an EULER token (and therefore an euler element), and any other letter as a LETTER token (and therefore a letter element), and build strs out of those two tokens.
An ANTLR4 lexer tokenizes the input stream, trying to build the longest tokens possible, and processing the tokenization rules in the order you code them. Thus EULER will capture every "e", and LETTER will capture "a"-"d", "f"-"z", and "A"-"Z". An ANTLR4 parser maps the stream of tokens (from the lexer) into elements based on the order of tokens and the rules you code. Since the parser will never get a LETTER token for "e", your str elements will always get chopped apart at the "e"s.
The fix for this is to code a lexer rule that collects sequences of letters that aren't stand-alone "e"s into a LETTER token (or, as #pavel-ganelin says, a WORD), and to present that to the parser instead of the individual letters. It's a little more complicated than that, though, becuase you probably want "easy" to be the WORD "easy", not an EULER ("e") followed by the WORD "asy". So, you need to ensure that the "e" starting a string of letters isn't captured as an EULER token. You do that by ensuring that the WORD lexer rule comes before the EULER rule, and that it ignores stand-alone "e"s:
grammar Test;
r: str '\r\n' EOF;
str: euler | word ;
euler: EULER;
word: WORD;
WORD: ('e' [a-zA-Z]+) | [a-zA-Z]+;
EULER: 'e';

How to get ANTLR4 grammar to parse over a single line without requiring line break in the middle?

I'm currently relearning ANTLR and I'm having a bit of an issue with my grammar and parsing is. I'm editing it in IntelliJ IDEA with the ANTLR plugin and I'm using ANTLR version 4.9.2.
My grammar is as follows
grammar Pattern;
pattern:
patternName
patternMeaning
patternMoves;
patternName : 'Name:' NAME ;
patternMeaning : 'Meaning:' NAME ;
patternMoves : 'Moves:' (patternStep)+ ;
patternStep : 'Turn' angle stance;
stance : 'Walking Stance';
angle : ('90'|'180'|'270'|'360') '°' 'anti-'? 'clockwise';
NAME : WORD (' ' WORD)*;
fragment WORD : [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
now when I try and parse the following text, I get the following error line 2:9 mismatched input 'clockwise Walking Stance' expecting {'anti-', 'clockwise'}
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise Walking Stance
However, if I change the text to the below it works without any issues. How can I tweak my grammar to allow me to parse it on one line?
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise
Walking Stance
Your problem is that clockwise Walking Stance is a valid NAME, so it's interpreted as such rather than as an instance of the clockwise keyword followed by the NAME Walking Stance. Adding a line break fixes this because line breaks can't appear in names.
To fix this, you should turn WORD into a lexer rule and NAME into a parser rule. That way the name rule will only be tried in places where the parser actually expects a name, so it won't try to interpret clockwise as part of a name. And the WORD rule won't eat keywords because the match produced by the WORD rule won't be longer than the keyword, so the keyword wins.
If this is your entire grammar, then there are no lexer rules defining the handling of whaitespace. In fact, the are no explicit lexer rules. (ANTLR will create implicit lexer rules for any literal strings in your parser rules (unless the match an already define grammar rule.))
Your grammar is essentially (in ANTLR’s perception)
grammar Pattern;
patternMoves : T_1 (patternStep)+ ;
patternStep : T_2 angle stance;
stance : T_3;
angle : (T_4|T_5|T_6|T_7) T_8 T_9? T_10;
T_1: ‘Moves:’;
T_2: ‘Turn’;
T_3: 'Walking Stance';
T_4: '90';
T_5: '180';
T_6: '270';
T_7: '360';
T_8: '°';
T_9: 'anti-';
T_10: 'clockwise';
ANTLR’s processing takes a stream of characters, passes them to a lexer, which must decide what to do with all characters (even whitespace). The lexer produces a stream of tokens that the parser rules process.
You need some lexer rule that prescribes how to handle whatespace:
WS: [ \t\r\n]+ -> skip;
Is a common way of handling this. It tokenized all whitespace as a WS token, but then skips handing that token to the parser. (This is very handy as you won’t have to sprinkle WS or WS? items all through your grammar where whitespace is expected.
That your plugin accepts you input would imply to me that it may be treating each line of input as a new parse.

Regular expression to capture n lines of text between two regex patterns

Need help with a regular expression to grab exactly n lines of text between two regex matches. For example, I need 17 lines of text and I used the example below, which does not work. I
Please see sample code below:
import re
match_string = re.search(r'^.*MDC_IDC_RAW_MARKER((.*?\r?\n){17})Stored_EGM_Trigger.*\n'), t, re.DOTALL).group()
value1 = re.search(r'value="(\d+)"', match_string).group(1)
value2 = re.search(r'value="(\d+\.\d+)"', match_string).group(1)
print(match_string)
print(value1)
print(value2)
I added a sample string to here, because SO does not allow long code string:
https://hastebin.com/aqowusijuc.xml
You are getting false positives because you are using the re.DOTALL flag, which allows the . character to match newline characters. That is, when you are matching ((.*?\r?\n){17}), the . could eat up many extra newline characters just to satisfy your required count of 17. You also now realize that the \r is superfluous. Also, starting your regex with ^.*? is superfluous because you are forcing the search to start from the beginning but then saying that the search engine should skip as many characters as necessary to find MDC_IDC_RAW_MARKER. So, a simplified and correct regex would be:
match_string = re.search(r'MDC_IDC_RAW_MARKER.*\n((.*\n){17})Stored_EGM_Trigger.*\n', t)
Regex Demo

Why doesn't this RegEx match anything?

I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.
This is what I've got: (\d)(?<!\1)\1(?!\1); but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)
For example:
In 1111223 I would expect to match the 3 at the end, since it's not preceded or followed by another 3.
In 1151223 I would expect to match the 5 in the middle, and the 3 at the end for the same reasons as above.
The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11 in 112223 or 44 in 123544) and I was going to try and match single isolated characters, and then add a {2} to it to find pairs, but I can't even seem to get isolated characters to match!
Any help would be much appreciated, I thought I knew RegEx pretty well!
P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.
Your regex is close, but by using simply (\d) you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:
(?=.*?(.))(?<!\1)\1(?!\1)
By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.
Note that in 1151223 this returns 5, 1 and 3 because the third 1 is not adjacent to any other 1s.
Demo on regex101 (requires JS that supports variable width lookbehinds)
The pattern you tried does not match because this part (\d)(?<!\1) can not match.
It reads as:
Capture a digit in group 1. Then, on the position after that captured
digit, assert what is captured should not be on the left.
You could make the pattern work by adding for example a dot after the backreference (?<!\1.) to assert that the value before what you have just matched is not the same as group 1
Pattern
(\d)(?<!\1.)\1(?!\1)
Regex demo | Python demo
Note that you have selected ECMAscript on regex101.
Python re does not support variable width lookbehind.
To make this work in Python, you need the PyPi regex module.
Example code
import regex
pattern = r"(\d)(?<!\1.)\1(?!\1)"
test_str = ("1111223\n"
"1151223\n\n"
"112223\n"
"123544")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
22
11
22
11
44
#Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.
Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).
r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'
Demo
Python's re regex engine performs the following operations.
^.$ match the first char if it is the only char in the line
| or
^ match beginning of line
(.) match a char in capture group 1...
(?!\1) ...that is not followed by the same character
| or
(?<=(.)) save the previous char in capture group 2...
(?!\2) ...that is not equal to the next char
(.) match a character and save to capture group 3...
(?!\3) ...that is not equal to the following char
Suppose the string were "cat".
The internal string pointer is initially at the beginning of the line.
"c" is not at the end of the line so the first part of the alternation fails and the second part is considered.
"c" is matched and saved to capture group 1.
The negative lookahead asserting that "c" is not followed by the content of capture group 1 succeeds, so "c" is matched and the internal string pointer is advanced to a position between "c" and "a".
"a" fails the first two parts of the assertion so the third part is considered.
The positive lookbehind (?<=(.)) saves the preceding character ("c") in capture group 2.
The negative lookahead (?!\2), which asserts that the next character ("a") is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a".
The next character ("a") is matched and saved in capture group 3.
The negative lookahead (?!\3), which asserts that the following character ("t") does not equal the content of capture group 3, succeeds, so "a" is matched and the string pointer advances to just before "t".
The same steps are performed when evaluating "t" as were performed when evaluating "a". Here the last token ((?!\3)) succeeds, however, because no characters follow "t".

ANTLR4 lexer not resolving ambiguity in grammar order

Using ANTLR 4.2, I'm trying a very simple parse of this test data:
RRV0#ABC
Using a minimal grammar:
grammar Tiny;
thing : RRV N HASH ID ;
RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard
I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
Running the ANTLR4 test rig with the test data above, the output is
[#0,0:3='RRV0',<4>,1:0]
[#1,4:4='#',<3>,1:4]
[#2,5:7='ABC',<4>,1:5]
[#3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'
I can see the first token is <4> for ID, with the value 'RRV0'
I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.
If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.
I started in ANTLR 4.1 with the same issue.
I checked in ANTLRWorks and from the command line, with the same result both ways.
How can I change the grammar to match lexer item RRV in preference to ID ?
The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.
This strategy is especially important in languages like Java. Consider the following input:
String className = "";
Along with the following two grammar rules (slightly simplified):
CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;
If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

Resources