ANTLR4 lexer rules don't work as expected

ANTLR4 lexer rules don't work as expected - antlr4

I want to write a lexer rule about the month and the year, the rule is(with regular expression):
"hello"[0-9]{1,2}"ever"([0-9]{2}([0-9]{2})?)?
the "hello" and "ever" literals are just for debuging.
that's say, one or two digits for month, and two or four digits for year. And what's more, the year part could be bypass.
such as:
Aug 2015 ->hello08ever2015 or hello8ever2015 or hello8ever15 or hello8ever or hello08ever;
Oct 2015 -> hello10ever2015 or hello10ever15 or hello10ever;
and my lexer rules are as follow(ANTLR4):
grammar Hello;
r : 'hello' TimeDate 'ever' TimeYear? ;
TimeDate : Digit Digit?;
TimeYear : TwoDigit TwoDigit?;
TwoDigit : Digit Digit;
Digit : [0-9] ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
But it seems not working.
Here're some logs for my testing:
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever2014
^Z
(r hello 20 ever 2014)
C:\antlr\workspace\demo>grun Hello r -tree -gui
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever20
^Z
(r hello 2 ever)
C:\antlr\workspace\demo>grun Hello r -tree -gui
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello20ever14
^Z
(r hello 20 ever)
C:\antlr\workspace\demo>grun Hello r -tree -gui
C:\antlr\workspace\demo>java org.antlr.v4.runtime.misc.TestRig Hello r -tree -gui
hello2ever2014
^Z
(r hello 2 ever 2014)
for input: hello2ever20, it can't identify the year part '20';
for input: hello20ever14, it can't identify the year part '14';
Anyone could help on this???
thanks!!

You must realise that ANTLR's lexer rules are matched according their position in the grammar file. The lexer does not "listen" what the parser might need at a certain position in a parser rule. The lexer tries to match as much characters as possible, and when 2 (or more) rules match the same amount of characters, the rule defined first will win.
In your case that means that 15 will always be tokenized as a TimeDate and never as a TimeYear because both rules match 15 but TimeDate is defined first. 2015 will be tokenized as a TimeYear because no other rule matches 4 digits.
A solution would be to change TimeYear into a parser rule:
timeYear
: TimeDate TimeDate?
;

Related

Simple grammar looping infinitely

I would expect this simple grammar to match strings such as 'abc':
grammar Hello;
entry
: LETTER+
;
LETTER : [a-z] ;
But it seems to enter an infinite loop:
C:\Code\antlr\hello>antlr4 Hello.g4 -encoding utf8
C:\Code\antlr\hello>javac Hello*.java
C:\Code\antlr\hello>grun Hello entry -tree
asdf^Z
Terminate batch job (Y/N)? y
Why?

Handling line feed in ANTLR4 grammar with Python target

I am working on an ANTLR4 grammar for parsing Python DSL scripts (a subset of Python, basically) with the target set as the Python 3. I am having difficulties handling the line feed.
In my grammar, I use lexer::members and NEWLINE embedded code based on Bart Kiers's Python3 grammar for ANTLR4 which are ported to Python so that they can be used with Python 3 runtime for ANTLR instead of Java. My grammar differs from the one provided by Bart (which is almost the same used in the Python 3 spec) since in my DSL I need to target only certain elements of Python. Based on extensive testing of my grammar, I do think that the Python part of the grammar in itself is not the source of the problem and so I won't post it here in full for now.
The input for the grammar is a file, catched by the file_input rule:
file_input: (NEWLINE | statement)* EOF;
The grammar performs rather well on my DSL and produces correct ASTs. The only problem I have is that my lexer rule NEWLINE clutters the AST with \r\n nodes and proves troublesome when trying to extend the generated MyGrammarListener with my own ExtendedListener which inherits from it.
Here is my NEWLINE lexer rule:
NEWLINE
: ( {self.at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
import re
from MyParser import MyParser
new_line = re.sub(r"[^\r\n\f]+", "", self._interp.getText(self._input))
spaces = re.sub(r"[\r\n\f]+", "", self._interp.getText(self._input))
next = self._input.LA(1)
if self.opened > 0 or next == '\r' or next == '\n' or next == '\f' or next == '#':
self.skip()
else:
self.emit_token(self.common_token(self.NEWLINE, new_line))
indent = self.get_indentation_count(spaces)
if len(self.indents) == 0:
previous = 0
else:
previous = self.indents[-1]
if indent == previous:
self.skip()
elif indent > previous:
self.indents.append(indent)
self.emit_token(self.common_token(MyParser.INDENT, spaces))
else:
while len(self.indents) > 0 and self.indents[-1] > indent:
self.emit_token(self.create_dedent())
del self.indents[-1]
};
The SPACES lexer rule fragment that NEWLINE uses is here:
fragment SPACES
: [ \t]+
;
I feel I should also add that both SPACES and COMMENTS are ultimately being skipped by the grammar, but only after the NEWLINE lexer rule is declared, which, as far as I know, should mean that there are no adverse effects from that, but I wanted to include it just in case.
SKIP_
: ( SPACES | COMMENT ) -> skip
;
When the input file is run without any empty lines between statements, everything runs as it should. However, if there are empty lines in my file (such as between import statements and variable assignement), I get the following errors:
line 15:4 extraneous input '\r\n ' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
line 15:0 extraneous input '\r\n' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
As I said before, when line feeds are omitted in my input file, the grammar and my ExtendedListener perform as they should, so the problem is definitely with the \r\n not being matched by the NEWLINE lexer rule - even the error statement I get says that it does not match alternative NEWLINE.
The AST produced by my grammar looks like this:
I would really appreciate any help with this since I cannot see why my NEWLINE lexer rule woud fail to match \r\n as it should and I would like to allow empty lines in my DSL.

so the problem is definitely with the \r\n not being matched by the
NEWLINE lexer rule
There is another explanation. An LL(1) parser would stop at the first mismatch, but ANTLR4 is a very smart LL(*) : it tries to match the input past the mismatch.
As I don't have your statement rule and your input around line 15, I'll demonstrate a possible case with the following grammar :
grammar Question;
/* Extraneous input parsing NL and spaces. */
#lexer::members {
public boolean at_start_of_input() {return true;}; // even if it always returns true, it's not the cause of the problem
}
question
#init {System.out.println("Question last update 2108");}
: ( NEWLINE
| statement
{System.out.println("found <<" + $statement.text + ">>");}
)* EOF
;
statement
: 'line ' NUMBER NEWLINE 'something else' NEWLINE
;
NUMBER : [0-9]+ ;
NEWLINE
: ( {at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
;
SKIP_
: SPACES -> skip
;
fragment SPACES
: [ \t]+
;
Input file t.text :
line 1
something else
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ hexdump -C t.text
00000000 6c 69 6e 65 20 31 0a 20 20 20 73 6f 6d 65 74 68 |line 1. someth|
00000010 69 6e 67 20 65 6c 73 65 0a |ing else.|
00000019
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2108
found <<line 1
something else
>>
Now change statement like so :
statement
// : 'line ' NUMBER NEWLINE 'something else' NEWLINE
: 'line ' NUMBER 'something else' NEWLINE // now NL will be extraneous
;
and execute again :
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:4='line ',<'line '>,1:0]
[#1,5:5='1',<NUMBER>,1:5]
[#2,6:9='\n ',<NEWLINE>,1:6]
[#3,10:23='something else',<'something else'>,2:3]
[#4,24:24='\n',<NEWLINE>,2:17]
[#5,25:24='<EOF>',<EOF>,3:0]
Question last update 2114
line 1:6 extraneous input '\n ' expecting 'something else'
found <<line 1
something else
>>
Note that the NL character and spaces have been correctly matched by the NEWLINE lexer rule.
You can find the explanation in section 9.1 of The Definitive ANTLR 4 Reference :
$ grun Simple prog ➾ class T ; { int i; } ➾EOF ❮ line 1:8 extraneous
input ';' expecting '{'
A Parade of Errors • 153
The parser reports an error at the ; but gives a slightly more
informative answer because it knows that the next token is what it was
actually looking for. This feature is called single-token deletion
because the parser can simply pretend the extraneous token isn’t there
and keep going.
Similarly, the parser can do single-token insertion when it detects a
missing token.
In other word, ANTLR4 is so powerful that it can resynchronize the input with the grammar even if several tokens are mismatching. If you run with the -gui option
$ grun Question question -gui t.text
you can see that ANTLR4 has parsed the whole file, despite the fact that a NEWLINE is missing in the statement rule, and that the input does not match exactly the grammar.
To summary : extraneous input is quite a common error when developing a grammar. It can come from a mismatch between input to parse and rule expectations, or also because some piece of input has been interpreted by another token than the one we believe, which can be detected by examining the list of tokens produced by the -tokens option.

Get Last 5 Sequential Numbers from Perl Text and 3 Preceding Characters

How can someone get the last 5 sequential numbers from a Perl string and then additionally get the 3 characters that immediately proceed that sequence. For example, if the string is "This is just a bunch of ran 00000 Dom text. It has no 11111 meaning." Then I would want to get "11111" and then "no ".

Use a regular expression:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'This is just a bunch of ran 00000 Dom text. It has no 11111 meaning.';
my ($pre, $digits) = $string =~ /.*(...)([0-9]{5})/;
say "<$pre>\t<$digits>";
[0-9] matches any digit
{5} means the previous thing should match five times
() parentheses create a capture group
. matches any character (excpet newline)
* means the previous thing should match zero or more times, as much as possible. The .* therefore tries to match as much as possible from the string, to prevent matching the 00000.

How can I insert a new line after each character in shell script?

Assuming I have the following string:
abcdefghi
Which command can I use so that the outcome is:
a
b
c
d
e
f
g
h
i
I just started coding so I hope someone can help me.

There is a tool called fold which inserts linebreaks, and you can tell it do add one after every character:
$ fold -w 1 <<< 'abcdefghi'
a
b
c
d
e
f
g
h
i
<<< is used to indicate a here string. If your shell doesn't support that, you can pipe to fold instead:
echo 'abcdefghi' | fold -w 1

You can use sed, although it will add an extra newline after the last letter so you get a blank line at the end:
$ sed 's/./&\
/g' <<<abcdefghi
a
b
c
d
e
f
g
h
i
$
s/old/new/ is the sed "substitute" command. On the old side, the pattern . matches any character at all. On the new side, the symbol & means "whatever the old pattern matched" - we include what we match in the replacement so we are adding things, not removing them.
We want to follow each matched character with a newline, but the newline will terminate the sed command and result in a syntax error unless we put a backslash in front of it.
So we are replacing any character at all (.) with that same character (&) followed by a newline (\ + newline). The g on the end means to replace every occurrence, not just the first one on each line.
The demonstration uses a here-string, which is part of most modern shells but not all; you could also do it with echo abcdefghi | sed '...'.

grep -o . <<< "abcdefghi"

Overcoming ambiguity in antlr4?

I have the grammar below, it's an extract out of something I am working on which is highlighting a problem I can't overcome.
In my grammar an expression is either a literal, which is a number or an expression "+" another expression. So I want to parse:
1 + 2 + 3 + 4
etc.
However my definition of a number means that it can have an optional sign e.g.:
1, +1 or -1
So it's conceivable that I may need to parse:
1 + +1 or 1 + -1
What I am finding is that 1 + 1 (or bigger numbers) are fine.
What I am struggling to parse are inputs without spaces or with extra signs e.g.:
1+2
This causes real problems as the lexer picks up +2 as a Number when actually I want 2 as the number and + to be picked up as the sign in the expression.
How do I get antlr4 to recognise the difference?
grammar example;
example : expression* EOF;
expression
: expression '+' expression
| literal
;
literal : Number;
Number : Sign? Digits;
Sign : [-+];
Digits : Digit+;
Digit : [0-9];
WS : [ \t\r\n\u000C]+ -> skip;

You can delete optional Sign lexem in the Number token. This way you will postpone processing of signs to parser stage, when you have more information about the context of the input. The idea here is to create unary operators for negation, minus sign (-) and plus sign (+) for keeping the number intact.
grammar example;
example : expression* EOF;
expression
: ('+'|'-') expression # unaryOp
| expression ('+'|'-') expression # binaryOp
| Number # number
;
Number : [0-9]+;
WS : [ \t\r\n\u000C]+ -> skip;

Not sure if it's still relevant, but here goes:
Your expression rule seems faulty, it can not match on a "literal + literal" string, because it always expects an expression on the left.
Your rule should look something like:
expression:
literal + literal
| expression + literal;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR4 lexer rules don't work as expected - antlr4

Related

Simple grammar looping infinitely

Handling line feed in ANTLR4 grammar with Python target

Get Last 5 Sequential Numbers from Perl Text and 3 Preceding Characters

How can I insert a new line after each character in shell script?

Overcoming ambiguity in antlr4?

Categories

Resources