Converting from Antlr3 to Antlr4

Converting from Antlr3 to Antlr4 - antlr4

I am in the process of converting an antlr3 to antlr4 grammar. I have stripped out all the syntactic predicates. I am struggling to make a correct conversion of this
relaxed_date_month_first
: relaxed_day_of_week? relaxed_month COMMA? WHITE_SPACE relaxed_day_of_month (relaxed_year_prefix relaxed_year)?
-> ^(EXPLICIT_DATE relaxed_day_of_month relaxed_month relaxed_day_of_week? relaxed_year?)
to antlr4 grammar.Everytime the antlr4 tool runs into "->" character it says "extraneous input '->' expecting {TOKEN_REF, RULE_REF...ACTION}".
How do I fix this?

ANTLR4 has no tree-rewriting. So, remove -> ... entirely:
relaxed_date_month_first
: relaxed_day_of_week? relaxed_month COMMA? WHITE_SPACE relaxed_day_of_month (relaxed_year_prefix relaxed_year)?
;

Related

How to get ANTLR4 grammar to parse over a single line without requiring line break in the middle?

I'm currently relearning ANTLR and I'm having a bit of an issue with my grammar and parsing is. I'm editing it in IntelliJ IDEA with the ANTLR plugin and I'm using ANTLR version 4.9.2.
My grammar is as follows
grammar Pattern;
pattern:
patternName
patternMeaning
patternMoves;
patternName : 'Name:' NAME ;
patternMeaning : 'Meaning:' NAME ;
patternMoves : 'Moves:' (patternStep)+ ;
patternStep : 'Turn' angle stance;
stance : 'Walking Stance';
angle : ('90'|'180'|'270'|'360') '°' 'anti-'? 'clockwise';
NAME : WORD (' ' WORD)*;
fragment WORD : [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;
now when I try and parse the following text, I get the following error line 2:9 mismatched input 'clockwise Walking Stance' expecting {'anti-', 'clockwise'}
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise Walking Stance
However, if I change the text to the below it works without any issues. How can I tweak my grammar to allow me to parse it on one line?
Name: Il Jang
Meaning: Heaven and light
Moves:
Turn 90° clockwise
Walking Stance

Your problem is that clockwise Walking Stance is a valid NAME, so it's interpreted as such rather than as an instance of the clockwise keyword followed by the NAME Walking Stance. Adding a line break fixes this because line breaks can't appear in names.
To fix this, you should turn WORD into a lexer rule and NAME into a parser rule. That way the name rule will only be tried in places where the parser actually expects a name, so it won't try to interpret clockwise as part of a name. And the WORD rule won't eat keywords because the match produced by the WORD rule won't be longer than the keyword, so the keyword wins.

If this is your entire grammar, then there are no lexer rules defining the handling of whaitespace. In fact, the are no explicit lexer rules. (ANTLR will create implicit lexer rules for any literal strings in your parser rules (unless the match an already define grammar rule.))
Your grammar is essentially (in ANTLR’s perception)
grammar Pattern;
patternMoves : T_1 (patternStep)+ ;
patternStep : T_2 angle stance;
stance : T_3;
angle : (T_4|T_5|T_6|T_7) T_8 T_9? T_10;
T_1: ‘Moves:’;
T_2: ‘Turn’;
T_3: 'Walking Stance';
T_4: '90';
T_5: '180';
T_6: '270';
T_7: '360';
T_8: '°';
T_9: 'anti-';
T_10: 'clockwise';
ANTLR’s processing takes a stream of characters, passes them to a lexer, which must decide what to do with all characters (even whitespace). The lexer produces a stream of tokens that the parser rules process.
You need some lexer rule that prescribes how to handle whatespace:
WS: [ \t\r\n]+ -> skip;
Is a common way of handling this. It tokenized all whitespace as a WS token, but then skips handing that token to the parser. (This is very handy as you won’t have to sprinkle WS or WS? items all through your grammar where whitespace is expected.
That your plugin accepts you input would imply to me that it may be treating each line of input as a new parse.

Mismatched input with binary operator parsing

I'm trying to parse an existing language in ANTLR that's currently being parsed using the Ruby library Parslet.
Here is a stripped down version of my grammar:
grammar FilterMin;
filter : condition_set;
condition_set: condition_set_type (property_condition)?;
condition_set_type: '=' | '^=';
property_condition: property_lhs CONDITION_SEPARATOR property_rhs;
property_lhs: QUOTED_STRING;
property_rhs: entity_rhs | contains_rhs;
contains_rhs: CONTAINS_OP '(' contains_value ')';
contains_value: QUOTED_STRING;
entity_rhs: NOT_OP? MATCH_OP? QUOTED_STRING;
// operators
MATCH_OP: '~';
NOT_OP: '^';
CONTAINS_OP: 'contains';
QUOTED_STRING: QUOTE STRING QUOTE;
STRING: (~['\\])*;
QUOTE: '\'';
CONDITION_SEPARATOR: ':';
This parses fails to parse both ='foo':'bar' and ='foo':contains('bar') with the same either: mismatched input ':' expecting ':' or mismatched input ':contains(' expecting ':'.
Why aren't these inputs parsing?

Your STRING rule matches everything that isn't a backslash or a single quote. So it overlaps with all of your other lexical rules except QUOTED_STRING. Since the lexer will always pick the rule that produces the longest match and that's almost always STRING, your lexer will produce a bunch of STRING tokens and never any CONDITION_SEPERATOR tokens.
Since you never use STRING in your parser rules, it doesn't need to be an actual type of token. In fact, you never want STRING tokens to be generated, you only ever want it to be matched as part of a QUOTED_STRING token. Therefore it should be a fragment.

No viable alternative at input ' '

I know this question has been asked before, but I haven't found any solution to my specific problem. I am using Antlr4 with the C# target and I have the following lexer rules:
INT : [0-9]+
;
LETTER : [a-zA-Z_]+
;
WS : [ \t\r\n\u000C]+ -> skip
;
LineComment
: '#' ~[\r\n]* -> skip
;
That are all lexer rules, but there are many parser rules which I will not post here since I don't think it is relevant.
The problem I have is that whitespaces do not get skipped. When I inspect the token stream after the lexer ran my input, the whitespaces are still in there and therefore cause an error. The input I use is relatively basic:
"fd 100"
it parses complete until it reaches this parser rule:
noSignFactor
: ':' ident #NoSignFactorArg
| integer #NoSignFactorInt
| float #NoSignFactorFloat
| BOOLEAN #NoSignFactorBool
| '(' expr ')' #NoSignFactorExpr
| 'not' factor #NoSignFactorNot
;
integer : INT #IntegerInt
;

Start by separating your grammar into a separate lexer grammar and parser grammar. For example, if you have a grammar Foo;, create the following:
Create a file FooLexer.g4, and move all of the lexer rules from Foo.g4 into FooLexer.g4.
Create a file FooParser.g4, and move all of the parser rules from Foo.g4 into FooParser.g4.
Include the following option in FooParser.g4:
options {
tokenVocab=FooLexer;
}
This separation will ensure that your parser isn't silently creating lexer rules for you. In a combined grammar, using a literal such as 'not' in a parser rule will create a lexer rule for you if one does not already exist. When this happens, it's easy to lose track of what kinds of tokens your lexer is able to produce. When you use a separate lexer grammar, you will need to explicitly declare a rule like the following in order to use 'not' in a parser rule.
NOT : 'not';
This should solve the problems with whitespace should you have included the literal ' ' somewhere in a parser rule.

Can SLR grammar have empty productions?

I've wrote following grammar:
S->S ( S ) S
S->e
e stands for "empty string"
So the language this grammar recognizes includes all strings with matching left and right parenthesis, like (), (()), (()()), etc.
And this grammar is not SLR, here is how I construct SLR parse table:
Augment this grammar:
S1->S
S->S(S)S
S->e
Then construct LR(0) automaton for it:
I0:
S1->.S
S->.S(S)S
S->.e
I1:
S1->S.
S->S.(S)S
...
Please note that for I0, there is no shift or reduce action for input symbol '(', which is the first token of any string this grammar generates.
So SLR parse table will generate error since on state I0, it doesn't know what to do when parsing string: (()).
My question is:
What is the culprit that makes this grammar NOT SLR? Is it the empty string production? That is:
S->e. ?
And in a general sense, can SLR grammar have empty productions? like, S->e in this example.
Thanks.

The answer is OK, if no shift/reduce action is available for current input and there is a shift on empty product, we choose to shift on this empty terminal.

What do parenthesis without quantifiers in Lexer rules?

Assume the following grammer:
grammar Demo;
start: START_BLOCK SEPERATOR;
START_BLOCK: '-.-.-';
ID: ( LETTER SEPERATOR ) (LETTER SEPERATOR)+;
fragment LETTER: L_A|L_K;
fragment L_A: '.-';
fragment L_K: '-.-';
SEPERATOR: '!';
I pass the following input to the grammar: -.-.-!
I'd expect that ANTLR recognizes the tokens START_BLOCK and SEPERATOR. But instead it finds a single Token of type ID.
I figured that I can fix the problem by removing the first couple of parenthesis in lexer rule "ID":
ID: LETTER SEPERATOR (LETTER SEPERATOR)+;
Now everything works fine, but why? What did the parenthesis above do to my grammar?

This is a bug in ANTLR 4 which is fixed for the 4.0.1 release. See: https://github.com/antlr/antlr4/issues/224

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Converting from Antlr3 to Antlr4 - antlr4

ANTLR4 has no tree-rewriting. So, remove -> ... entirely: relaxed_date_month_first : relaxed_day_of_week? relaxed_month COMMA? WHITE_SPACE relaxed_day_of_month (relaxed_year_prefix relaxed_year)? ;

Related

How to get ANTLR4 grammar to parse over a single line without requiring line break in the middle?

Mismatched input with binary operator parsing

No viable alternative at input ' '

Can SLR grammar have empty productions?

What do parenthesis without quantifiers in Lexer rules?

Categories

Resources