Curbing ANTLR4 greediness (Building ANTLR4 Grammar for existing DSL)

Curbing ANTLR4 greediness (Building ANTLR4 Grammar for existing DSL) - antlr4

I already have a DSL and would like to build ANTLR4 grammar for it.
Here is an exaple of that DSL:
rule isC {
true when O_M in [5, 6, 17, 34]
false in other cases
}
rule isContract {
true when O_C in ['XX','XY','YY']
false in other cases
}
rule isFixed {
true when F3 ==~ '.*/.*/.*-F.*/.*'
false in other cases
}
rule temp[1].future {
false when O_OF in ['C','P']
true in other cases
}
rule temp[0].scale {
10 when O_M == 5 && O_C in ['YX']
1 in other cases
}
How the DSL is parsed simply by using regular expressions that have became a total mess - so a grammar is needed.
The way it works is the following: it extracts left (before when) and right parts and they're evaluated by Groovy.
I would still like to have it evaluated by Groovy, but organize the parsing process by using grammar. So, in essence, what I need is to extract these left and right parts using some kind of wildcards.
I unfortunatelly cannot figure out how to do that. Here is what I have so far:
grammar RuleDSL;
rules: basic_rule+ EOF;
basic_rule: 'rule' rule_name '{' condition_expr+ '}';
name: CHAR+;
list_index: '[' DIGIT+ ']';
name_expr: name list_index*;
rule_name: name_expr ('.' name_expr)*;
condition_expr: when_condition_expr | otherwise_condition_expr;
condition: .*?;
result: .*?;
when_condition_expr: result WHEN condition;
otherwise_condition_expr: result IN_OTHER_CASES;
WHEN: 'when';
IN_OTHER_CASES: 'in other cases';
DIGIT: '0'..'9';
CHAR: 'a'..'z' | 'A'..'Z';
SYMBOL: '?' | '!' | '&' | '.' | ',' | '(' | ')' | '[' | ']' | '\\' | '/' | '%'
| '*' | '-' | '+' | '=' | '<' | '>' | '_' | '|' | '"' | '\'' | '~';
// Whitespace and comments
WS: [ \t\r\n\u000C]+ -> skip;
COMMENT: '/*' .*? '*/' -> skip;
This grammar is "too" greedy, and only one rule is processed. I mean, if I listen to parsing with
#Override
public void enterBasic_rule(Basic_ruleContext ctx) {
System.out.println("ENTERING RULE");
}
#Override
public void exitBasic_rule(Basic_ruleContext ctx) {
System.out.println(ctx.getText());
System.out.println("LEAVING RULE");
}
I have the following as output
ENTERING RULE
-- tons of text
LEAVING RULE
How I can make it less greedy, so if I parse this given input, I'll get 5 rules? The greediness comes from condition and result I suppose.
UPDATE:
It turned out that skipping whitespaces wasn't the best idea, so after a while I ended up with the following: link to gist
Thanks 280Z28 for the hint!

Instead of using .*? in your parser rules, try using ~'}'* to ensure that those rules won't try to read past the end of the rule.
Also, you skip whitespace in your lexer but use CHAR+ and DIGIT+ in your parser rules. This means the following are equivalent:
rule temp[1].future
rule t e m p [ 1 ] . f u t u r e
Beyond that, you made in other cases a single token instead of 3, so the following are not equivalent:
true in other cases
true in other cases
You should probably start by making the following lexer rules, and then making the CHAR and DIGIT rules fragment rules:
ID : CHAR+;
INT : DIGIT+;

Related

How to use the reserved words inside the string in ANTLR4?

I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?

Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);

Antlr4 Match Force Priority

I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.

You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;

ANTLR: Lexer RULE does not match

I want to match the following text:
test.define_shared_constant(:testConst, "12", false)
With this grammar it matches correctly:
grammar test;
statement: shared_constant_defioniton | method_call;
KEY: ':' ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?'|'!'|'|'|'-'|'()')+;
expr: STRING;
STRING: '"' (~'"')* ('"' | NEWLINE) | '\'' (~'\'')* ('\'' | NEWLINE);
NEWLINE: '\r'? '\n' | '\r';
BOOLEAN: 'true' | 'false';
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
shared_constant_defioniton
: ID('.define_shared_constant' '(' KEY ',' expr ',' (BOOLEAN) ')')
;
method_call
: ID '.' ID? '('expr*(',' expr)*')'
;
With this grammar it does not match. It matches to method_call which is not even correct.
shared_constant_defioniton
: ID('.' DEF_SHARED_CONSTANT '(' KEY ',' expr ',' (BOOLEAN) ')')
;
It is interpreting 'define_shared_constant' as ID. So I have to specify that ID should not contain 'define_'. But how can I do that?

ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
Here both ID and DEF_SHARED_CONSTANT could match the input define_shared_constant. In cases like this where multiple rules could match and would produce a match of the same length, the rule that's defined first wins. So defined_shared_constant is recognized as an ID token because ID is defined first.
To get the behaviour you want, you should move the definition of DEF_SHARED_CONSTANT before the definition of ID. If you don't define a named lexer rule for it at all and instead use 'define_shared_constant' directly in the parser rule, that also works because implicitly defined lexer rules act as if they had been defined at the beginning of the file.

This worked according to ANTLR specification. But running it as an IntelliJ Language plugin did not. I had use a predicated and the final solution looks like this:
ID: { getText().indexOf("define") == 0}? ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;

How to break out of a lexical mode

I've been playing around with modes in an attempt to parse a message like this:
-MSGTXT (DO NOT TOKENIZE (THERE CAN BE PARENS HERE) THIS PART)
-END END OF MESSAGE
-TEST 123
The contents of MSGTXT can be any character so I set up my lexer grammar as follows:
lexer grammar ADEXPLexer;
// Fields
MSGTYP: 'MSGTYP';
ADEP: 'ADEP';
TITLE: 'TITLE';
FILTIM: 'FILTIM';
ORIGINDT: 'ORIGINDT';
IFPLID: 'IFPLID';
MSGTXT: 'MSGTXT' -> pushMode(MSG);
COMMENT: 'COMMENT';
// Message types.
ACK: 'ACK';
IFPL: 'IFPL';
// Lexical rules.
SEP: HYPHEN;
WS: [ \t\n\r] + -> skip;
KEYWORD: (ALPHA|DIGIT)+;
mode MSG;
TEXT: CLOSE_MSG | (ALPHA|DIGIT|SPECIAL|WS|HYPHEN)+;
CLOSE_MSG: ')' -> popMode;
fragment HYPHEN: '-';
fragment ALPHA: [A-Z];
fragment DIGIT: [0-9];
fragment SPECIAL
: '('
| '?'
| ':'
| '.'
| ','
| '\''
| '='
| '+'
| '/'
| ')'
;
The problem now however is that the last closing ')' is never used to break out back into the default mode so it continues on into other parts of the message. The parser rule itself looks like this:
msgtxt: SEP MSGTXT TEXT;
I'm looking for a way to get around this which doesn't involve TokenStreamRewriter as there's no such thing in the JavaScript runtime.
Any help appreciated!

Not sure what you need exactly, but if you don't need to check if contents of the TEXT is one of the (ALPHA|DIGIT|SPECIAL|WS|HYPHEN) just use this:
mode MSG;
TEXT: ~[)]+;
CLOSE_MSG: ')' -> popMode;
if you do, just exclude ')' from fragment SPECIAL

ANTLR4 non-greedy rules with stop token

In the lexer I have :
AT: '#' -> mode(OPERATOR);
DOUBLE_AT: '##' ;
CURLY_CLOSE: '}' { block_nesting > 0 && block_nesting >= curly_nesting }? { curly_nesting--; block_nesting--; };
NORMAL_ELSE: 'else' { previous_is_parenthesis_close() }? { block_nesting++; tokens.clear(); setType(ELSE); } -> mode(RYTHM);
NWS: [\t\r\n ]+ { setType(WS); };
CONTENT: .+? ('#' | '}' | 'else' | '\t' | '\r' | '\n' ) ;
The CONTENT rule matches all, but including the tokens that terminate it. Which is not what the grammar needs : It needs to match all, until the terminator, not including.
Is there a way to do what I want ?

You can try to use the following code:
CONTENT: ~('#' | '}' | 'e' | '\t' | '\r' | '\n' )+;
NOT_ELSE: 'e' -> type(CONTENT);
But in this case, you will have several CONTENT rules instead of one. You can resolve this problem on the parser rules level.

For now, I've created a method that merges centain tokens into one, and used reflection to feed that new list to the parser. Not exactly pretty, but it works for now.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Curbing ANTLR4 greediness (Building ANTLR4 Grammar for existing DSL) - antlr4

Related

How to use the reserved words inside the string in ANTLR4?

Antlr4 Match Force Priority

ANTLR: Lexer RULE does not match

How to break out of a lexical mode

ANTLR4 non-greedy rules with stop token

Categories

Resources