I have antlr grammar like following.
accepted: appendix '$' pin;
pin: alphanums (connectors alphanums)+;
appendix: LOWERCASE | UPPERCASE;
alphanums: (LOWERCASE | UPPERCASE | INT)+;
connectors: CONNECTOR+;
LOWERCASE: [a-z]+;
UPPERCASE: [A-Z]+;
INT: [0-9]+;
CONNECTOR: ',' | 'and' | 'or';
WS: [ \t\r\n]+ -> skip;
it is expected to accept patterns like "a $ 100a, 101b", but it is unfortunately also accepting patterns like "a $ 100a of sth unacceptable". here "of sth unacceptable" are recognized as part of alphanums. what I really have intended the rule "alphanums" to recognize is just letters and digits, no spaces.
If I change alphanums to a lexer rule, like
accepted: appendix '$' pin;
pin: ALPHANUMS (connectors ALPHANUMS)+;
appendix: LOWERCASE | UPPERCASE;
ALPHANUMS: (LOWERCASE | UPPERCASE | INT)+;
connectors: CONNECTOR+;
LOWERCASE: [a-z]+;
UPPERCASE: [A-Z]+;
INT: [0-9]+;
CONNECTOR: ',' | 'and' | 'or';
WS: [ \t\r\n]+ -> skip;
The appendix rule no longer recognizes "a", since "a" is now of lexer token "ALPHANUMS".
I don't really want to change appendix rule to take ALPHANUMS, like
appendix: ALPHANUMS;
since I only intend letters for appendix, no digits there. To use ALPHANUMS I'll have to put in validation code in listener, which is extra piece of logic that also makes the grammar harder to understand.
Is there any way out?
If you skip spaces in the lexer, then a a will be treated the same as aa in rules like alphanums. There's no way around it. Either don't skip spaces and account for them in the parser (usually not a viable solution), or demote the alphanums down to the lexer as you already tried (this is the way to go).
How about something like this:
accepted : appendix '$' pin;
pin : alphanums (connectors alphanums)+;
appendix : LETTERS | AND | OR; // perhaps without the AND and OR?
connectors : connector+;
connector : COMMA | AND | OR;
alphanums : ALPHANUMS | LETTERS | AND | OR; // perhaps without the AND and OR?
AND : 'and';
OR : 'or';
COMMA : ',';
LETTERS : [a-zA-Z]+;
ALPHANUMS : [a-zA-Z0-9]+;
WS : [ \t\r\n]+ -> skip;
Related
I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);
I have a language where i want to parse unicode characters. Those characters are presided by %.
So this text: %,, this: a=&, or even this: (a,b)=%, should detect the ',' as unicode character.
It does so until i add the pattern for (a,b).
Here the code that works without (a,b):
grammar example;
test: expr | decl;
decl: (VARIABLE_DECLARATION? ID ) '=' expr
;
VARIABLE_DECLARATION
: 'public' | 'private'
;
expr: unicode;
unicode: '%' CHAR;
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
CHAR: // Other_Punctuation
'\u{0021}'..'\u{0023}' // [!..#] Basic Latin
| '\u{0025}'..'\u{0027}' // [%..'] Basic Latin
| '\u{002a}' // [*] Basic Latin
| '\u{002c}' // [,] Basic Latin
| '\u{002e}'..'\u{002f}' // [.../] Basic Latin
| '\u{003a}'..'\u{003b}' // [:..;] Basic Latin
| '\u{003f}'..'\u{0040}' // [?..#] Basic Latin
| '\u{005c}' // [\] Basic Latin
;
with this i get the following error: mismatched input ',' expecting CHAR
grammar example;
test: expr | decl;
decl: (VARIABLE_DECLARATION? ID | '('ID (',' ID)* ')' ) '=' expr
;
VARIABLE_DECLARATION
: 'public' | 'private'
;
expr: unicode;
unicode: '%' CHAR;
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
CHAR: // Other_Punctuation
'\u{0021}'..'\u{0023}' // [!..#] Basic Latin
| '\u{0025}'..'\u{0027}' // [%..'] Basic Latin
| '\u{002a}' // [*] Basic Latin
| '\u{002c}' // [,] Basic Latin
| '\u{002e}'..'\u{002f}' // [.../] Basic Latin
| '\u{003a}'..'\u{003b}' // [:..;] Basic Latin
| '\u{003f}'..'\u{0040}' // [?..#] Basic Latin
| '\u{005c}' // [\] Basic Latin
;
how can i solve that?
'\u{002c}' does indeed match , (though I don't understand why you'd write it as a Unicode escape instead of just ','). The problem is that you're also using ',' as a literal in your parser rules. This implicitly defines a lexer rule that matches only ,.
Lexer rules that are implicitly defines through the use of literals have higher priority than named lexer rules, so whenever the lexer sees a comma, it chooses to create a ',' token instead of a CHAR token.
To fix this I suggest you remove , from the set of characters matched by CHAR and instead use (CHAR | ',') wherever you want to allow both. You could even define a non-terminal char: CHAR | ','; and use that.
I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;
grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR expression)+
| expression (NOT expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)
I already have a DSL and would like to build ANTLR4 grammar for it.
Here is an exaple of that DSL:
rule isC {
true when O_M in [5, 6, 17, 34]
false in other cases
}
rule isContract {
true when O_C in ['XX','XY','YY']
false in other cases
}
rule isFixed {
true when F3 ==~ '.*/.*/.*-F.*/.*'
false in other cases
}
rule temp[1].future {
false when O_OF in ['C','P']
true in other cases
}
rule temp[0].scale {
10 when O_M == 5 && O_C in ['YX']
1 in other cases
}
How the DSL is parsed simply by using regular expressions that have became a total mess - so a grammar is needed.
The way it works is the following: it extracts left (before when) and right parts and they're evaluated by Groovy.
I would still like to have it evaluated by Groovy, but organize the parsing process by using grammar. So, in essence, what I need is to extract these left and right parts using some kind of wildcards.
I unfortunatelly cannot figure out how to do that. Here is what I have so far:
grammar RuleDSL;
rules: basic_rule+ EOF;
basic_rule: 'rule' rule_name '{' condition_expr+ '}';
name: CHAR+;
list_index: '[' DIGIT+ ']';
name_expr: name list_index*;
rule_name: name_expr ('.' name_expr)*;
condition_expr: when_condition_expr | otherwise_condition_expr;
condition: .*?;
result: .*?;
when_condition_expr: result WHEN condition;
otherwise_condition_expr: result IN_OTHER_CASES;
WHEN: 'when';
IN_OTHER_CASES: 'in other cases';
DIGIT: '0'..'9';
CHAR: 'a'..'z' | 'A'..'Z';
SYMBOL: '?' | '!' | '&' | '.' | ',' | '(' | ')' | '[' | ']' | '\\' | '/' | '%'
| '*' | '-' | '+' | '=' | '<' | '>' | '_' | '|' | '"' | '\'' | '~';
// Whitespace and comments
WS: [ \t\r\n\u000C]+ -> skip;
COMMENT: '/*' .*? '*/' -> skip;
This grammar is "too" greedy, and only one rule is processed. I mean, if I listen to parsing with
#Override
public void enterBasic_rule(Basic_ruleContext ctx) {
System.out.println("ENTERING RULE");
}
#Override
public void exitBasic_rule(Basic_ruleContext ctx) {
System.out.println(ctx.getText());
System.out.println("LEAVING RULE");
}
I have the following as output
ENTERING RULE
-- tons of text
LEAVING RULE
How I can make it less greedy, so if I parse this given input, I'll get 5 rules? The greediness comes from condition and result I suppose.
UPDATE:
It turned out that skipping whitespaces wasn't the best idea, so after a while I ended up with the following: link to gist
Thanks 280Z28 for the hint!
Instead of using .*? in your parser rules, try using ~'}'* to ensure that those rules won't try to read past the end of the rule.
Also, you skip whitespace in your lexer but use CHAR+ and DIGIT+ in your parser rules. This means the following are equivalent:
rule temp[1].future
rule t e m p [ 1 ] . f u t u r e
Beyond that, you made in other cases a single token instead of 3, so the following are not equivalent:
true in other cases
true in other cases
You should probably start by making the following lexer rules, and then making the CHAR and DIGIT rules fragment rules:
ID : CHAR+;
INT : DIGIT+;