I inherited a scripting language that I'm trying to port over to antlr4. Part of the scripting language uses curly braces to identify variables.
set {myVariable} = "5";
I'm using the java grammar a bit as a guideline where variableExpression is new but Identifier and expression are just copies from java. I have:
variableExpression
: '{' Identifier '}'
;
parExpression
: '(' expression ')'
;
but I get an error that { is missing when I have
set {foo} = "5";
If i change the curly brace to (), it works. ! works. $ does not. Are there special characters that we need to escaped in a certain way to make this work? No, i can't change the use of curly braces (legacy code issues).
I'm currently digging through the doc and web for guidance but if someone already knows the answer, please let me know.
thanks!
There is no such restriction on using these characters {}()!$. As an example let's look at a simple grammar:
WS : [ \r\n\t] -> skip;
NAME : [a-zA-Z0-9]+;
STRING : '\"' .*? '\"';
script
: varDeclaration+ EOF;
varDeclaration
: 'set' variable '=' STRING;
variable
: '(' NAME ')'
| '{' NAME '}'
| '!' NAME '!'
| '$' NAME '$'
;
This grammar allows to match code below:
set {var1} = "value1"
set (var2) = "value2"
set !var3! = "value3"
set $var4$ = "value4"
The result AST looks like this:
Related
My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).
I want to match the following text:
test.define_shared_constant(:testConst, "12", false)
With this grammar it matches correctly:
grammar test;
statement: shared_constant_defioniton | method_call;
KEY: ':' ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?'|'!'|'|'|'-'|'()')+;
expr: STRING;
STRING: '"' (~'"')* ('"' | NEWLINE) | '\'' (~'\'')* ('\'' | NEWLINE);
NEWLINE: '\r'? '\n' | '\r';
BOOLEAN: 'true' | 'false';
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
shared_constant_defioniton
: ID('.define_shared_constant' '(' KEY ',' expr ',' (BOOLEAN) ')')
;
method_call
: ID '.' ID? '('expr*(',' expr)*')'
;
With this grammar it does not match. It matches to method_call which is not even correct.
shared_constant_defioniton
: ID('.' DEF_SHARED_CONSTANT '(' KEY ',' expr ',' (BOOLEAN) ')')
;
It is interpreting 'define_shared_constant' as ID. So I have to specify that ID should not contain 'define_'. But how can I do that?
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
Here both ID and DEF_SHARED_CONSTANT could match the input define_shared_constant. In cases like this where multiple rules could match and would produce a match of the same length, the rule that's defined first wins. So defined_shared_constant is recognized as an ID token because ID is defined first.
To get the behaviour you want, you should move the definition of DEF_SHARED_CONSTANT before the definition of ID. If you don't define a named lexer rule for it at all and instead use 'define_shared_constant' directly in the parser rule, that also works because implicitly defined lexer rules act as if they had been defined at the beginning of the file.
This worked according to ANTLR specification. But running it as an IntelliJ Language plugin did not. I had use a predicated and the final solution looks like this:
ID: { getText().indexOf("define") == 0}? ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR expression)+
| expression (NOT expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)
I'm writing a dsl in a text in which people can declare some variables. the grammar is as follows:
Cosem:
cosem+=ID '=' 'COSEM' '(' class=INT ',' version=INT ',' obis=STRING ')' ;
Attributes :
attribute+=ID '=' 'ATTRIBUTE' '(' object=ID ',' attribute_name=STRING ')' ;
Action:
action+=ID '=' 'ACTION' '(' object=ID ',' action_name=STRING ')';
the Dsl has some methods like the print method:
Print:
'PRINT' '(' var0=(STRING|ID) (','var1+=(STRING|ID) )* ')' |
'PRINT' '(' ')'
;
I put all my variables in map so I can use them later in my code. the key is identifying them is their ID which is a string.
However, in my interpreter I can't make the différence between a string and an ID
def dispatch void exec(Print p) {
if (LocalMapAttribute.containsKey(p.var0) )
{print(LocalMapAttribute.get(p.var0))}
else if (LocalMapAction.containsKey(p.var0)){print(LocalMapAction.get(p.var0))}
else if (LocalMapCosem.containsKey(p.var0)){print(LocalMapCosem.get(p.var0))}
else
{print("erreeeur Print")}
p.var1.forEach[v
| if (LocalMapAttribute.containsKey(v)){print(LocalMapAttribute.get(v))}
else if (LocalMapAction.containsKey(v)){print(LocalMapAction.get(v))}
else if (LocalMapCosem.containsKey(v)){print(LocalMapCosem.get(v))}
else{print("erreur entre print")} ]
}
For example when I write PRINT ("attribut2",attribut2) the result shoud be
attribut2 "the value of attribut2"
but I get
"the value of attribut2" "the value of attribut2"
your current grammar structure makes it hard to do this since you throw away the information at the point where you fill the map.
you can use org.eclipse.xtext.nodemodel.util.NodeModelUtils.findNodesForFeature(EObject, EStructuralFeature) to obtain the actual text (which still may contain the original value including the ""
or you change your grammar to
var0=Value ...
Value: IDValue | StringValue;
IDValue: value=ID;
StringValue: value=STRING;
then you can have a look at the type (IDValue or StringValue) to decide wheather you need to put the text into "" (org.eclipse.xtext.util.Strings.convertToJavaString(String, boolean)) might be helpful
Or you can try to use a special replacement for STRINGValueaConcerter that does not strip the quotation marks
How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.