Xtext, making the difference between ID and String at the interpreter - dsl

I'm writing a dsl in a text in which people can declare some variables. the grammar is as follows:
Cosem:
cosem+=ID '=' 'COSEM' '(' class=INT ',' version=INT ',' obis=STRING ')' ;
Attributes :
attribute+=ID '=' 'ATTRIBUTE' '(' object=ID ',' attribute_name=STRING ')' ;
Action:
action+=ID '=' 'ACTION' '(' object=ID ',' action_name=STRING ')';
the Dsl has some methods like the print method:
Print:
'PRINT' '(' var0=(STRING|ID) (','var1+=(STRING|ID) )* ')' |
'PRINT' '(' ')'
;
I put all my variables in map so I can use them later in my code. the key is identifying them is their ID which is a string.
However, in my interpreter I can't make the différence between a string and an ID
def dispatch void exec(Print p) {
if (LocalMapAttribute.containsKey(p.var0) )
{print(LocalMapAttribute.get(p.var0))}
else if (LocalMapAction.containsKey(p.var0)){print(LocalMapAction.get(p.var0))}
else if (LocalMapCosem.containsKey(p.var0)){print(LocalMapCosem.get(p.var0))}
else
{print("erreeeur Print")}
p.var1.forEach[v
| if (LocalMapAttribute.containsKey(v)){print(LocalMapAttribute.get(v))}
else if (LocalMapAction.containsKey(v)){print(LocalMapAction.get(v))}
else if (LocalMapCosem.containsKey(v)){print(LocalMapCosem.get(v))}
else{print("erreur entre print")} ]
}
For example when I write PRINT ("attribut2",attribut2) the result shoud be
attribut2 "the value of attribut2"
but I get
"the value of attribut2" "the value of attribut2"

your current grammar structure makes it hard to do this since you throw away the information at the point where you fill the map.
you can use org.eclipse.xtext.nodemodel.util.NodeModelUtils.findNodesForFeature(EObject, EStructuralFeature) to obtain the actual text (which still may contain the original value including the ""
or you change your grammar to
var0=Value ...
Value: IDValue | StringValue;
IDValue: value=ID;
StringValue: value=STRING;
then you can have a look at the type (IDValue or StringValue) to decide wheather you need to put the text into "" (org.eclipse.xtext.util.Strings.convertToJavaString(String, boolean)) might be helpful
Or you can try to use a special replacement for STRINGValueaConcerter that does not strip the quotation marks

Related

Parsing key / value subparameters

I'm a bit clueless as to how I can parse (more or less) "free form" parameter lists, suppose the syntax allows for
PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE)
I have managed to basically parse combos of positional and key/value parameters, but as soon as I hit a lexer token like PARM= the parser bails out with a "mismatched input", and I can't specifically allow for or expect anything because these parameters passed to a function are completely arbitrary.
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode, the delimiters are PARM=( on the left and the closing ) on the right, but as the "data" itself can contain (pairs of) brackets how would I identify the correct closing paren so I don't prematurely end the lexer mode?
TIA - Alex
Edit 1:
Minimal grammar showing the issue with keywords being used where they shouldn't, as this is part of a complex grammar I can't change the order of tokens to put ID in front of everything else, for example, as it would catch too much. So I don't see how this can work short of breaking out into a different lexer mode.
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
COMMA : ',' ;
EQUALS : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
PARM : 'PARM=' ;
ID : ID_LITERAL ;
fragment ID_LITERAL : [A-Za-z]+ ;
.
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parms : PARM LPAREN parm+ RPAREN ;
parm : (pkey=ID EQUALS)? pval=ID COMMA? ;
Input:
PARM=( TEST, KEY=VAL, PARM=X)
Results in
line 1:22 extraneous input 'PARM=' expecting {')', ID}
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode
Instead of switching to modes (with -> mode(...)), you can push your "special" mode on a stack (with -> pushMode(...)) and then when encountering a ) you pop a mode from the stack. That way, you can have multiple nested lists (..(..(..).)..). A quick demo:
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN);
EQUALS : '=' ;
LPAREN : '(' -> pushMode(InList);
PARM : 'PARM';
ID : [A-Za-z] [A-Za-z0-9]*;
mode InList;
LST_LPAREN : '(' -> type(LPAREN), pushMode(InList);
RPAREN : ')' -> popMode;
COMMA : ',';
LST_EQUALS : '=' -> type(EQUALS);
STRING : '\'' ~['\r\n]* '\'';
LST_ID : [A-Za-z] [A-Za-z0-9]* -> type(ID);
LST_SPACE : [ \t\r\n]+ -> channel(HIDDEN);
and:
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parse
: PARM EQUALS list EOF
;
list
: LPAREN ( value ( COMMA value )* )? RPAREN
;
value
: ID
| STRING
| key_value
| ID list
;
key_value
: ID EQUALS value
;
which will parse your example input PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE) like this:
You don't have a rule (alternative) that recognizes a PARM token in your parm rule.
Bart has provided an answer using Lexer modes (and assuming that LPAREN and RPAREN always control those modes), but you can also just set up a parser rule that matches all of your keywords:
lexer grammar ParmLexer
;
SPACE: [ \t\r\n]+ -> channel(HIDDEN);
COMMA: ',';
EQUALS: '=';
LPAREN: '(';
RPAREN: ')';
PARM: 'PARM';
KW1: 'KW1';
KW2: 'KW2';
ID: ID_LITERAL;
fragment ID_LITERAL: [A-Za-z]+;
parser grammar ParmParser
;
options {
tokenVocab = ParmLexer;
}
parms: PARM EQUALS LPAREN parm (COMMA parm)* RPAREN;
parm: ((pkey = ID | kwid = kw) EQUALS)? pval = ID;
kw: PARM | KW1 | KW2;
input
"PARM=( TEST, KEY=VAL, KW2=v2, PARM=X)"
yields:
(parms PARM = ( (parm TEST) , (parm KEY = VAL) , (parm (kw KW2) = v) , (parm (kw PARM) = X) ))

Choosing lexer mode based on variable

My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)
Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).

ANTLR: Lexer RULE does not match

I want to match the following text:
test.define_shared_constant(:testConst, "12", false)
With this grammar it matches correctly:
grammar test;
statement: shared_constant_defioniton | method_call;
KEY: ':' ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?'|'!'|'|'|'-'|'()')+;
expr: STRING;
STRING: '"' (~'"')* ('"' | NEWLINE) | '\'' (~'\'')* ('\'' | NEWLINE);
NEWLINE: '\r'? '\n' | '\r';
BOOLEAN: 'true' | 'false';
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
shared_constant_defioniton
: ID('.define_shared_constant' '(' KEY ',' expr ',' (BOOLEAN) ')')
;
method_call
: ID '.' ID? '('expr*(',' expr)*')'
;
With this grammar it does not match. It matches to method_call which is not even correct.
shared_constant_defioniton
: ID('.' DEF_SHARED_CONSTANT '(' KEY ',' expr ',' (BOOLEAN) ')')
;
It is interpreting 'define_shared_constant' as ID. So I have to specify that ID should not contain 'define_'. But how can I do that?
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
Here both ID and DEF_SHARED_CONSTANT could match the input define_shared_constant. In cases like this where multiple rules could match and would produce a match of the same length, the rule that's defined first wins. So defined_shared_constant is recognized as an ID token because ID is defined first.
To get the behaviour you want, you should move the definition of DEF_SHARED_CONSTANT before the definition of ID. If you don't define a named lexer rule for it at all and instead use 'define_shared_constant' directly in the parser rule, that also works because implicitly defined lexer rules act as if they had been defined at the beginning of the file.
This worked according to ANTLR specification. But running it as an IntelliJ Language plugin did not. I had use a predicated and the final solution looks like this:
ID: { getText().indexOf("define") == 0}? ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;

How to break out of a lexical mode

I've been playing around with modes in an attempt to parse a message like this:
-MSGTXT (DO NOT TOKENIZE (THERE CAN BE PARENS HERE) THIS PART)
-END END OF MESSAGE
-TEST 123
The contents of MSGTXT can be any character so I set up my lexer grammar as follows:
lexer grammar ADEXPLexer;
// Fields
MSGTYP: 'MSGTYP';
ADEP: 'ADEP';
TITLE: 'TITLE';
FILTIM: 'FILTIM';
ORIGINDT: 'ORIGINDT';
IFPLID: 'IFPLID';
MSGTXT: 'MSGTXT' -> pushMode(MSG);
COMMENT: 'COMMENT';
// Message types.
ACK: 'ACK';
IFPL: 'IFPL';
// Lexical rules.
SEP: HYPHEN;
WS: [ \t\n\r] + -> skip;
KEYWORD: (ALPHA|DIGIT)+;
mode MSG;
TEXT: CLOSE_MSG | (ALPHA|DIGIT|SPECIAL|WS|HYPHEN)+;
CLOSE_MSG: ')' -> popMode;
fragment HYPHEN: '-';
fragment ALPHA: [A-Z];
fragment DIGIT: [0-9];
fragment SPECIAL
: '('
| '?'
| ':'
| '.'
| ','
| '\''
| '='
| '+'
| '/'
| ')'
;
The problem now however is that the last closing ')' is never used to break out back into the default mode so it continues on into other parts of the message. The parser rule itself looks like this:
msgtxt: SEP MSGTXT TEXT;
I'm looking for a way to get around this which doesn't involve TokenStreamRewriter as there's no such thing in the JavaScript runtime.
Any help appreciated!
Not sure what you need exactly, but if you don't need to check if contents of the TEXT is one of the (ALPHA|DIGIT|SPECIAL|WS|HYPHEN) just use this:
mode MSG;
TEXT: ~[)]+;
CLOSE_MSG: ')' -> popMode;
if you do, just exclude ')' from fragment SPECIAL

Antlr 4 curly brace usage

I inherited a scripting language that I'm trying to port over to antlr4. Part of the scripting language uses curly braces to identify variables.
set {myVariable} = "5";
I'm using the java grammar a bit as a guideline where variableExpression is new but Identifier and expression are just copies from java. I have:
variableExpression
: '{' Identifier '}'
;
parExpression
: '(' expression ')'
;
but I get an error that { is missing when I have
set {foo} = "5";
If i change the curly brace to (), it works. ! works. $ does not. Are there special characters that we need to escaped in a certain way to make this work? No, i can't change the use of curly braces (legacy code issues).
I'm currently digging through the doc and web for guidance but if someone already knows the answer, please let me know.
thanks!
There is no such restriction on using these characters {}()!$. As an example let's look at a simple grammar:
WS : [ \r\n\t] -> skip;
NAME : [a-zA-Z0-9]+;
STRING : '\"' .*? '\"';
script
: varDeclaration+ EOF;
varDeclaration
: 'set' variable '=' STRING;
variable
: '(' NAME ')'
| '{' NAME '}'
| '!' NAME '!'
| '$' NAME '$'
;
This grammar allows to match code below:
set {var1} = "value1"
set (var2) = "value2"
set !var3! = "value3"
set $var4$ = "value4"
The result AST looks like this:

Resources