Xtext grammar more abstract classes instantiation - dsl

I'm doing Expressions DSL using Xtext and I want some classes to inherit from some abstract ones.
Hierarchy:
Expression is an abstract class extended by BinaryOperation, UnaryOperation, Number and Atomic.
BinaryOperation is an abstract class extended by Add, Sub, Mul, Div, Power
UnaryOperation is an abstract class extended by UnaryPlus, UnaryMinus and Factorial.
Whole grammar:
Expressions:
elements+=EvalExpr;
EvalExpr:
'eval' expression=Expression ';';
Expression: AddOrSub;
UnaryOperation:
Expression;
BinaryOperation:
Expression;
AddOrSub returns BinaryOperation:
MulOrDivOrPower (( {Add.left=current} '+' |
{Sub.left=current} '-'
) right=MulOrDivOrPower)*;
MulOrDivOrPower returns BinaryOperation:
UnaryPlusOrMinus (( {Mul.left=current} '*' |
{Div.left=current} '/' |
{Power.left=current} '^'
) right=UnaryPlusOrMinus)*;
UnaryPlusOrMinus returns UnaryOperation:
'-' {UnaryMinus} expression=Factorial | '+' {UnaryPlus} expression=Factorial | Factorial;
Factorial returns UnaryOperation:
Atomic ({Factorial.expression=current} '!')?;
Atomic returns Expression:
'(' Expression ')' | Number;
Number returns Expression:
{IntConstant} value=INT;
But I'm seeing an error: A class may not be a super type of itself. How can I provide such a functionality ?

Related

What's difference between expr*, expr? and expr in Python3.6 AST Abstract Grammar?

In Python3.6 AST Abstract Grammar, there are expr*, expr? and expr inside. What's difference betweem them? Such as: expr* targets, expr? value and expr target.
Typically those suffixes mean the same as in regular expressions:
expr - exactly one instance of the expression
expr?- zero or one instances
expr* - zero or more instances
For example, for the following:
FunctionDef(identifier name, arguments args,
stmt* body, expr* decorator_list, expr? returns)
a function definition consists of exactly one name, at least zero statements and asn optional return value (for functions that don't return anything).

ANTLR4 different precedence in two seemingly equivalent grammars

The following test grammars differ only in that the first alternative of the rule 'expr' is either specified inline or refers to another rule 'notExpression' with just the same definition. But this grammars produce different trees parsing this: '! a & b'. Why?
I really want the grammar to produce the first result (with NOT associated with identifier, not with AND expression) but still need to have 'expr' to reference 'notExpression' in my real grammar. What do I have to change?
grammar test;
s: expr ';' <EOF>;
expr:
NOT expr
| left=expr AND right=expr
| identifier
;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree one
grammar test;
s: expr ';' <EOF>;
expr:
notExpression
| left=expr AND right=expr
| identifier
;
notExpression: NOT expr;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree two
I kind of got an answer to the second part of my question, which still do not quite give me a satisfaction because using this approach in real elaborate grammar is going to be ugly. As to the first part (WHY) I still have no idea, so more answers are welcome.
Anyway, to fix precedence in presence of referenced rule the 'notExpression' rule can be modified as follows:
notExpression: NOT (identifier|expr);
Which produces the tree different from both shown in original question, but at least the NOT does acquire higher precedence.
Parse tree

can antlr semantic predicates access grammar symbols

The antlr book has the following sample code to resolve grammar ambiguities using semantic predicates:
// predicates/PredCppStat.g4
#parser::members {
Set<String> types = new HashSet<String>() {{add("T");}};
boolean istype() { return types.contains(getCurrentToken().getText());}
}
stat: decl ';' {System.out.println("decl "+$decl.text);}
| expr ';' {System.out.println("expr "+$expr.text);}
;
decl: ID ID
| {istype()}? ID '(' ID ')'
;
expr: INT
| ID
| {!istype()}? ID '(' expr ')'
;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
WS : [ \t\n\r]+ -> skip ;
Here, the predicate is the first function called in a rule, determining whether the rule should be fired or not. And it uses getCurrentToken() to take its decision.
However, if we alter the grammar slightly, to use hierarchical names instead of simple ID, like this:
decl: ID ID
| {istype()}? hier_id '(' ID ')'
;
expr: INT
| ID
| {!istype()}? hier_id '(' expr ')'
;
hier_id : ID ('.' ID)* ;
Then the istype() predicate can no longer use getCurrentToken to take its decision. It will need the entire chain of tokens in the hier_id to determine whether the chain is a type symbol or not.
That means, that we will need to do one of the following:
(1) put the predicate after hier_id, and access these value from istype(). Is this possible? I tried it, and I am getting compiler errors on the generated code.
(2) break up the grammar into sub-rules, and then place istype() after hier_id tokens are consumed. But this will wreck the readability of the grammar, and I would not like to do it.
What is the best way to solve this problem? I am using antlr-4.6.
One solution is to make ID itself to contain '.', thereby making hier_id a lexer token. In that case, the semantic predicate's call to getCurrentToken() will have access to the full chain of names.
Note that hier_id will subsume ID if it becomes a lexer token. And that comes at a cost. If the grammar has other references to ID only (and I guess it will have), then you have to add predicates in all those situations to avoid false matches. This will slow down the parser.
So I guess the question, in its general sense (ie how can rules be restricted by pedicates if the currentToken information is not enough to make the decision), still needs to be answered by Antlr4 experts.

Range Specification in Xtext

I am new to XText and want to define a language element for specifying ranges of values. Examples: [1-2] or ]0.1-0.3[
I have the following rule for this purpose:
Range returns Expression:
Atomic (leftBracket=('[' | ']') left=Atomic '-' right=Atomic rightBracket=('[' | ']'))*;
Atomic here refers basically to the primitive float and int types. I have two problems:
I get the warning "The assigned value of feature 'leftBracket' will possibly override itself because it is used inside of a loop" and the same for rightBracket. What does this mean in this context?
The expression works only in standalone manner (in one row), but not in connection with the rest of the language elements. E.g. in connection with the element right before:
Comparison returns Expression:
Range ({Comparison.left=current} op=(">="|"<="|">"|"<"|"=>"|" <=>"|"xor"|"=") right=Range)*;
This means, if such an operation is before the Range element in my input of the second Eclipse window, I get the error "No viable alternative at input".
Any ideas? Thanks for any hints and advices!
Some more information:
I took this example and changed it: https://github.com/LorenzoBettini/packtpub-xtext-book-examples/blob/master/org.example.expressions/src/org/example/expressions/Expressions.xtext
Full code:
grammar org.example.expressions.Expressions with org.eclipse.xtext.common.Terminals
generate expressions "http://www.example.org/expressions/Expressions"
ExpressionsModel:
expressions+=Expression*;
Expression: Or;
Or returns Expression:
And ({Or.left=current} "||" right=And)*
;
And returns Expression:
Equality ({And.left=current} "&&" right=Equality)*
;
Equality returns Expression:
Comparison (
{Equality.left=current} op=("==")
right=Comparison
)*
;
Comparison returns Expression:
Range ({Comparison.left=current} op=(">="|"<="|">"|"<"|"=>"|"<=>"|"xor"|"=") right=Range)*
;
Range returns Expression:
Primary (leftBracket=('[' | ']') left=Primary '-' right=Primary rightBracket=('[' | ']'))*
;
Primary returns Expression:
'(' Expression ')' |
{Not} "!" expression=Primary |
Atomic
;
Atomic returns Expression:
{IntConstant} value=INT |
{StringConstant} value=STRING |
{BoolConstant} value=('true'|'false')
;
Example where it fails: (1 = [1-2]) however [1-2] in a row works fine.
i cannot really follow you but your grammar looks strange to me
Model:
(expressions+=Comparison ";")*;
Comparison returns Expression:
Range ({Comparison.left=current} op=(">=" | "<=" | ">" | "<" | "=>" | "<=>" | "xor" | "=") right=Range)*;
Range:
(leftBracket=('[' | ']') left=Atomic '-' right=Atomic rightBracket=('[' | ']'))
|
Atomic;
Atomic:
value=INT;
works fine with
[1-2];
]3-5[;
[1-4[ < ]1-6];
6;
1 < 2;
so can you give some more context

Bison/Flex, reduce/reduce, identifier in different production

I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}

Resources