ANTLR4 Semantic Predicates that is Context Dependent Does Not Work - antlr4

I am parsing a C++ like declaration with this scaled down grammar (many details removed to make it a fully working example). It fails to work mysteriously (at least to me). Is it related to the use of context dependent predicate? If yes, what is the proper way to implement the "counting the number of child nodes logic"?
grammar CPPProcessor;
cppCompilationUnit : decl_specifier_seq? init_declarator* ';' EOF;
init_declarator: declarator initializer?;
declarator: identifier;
initializer: '=0';
decl_specifier_seq
locals [int cnt=0]
#init { $cnt=0; }
: decl_specifier+ ;
decl_specifier : #init { System.out.println($decl_specifier_seq::cnt); }
'const'
| {$decl_specifier_seq::cnt < 1}? type_specifier {$decl_specifier_seq::cnt += 1;} ;
type_specifier: identifier ;
identifier:IDENTIFIER;
CRLF: '\r'? '\n' -> channel(2);
WS: [ \t\f]+ -> channel(1);
IDENTIFIER:[_a-zA-Z] [0-9_a-zA-Z]* ;
I need to implement the standard C++ rule that no more than 1 type_specifier is allowed under an decl_specifier_seq.
Semantic predicate before type_specifier seems to be the solution. And the count is naturally declared as a local variable in decl_specifier_seq since nested decl_specifier_seq are possible.
But it seems that a context dependent semantic predicate like the one I used will produce incorrect parsing i.e. a semantic predicate that references $attributes. First an input file with correct result (to illustrate what a normal parse tree looks like):
int t=0;
and the parse tree:
But, an input without the '=0' to aid the parsing
int t;
0
1
line 1:4 no viable alternative at input 't'
1
the parsing failed with the 'no viable alternative' error (the numbers printed in the console is debug print of the $decl_specifier_cnt::cnt value as a verification of the test condition). i.e. the semantic predicate cannot prevent the t from being parsed as type_specifier and t is no longer considered a init_declarator. What is the problem here? Is it because a context dependent predicate having $decl_specifier_seq::cnt is used?
Does it mean context dependent predicate cannot be used to implement "counting the number of child nodes" logic?
EDIT
I tried new versions whose predicate uses member variable instead of the $decl_specifier_seq::cnt and surprisingly the grammar now works proving that the Context Dependent predicate did cause the previous grammar to fail:
....
#parser::members {
public int cnt=0;
}
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 }? type_specifier {cnt++;} ;
A normal parse tree is resulted:
This gives rise to the question of how to support nested rule if we must use member variables to replace the local variables to avoid context sensitive predicates?
And a weird result is that if I add a /*$ctx*/ after the predicate, it fails again:
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 /*$ctx*/ }? type_specifier {cnt++;} ;
line 1:4 no viable alternative at input 't'
The parsing failed with no viable alternative. Why the /*$ctx*/ causes the parsing to fail like when $decl_specifier_seq::cnt is used although the actual logic uses a member variable only?
And, without the /*$ctx*/, another issue related to the predicate called before #init block appears(described here)

ANTLR 4 evaluates semantic predicates in two cases.
The generated code evaluates a semantic predicate during parsing, and throws an exception of the evaluation returns false. All predicates traversed during parsing are evaluated in this way, including context-dependent predicates and predicates which do not appear at the left side of a decision.
The prediction method evaluates predicates in order to make correct decisions during parsing. In this case, predicates which appear anywhere other than the left edge of the decision being evaluated are assumed to return true (i.e. they are ignored). In addition, context-dependent predicates are only evaluated if the context data is available. The prediction algorithm will not create context structures that were not already provided by the parsing code. If a context-dependent predicate is encountered during prediction and no context is available, the predicate is assumed to return true (i.e. it is ignored for that decision).
The code generator does not evaluate the semantics of the target language, so it has no way to know that $ctx is semantically irrelevant when it appears in /*$ctx*/. Both cases result in the predicate being treated as context-dependent.

Related

How can i specify parameter dependent on index value?

I'm trying to port code from DML 1.2 to DML 1.4. Here is part of code that i ported:
group rx_queue [i < NQUEUES] {
<...>
param desctype = i < 64 #? regs.SRRCTL12[i].DESCTYPE.val #: regs.SRRCTL2[i - 64].DESCTYPE.val; // error occurs here
<...>
}
Error says:
error: non-constant expression: cast(i, int64 ) < 64
How can i specify parameter dependent on index value?
I tried to use if...else instead ternary operator, but it says that conditional parameters are not allowed in DML.
Index parameters in DML are a slightly magical expressions; when used from within parameters, they can evaluate to either a constant or a variable depending on where the parameter is used from. Consider the following example:
group g[i < 5] {
param x = i * 4;
method m() {
log info: "%d", x;
log info: "%d", g[4 - i].x;
log info: "%d", g[2].x;
}
}
i becomes an implicit local variable within the m method, and in params, indices are a bit like implicit macro parameters. When the compiler encounters x in the first log statement, the param will expand to i * 4 right away. In the second log statement, the x param is taken from an object indexed with the expression 4 - i, so param expansion will instead insert (5 - i) * 4. In the third log statement, the x param is taken from a constant indexed object, so it expands to 2 * 4 which collapses into the constant 8.
Most uses of desctype will likely happen from contexts where indices are variable, and the #? expression requires a constant boolean as condition, so this will likely give an error as soon as anyone tries to use it.
I would normally advise you to switch from #? to ? in the definition of the desctype param, but that fails in this particular case: DMLC will report error: array index out of bounds on the i - 64 expression. This error is much more confusing, and happens because DMLC automatically evaluates every parameter once with all zero indices, to smoke out misspelled identifiers; this will include evaluation of SRRCTL2[i-64] which collapses into SRRCTL2[-64] which annoys DMLC.
This is arguably a compiler bug; DMLC should probably be more tolerant in this corner. (Note that even if we would remove the zero-indexed validation step from the compiler, your parameter would still give the same error message if it ever would be explicitly referenced with a constant index, like log info: "%d", rx_queue[0].desctype).
The reason why you didn't get an error in DML 1.2 is that DML 1.2 had a single ternary operator ? that unified 1.4's ? and #?; when evaluated with a constant condition the dead branch would be disregarded without checking for errors. This had some strange effects in other situations, but made your particular use case work.
My concrete advise would be to replace the param with a method; this makes all index variables unconditionally non-constant which avoids the problem:
method desctype() -> (uint64) {
return i < 64 ? regs.SRRCTL12[i].DESCTYPE.val : regs.SRRCTL2[i - 64].DESCTYPE.val;
}

element => command in Terraform

I see a code as below in https://github.com/terraform-aws-modules/terraform-aws-efs/blob/master/examples/complete/main.tf#L58
# Mount targets / security group
mount_targets = { for k, v in toset(range(length(local.azs))) :
element(local.azs, k) => { subnet_id = element(module.vpc.private_subnets, k) }
}
I am trying to understand what => means here. Also this command with for loop, element and =>.
Could anyone explain here please?
In this case the => symbol isn't an independent language feature but is instead just one part of the for expression syntax when the result will be a mapping.
A for expression which produces a sequence (a tuple, to be specific) has the following general shape:
[
for KEY_SYMBOL, VALUE_SYMBOL in SOURCE_COLLECTION : RESULT
if CONDITION
]
(The KEY_SYMBOL, portion and the if CONDITION portion are both optional.)
The result is a sequence of values that resulted from evaluating RESULT (an expression) for each element of SOURCE_COLLECTION for which CONDITION (another expression) evaluated to true.
When the result is a sequence we only need to specify one result expression, but when the result is a mapping (specifically an object) we need to specify both the keys and the values, and so the mapping form has that additional portion including the => symbol you're asking about:
{
for KEY_SYMBOL, VALUE_SYMBOL in SOURCE_COLLECTION : KEY_RESULT => VALUE_RESULT
if CONDITION
}
The principle is the same here except that for each source element Terraform will evaluate both KEY_RESULT and VALUE_RESULT in order to produce a key/value pair to insert into the resulting mapping.
The => marker here is just some punctuation so that Terraform can unambiguously recognize where the KEY_RESULT ends and where the VALUE_RESULT begins. It has no special meaning aside from being a delimiter inside a mapping-result for expression. You could think of it as serving a similar purpose as the comma between KEY_SYMBOL and VALUE_SYMBOL; it has no meaning of its own, and is only there to mark the boundary between two clauses of the overall expression.
When I read a for expression out loud, I typically pronounce => as "maps to". So with my example above, I might pronounce it as "for each key and value in source collection, key result maps to value result if the condition is true".
Lambda expressions use the operator symbol =, which reads as "goes to." Input parameters are specified on the operator's left side, and statement/expressions are specified on the right. Generally, lambda expressions are not directly used in query syntax but are often used in method calls. Query expressions may contain method calls.
Lambda expression syntax features are as follows:
It is a function without a name.
There are no modifiers, such as overloads and overrides.
The body of the function should contain an expression, rather than a statement.
May contain a call to a function procedure but cannot contain a call to a subprocedure.
The return statement does not exist.
The value returned by the function is only the value of the expression contained in the function body.
The End function statement does not exist.
The parameters must have specified data types or be inferred.
Does not allow generic parameters.
Does not allow optional and ParamArray parameters.
Lambda expressions provide shorthand for the compiler, allowing it to emit methods assigned to delegates.
The compiler performs automatic type inference on the lambda arguments, which is a key advantage.

Groovy compareTo for CustomClass and numbers/strings

I am building DSL and try to define a custom class CustomClass that you can use in expressions like
def result = customInstance >= 100 ? 'a' : 'b'
if (customInstance == 'hello') {...}
Groovy doesn't call == when your class defines equals and implements Comparable (defines compareTo) at the same time.
Instead Groovy calls compareToWithEqualityCheck which has a branching logic. And unless your custom DSL class is assignable from String or Number your custom compareTo won't be called for the example above.
You can't extend CustomClass with String.
I feel like I am missing something. Hope you can help me figure out how to implement a simple case like I showed above.
Here is a short answer first: You could extend GString for the CustomClass. Then its compareTo method will be called in both cases - when you check for equality and when you actually compare.
Edit: Considering the following cases, it will work for 1 and 2, but not for 3.
customInstance >= 100 // case 1
customInstance == 'hallo' // case 2
customInstance == 10 // case 3
Now I will explain what I understand from the implementation in Groovy's ScriptBytecodeAdapter and DefaultTypeTransformation.
For the == operator, in case Comparable is implemented (and there is no simple identity), it tries to use the interface method compareTo, hence the same logic that is used for other comparison operators. Only if Comparable is not implemented it tries to determine equality based on some smart type adjustments and as an ultima ratio falls back to calling the equals method. This happens in DefaultTypeTransformation.compareEqual#L603-L608
For all other comparison operators such as >=, Groovy delegates to the compareToWithEqualityCheck method. Now this method is called with the equalityCheckOnly flag set to false, while it is set to true for the first case when it the invocation originates from the == operator. Again there is some Groovy smartness happening based on the type of the left side if it is Number, Character, or String. If none applies it ends up calling the compareTo method in DefaultTypeTransformation.compareToWithEqualityCheck#L584-L586.
Now, this happens only if
!equalityCheckOnly || left.getClass().isAssignableFrom(right.getClass())
|| (right.getClass() != Object.class && right.getClass().isAssignableFrom(left.getClass())) //GROOVY-4046
|| (left instanceof GString && right instanceof String)
There are some restrictions for the case of equalityCheckOnly, hence when we come from the == operator. While I can not explain all of those I believe these are to prevent exceptions to be thrown under specific circumstances, such as the issue mentioned in the comment.
For brevity I omitted above that there are also cases that are handled upfront in the ScriptBytecodeAdapter and delegated to equals right away, if left and right hand side are both of the same type and one of Integer, Double or Long.

Bison/Flex, reduce/reduce, identifier in different production

I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}

Token empty when matching grammar although rule matched

So my rule is
/* Addition and subtraction have the lowest precedence. */
additionExp returns [double value]
: m1=multiplyExp {$value = $m1.value;}
( op=AddOp m2=multiplyExp )* {
if($op != null){ // test if matched
if($op.text == "+" ){
$value += $m2.value;
}else{
$value -= $m2.value;
}
}
}
;
AddOp : '+' | '-' ;
My test ist 3 + 4 but op.text always returns NULL and never a char.
Does anyone know how I can test for the value of AddOp?
In the example from ANTLR4 Actions and Attributes it should work:
stat: ID '=' INT ';'
{
if ( !$block::symbols.contains($ID.text) ) {
System.err.println("undefined variable: "+$ID.text);
}
}
| block
;
Are you sure $op.text is always null? Your comparison appears to check for $op.text=="+" rather than checking for null.
I always start these answers with a suggestion that you migrate all of your action code to listeners and/or visitors when using ANTLR 4. It will clean up your grammar and greatly simplify long-term maintenance of your code.
This is probably the primary problem here: Comparing String objects in Java should be performed using equals: "+".equals($op.text). Notice that I used this ordering to guarantee that you never get a NullPointerException, even if $op.text is null.
I recommend removing the op= label and referencing $AddOp instead.
When you switch to using listeners and visitors, removing the explicit label will marginally reduce the size of the parse tree.
(Only relevant to advanced users) In some edge cases involving syntax errors, labels may not be assigned while the object still exists in the parse tree. In particular, this can happen when a label is assigned to a rule reference (your op label is assigned to a token reference), and an error appears within the labeled rule. If you reference the context object via the automatically generated methods in the listener/visitor, the instances will be available even when the labels weren't assigned, improving your ability to report details of some errors.

Resources