Is it possible to lookahead in ANTLR4 without actually matching a token? - antlr4

I am writing a compiler by translating JavaCC to ANTLR4 and one of the rules involve passing parameters and getting return values from it.
I have to do something like the following for a rule 'term':
Term term(ReadOptions options, int priority):
{
int p = options.operatorSet.getNextLevel(priority);
Term t;
}
{
(
LOOKAHEAD({p==0})
t = simpleTerm(options)
|
LOOKAHEAD(<NAME_TOKEN>,{priority==1201 && is1201Separator(2)})
t = name()
|
t = operatorTerm(options, p)
)
{return t;}
}
The problem is that how do I match sub-rules on the basis of the value of 'p'. In the previous versions of ANTLR I could have used => and my problem would have solved but what do I do in ANTLR4 ?

The => operator in previous versions of ANTLR is no longer necessary in ANTLR 4.
ANTLR 4 does not support syntactic predicates because its lookahead algorithm fully supports infinite lookahead. If you used the form (x) => y previously, in ANTLR 4 you can simply use y.
Semantic predicates are still supported, but in ANTLR 4 all semantic predicates work like gated semantic predicates in ANTLR 3. If you used the form {x}? => y previously, then in ANTLR 4 you can simply use {x}? y.

Related

Proving the inexpressibility of a function in a given language

I'm currently reading through John C. Mitchell's Foundations for Programming Languages. Exercise 2.2.3, in essence, asks the reader to show that the (natural-number) exponentiation function cannot be implicitly defined via an expression in a small language. The language consists of natural numbers and addition on said numbers (as well as boolean values, a natural-number equality predicate, & ternary conditionals). There are no loops, recursive constructs, or fixed-point combinators. Here is the precise syntax:
<bool_exp> ::= <bool_var> | true | false | Eq? <nat_exp> <nat_exp> |
if <bool_exp> then <bool_exp> else <bool_exp>
<nat_exp> ::= <nat_var> | 0 | 1 | 2 | … | <nat_exp> + <nat_exp> |
if <bool_exp> then <nat_exp> else <nat_exp>
Again, the object is to show that the exponentiation function n^m cannot be implicitly defined via an expression in this language.
Intuitively, I'm willing to accept this. If we think of exponentiation as repeated multiplication, it seems like we "just can't" express that with this language. But how does one formally prove this? More broadly, how do you prove that an expression from one language cannot be expressed in another?
Here's a simple way to think about it: the expression has a fixed, finite size, and the only arithmetic operation it can do to produce numbers not written as literals or provided as the values of variables is addition. So the largest number it can possibly produce is limited by the number of additions plus 1, multiplied by the largest number involved in the expression.
So, given a proposed expression, let k be the number of additions in it, let c be the largest literal (or 1 if there is none) and choose m and n such that n^m > (k+1)*max(m,n,c). Then the result of the expression for that input cannot be the correct one.
Note that this proof relies on the language allowing arbitrarily large numbers, as noted in the other answer.
No solution, only hints:
First, let me point out that if there are finitely many numbers in the language, then exponentiation is definable as an expression. (You'd have to define what it should produce when the true result is unrepresentable, eg wraparound.) Think about why.
Hint: Imagine that there are only two numbers, 0 and 1. Can you write an expression involving m and n whose result is n^m? What if there were three numbers: 0, 1, and 2? What if there were four? And so on...
Why don't any of those solutions work? Let's index them and call the solution for {0,1} partial_solution_1, the solution for {0,1,2} partial_solution_2, and so on. Why isn't partial_solution_n a solution for the set of all natural numbers?
Maybe you can generalize that somehow with some metric f : Expression -> Nat so that every expression expr with f(expr) < n is wrong somehow...
You may find some inspiration from the strategy of Euclid's proof that there are infinitely many primes.

Alloy pred declaration: is there a difference between square brackets and parentheses?

The question probably has a yes/no answer. Consider the snippet:
sig A { my : lone B }
sig B { }
pred single1 [x:A]{ // defined using []
#x.my = 0
}
pred single2 (x:A){ // defined using ()
#x.my = 0
}
// these two runs produce the exact same results
run single1 for 3 but exactly 1 A
run single2 for 3 but exactly 1 A
check oneOfTheMostTrivialQuestionsOnStackOverflow { all x: A |
single1[x] iff single2[x] // pred calls use [], so as expected, single2(x) would cause a syntax error
} for 3000 but exactly 1 A // assertion holds :)
Are single1 and single2 exactly the same?
They seem to be, but am I missing something?
When we extended the syntax in Alloy 4, we changed the predicate invocations to []. My recollection is that we did it to make parsing easier, so that if you had a predicate P with no args, you could call it as just "P", and there would be no problems if it were followed by a formula in parens "P (...)". As Peter notes, it also seemed reasonable since it's similar to the relational lookup operator, and this makes sense especially for functions. We added the ability to declare predicates and functions with [] for consistency, but saw no reason to prevent () in decls (since there's no possible ambiguity there).
I think the parentheses were originally used for predicates and functions. However, they were changed in favour of the square brackets because it made it look more relational. I vaguely recall that Daniel Jackson explains this in his book.
That said, why ask because you seem to have proven it yourself? :-)

Bison/Flex, reduce/reduce, identifier in different production

I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}

ANTLR4 Semantic Predicates that is Context Dependent Does Not Work

I am parsing a C++ like declaration with this scaled down grammar (many details removed to make it a fully working example). It fails to work mysteriously (at least to me). Is it related to the use of context dependent predicate? If yes, what is the proper way to implement the "counting the number of child nodes logic"?
grammar CPPProcessor;
cppCompilationUnit : decl_specifier_seq? init_declarator* ';' EOF;
init_declarator: declarator initializer?;
declarator: identifier;
initializer: '=0';
decl_specifier_seq
locals [int cnt=0]
#init { $cnt=0; }
: decl_specifier+ ;
decl_specifier : #init { System.out.println($decl_specifier_seq::cnt); }
'const'
| {$decl_specifier_seq::cnt < 1}? type_specifier {$decl_specifier_seq::cnt += 1;} ;
type_specifier: identifier ;
identifier:IDENTIFIER;
CRLF: '\r'? '\n' -> channel(2);
WS: [ \t\f]+ -> channel(1);
IDENTIFIER:[_a-zA-Z] [0-9_a-zA-Z]* ;
I need to implement the standard C++ rule that no more than 1 type_specifier is allowed under an decl_specifier_seq.
Semantic predicate before type_specifier seems to be the solution. And the count is naturally declared as a local variable in decl_specifier_seq since nested decl_specifier_seq are possible.
But it seems that a context dependent semantic predicate like the one I used will produce incorrect parsing i.e. a semantic predicate that references $attributes. First an input file with correct result (to illustrate what a normal parse tree looks like):
int t=0;
and the parse tree:
But, an input without the '=0' to aid the parsing
int t;
0
1
line 1:4 no viable alternative at input 't'
1
the parsing failed with the 'no viable alternative' error (the numbers printed in the console is debug print of the $decl_specifier_cnt::cnt value as a verification of the test condition). i.e. the semantic predicate cannot prevent the t from being parsed as type_specifier and t is no longer considered a init_declarator. What is the problem here? Is it because a context dependent predicate having $decl_specifier_seq::cnt is used?
Does it mean context dependent predicate cannot be used to implement "counting the number of child nodes" logic?
EDIT
I tried new versions whose predicate uses member variable instead of the $decl_specifier_seq::cnt and surprisingly the grammar now works proving that the Context Dependent predicate did cause the previous grammar to fail:
....
#parser::members {
public int cnt=0;
}
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 }? type_specifier {cnt++;} ;
A normal parse tree is resulted:
This gives rise to the question of how to support nested rule if we must use member variables to replace the local variables to avoid context sensitive predicates?
And a weird result is that if I add a /*$ctx*/ after the predicate, it fails again:
decl_specifier
#init {System.out.println("cnt:"+cnt); }
:
'const'
| {cnt<1 /*$ctx*/ }? type_specifier {cnt++;} ;
line 1:4 no viable alternative at input 't'
The parsing failed with no viable alternative. Why the /*$ctx*/ causes the parsing to fail like when $decl_specifier_seq::cnt is used although the actual logic uses a member variable only?
And, without the /*$ctx*/, another issue related to the predicate called before #init block appears(described here)
ANTLR 4 evaluates semantic predicates in two cases.
The generated code evaluates a semantic predicate during parsing, and throws an exception of the evaluation returns false. All predicates traversed during parsing are evaluated in this way, including context-dependent predicates and predicates which do not appear at the left side of a decision.
The prediction method evaluates predicates in order to make correct decisions during parsing. In this case, predicates which appear anywhere other than the left edge of the decision being evaluated are assumed to return true (i.e. they are ignored). In addition, context-dependent predicates are only evaluated if the context data is available. The prediction algorithm will not create context structures that were not already provided by the parsing code. If a context-dependent predicate is encountered during prediction and no context is available, the predicate is assumed to return true (i.e. it is ignored for that decision).
The code generator does not evaluate the semantics of the target language, so it has no way to know that $ctx is semantically irrelevant when it appears in /*$ctx*/. Both cases result in the predicate being treated as context-dependent.

Flex and Bison Associativity difficulty

Using Flex and Bison, I have a grammar specification for a boolean query language, which supports logical "and", "or", and "not" operations, as well as nested subexpressions using "()".
All was well until I noticed that queries like "A and B or C and D" which I'd like parsed as "(A & B) | (C & D)" was actually being interpreted as "A & ( B | ( C & D ) )". I'm nearly certain this is an associativity issue, but can't seem to find a proper explanation or example anywhere - that or I'm missing something important.
Pertinent information from boolpars.y:
%token TOKEN
%token OPEN_PAREN CLOSE_PAREN
%right NOT
%left AND
%left OR
%%
query: expression { ... }
;
expression: expression AND expression { ... }
| expression OR expression { ... }
| NOT expression { ... }
| OPEN_PAREN expression CLOSE_PAREN { ... }
| TOKEN { ... }
;
Can anyone find the flaw? I can't see why Bison isn't giving "or" appropriate precedence.
From bison docs:
Operator precedence is determined by
the line ordering of the declarations;
the higher the line number of the
declaration (lower on the page or
screen), the higher the precedence.
So in your case OR is lower on the screen and has higher precedence.
Change the order to
%left OR
%left AND
(I haven't tested it though)
Why not split up the productions, as in this snippet from
a C-ish language
logical_AND_expression:
inclusive_OR_expression
| logical_AND_expression ANDAND inclusive_OR_expression
{$$ = N2(__logand__, $1, $3);}
;
logical_OR_expression:
logical_AND_expression
| logical_OR_expression OROR logical_AND_expression
{$$ = N2(__logor__, $1, $3);}
;
I've performed tests on my own implementation, and from my tests, marcin's answer is correct. If I define the precedence as:
%left OR
%left AND
Then the expression A&B|C&D will be reduced to ((A&B)|(C&D))
If I define the precedence as:
%left AND
%left OR
Then the expression A&B|C&D will be reduced to ((A&(B|C))&D)
One differentiating expression would be:
true & true | true & false
The former precedence definition would render this as true, whereas the latter would render it as false. I've tested both scenarios and both work as explained.
Double check your tests to make sure. Also note that it is the order of the %left, %right, etc. definitions in the header portion that define the precedence, not the order that you define your rules themselves. If it's still not working, maybe it's some other area in your code that's messing it up, or maybe your version of bison is different (I'm just shooting in the dark at this point).

Resources