Elegant way to parse Calculator.g4 in grammars-v4 using antlr4 listener model

Elegant way to parse Calculator.g4 in grammars-v4 using antlr4 listener model - antlr4

I learned the basic grammars of antlr4 and tried to build a simple calculator. But, I have no idea how to handle PLUS | MINUS and TIMES | DIV.
expression: multiplyingExpression ((PLUS | MINUS) multiplyingExpression)*;
multiplyingExpression: signedAtom ((TIMES | DIV) signedAtom)*;
signedAtom: PLUS signedAtom | MINUS signedAtom | func_ | atom;
(code extracted from Antlr4 sample calculator grammar)
It seems like no API can handle PLUS | MINUS, because they are not defined like signedAtom/expression, which can be handled in method like exitXXX.
Does this mean, the grammar like this can only be parsed in visitor model?
Example:
Here is a extremely simple example code in golang.
calculator.g4
grammar calculator;
expression: atom ((PLUS | MINUS) atom)*;
atom: '1';
PLUS: '+';
MINUS: '-';
WS: [ \r\n\t]+ -> skip;
Code in Golang listener model
func NewCalculatorListenrImpl() *CalculatorListenrImpl {
return &CalculatorListenrImpl{
stack: stack.New(),
}
}
type CalculatorListenrImpl struct {
BasecalculatorListener
stack *stack.Stack
}
func (s *CalculatorListenrImpl) ExitExpression(ctx *ExpressionContext) {
left, op, right := s.stack.Pop().(int), s.stack.Pop().(string), s.stack.Pop().(int)
switch op {
case "+":
s.stack.Push(left + right)
case "-":
s.stack.Push(left - right)
}
}
func (s *CalculatorListenrImpl) ExitAtom(ctx *AtomContext) {
v, _ := strconv.ParseInt(ctx.GetText(), 10, 32)
s.stack.Push(int(v))
}
I can push the atom element to stack in exitAtom and then handle the logic in exitExpression method. However, there is no listener for (PLUS | MINUS). What I am looking for is method like exitPLUS where I can simply push them to stack like atom. May be there is other ways to do this? I knostrateforw their is some magic syntax like # and op=xxx, but these code fragements are copied from grammars-v4.

Related

Antlr4 Adjacent Token Precedence

I'm running into a problem while building a complex grammar. A pet grammar to illustrate it is below:
grammar test;
start: (r1 | r2 | .)*
r1: A B
r2: B C
// A B C are tokens
When the following input occurs:
ABC
The parse tree looks like this:
start
| \
r1 C
| \
A B
But what I actually want is for it to look like this:
start
| \
A r2
| \
B C
I've tried reordering the rules & adding <assoc=right>, but nothing seems to work except removing rule r1, which is incorrect because I expect AB and BC to be valid inputs. What am I missing?
EDIT
It seems the above problem description oversimplifies the actual issue, so I'll give more details:
r3: rA r4 // prefers rA(classB classC) over (rA classB)classC
r4: classB? classC // also used elsewhere other than r3
rA: // rules to build A subtree, ends with classB? in 'some' cases
classB: B1 | B2 | ... | Bm
classC: C1 | C2 | ... | Cn
I've found that the following 'kind of' works:
r3: rA Bx classC | ...
But the following doesn't:
r3: <assoc=right> rA r4 | ... // still builds (rA classB)classC
I'm wondering if there's a way I can build the tree correctly while being able to utilize r4 and its associated code (and avoid having to put another m lines for all instances of B)?
PS. rA is expensive, so expanding B tokens in r3 like above throws performance to the dogs.

The problem I see here is that you tell the parser to produce the parse tree you don't want. If you don't want it then don't specify the grammar that is supposed to produce it.
Similar to what Mike Cargal came up with I think the real solution is to more explicitly specify what you want to see at the end. Here's something that works pretty well (using your initial problem description and MikeC's test input):
parser grammar testparser;
options {
tokenVocab = testlexer;
}
start: (A r2 | r1 | .)*? EOF;
r1: A B;
r2: B C;
lexer grammar testlexer;
A: 'A';
B: 'B';
C: 'C';
WHITE_SPACE: [ \u000B\t\r\n] -> skip ;
OTHER: .;
With the input AB!C2 I get this parse three:
Leaving out C this changes to:
The main change is that you specialise the rule to make BC match in their own sub parse tree, by adding A for the r2 alt and put that first.
Note
Moving that single A down to the r2 rule will break this, because then you tell the parser to create a sub tree with ABC in it (what you don't want).

I suspect that you'll find that you're really fighting the way a recursive descent parser works.
When the parser tries to match rA it will try to match as many input tokens as possible, and is not remotely aware of the parents subsequent rule (i.e. the current rule's "next sibling"). (assoc = right isn't going to make the rule look up to it's parent and next sibling, it's designed for building the correct parse trees for things like exponentiation.)
As such, you'd need to use something like a semantic predicate to "block" the wrong alternative from matching, by looking far enough ahead to determine whether it should actually match r4 (for example).
For your example, the following grammar parses the input "AB!C2" as
grammar test
;
start: r3 EOF;
r3
: rA r4 // prefers rA(classB classC) over (rA classB)classC
;
r4
: classB? classC // also used elsewhere other than r3
;
rA // rules to build A subtree, ends with classB? in 'some' cases
: A C1
| A
| {!( //
_input.get(_input.index()).getType() == C1 || //
_input.get(_input.index()).getType() == C2) //
}? A classB?
;
classB: B1 | B2;
classC: C1 | C2;
A: 'A';
B1: 'B1';
B2: 'B2';
C1: 'C1';
C2: 'C2';
WS: [ \n\r] -> skip;
OTHER: .;
However, that's making an assumption that You can determine an r4 rule match by just looking ahead at the second token that rule would possibly examine. I would suspect that this is completely untenable from a maintenance and understanding standpoint.
You could slightly improve maintainability by including functions using the #parser::members ability. This allows you to have more complex logic for your predicate, while not littering the actual grammar too badly.
grammar test
;
#parser::members {
private int[] laValues = new int[] { C1, C2 };
private boolean r4Follows() {
int la = _input.get(_input.index()).getType();
for (int i = 0; i < laValues.length; i++) {
if (la == laValues[i])
return true;
}
return false;
}
}
start: r3 EOF;
r3
: rA r4 // prefers rA(classB classC) over (rA classB)classC
;
r4
: classB? classC // also used elsewhere other than r3
;
rA // rules to build A subtree, ends with classB? in 'some' cases
: A C1
| A
| A classB {!r4Follows()}?
;
classB: B1 | B2;
classC: C1 | C2;
A: 'A';
B1: 'B1';
B2: 'B2';
C1: 'C1';
C2: 'C2';
WS: [ \n\r] -> skip;
OTHER: .;
However, this is still likely to be a mess to maintain. I think the bottom line is that what you're trying to do, creates an ambiguity as to which rule classB belongs to.
A recursive descent parser, doesn't really see an ambiguity, it just matches the rule it's working on at that time (rA) the best it can and that includes pulling in the classB match. You'll have to introduce a semantic predicate to prevent that, and that's going to be, pretty much, unmanageable.
Addendum:
Against all that is holy in the ANTLR realm, I did work out a way to use the actual r4() rule to test for the match of a rule at the current position of the token stream. I HIGHLY recommend you file this under "stupid ANTLR tricks".
Since this actually attempts the parse of r4, depending on the complexity involved, it could substantially impact performance.
Also, I can make no guarantee that this doesn't still leave some state violated, though it worked on simple examples.
NOTE: The location of the semantic predicate is important as the current index of the token stream is advancing as you progress through the rule.
grammar test
;
#parser::header {import java.util.function.Supplier;}
#parser::members {
private BailErrorStrategy bailStrategy = new BailErrorStrategy();
private boolean ruleMatches(Supplier<ParserRuleContext> rule) {
boolean result = false;
// save state
int idx = _input.index();
int savedState = this.getState();
List<ParseTree> savedChildren = _ctx.children;
_ctx.children = new ArrayList<>();
ANTLRErrorStrategy savedErrStrategy = this.getErrorHandler();
this.setErrorHandler(bailStrategy);
try {
ParserRuleContext attempt = rule.get();
result = true;
} catch (ParseCancellationException pce) {
result = false;
} finally {
// restore state
this.setErrorHandler(savedErrStrategy);
_ctx.children = savedChildren;
this.setState(savedState);
_input.seek(idx);
}
return result;
}
}
start: r3 EOF;
r3
: rA r4? // prefers rA(classB classC) over (rA classB)classC
;
r4
: classB? classC // also used elsewhere other than r3
;
rA // rules to build A subtree, ends with classB? in 'some' cases
: A {!ruleMatches(this::r4)}? classB?
| A D
;
classB: B1 | B2;
classC: C1 | C2;
A: 'A';
B1: 'B1';
B2: 'B2';
C1: 'C1';
C2: 'C2';
D: 'D';
WS: [ \n\r] -> skip;
OTHER: .;
input "AB1C1"
input "ADC1"
input "AC1"
input "AB1"
(I believe it is now required that I go sacrifice an innocent kitten to the ANTLR gods to save my soul for posting that)

Alloy Analyzer element comparision from set

Some background: my project is to make a compiler that compiles from a c-like language to Alloy. The input language, that has c-like syntax, must support contracts. For now, I am trying to implement if statements that support pre and post condition statements, similar to the following:
int x=2
if_preCondition(x>0)
if(x == 2){
x = x + 1
}
if_postCondtion(x>0)
The problem is that I am a bit confused with the results of Alloy.
sig Number{
arg1: Int,
}
fun addOneConditional (x : Int) : Number{
{ v : Number |
v.arg1 = ((x = 2 ) => x.add[1] else x)
}
}
assert conditionalSome {
all n: Number| (n.arg1 = 2 ) => (some field: addOneConditional[n.arg1] | { field.arg1 = n.arg1.add[1] })
}
assert conditionalAll {
all n: Number| (n.arg1 = 2 ) => (all field: addOneConditional[n.arg1] | { field.arg1 = n.arg1.add[1] })
}
check conditionalSome
check conditionalAll
In the above example, conditionalAll does not generate any Counterexample. However, conditionalSomegenerates Counterexamples. If I understand all and some quantifiers correctly then there is a mistake. Because from mathematical logic we have Ɐx expr(x) => ∃x expr(x) ( i.e. If expression expr(x) is true for all values of x then there exist a single x for which expr(x) is true)

The first thing is that you need to model your pre-, post- and operations. Functions are terrible for that because they cannot not return something that indicates failure. You, therefore, need to model the operation as a predicate. The value of a predicate indicates if the pre/post are satisfied, the arguments and return values can be modeled as parameters.
This is as far as I can understand your operation:
pred add[ x : Int, x' : Int ] {
x > 0 -- pre condition
x = 2 =>
x'=x.plus[1]
else
x'=x
x' > 0 -- post condition
}
Alloy has no writable variables (Electrum has) so you need to model the before and after state with a prime (') character.
We can now use this predicate to calculate the set of solutions to your problem:
fun solutions : set Int {
{ x' : Int | some x : Int | add[ x,x' ] }
}
We create a set with integers for which we have a result. The prime character is nothing special in Alloy, it is only a convention for the post-state. I am abusing it slightly here.
This is more than enough Alloy source to make mistakes so let's test this.
run { some solutions }
If you run this then you'll see in the Txt view:
skolem $solutions={1, 3, 4, 5, 6, 7}
This is as expected. The add operation only works for positive numbers. Check. If the input is 2, the result is 3. Ergo, 2 can never be a solution. Check.
I admit, I am slight confused by what you're doing in your asserts. I've tried to replicate them faithfully, although I've removed unnecessary things, at least I think we're unnecessary. First your some case. Your code was doing an all but then selecting on 2. So removed the outer quantification and hardcoded 2.
check somen {
some x' : solutions | 2.plus[1] = x'
}
This indeed does not give us any counterexample. Since solutions was {1, 3, 4, 5, 6, 7}, 2+1=3 is in the set, i.e. the some condition is satisfied.
check alln {
all x' : solutions | 2.plus[1] = x'
}
However, not all solutions have 3 as the answer. If you check this, I get the following counter-example:
skolem $alln_x'={7}
skolem $solutions={1, 3, 4, 5, 6, 7}
Conclusion. Daniel Jackson advises not to learn Alloy with Ints. Looking at your Number class you took him literally: you still base your problem on Ints. What he meant is not use Int, don't hide them under the carpet in a field. I understand where Daniel is coming from but Ints are very attractive since we're so familiar with them. However, if you use Ints, let them at least use their full glory and don't hide them.
Hope this helps.
And the whole model:
pred add[ x : Int, x' : Int ] {
x > 0 -- pre condition
x = 2 =>
x'=x.plus[1]
else
x'=x
x' > 0 -- post condition
}
fun solutions : set Int { { x' : Int | some x : Int | add[ x,x' ] } }
run { some solutions }
check somen { some x' : solutions | x' = 3 }
check alln { all x' : solutions | x' = 3 }

non-fragment lexer rule x can match the empty string

What's wrong with the following antlr lexer?
I got an error
warning(146): MySQL.g4:5685:0: non-fragment lexer rule VERSION_COMMENT_TAIL can match the empty string
Attached source code
VERSION_COMMENT_TAIL:
{ VERSION_MATCHED == False }? // One level of block comment nesting is allowed for version comments.
((ML_COMMENT_HEAD MULTILINE_COMMENT) | . )*? ML_COMMENT_END { self.setType(MULTILINE_COMMENT); }
| { self.setType(VERSION_COMMENT); IN_VERSION_COMMENT = True; }
;

You are trying to convert my ANTLR3 grammar for MySQL to ANTLR4? Remove all the comment rules in the lexer and insert this instead:
// There are 3 types of block comments:
// /* ... */ - The standard multi line comment.
// /*! ... */ - A comment used to mask code for other clients. In MySQL the content is handled as normal code.
// /*!12345 ... */ - Same as the previous one except code is only used when the given number is a lower value
// than the current server version (specifying so the minimum server version the code can run with).
VERSION_COMMENT_START: ('/*!' DIGITS) (
{checkVersion(getText())}? // Will set inVersionComment if the number matches.
| .*? '*/'
) -> channel(HIDDEN)
;
// inVersionComment is a variable in the base lexer.
MYSQL_COMMENT_START: '/*!' { inVersionComment = true; setChannel(HIDDEN); };
VERSION_COMMENT_END: '*/' {inVersionComment}? { inVersionComment = false; setChannel(HIDDEN); };
BLOCK_COMMENT: '/*' ~[!] .*? '*/' -> channel(HIDDEN);
POUND_COMMENT: '#' ~([\n\r])* -> channel(HIDDEN);
DASHDASH_COMMENT: DOUBLE_DASH ([ \t] (~[\n\r])* | LINEBREAK | EOF) -> channel(HIDDEN);
You need a local inVersionComment member and a function checkVersion() in your lexer (I have it in the base lexer from which the generated lexer derives) which returns true or false, depending on whether the current server version is equal to or higher than the given version.
And for your question: you cannot have actions in alternatives. Actions can only appear at the end of an entire rule. This differs from ANTLR3.

ANTLR4 lexer rule with #init block

I have this lexer rule defined in my ANTLR v3 grammar file - it maths text in double quotes.
I need to convert it to ANTLR v4. ANTLR compiler throws an error 'syntax error: mismatched input '#' expecting COLON while matching a lexer rule' (in #init line). Can lexer rule contain a #init block ? How this should be rewritten ?
DOUBLE_QUOTED_CHARACTERS
#init
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
: ('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
;
Sample data:
" " -> throws error "Illegal empty quotes!";
"asd -> throws error "Missing closing double quote!"
"text" -> returns text (valid input, content of "...")

I think this is the right way to do this.
DOUBLE_QUOTED_CHARACTERS
:
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
(
('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
)
;
There are some other errors as well in above like WS .. => ... but I am not correcting them as part of this answer. Just to keep things simple. I took hint from here
Just to hedge against that link moving or becoming invalid after sometime, quoting the text as is:
Lexer actions can appear anywhere as of 4.2, not just at the end of the outermost alternative. The lexer executes the actions at the appropriate input position, according to the placement of the action within the rule. To execute a single action for a role that has multiple alternatives, you can enclose the alts in parentheses and put the action afterwards:
END : ('endif'|'end') {System.out.println("found an end");} ;
The action conforms to the syntax of the target language. ANTLR copies the action’s contents into the generated code verbatim; there is no translation of expressions like $x.y as there is in parser actions.
Only actions within the outermost token rule are executed. In other words, if STRING calls ESC_CHAR and ESC_CHAR has an action, that action is not executed when the lexer starts matching in STRING.

I in countered this problem when my .g4 grammar imported a lexer file. Importing grammar files seems to trigger lots of undocumented shortcomings in ANTLR4. So ultimately I had to stop using import.
In my case, once I merged the LEXER grammar into the parser grammar (one single .g4 file) my #input and #after parsing errors vanished. I should submit a test case + bug, at least to get this documented. I will update here once I do that.
I vaguely recall 2-3 issues with respect to importing lexer grammar into my parser that triggered undocumented behavior. Much is covered here on stackoverflow.

ANTLR4: Parser for a Boolean expression

I am trying to parse a boolean expression of the following type
B1=p & A4=p | A6=p &(~A5=c)
I want a tree that I can use to evaluate the above expression. So I tried this in Antlr3 with the example in Antlr parser for and/or logic - how to get expressions between logic operators?
It worked in Antlr3. Now I want to do the same thing for Antlr 4. I came up the grammar below and it compiles. But I am having trouble writing the Java code.
Start of Antlr4 grammar
grammar TestAntlr4;
options {
output = AST;
}
tokens { AND, OR, NOT }
AND : '&';
OR : '|';
NOT : '~';
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| ('+'|'-'|'*'|'/'|'_')
| '='
)+
;
I have written the Java Code (snippet below) for getting a tree for the expression "B1=p & A4=p | A6=p &(~A5=c)". I am expecting & with children B1=p and |. The child | operator will have children A4=p and A6=p &(~A5=c). And so on.
Here is that Java code but I am stuck trying to figure out how I will get the tree. I was able to do this in Antlr 3.
Java Code
String src = "B1=p & A4=p | A6=p &(~A5=c)";
CharStream stream = (CharStream)(new ANTLRInputStream(src));
TestAntlr4Lexer lexer = new TestAntlr4Lexer(stream);
parser.setBuildParseTree(true);
ParserRuleContext tree = parser.parse();
tree.inspect(parser);
if ( tree.children.size() > 0) {
System.out.println(" **************");
test.getChildren(tree, parser);
}
The get Children method is below. But this does not seem to extract any tokens.
public void getChildren(ParseTree tree, TestAntlr4Parser parser ) {
for (int i=0; i<tree.getChildCount(); i++){
System.out.println(" Child i= " + i);
System.out.println(" expression = <" + tree.toStringTree(parser) + ">");
if ( tree.getChild(i).getChildCount() != 0 ) {
this.getChildren(tree.getChild(i), parser);
}
}
}
Could someone help me figure out how to write the parser in Java?

The output=AST option was removed in ANTLR 4, as well as the ^ and ! operators you used in the grammar. ANTLR 4 produces parse trees instead of ASTs, so the root of the tree produced by a rule is the rule itself. For example, given the following rule:
and : not (AND not)*;
You will end up with an AndContext tree containing NotContext and TerminalNode children for the not and AND references, respectively. To make it easier to work with the trees, AndContext will contain a generated method not() which returns a list of context objects returned by the invocations of the not rule (return type List<? extends NotContext>). It also contains a generated method AND which returns a list of the TerminalNode instances created for each AND token that was matched.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Elegant way to parse Calculator.g4 in grammars-v4 using antlr4 listener model - antlr4

Related

Antlr4 Adjacent Token Precedence

Alloy Analyzer element comparision from set

non-fragment lexer rule x can match the empty string

ANTLR4 lexer rule with #init block

ANTLR4: Parser for a Boolean expression

Categories

Resources