N-ary operator parsing - antlr4

I'm trying to match an operator of variable arity (e.g. "1 < 3 < x < 10" yields true, given that 3 < x < 10) within a mathematical expression. Note that this is unlike most languages would parse the expression)
The (simplified) production rule is:
expression: '(' expression ')' # parenthesisExpression
| expression ('*' | '/' | '%') expression # multiplicationExpression
| expression ('+' | '-') expression # additionExpression
| expression (SMALLER_THAN expression)+ # smallerThanExpression
| IDENTIFIER # variableExpression
;
How do we keep the precedence, but still parse the smallerThanExpression as greedy as possible?
For example; "1 < 1+1 < 3" should be parsed as a single parse node "smallerThanExpression" with three child nodes, each of which is an expression. At this moment, the smallerThanExpression is broken up in two smallerThanExpressions (1 < (1+1 < 3)).

To give an answer for "future generations": we fixed it by separating arithmetic expressions from the other expressions. We know that only arithmetic expressions can be used as operands for our variable-arity operators ('true < false' is not a valid expression).
expression:
'!' expression
| arithmetic (SMALLER_THAN arithmetic)+
| arithmetic (GREATER_THAN arithmetic)+
| ....
;
arithmetic:
'(' expression ')'
| expression ('*' | '/' | '%') expression
| expression ('+' | '-') expression
| IDENTIFIER
| ...
;
This enforces an expression such as "x < y < z" to be parsed as a single 'expression' node with three 'arithmetic' nodes as children.
(Note that an identifier might refer to a non-integer object; this is checked in the context checker)

Related

ANTLR4: what design pattern to follow?

I have a ANTR4 rule "expression" that can be either "maths" or "comparison", but "comparison" can contain "maths". Here a concrete code:
expression
: ID
| maths
| comparison
;
maths
: maths_atom ((PLUS | MINUS) maths_atom) ? // "?" because in fact there is first multiplication then pow and I don't want to force a multiplication to make an addition
;
maths_atom
: NUMBER
| ID
| OPEN_PAR expression CLOSE_PAR
;
comparison
: comp_atom ((EQUALS | NOT_EQUALS) comp_atom) ?
;
comp_atom
: ID
| maths // here is the expression of interest
| OPEN_PAR expression CLOSE_PAR
;
If I give, for instance, 6 as input, this is fine for the parse tree, because it detects maths. But in the ANTLR4 plugin for Intellij Idea, it mark my expression rule as red - ambiguity. Should I say goodbye to a short parse tree and allow only maths trough comparison in expression so it is not so ambiguous anymore ?
The problem is that when the parser sees 6, which is a NUMBER, it has two paths of reaching it through your grammar:
expression - maths - maths_atom - NUMBER
or
expression - comparison - comp_atom - NUMBER
This ambiguity triggers the error that you see.
You can fix this by flattening your parser grammar as shown in this tutorial:
start
: expr | <EOF>
;
expr
: expr (PLUS | MINUS) expr # ADDGRP
| expr (EQUALS | NOT_EQUALS) expr # COMPGRP
| OPEN_PAR expression CLOSE_PAR # PARENGRP
| NUMBER # NUM
| ID # IDENT
;

How to write lexer rule that differentiates between -9 and - operator in arithmetic operation of 9 - 9?

I am writing a simple expression parser using Antlr4 for calculator application. I have no idea of how to write a grammar that differentiate between digit -9 and arithmetic expression 9 - 9. Any kind of help is much appreciated
Here is my Grammar expression.g4:
grammar expression;
expression = expression ADDOPER expression
| expression SUBOPER expression
| NUMBER;
/* lexical rules */
ADDOPER :'+';
SUBOPER :'-';
NUMBER : -?[1-9]+ [0-9]*('.'DIGIT+)? |'0'?('.'DIGIT+) |'0' ;
Problem with above grammar is that, it is matching -9 as Number in the arithmetic expression 9 - 9 but which is supposed to be full arithmetic operation.
But -9 + 9 works fine.
-9 is just an expression. So, simply do this:
expression
: SUBOPER expression
| expression ADDOPER expression
| expression SUBOPER expression
| NUMBER
;
and remove the - from your NUMBER:
NUMBER
: [1-9] [0-9]* ( '.'DIGIT+ )?
| '0'? '.' DIGIT+
| '0'
;

How to do Priority of Operations (+ * - /) in my grammars?

I define my own grammars using antlr 4 and I want to build tree true According to Priority of Operations (+ * - /) ....
I find sample on do Priority of Operations (* +) it work fine ...
I try to edit it to add the Priority of Operations (- /) but I failed :(
the grammars for Priority of Operations (+ *) is :
println:PRINTLN expression SEMICOLON {System.out.println($expression.value);};
expression returns [Object value]:
t1=factor {$value=(int)$t1.value;}
(PLUS t2=factor{$value=(int)$value+(int)$t2.value;})*;
factor returns [Object value]: t1=term {$value=(int)$t1.value;}
(MULT t2=term{$value=(int)$value*(int)$t2.value;})*;
term returns [Object value]:
NUMBER {$value=Integer.parseInt($NUMBER.text);}
| ID {$value=symbolTable.get($value=$ID.text);}
| PAR_OPEN expression {$value=$expression.value;} PAR_CLOSE
;
MULT :'*';
PLUS :'+';
MINUS:'-';
DIV:'/' ;
How I can add to them the Priority of Operations (- /) ?
In ANTLR3 (and ANTLR4) * and / can be given a higher precedence than + and - like this:
println
: PRINTLN expression SEMICOLON
;
expression
: factor ( PLUS factor
| MINUS factor
)*
;
factor
: term ( MULT term
| DIV term
)*
;
term
: NUMBER
| ID
| PAR_OPEN expression PAR_CLOSE
;
But in ANTLR4, this will also work:
println
: PRINTLN expression SEMICOLON
;
expression
: NUMBER
| ID
| PAR_OPEN expression PAR_CLOSE
| expression ( MULT | DIV ) expression // higher precedence
| expression ( PLUS | MINUS ) expression // lower precedence
;
You normally solve this by defining expression, term, and factor production rules. Here's a grammar (specified in EBNF) that implements unary + and unary -, along with the 4 binary arithmetic operators, plus parentheses:
start ::= expression
expression ::= term (('+' term) | ('-' term))*
term ::= factor (('*' factor) | ('/' factor))*
factor :: = (number | group | '-' factor | '+' factor)
group ::= '(' expression ')'
where number is a numeric literal.

How can non-associative operators like "<" be specified in ANTLR4 grammars?

In a rule expr : expr '<' expr | ...;
the ANTLR parser will accept expressions like 1 < 2 < 3 (and construct left-associative trees corrsponding to brackets (1 < 2) < 3.
You can tell ANTLR to treat operators as right associative, e.g.
expr : expr '<'<assoc=right> expr | ...;
to yield parse trees 1 < (2 < 3).
However, in many languages, relational operators are non-associative, i.e., an expression 1 < 2 < 3 is forbidden.
This can be specified in YACC and its derivates.
Can it also be specified in ANTLR?
E.g., as expr : expr '<'<assoc=no> expr | ...;
I was unable to find something in the ANTLR4-book so far.
How about the following approach. Basically the "result" of a < b has a type not compatible for another application of operator < or >:
expression
: boolExpression
| nonBoolExpression
;
boolExpression
: nonBoolExpression '<' nonBoolExpression
| nonBoolExpression '>' nonBoolExpression
| ...
;
nonBoolExpression
: expression '*' expression
| expression '+' expression
| ...
;
Although personally I'd go with Darien and rather detect the error after parsing.

What characters are permitted for Haskell operators?

Is there a complete list of allowed characters somewhere, or a rule that determines what can be used in an identifier vs an operator?
From the Haskell report, this is the syntax for allowed symbols:
a | b means a or b and
a<b> means a except b
special -> ( | ) | , | ; | [ | ] | `| { | }
symbol -> ascSymbol | uniSymbol<special | _ | : | " | '>
ascSymbol -> ! | # | $ | % | & | * | + | . | / | < | = | > | ? | #
\ | ^ | | | - | ~
uniSymbol -> any Unicode symbol or punctuation
So, symbols are ASCII symbols or Unicode symbols except from those in special | _ | : | " | ', which are reserved.
Meaning the following characters can't be used: ( ) | , ; [ ] ` { } _ : " '
A few paragraphs below, the report gives the complete definition for Haskell operators:
varsym -> ( symbol {symbol | :})<reservedop | dashes>
consym -> (: {symbol | :})<reservedop>
reservedop -> .. | : | :: | = | \ | | | <- | -> | # | ~ | =>
Operator symbols are formed from one or more symbol characters, as
defined above, and are lexically distinguished into two namespaces
(Section 1.4):
An operator symbol starting with a colon is a constructor.
An operator symbol starting with any other character is an ordinary identifier.
Notice that a colon by itself, ":", is reserved solely for use as the
Haskell list constructor; this makes its treatment uniform with other
parts of list syntax, such as "[]" and "[a,b]".
Other than the special syntax for prefix negation, all operators are
infix, although each infix operator can be used in a section to yield
partially applied operators (see Section 3.5). All of the standard
infix operators are just predefined symbols and may be rebound.
From the Haskell 2010 Report §2.4:
Operator symbols are formed from one or more symbol characters...
§2.2 defines symbol characters as being any of !#$%&*+./<=>?#\^|-~: or "any [non-ascii] Unicode symbol or punctuation".
NOTE: User-defined operators cannot begin with a : as, quoting the language report, "An operator symbol starting with a colon is a constructor."
What I was looking for was the complete list of characters. Based on the other answers, the full list is;
Unicode Punctuation:
http://www.fileformat.info/info/unicode/category/Pc/list.htm
http://www.fileformat.info/info/unicode/category/Pd/list.htm
http://www.fileformat.info/info/unicode/category/Pe/list.htm
http://www.fileformat.info/info/unicode/category/Pf/list.htm
http://www.fileformat.info/info/unicode/category/Pi/list.htm
http://www.fileformat.info/info/unicode/category/Po/list.htm
http://www.fileformat.info/info/unicode/category/Ps/list.htm
Unicode Symbols:
http://www.fileformat.info/info/unicode/category/Sc/list.htm
http://www.fileformat.info/info/unicode/category/Sk/list.htm
http://www.fileformat.info/info/unicode/category/Sm/list.htm
http://www.fileformat.info/info/unicode/category/So/list.htm
But excluding the following characters with special meaning in Haskell:
(),;[]`{}_:"'
A : is only permitted as the first character of the operator, and denotes a constructor (see An operator symbol starting with a colon is a constructor).

Resources