What characters are permitted for Haskell operators? - haskell

Is there a complete list of allowed characters somewhere, or a rule that determines what can be used in an identifier vs an operator?

From the Haskell report, this is the syntax for allowed symbols:
a | b means a or b and
a<b> means a except b
special -> ( | ) | , | ; | [ | ] | `| { | }
symbol -> ascSymbol | uniSymbol<special | _ | : | " | '>
ascSymbol -> ! | # | $ | % | & | * | + | . | / | < | = | > | ? | #
\ | ^ | | | - | ~
uniSymbol -> any Unicode symbol or punctuation
So, symbols are ASCII symbols or Unicode symbols except from those in special | _ | : | " | ', which are reserved.
Meaning the following characters can't be used: ( ) | , ; [ ] ` { } _ : " '
A few paragraphs below, the report gives the complete definition for Haskell operators:
varsym -> ( symbol {symbol | :})<reservedop | dashes>
consym -> (: {symbol | :})<reservedop>
reservedop -> .. | : | :: | = | \ | | | <- | -> | # | ~ | =>
Operator symbols are formed from one or more symbol characters, as
defined above, and are lexically distinguished into two namespaces
(Section 1.4):
An operator symbol starting with a colon is a constructor.
An operator symbol starting with any other character is an ordinary identifier.
Notice that a colon by itself, ":", is reserved solely for use as the
Haskell list constructor; this makes its treatment uniform with other
parts of list syntax, such as "[]" and "[a,b]".
Other than the special syntax for prefix negation, all operators are
infix, although each infix operator can be used in a section to yield
partially applied operators (see Section 3.5). All of the standard
infix operators are just predefined symbols and may be rebound.

From the Haskell 2010 Report §2.4:
Operator symbols are formed from one or more symbol characters...
§2.2 defines symbol characters as being any of !#$%&*+./<=>?#\^|-~: or "any [non-ascii] Unicode symbol or punctuation".
NOTE: User-defined operators cannot begin with a : as, quoting the language report, "An operator symbol starting with a colon is a constructor."

What I was looking for was the complete list of characters. Based on the other answers, the full list is;
Unicode Punctuation:
http://www.fileformat.info/info/unicode/category/Pc/list.htm
http://www.fileformat.info/info/unicode/category/Pd/list.htm
http://www.fileformat.info/info/unicode/category/Pe/list.htm
http://www.fileformat.info/info/unicode/category/Pf/list.htm
http://www.fileformat.info/info/unicode/category/Pi/list.htm
http://www.fileformat.info/info/unicode/category/Po/list.htm
http://www.fileformat.info/info/unicode/category/Ps/list.htm
Unicode Symbols:
http://www.fileformat.info/info/unicode/category/Sc/list.htm
http://www.fileformat.info/info/unicode/category/Sk/list.htm
http://www.fileformat.info/info/unicode/category/Sm/list.htm
http://www.fileformat.info/info/unicode/category/So/list.htm
But excluding the following characters with special meaning in Haskell:
(),;[]`{}_:"'
A : is only permitted as the first character of the operator, and denotes a constructor (see An operator symbol starting with a colon is a constructor).

Related

ANTLR4: what design pattern to follow?

I have a ANTR4 rule "expression" that can be either "maths" or "comparison", but "comparison" can contain "maths". Here a concrete code:
expression
: ID
| maths
| comparison
;
maths
: maths_atom ((PLUS | MINUS) maths_atom) ? // "?" because in fact there is first multiplication then pow and I don't want to force a multiplication to make an addition
;
maths_atom
: NUMBER
| ID
| OPEN_PAR expression CLOSE_PAR
;
comparison
: comp_atom ((EQUALS | NOT_EQUALS) comp_atom) ?
;
comp_atom
: ID
| maths // here is the expression of interest
| OPEN_PAR expression CLOSE_PAR
;
If I give, for instance, 6 as input, this is fine for the parse tree, because it detects maths. But in the ANTLR4 plugin for Intellij Idea, it mark my expression rule as red - ambiguity. Should I say goodbye to a short parse tree and allow only maths trough comparison in expression so it is not so ambiguous anymore ?
The problem is that when the parser sees 6, which is a NUMBER, it has two paths of reaching it through your grammar:
expression - maths - maths_atom - NUMBER
or
expression - comparison - comp_atom - NUMBER
This ambiguity triggers the error that you see.
You can fix this by flattening your parser grammar as shown in this tutorial:
start
: expr | <EOF>
;
expr
: expr (PLUS | MINUS) expr # ADDGRP
| expr (EQUALS | NOT_EQUALS) expr # COMPGRP
| OPEN_PAR expression CLOSE_PAR # PARENGRP
| NUMBER # NUM
| ID # IDENT
;

How to do Priority of Operations (+ * - /) in my grammars?

I define my own grammars using antlr 4 and I want to build tree true According to Priority of Operations (+ * - /) ....
I find sample on do Priority of Operations (* +) it work fine ...
I try to edit it to add the Priority of Operations (- /) but I failed :(
the grammars for Priority of Operations (+ *) is :
println:PRINTLN expression SEMICOLON {System.out.println($expression.value);};
expression returns [Object value]:
t1=factor {$value=(int)$t1.value;}
(PLUS t2=factor{$value=(int)$value+(int)$t2.value;})*;
factor returns [Object value]: t1=term {$value=(int)$t1.value;}
(MULT t2=term{$value=(int)$value*(int)$t2.value;})*;
term returns [Object value]:
NUMBER {$value=Integer.parseInt($NUMBER.text);}
| ID {$value=symbolTable.get($value=$ID.text);}
| PAR_OPEN expression {$value=$expression.value;} PAR_CLOSE
;
MULT :'*';
PLUS :'+';
MINUS:'-';
DIV:'/' ;
How I can add to them the Priority of Operations (- /) ?
In ANTLR3 (and ANTLR4) * and / can be given a higher precedence than + and - like this:
println
: PRINTLN expression SEMICOLON
;
expression
: factor ( PLUS factor
| MINUS factor
)*
;
factor
: term ( MULT term
| DIV term
)*
;
term
: NUMBER
| ID
| PAR_OPEN expression PAR_CLOSE
;
But in ANTLR4, this will also work:
println
: PRINTLN expression SEMICOLON
;
expression
: NUMBER
| ID
| PAR_OPEN expression PAR_CLOSE
| expression ( MULT | DIV ) expression // higher precedence
| expression ( PLUS | MINUS ) expression // lower precedence
;
You normally solve this by defining expression, term, and factor production rules. Here's a grammar (specified in EBNF) that implements unary + and unary -, along with the 4 binary arithmetic operators, plus parentheses:
start ::= expression
expression ::= term (('+' term) | ('-' term))*
term ::= factor (('*' factor) | ('/' factor))*
factor :: = (number | group | '-' factor | '+' factor)
group ::= '(' expression ')'
where number is a numeric literal.

ANTLR: VERY slow parsing

I have successfully split my expressions into arithmetic and boolean expressions like this:
/* entry point */
parse: formula EOF;
formula : (expr|boolExpr);
/* boolean expressions : take math expr and use boolean operators on them */
boolExpr
: bool
| l=expr operator=(GT|LT|GEQ|LEQ) r=expr
| l=boolExpr operator=(OR|AND) r=boolExpr
| l=expr (not='!')? EQUALS r=expr
| l=expr BETWEEN low=expr AND high=expr
| l=expr IS (NOT)? NULL
| l=atom LIKE regexp=string
| l=atom ('IN'|'in') '(' string (',' string)* ')'
| '(' boolExpr ')'
;
/* arithmetic expressions */
expr
: atom
| (PLUS|MINUS) expr
| l=expr operator=(MULT|DIV) r=expr
| l=expr operator=(PLUS|MINUS) r=expr
| function=IDENTIFIER '(' (expr ( ',' expr )* ) ? ')'
| '(' expr ')'
;
atom
: number
| variable
| string
;
But now I have HUGE performance problems. Some formulas I try to parse are utterly slow, to the point that it has become unbearable: more than an hour (I stopped at that point) to parse this:
-4.77+[V1]*-0.0071+[V1]*[V1]*0+[V2]*-0.0194+[V2]*[V2]*0+[V3]*-0.00447932+[V3]*[V3]*-0.0017+[V4]*-0.00003298+[V4]*[V4]*0.0017+[V5]*-0.0035+[V5]*[V5]*0+[V6]*-4.19793004+[V6]*[V6]*1.5962+[V7]*12.51966636+[V7]*[V7]*-5.7058+[V8]*-19.06596752+[V8]*[V8]*28.6281+[V9]*9.47136506+[V9]*[V9]*-33.0993+[V10]*0.001+[V10]*[V10]*0+[V11]*-0.15397774+[V11]*[V11]*-0.0021+[V12]*-0.027+[V12]*[V12]*0+[V13]*-2.02963068+[V13]*[V13]*0.1683+[V14]*24.6268688+[V14]*[V14]*-5.1685+[V15]*-6.17590512+[V15]*[V15]*1.2936+[V16]*2.03846688+[V16]*[V16]*-0.1427+[V17]*9.02302288+[V17]*[V17]*-1.8223+[V18]*1.7471106+[V18]*[V18]*-0.1255+[V19]*-30.00770912+[V19]*[V19]*6.7738
Do you have any idea on what the problem is?
The parsing stops when the parser enters the formula grammar rule.
edit Original problem here:
My grammar allows this:
// ( 1 LESS_EQUALS 2 )
1 <= 2
But the way I expressed it in my G4 file makes it also accept this:
// ( ( 1 LESS_EQUALS 2 ) LESS_EQUALS 3 )
1 <= 2 <= 3
Which I don't want.
My grammar contains this:
expr
: atom # atomArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=('*'|'/') r=expr # multdivArithmeticExpr
| l=expr operator=('+'|'-') r=expr # addsubtArithmeticExpr
| l=expr operator=('>'|'<'|'>='|'<=') r=expr # comparisonExpr
[...]
How can I tell Antlr that this is not acceptable?
Just split root into two. Either rename root 'expr' into 'rootexpr', or vice versa.
rootExpr
: atom # atomArithmeticExpr
| (PLUS|MINUS) expr # plusMinusExpr
| l=expr operator=('*'|'/') r=expr # multdivArithmeticExpr
| l=expr operator=('+'|'-') r=expr # addsubtArithmeticExpr
| l=expr operator=('>'|'<'|'>='|'<=') r=expr # comparisonExpr
EDIT: You cannot have cyclic reference => expr node in expr rule.

N-ary operator parsing

I'm trying to match an operator of variable arity (e.g. "1 < 3 < x < 10" yields true, given that 3 < x < 10) within a mathematical expression. Note that this is unlike most languages would parse the expression)
The (simplified) production rule is:
expression: '(' expression ')' # parenthesisExpression
| expression ('*' | '/' | '%') expression # multiplicationExpression
| expression ('+' | '-') expression # additionExpression
| expression (SMALLER_THAN expression)+ # smallerThanExpression
| IDENTIFIER # variableExpression
;
How do we keep the precedence, but still parse the smallerThanExpression as greedy as possible?
For example; "1 < 1+1 < 3" should be parsed as a single parse node "smallerThanExpression" with three child nodes, each of which is an expression. At this moment, the smallerThanExpression is broken up in two smallerThanExpressions (1 < (1+1 < 3)).
To give an answer for "future generations": we fixed it by separating arithmetic expressions from the other expressions. We know that only arithmetic expressions can be used as operands for our variable-arity operators ('true < false' is not a valid expression).
expression:
'!' expression
| arithmetic (SMALLER_THAN arithmetic)+
| arithmetic (GREATER_THAN arithmetic)+
| ....
;
arithmetic:
'(' expression ')'
| expression ('*' | '/' | '%') expression
| expression ('+' | '-') expression
| IDENTIFIER
| ...
;
This enforces an expression such as "x < y < z" to be parsed as a single 'expression' node with three 'arithmetic' nodes as children.
(Note that an identifier might refer to a non-integer object; this is checked in the context checker)

Struggling to parse array notation

I have a very simple grammar to parse statements.
Here are examples of the type of statements that need be parsed:
a.b.c
a.b.c == "88"
The issue I am having is that array notation is not matching. For example, things that are not working:
a.b[0].c
a[3][4]
I hope someone can point out what I am doing wrong here. (I am testing in ANTLRWorks)
Here is the grammar (generationUnit is my entry point):
grammar RatBinding;
generationUnit: testStatement | statement;
arrayAccesor : identifier arrayNotation+;
arrayNotation: '[' Number ']';
testStatement:
(statement | string | Number | Bool )
(greaterThanAndEqual
| lessThanOrEqual
| greaterThan
| lessThan | notEquals | equals)
(statement | string | Number | Bool )
;
part: identifier | arrayAccesor;
statement: part ('.' part )*;
string: ('"' identifier '"') | ('\'' identifier '\'');
greaterThanAndEqual: '>=';
lessThanOrEqual: '<=';
greaterThan: '>';
lessThan: '<';
notEquals : '!=';
equals: '==';
identifier: Letter (Letter|Digit)*;
Bool : 'true' | 'false';
ArrayLeft: '\u005B';
ArrayRight: '\u005D';
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f '|
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
Digit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : [ \r\t\u000C\n]+ -> channel(HIDDEN)
;
You referenced the non-existent rule Number in the arrayNotation parser rule.
A Digit rule does exist in the lexer, but it will only match a single-digit number. For example, 1 is a Digit, but 10 is two separate Digit tokens so a[10] won't match the arrayAccesor rule. You probably want to resolve this in two parts:
Create a Number token consisting of one or more digits.
Number
: Digit+
;
Mark Digit as a fragment rule to indicate that it doesn't form tokens on its own, but is merely intended to be referenced from other lexer rules.
fragment // prevents a Digit token from being created on its own
Digit
: ...
You will not need to change arrayNotation because it already references the Number rule you created here.
Bah, waste of space. I Used Number instead of Digit in my array declaration.

Resources