Parse mathematical expressions with pyparsing - python-3.x

I'm trying to parse a mathematical expression using pyparsing. I know i could just copy the example calculator from pyparsing site, but i want to understand it so i can add to it later. And i'm here because i tried to understand the example, and i couldn't, so i tried my best, and i got to this:
symbol = (
pp.Literal("^") |
pp.Literal("*") |
pp.Literal("/") |
pp.Literal("+") |
pp.Literal("-")
)
operation = pp.Forward()
atom = pp.Group(
pp.Literal("(").suppress() + operation + pp.Literal(")").suppress()
) | number
operation << (pp.Group(number + symbol + number + pp.ZeroOrMore(symbol + atom)) | atom)
expression = pp.OneOrMore(operation)
print(expression.parseString("9-1+27+(3-5)+9"))
That prints:
[[9, '-', 1, '+', 27, '+', [[3, '-', 5]], '+', 9]]
It works, kinda. I want precedence and all sorted into Groups, but after trying a lot, i couldn't find a way to do it. More or less like this:
[[[[9, '-', 1], '+', 27], '+', [3, '-', 5]], '+', 9]
I want to keep it AST-looking, i would like to generate code from it.
I did saw the operatorPrecedence class? similar to Forward, but i don't think i understand how it works either.
EDIT:
Tried more in depth operatorPrecedence and i got this:
expression = pp.operatorPrecedence(number, [
(pp.Literal("^"), 1, pp.opAssoc.RIGHT),
(pp.Literal("*"), 2, pp.opAssoc.LEFT),
(pp.Literal("/"), 2, pp.opAssoc.LEFT),
(pp.Literal("+"), 2, pp.opAssoc.LEFT),
(pp.Literal("-"), 2, pp.opAssoc.LEFT)
])
Which doesn't handle parenthesis (i don't know if i will have to postprocess the results) and i need to handle them.

The actual name for this parsing problem is "infix notation" (and in recent versions of pyparsing, I am renaming operatorPrecedence to infixNotation). To see the typical implementation of infix notation parsing, look at the fourFn.py example on the pyparsing wiki. There you will see an implementation of this simplified BNF to implement 4-function arithmetic, with precedence of operations:
operand :: integer or real number
factor :: operand | '(' expr ')'
term :: factor ( ('*' | '/') factor )*
expr :: term ( ('+' | '-') term )*
So an expression is one or more terms separated by addition or subtraction operations.
A term is one or more factors separated by multiplication or division operations.
A factor is either a lowest-level operand (in this case, just integers or reals), OR an expr enclosed in ()'s.
Note that this is a recursive parser, since factor is used indirectly in the definition of expr, but expr is also used to define factor.
In pyparsing, this looks roughly like this (assuming that integer and real have already been defined):
LPAR,RPAR = map(Suppress, '()')
expr = Forward()
operand = real | integer
factor = operand | Group(LPAR + expr + RPAR)
term = factor + ZeroOrMore( oneOf('* /') + factor )
expr <<= term + ZeroOrMore( oneOf('+ -') + term )
Now using expr, you can parse any of these:
3
3+2
3+2*4
(3+2)*4
The infixNotation pyparsing helper method takes care of all the recursive definitions and groupings, and lets you define this as:
expr = infixNotation(operand,
[
(oneOf('* /'), 2, opAssoc.LEFT),
(oneOf('+ -'), 2, opAssoc.LEFT),
])
But this obscures all the underlying theory, so if you are trying to understand how this is implemented, look at the raw solution in fourFn.py.
[EDIT - 18 Dec 2022] For those looking for a pre-defined solution, I've packaged infixNotation up into its own pip-installable package called plusminus. plusminus defines a BaseArithmeticParser class for creating a ready-to-run parser and evaluator that supports these operators:
** ÷ >= ∈ in ?:
* + == ∉ not |absolute-value|
// - != ∩ and
/ < ≠ ∪ ∧
mod > ≤ & or
× <= ≥ | ∨
And these functions:
abs ceil max
round floor str
trunc min bool
The BaseArithmeticParser class allows you to define additional operators and functions for your own domain-specific expressions, and the examples show how to define parsers with custom functions and operators for dice rolling, retail price discounts, among others.

Related

Parsing custom binary function in sympy

I have a string that represents a function, for example
1 + x - 2^x
I am aware I can use sympy to parse this expression with parse_expr.
Suppose now I have another expression with a binary operator not currently in the grammar, for example
1 + x my_custom_operator 2^x
Which should yield
1 + my_custom_operator(x, 2^x)
where my_custom_operator(a, b) = min(a, b)
Is there a way to parse this in sympy or an alternative means. The true expression is far more complicated, so no regex allowed due to things like nested parenthesis.

Proving the inexpressibility of a function in a given language

I'm currently reading through John C. Mitchell's Foundations for Programming Languages. Exercise 2.2.3, in essence, asks the reader to show that the (natural-number) exponentiation function cannot be implicitly defined via an expression in a small language. The language consists of natural numbers and addition on said numbers (as well as boolean values, a natural-number equality predicate, & ternary conditionals). There are no loops, recursive constructs, or fixed-point combinators. Here is the precise syntax:
<bool_exp> ::= <bool_var> | true | false | Eq? <nat_exp> <nat_exp> |
if <bool_exp> then <bool_exp> else <bool_exp>
<nat_exp> ::= <nat_var> | 0 | 1 | 2 | … | <nat_exp> + <nat_exp> |
if <bool_exp> then <nat_exp> else <nat_exp>
Again, the object is to show that the exponentiation function n^m cannot be implicitly defined via an expression in this language.
Intuitively, I'm willing to accept this. If we think of exponentiation as repeated multiplication, it seems like we "just can't" express that with this language. But how does one formally prove this? More broadly, how do you prove that an expression from one language cannot be expressed in another?
Here's a simple way to think about it: the expression has a fixed, finite size, and the only arithmetic operation it can do to produce numbers not written as literals or provided as the values of variables is addition. So the largest number it can possibly produce is limited by the number of additions plus 1, multiplied by the largest number involved in the expression.
So, given a proposed expression, let k be the number of additions in it, let c be the largest literal (or 1 if there is none) and choose m and n such that n^m > (k+1)*max(m,n,c). Then the result of the expression for that input cannot be the correct one.
Note that this proof relies on the language allowing arbitrarily large numbers, as noted in the other answer.
No solution, only hints:
First, let me point out that if there are finitely many numbers in the language, then exponentiation is definable as an expression. (You'd have to define what it should produce when the true result is unrepresentable, eg wraparound.) Think about why.
Hint: Imagine that there are only two numbers, 0 and 1. Can you write an expression involving m and n whose result is n^m? What if there were three numbers: 0, 1, and 2? What if there were four? And so on...
Why don't any of those solutions work? Let's index them and call the solution for {0,1} partial_solution_1, the solution for {0,1,2} partial_solution_2, and so on. Why isn't partial_solution_n a solution for the set of all natural numbers?
Maybe you can generalize that somehow with some metric f : Expression -> Nat so that every expression expr with f(expr) < n is wrong somehow...
You may find some inspiration from the strategy of Euclid's proof that there are infinitely many primes.

pyparsing: nested expression with simple arithmetics

I am using pyparsing to parse a nested expression which is formed by delimited lists but which includes some basic arithmetic (just multiplication, for instance).
A sample expression could look like this:
(A, B, 2 * C, 3 * ( D, E, 2 * F, 3 *(G, H)), I )
The output should unfold the arithmetic:
( A, B, C, C, D, E, F, F, G, H, G, H, G, H, D, E, F, F, G, H, G, H, G, H, D, E, F, F, G, H, G, H, G, H, I )
Could somebody give me hint how to approach the problem?
I started like follows: since there's just the operation multiplication, I decided to use the '*' character as a delimiter in a somewhat weird list:
import pyparsing as pp
oddDelim = pp.Or([',', '*'])
weirdList = pp.Optional(',').suppress() + \
pp.delimitedList(pp.Or([pp.alphas, pp.pyparsing_common.number]), delim = oddDelim, combine = False) + \
pp.Optional('*').suppress()
nestedTest = pp.nestedExpr(content = weirdList)
Using this nestedTest expression I get a reasonable result:
[['A', 'B', 2, 'C', 3, ['D', 'E', 2, 'F', 3, ['G', 'H']], 'I']]
but I don't know how should I parse the tokens in order to properly unfold the arithmetics.
Instead of iterating over the tokens sequentially in a FOR loop, I would ideally like to start unfolding the arithmetic from the highest degree of nesting and progressively going down. But I don't know how...
Is nestedExpr the way to go? Or should I change the approach and use Forward or maybe infixNotation? I am very new into pyparsing I would be very grateful if I got some hints/ideas on this.
Thanks very much in advance for your help!
Cheers,
Pau
If you want to use Forward() to roll our own recursive grammar, it is best to start
with writing a BNF for your grammar. This will help you think straight about the
problem space first, and then worry about the coding later.
Here is a rough BNF for what you've posted:
list_expr ::= '(' list_item [',' list_item]* ')'
list_item ::= term | mult_term | list_expr
mult_term ::= integer '*' list_item
term ::= A-Z
That is, each list enclosed in parentheses has a comma-delimited list of
items, and each item can be a single character term, a multiplication expression
of an integer, a '*' and an item, or a nested list in another set of parentheses.
To translate this to pyparsing, work bottom-up to define each expression. For
instance, define a term using the new Char class (which is a single-character
from a string of allowed characters):
term = pp.Char("ABCDEFGHI... etc.")
You'll need to use Forward for list_item, since it will need expressions that
aren't defined yet, so Forward() gives you a placeholder. Then when you have
term, mult_term, and list_expr defined, using '<<=' to "insert" the definition
into the existing placeholder, like this:
list_item <<= term | mult_term | list_expr
Since you asked about infixNotation, I'll talk about that approach also.
When using infixNotation, look at your input and identify what constitutes
grouping, operators, and operands.
The grouping here is easy, it is done using ()'s,
which is pretty standard, and infixNotation will treat them as such by default.
Next identify what the lowest-level
operands are. You have two types of operands: integers and single alpha
characters.
The two operators are '*' for multiplication, and ',' for addition.
Since you only asked for suggestions, I'll stop there and let you tackle/struggle
with the next steps on your own.

What's the priority for function composition in Haskell?

I saw this code on my textbook:
double :: (Num a) => a -> a
double x = x * 2
map (double.double) [1,2,3,4]
What I don't get is that if functional composition operation have the highest priority, why use parentheses to include double.double? If I remove those parentheses, I get error message. So what's exactly is functional composition's priority?
All of the built-in operators' respective precedences and fixities can be found in the Haskell Report section 4.4.2. In particular, . and !! have precedence 9, which is the highest among operators. However, function application is not an operator. Function application is specifically designed to have precedence higher than any operator, so
map (double.double) [1,2,3,4]
This is applying the function double . double to each element of the list [1, 2, 3, 4]
map double.double [1,2,3,4]
This is attempting to compose the functions map double and double [1, 2, 3, 4], which is unlikely to be successful (though it is not technically impossible).
Precedence (and associativity) are ways of resolving the ambiguity between multiple infix operators in an expression. If there are two operators next to an operand, precedence (and associativity) tells you which of them takes the operand as an argument and which of them takes the other applied-operator expression as an argument. For example, in the expression 1 + 2 * 3, the 2 is next to both + and *. The higher precedence of * means that * gets the 2 as its left argument, while + takes the whole 2 * 3 sub-expression as its right argument.
However that's not the case in map double.double [1, 2, 3, 4]. There's only one operator, with two operands on either side, so there's no question for precedence to answer for us. The two operands are map double and double [1, 2, 3, 4]; operands are sequences of one or more terms, not only the immediate left and right terms. Where there's more than one term the sequence is parsed as simple chained function application (i.e. a b c d is ((a b) c) d).
Another way to think of it is that there is an "adjacency operator" which has higher precedence than can be assigned to anything else, and is invisibly present between any two non-operator terms (but not anywhere else). In this way of thinking map double.double [1, 2, 3, 4] is really something like map # double . double # [1, 2, 3, 4] (where I've written this "adjacency operator" as #). Since # has higher precedence than ., this is (map # double) . (double # [1, 2, 3, 4]).1
Whichever way you choose to interpret it, there is a simple consequence. It is simply impossible for any applied operator expression to be passed as an argument in non-operator function application unless there are parentheses around the operator application. If there is an operator in an expression outside any parentheses, then the outermost layer of the expression is always going to be an operator application.
1 This adjacency operator explanation seems to be pretty common. I personally think it is a poor explanation for how to parse expressions, since you need to partially parse an expression to know where to insert the adjacency operators.
It's often called the "whitespace operator", which is even more confusing since not every piece of whitespace represents this operator, and you don't always need whitespace for it to be there. e.g. length"four" + 1

How can I write a grammar that matches "x by y by z of a"?

I'm designing a low-punctuation language in which I want to support the declaration of arrays using the following syntax:
512 by 512 of 255 // a 512x512 array filled with 255
100 of 0 // a 100-element array filled with 0
expr1 by expr2 by expr3 ... by exprN of exprFill
These array declarations are just one kind of expression among many.
I'm having a hard time figuring out how to write the grammar rules. I've simplified my grammar down to the simplest thing that reproduces my trouble:
grammar Dimensions;
program
: expression EOF
;
expression
: expression (BY expression)* OF expression
| INT
;
BY : 'by';
OF : 'of';
INT : [0-9]+;
WHITESPACE : [ \t\n\r]+ -> skip;
When I feed in 10 of 1, I get the parse I want:
When I feed in 20 by 10 of 1, the middle expression non-terminal slurps up the 10 of 1, leaving nothing left to match the rule's OF expression:
And I get the following warning:
line 2:0 mismatched input '<EOF>' expecting 'of'
The parse I'd like to see is
(program (expression (expression 20) by (expression 10) of (expression 1)) <EOF>)
Is there a way I can reformulate my grammar to achieve this? I feel that what I need is right-association across both BY and OF, but I don't know how to express this across two operators.
After some non-intellectual experimentation, I came up with some productions that seem to generate my desired parse:
expression
:<assoc=right> expression (BY expression)+ OF expression
|<assoc=right> expression OF expression
| INT
;
I don't know if there's a way I can express it with just one production.

Resources