reverse "inequality sign" presenting order in decision tree built in rpart or partykit - decision-tree

I have built a multi-layer decision tree using rpart and I am trying to replicate the tree structure using partykit package, more specifically, the partysplit-partynodecombo.
I am currently having an issue with the order difference between rpart and partysplit.
The decision tree coming from rpart always take "greater than" sign (>) first then "less than" sign (<) underneath while partykit is the opposite.
e.g., the rpart output
[6] value.a >= 33: FALSE. (n = 63, err = 33.3%)
[7] value.a< 33: FALSE. (n = 74, err = 8.1%)
vs. the partykit output
[6] value.a < 33: FALSE. (n = 74, err = 8.1%)
[7] value.a >= 33: FALSE. (n = 63, err = 33.3%)
As a result, I am having trouble reading the decision tree in correct order and using partykit to recreate the tree from rpart.
Is there a way I can create a tree from rpart such as the tree take "less than" sign first, or is there an option on partysplit such that you make the split take the "greater than" sign first?

The partysplit() constructor function provides the arguments index and right which can be used to create all combinations. The index controls whether the left part of the interval is presented first or second. The right argument controls where the equal sign goes. For a simple dummy data set:
d <- data.frame(x = 1:10)
we can create all for combinations:
sp1 <- partysplit(1L, 5.5, index = 1:2, right = TRUE)
sp2 <- partysplit(1L, 5.5, index = 2:1, right = TRUE)
sp3 <- partysplit(1L, 5.5, index = 1:2, right = FALSE)
sp4 <- partysplit(1L, 5.5, index = 2:1, right = FALSE)
Then the corresponding character labels can be computed as:
character_split(sp1, d)$levels
## [1] "<= 5.5" "> 5.5"
character_split(sp2, d)$levels
## [1] "> 5.5" "<= 5.5"
character_split(sp3, d)$levels
## [1] "< 5.5" ">= 5.5"
character_split(sp4, d)$levels
## [1] ">= 5.5" "< 5.5"
The as.party() method for rpart objects tries to preserve this. For example:
library("partykit")
library("rpart")
iris2 <- iris[1:100,]
rp <- rpart(Species ~ ., data = iris2)
rp
## n= 100
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 100 50 setosa (0.5000000 0.5000000)
## 2) Petal.Length< 2.45 50 0 setosa (1.0000000 0.0000000) *
## 3) Petal.Length>=2.45 50 0 versicolor (0.0000000 1.0000000) *
as.party(rp)
## Model formula:
## Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
##
## Fitted party:
## [1] root
## | [2] Petal.Length < 2.45: setosa (n = 50, err = 0.0%)
## | [3] Petal.Length >= 2.45: versicolor (n = 50, err = 0.0%)
##
## Number of inner nodes: 1
## Number of terminal nodes: 2

Related

Finding matching 2D intersections among parameters in a 3D data matrix, based on numerical comparison?

I have a 3-dimensional matrix
import numpy as np
matrix = np.random.randn(100,10,500)
X - dimension : samples of data
Y - dimension : parameter/variable type
Z - dimension : location
In the above matrix, there are 100 samples for 10 variables at 500 locations. For example, if I index variable #1, then I have 100 samples of that variable (spanning 20 seconds) at each of the 500 locations.
I need to identify which samples of data and which locations match certain criteria. Matches are per matrix (indexing a particular variable/parameter) and would be in pairs, describing the sample and location of where the criteria matched. For example, a match for the above 3D matrix would be at sample = 50 and location = 31. There could be multiple pairs of matches. I generally return this as an array of tuples, where each tuple contains the sample and location number.
The criteria could specify one or more:
Ranges of values : between -1.0 and 5.5 for example
Individual values : value must == 1.39 for example
These ranges and individual values can be specified for one or more:
Variables/parameters
For example:
Parameter #1 (Y-index = 0) : (-1.0 to 5.5) or (10.3 to 12.1) or 20.32
Parameter #5 (Y-index = 4) : 10.0 or (1.0 to 800.0)
Parameter #8 (Y-index = 7) : (50.0 to 100.0)
Additionally, I would need the ability to invert the criteria, for example:
Parameter #1 (Y-index = 0) : NOT ( (-1.0 to 5.5) or (10.3 to 12.1) or 20.32 )
I would need to have a list of tuples indicating the sample index (X-axis) and location index (Z-axis), where Parameter #1 and Parameter #5 and Parameter #8 match their conditions in the 3D matrix.
I've been looking at intersect1D. I've also been using np.where in a loop, which is very very inefficient, such as:
import numpy as np
matrix = np.random.randn(100,10,500)
net_array = None
for parameter in parameters:
total_result = None
for lower_range_value, upper_range_value in range_values[parameter]:
result = np.where( (matrix[:,parameter,:] >= lower_range_value) & (matrix[:,parameter,:] <= upper_range_value)
if result[0].size > 0:
if type(total_result) == type(None):
total_result = result
else:
concat_0 = np.concatenate( (total_result[0], result[0]) )
concat_1 = np.concatenate( (total_result[1], result[1]) )
total_result = (concat_0,concat_1)
for discrete_value in discrete_values[parameter]:
result = np.where( matrix[:,parameter,:] == threshold )
if result[0].size > 0:
if type(total_result) == type(None):
total_result = result
else:
concat_0 = np.concatenate( (total_result[0], result[0]) )
concat_1 = np.concatenate( (total_result[1], result[1]) )
total_result = (concat_0,concat_1)
if type(total_result) != type(None):
if type(net_array) == type(None):
net_array = np.stack( [ total_result[0] , total_result[1] ] , axis = -1)
else:
stacked_total_result = np.stack( [ total_result[0] , total_result[1] ] , axis=-1 )
match_indexes = (net_array[:,None] == stacked_total_result).all(-1).any(1)
net_array = net_array[match_indexes]
if np.any(match_indexes) == False:
break
Is there an efficient way of finding the sample index (X-axis) and location index (Z-axis) where one or more parameters (Y-axis) each match their criteria?
I think you want something like this?
import numpy as np
(n_samples, n_vars, n_locs) = (100, 10, 500)
matrix = np.random.randn(n_samples, n_vars, n_locs)
param_idx2ranges = {
0: [(-2.0, -2.5), (0.5, 0.75), (2.32, 2.32)],
4: [(1.0, 1.0), (2.0, 40.0)],
7: [(1.5, 2.0)],
}
final_mask = np.ones((n_samples, n_locs), dtype="bool")
for (param_idx, ranges) in param_idx2ranges.items():
param_mask = np.zeros((n_samples, n_locs), dtype="bool")
for (min_val, max_val) in ranges:
param_mask |= (min_val <= matrix[:, param_idx]) & (
matrix[:, param_idx] <= max_val
)
final_mask &= param_mask
idxs = np.argwhere(final_mask)
print(matrix[idxs[:, 0], :, idxs[:, 1]])
Negating just involves applying the ~ operator where you need it.

Python cvxpy - reuse some constraints

I'm currently using cvxpy to optimize a really big problem but now facing the current issue.
I run multiple iterations of the solver (every iteration reduces the flexibility of some variables).
Every run has 50 constraints in total, of which only 2 of them are different on every run. The remaining 48 constraints are identical.
During every iteration I rebuild from scratch those 2 constraints, the problem, and the obj function.
If I don't rebuild the remaining (same) 48 constraints, the final solution makes no sense.
I read this post CVXPY: how to efficiently solve a series of similar problems but here in my case, I don't need to change parameters and re-optimize.
I just managed to prepare an example that shows this issue:
x = cvx.Variable(3)
y = cvx.Variable(3)
tc = np.array([1.0, 1.0,1.0])
constraints2 = [x >= 2]
constraints3 = [x <= 4]
constraints4 = [y >= 0]
for i in range(2):
if i == 0:
constraints1 = [x - y >= 0]
else:
x = cvx.Variable(3)
y = cvx.Variable(3)
constraints1 = [x + y == 1,
x - y >= 1,
x - y >= 0,
x >= 0]
constraints = constraints1 + constraints2 + constraints3 + constraints4
# Form objective.
obj = cvx.Minimize( (tc.T # x ) - (tc.T # y ) )
# Form and solve problem.
prob = cvx.Problem(obj, constraints)
prob.solve()
solution_value = prob.value
solution = str(prob.status).lower()
print("\n\n** SOLUTION: {} Value: {} ".format(solution, solution_value))
print("* optimal (x + y == 1) dual variable", constraints[0].dual_value)
print("optimal (x - y >= 1) dual variable", constraints[1].dual_value)
print("x - y value:", (x - y).value)
print("x = {}".format(x.value))
print("y = {}".format(y.value))
As you can see, constraints2 requires all the values in the x vector to be greater than 2. constraints2 is added in both iterations to "constraints" that is used in the solver.
The second solution should give you values of vector x that are less than 2.
Why? How to avoid this issue?
Thank you
You need to use parameters as described in the linked post. Suppose you have the constraint rhs >= lhs which is sometimes used and other times not, where rhs and lhs have dimensions m x n. Write the following code:
param = cp.Parameter((m, n))
slack = cp.Variable((m, n))
param_constraint = [rhs >= lhs + cp.multiply(param, slack)]
Now to turn off the constraint, set param.values = np.ones((m, n)). To turn the constraint on, set param.values = np.zeros((m, n)). You can turn some entries of the constraint off/on by setting some entries of param to be 1 and others to be 0.

Speed Up a for Loop - Python

I have a code that works perfectly well but I wish to speed up the time it takes to converge. A snippet of the code is shown below:
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :]x))*data[i]/(norm(data[i])**2))
return y
rows, columns = data.shape
start = time.time()
iterate = 0
iterate_count = []
norm_count = []
res = 5
x_not = np.ones(columns)
norm_count.append(norm(x_not))
iterate_count.append(0)
while res > 1e-8:
for row in range(rows):
y = myfunction(x_not, row)
x_not = y
iterate += 1
iterate_count.append(iterate)
norm_count.append(norm(x_not))
res = abs(norm_count[-1] - norm_count[-2])
print('Converge at {} iterations'.format(iterate))
print('Duration: {:.4f} seconds'.format(time.time() - start))
I am relatively new in Python. I will appreciate any hint/assistance.
Ax=b is the problem we wish to solve. Here, 'A' is the 'data' and 'b' is the 'target'
Ugh! After spending a while on this I don't think it can be done the way you've set up your problem. In each iteration over the row, you modify x_not and then pass the updated result to get the solution for the next row. This kind of setup can't be vectorized easily. You can learn the thought process of vectorization from the failed attempt, so I'm including it in the answer. I'm also including a different iterative method to solve linear systems of equations. I've included a vectorized version -- where the solution is updated using matrix multiplication and vector addition, and a loopy version -- where the solution is updated using a for loop to demonstrate what you can expect to gain.
1. The failed attempt
Let's take a look at what you're doing here.
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :] # x)) * (data[i] / (norm(data[i])**2))
return y
You subtract
the dot product of (the ith row of data and x_not)
from the ith row of target,
limited at zero.
You multiply this result with the ith row of data divided my the norm of that row squared. Let's call this part2
Then you add this to the ith element of x_not
Now let's look at the shapes of the matrices.
data is (M, N).
target is (M, ).
x_not is (N, )
Instead of doing these operations rowwise, you can operate on the entire matrix!
1.1. Simplifying the dot product.
Instead of doing data[i, :] # x, you can do data # x_not and this gives an array with the ith element giving the dot product of the ith row with x_not. So now we have data # x_not with shape (M, )
Then, you can subtract this from the entire target array, so target - (data # x_not) has shape (M, ).
So far, we have
part1 = target - (data # x_not)
Next, if anything is greater than zero, set it to zero.
part1[part1 > 0] = 0
1.2. Finding rowwise norms.
Finally, you want to multiply this by the row of data, and divide by the square of the L2-norm of that row. To get the norm of each row of a matrix, you do
rownorms = np.linalg.norm(data, axis=1)
This is a (M, ) array, so we need to convert it to a (M, 1) array so we can divide each row. rownorms[:, None] does this. Then divide data by this.
part2 = data / (rownorms[:, None]**2)
1.3. Add to x_not
Finally, we're adding each row of part1 * part2 to the original x_not and returning the result
result = x_not + (part1 * part2).sum(axis=0)
Here's where we get stuck. In your approach, each call to myfunction() gives a value of part1 that depends on target[i], which was changed in the last call to myfunction().
2. Why vectorize?
Using numpy's inbuilt methods instead of looping allows it to offload the calculation to its C backend, so it runs faster. If your numpy is linked to a BLAS backend, you can extract even more speed by using your processor's SIMD registers
The conjugate gradient method is a simple iterative method to solve certain systems of equations. There are other more complex algorithms that can solve general systems well, but this should do for the purposes of our demo. Again, the purpose is not to have an iterative algorithm that will perfectly solve any linear system of equations, but to show what kind of speedup you can expect if you vectorize your code.
Given your system
data # x_not = target
Let's define some variables:
A = data.T # data
b = data.T # target
And we'll solve the system A # x = b
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
p = resid
while (np.abs(resid) > tolerance).any():
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
To contrast the fully vectorized approach with one that uses iterations to update the rows of x and resid_new, let's define another implementation of the CG solver that does this.
def solve_loopy(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
for i in range(len(x)):
x[i] = x[i] + alpha * p[i]
resid_new[i] = resid[i] - alpha * Ap[i]
# resid_new = resid - alpha * A # p
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
And our original vector method:
def solve_vect(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
Let's solve a simple system to see if this works first:
2x1 + x2 = -5
−x1 + x2 = -2
should give a solution of [-1, -3]
data = np.array([[ 2, 1],
[-1, 1]])
target = np.array([-5, -2])
print(solve_loopy(data, target))
print(solve_vect(data, target))
Both give the correct solution [-1, -3], yay! Now on to bigger things:
data = np.random.random((100, 100))
target = np.random.random((100, ))
Let's ensure the solution is still correct:
sol1 = solve_loopy(data, target)
np.allclose(data # sol1, target)
# Output: False
sol2 = solve_vect(data, target)
np.allclose(data # sol2, target)
# Output: False
Hmm, looks like the CG method doesn't work for badly conditioned random matrices we created. Well, at least both give the same result.
np.allclose(sol1, sol2)
# Output: True
But let's not get discouraged! We don't really care if it works perfectly, the point of this is to demonstrate how amazing vectorization is. So let's time this:
import timeit
timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
# Output: 0.25586539999994784
timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
# Output: 0.12008900000000722
Nice! A ~2x speedup simply by avoiding a loop while updating our solution!
For larger systems, this will be even better.
for N in [10, 50, 100, 500, 1000]:
data = np.random.random((N, N))
target = np.random.random((N, ))
t_loopy = timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
t_vect = timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
print(N, t_loopy, t_vect, t_loopy/t_vect)
This gives us:
N t_loopy t_vect speedup
00010 0.002823 0.002099 1.345390
00050 0.051209 0.014486 3.535048
00100 0.260348 0.114601 2.271773
00500 0.980453 0.240151 4.082644
01000 1.769959 0.508197 3.482822

I am writing a code based on some formulas;but my if statement does not seem to work(line 33)

W = int(input('Enter weight:'))
b = [1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
AR = [1, 2, 3, 4, 5]
T = [50.99238, 50.05062, 49.07943, 48.05919, 47.00952, 45.92061, 44.79246, 43.62507, 42.42825, 41.18238, 39.90708, 38.60235, 37.24857, 35.86536, 34.4331, 32.97141, 31.48029, 29.94012, 28.37052, 26.76168, 25.1136]
CL = [1.0]
CD = [0.5]
clmax = 2
n = 0
z = 0
while b[n] < 2.1 :
while AR[z]< 5.1:
cl = CL[n]
cd = CD[z]
s = ((b[n]*b[n])/AR[z])
V = ((2*W*9.81)/(1.2*s*clmax) ** 0.5)*1.1
vlof = V/1.41
Vlof = round(vlof)
D = 0.5*cd*1.2*Vlof*Vlof*s
L = 0.5*cl*1.2*Vlof*Vlof*s
a = (9.81/W)*(T[Vlof]-D-O.O5(W-L))
Sg = (V*V)/(2*a)
if Sg <= 30:
print('IT WILL TAKEOFF')
else:
print('It will NOT takeoff')
t/c = int(input('t/c ratio is:'))
l = int(input('Taper ratio is:'))
f = 0.005(1+1.5*((l-0.6)**2))
e = (1/((1+0.12*V*V*0.003*0.003)(1+(0.142+(f*AR[z]*(10*t/c)**0.33)
+(0.1/(4+AR[z])**0.8))
if z <= 5: # line 33 #
z +=1
else :
break
n+=1
OUTPUT:
File "<ipython-input-49-d7d52927efb2>", line 33
if z <= 5:
^
SyntaxError: invalid syntax
This is a code which should tell me whether my aircraft will takeoff or not (on the input of a weight)
I cannot seem to understand the reason why is this syntax invalid ?
There are several errors in the code you provided. I tried to clean it up as best as I could. The following does not cause an error by default. This does not mean that the code does what it is supposed to do. You should still check that.
Your lists like CL and CD do not contain enough elements for most cases which will cause IndexError: list index out of range. You should extend those lists or use a different kind of logic to make sure this does not happen.
W = int(input('Enter weight:'))
b = [1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]
AR = [1, 2, 3, 4, 5]
T = [
50.99238, 50.05062, 49.07943, 48.05919, 47.00952, 45.92061, 44.79246,
43.62507, 42.42825, 41.18238, 39.90708, 38.60235, 37.24857, 35.86536,
34.4331, 32.97141, 31.48029, 29.94012, 28.37052, 26.76168, 25.1136
]
CL = [1.0]
CD = [0.5]
clmax = 2
n = 0
z = 0
while b[n] < 2.1:
while AR[z] < 5.1:
cl = CL[n]
cd = CD[z]
s = ((b[n]*b[n])/AR[z])
V = ((2*W*9.81)/(1.2*s*clmax) ** 0.5)*1.1
vlof = V/1.41
Vlof = round(vlof)
D = 0.5*cd*1.2*Vlof*Vlof*s
L = 0.5*cl*1.2*Vlof*Vlof*s
# O.O5 -> 0.05*: you used the letter 'O' instead of '0' (zero). also
# you always have to use * if you want to multiply something. you
# cannot leave it out.
a = (9.81/W)*(T[Vlof]-D-0.05*(W-L))
Sg = (V*V)/(2*a)
if Sg <= 30:
print('IT WILL TAKEOFF')
else:
print('It will NOT takeoff')
# t/c -> t_c: you cannot use special characters when defining
# variables. best stick to lower- and upper-case letters,
# the underscore and numbers.
t_c = int(input('t/c ratio is:'))
l = int(input('Taper ratio is:'))
# a whole lot of closing parentheses and '*' were missing here. also
# cleaned up the formatting
f = 0.005*(1+1.5*((l-0.6)**2))
e = (
1 / (
(1+0.12*V*V*0.003*0.003)*(
1 + (
0.142+(f*AR[z]*(10*t_c)**0.33)
+ (0.1/(4+AR[z])**0.8)
)
)
)
)
# in Python, space has meaning. it determines which part of the code
# is executed at which level in the routine. you cannot chose it
# arbitrarily.
if z <= 5:
z += 1
else:
break
n += 1

How do I interpret this error of integration example in the SCIPY docs?

This SCIPY doc shows how to integrate a function and return 5 outputs as:
scipy.integrate.quad(func, a, b, args=(), full_output=0, epsabs=1.49e-08, epsrel=1.49e-08, limit=50, points=None, weight=None, wvar=None, wopts=None, maxp1=50, limlst=50)
Near the bottom of the page is an example in which the function e^(-x) is integrated from 0 to infinity with respect to x as:
>>> invexp = lambda x: np.exp(-x)
>>> integrate.quad(invexp, 0, np.inf)
(1.0, 5.842605999138044e-11)
This part makes sense. The example then continues as:
>>> f = lambda x,a : a*x
>>> y, err = integrate.quad(f, 0, 1, args=(1,))
>>> y
0.5
>>> y, err = integrate.quad(f, 0, 1, args=(3,))
>>> y
1.5
What do the numbers in args=(numbers,)) represent? How is it used in the function, and how does the function know how to use it? How are the outputs 0.5 and 1.5 obtained with the num in args?
Is it correct to say that whatever the numbers in args=(numbers,)) are, they can never be changed? If so, is it possible to pass two sets of args; the first set consisting of unchangeable inputs and the second set of mutable args (perhaps for mutable parameters a and b in an integrable Gaussian distribution function)?
PS -- I can remove the code and extra thoughts below if it's too long and irrelevant.
I am thinking this method may help me generalize my code such that an unchangeable argument can be added to my functions to choose which type of distribution to integrate over if multiple distribution functions are defined. Though they are I have a way of passing unchangeable args in a distribution function as:
from scipy.integrate import quad
def distribution( x , a , b ): ## GAUSSIAN
## add unchangeable input pick_distribution to specify which distribution if multiple distributions are definable
a = abs(a) ## mu
b = abs(b) ## sigma
cnorm = 1 / ( b * ( 2*pi )**(1/2) )
return cnorm * exp( (-1) * (x - a)**2 / ( 2 * (b **2) ) )
def integratesubs( args ):
## args[0] = a
## args[1] = b
## if the interval is 0 to 5, the subintervals could be 0 to 1, 1 to 2, 2 to 3, etc; these are generalizable
## integral over one subinterval is equal to area under Gaussian curve between the subinterval bounds
## numobs is a pre-defined sample size
res = []
for i in range(len(subintervalbounds)-1): ## ith i does not exist for rightmost boundary
res.append( quad( distribution , subintervalbounds[ i ] , subintervalbounds[ i + 1 ],
args = ( args[0] , args[1] ))[0] * numobs )
return res
Another function that calls integratesubs (not shown) from higher up the chain takes an input parameters = [a,b] as mutable inputs.

Resources