Can I use a decision tree to compare values for pairs of attributes? - decision-tree

I would like to use a decision tree for binary classification. I would like to know if my approach is a valid approach for decision trees.
Each instance in my data set has pairs of attributes, and I have identified that for some pairs, I can compare the values to make a decision. For example, an instance may have the following attributes:
instance = {A1, A2, A3, A4, B1, B2, B3, B4}
A1 and B1 have different values, but refer to the same feature--this is that I meant when I referred to them as a pair. What I would like to do is have nodes in the tree that compare values of a pair:
(A1 > B1)
/ \
(A2 < B2) (A3 > B3)
/ \ / \
...
Is this a valid approach using decision trees?
Is there a better learning approach for this type of problem?

This is a valid approach indeed. All you need is to create new binary features like
C[i] = 1 if A[i] > B[i] else 0
or just
C[i] = A[i] - B[i]
and feed them to an ordinary decision tree algorithm, like rpart in R on sklearn.tree.DecisionTreeClassifier in Python.

Related

How to make a formula shorter

I have three pieces of information: quantity, weight per piece and a limit. What I need is for the weight per piece to multiple without passing the limit and the quantity.
I made a code, but the thing is the data of quantity varies and the code used is very long.
=ROUND(IFS((F13*G13)<H13,F13*G13,((F13-1)*G13)<H13,((F13-1)*G13),
((F13-2)*G13)<H13,,((F13-3)*G13)<H13,,((F13-4)*G13)<H13,,((F13-5)*G13)<H13,,
((F13-6)*G13)<H13,(F13-6)*G13,((F13-7)*G13)<H13,(F13-7)*G13,
((F13-8)*G13)<H13,(F13-8)*G13,((F13-9)*G13)<H13,(F13-9)*G13,
((F13-10)*G13)<H13,(F13-10)*G13,((F13-11)*G13)<H13, (F13-11)*G13,
((F13-12)*G13)<H13,(F13-12)*G13,((F13-13)*G13)<H13,(F13-13)*G13,
((F13-14)*G13)<H13,(F13-14)*G13,((F13-15)*G13)<H13,(F13-15)*G13,
((F13-16)*G13)<H13, (F13-16)*G13,((F13-17)*G13)<H13,(F13-17)*G13,
((F13-18)*G13)<H13,(F13-18)*G13),-2)+200
Here is the input and expected result, if no condition matches returns #N/A.
I don't know if the +200 at the end of the formula is supposed to be included in the limit or not so just adjust the formula accordingly
=ROUND(IF(G13*F13+200>=H13,H13,F13*G13+200),-2)
Multiply the quantity and the weight and add 200.
If the result is equal to or greater than the limit, set the result to the limit.
Otherwise use the results of the quantity * weight + 200
And then round it according to your initial formula
This is what your IFS() looks like:
IFS(((F13-0)*G13)<H13,(F13-0)*G13,
((F13-1)*G13)<H13,(F13-1)*G13,
((F13-2)*G13)<H13,,
((F13-3)*G13)<H13,,
((F13-4)*G13)<H13,,
((F13-5)*G13)<H13,,
((F13-6)*G13)<H13,(F13-6)*G13,
((F13-7)*G13)<H13,(F13-7)*G13,
((F13-8)*G13)<H13,(F13-8)*G13,
((F13-9)*G13)<H13,(F13-9)*G13,
((F13-10)*G13)<H13,(F13-10)*G13,
((F13-11)*G13)<H13,(F13-11)*G13,
((F13-12)*G13)<H13,(F13-12)*G13,
((F13-13)*G13)<H13,(F13-13)*G13,
((F13-14)*G13)<H13,(F13-14)*G13,
((F13-15)*G13)<H13,(F13-15)*G13,
((F13-16)*G13)<H13,(F13-16)*G13,
((F13-17)*G13)<H13,(F13-17)*G13,
((F13-18)*G13)<H13,(F13-18)*G13)
I see three issues:
the cases for 2 up to 5 are missing. Are you sure this is correct?
it looks like (for a general 'x'):
IF ((F13-x)*G13)<H13 THEN (F13-x)*G13, you can calculate the resulting value quite easily, isn't it?
why do you stop at 18?
You can simplify it as follows:
=LET(A, A2, B, B2, C, C2, seq, SEQUENCE(19,,0),
out, IF((seq > 1) * (seq < 6), 0, (A - seq)*B),
ROUND(#FILTER(out, (A - seq)*B < C, NA()), -2) + 200)
and extend it down, or use the array version as follow in cell D2:
=LET(A, A2:A3, B, B2:B3, C, C2:C3, seq, SEQUENCE(19,,0),
MAP(A,B,C, LAMBDA(x,y,z, LET(out, IF((seq > 1) * (seq < 6), 0, (x - seq)*y),
ROUND(#FILTER(out, (x - seq)*y < z, NA()), -2) + 200))))
Here is the output:
The implicit intersection operator (#) ensures to get the first element of FILTER result, which is equivalent to get the first condition that matches. If no condition matches, then it returns NA(). The name out, has the result in the same order it should be tested via IFS. Using SEQUENCE allows to simplify the process.

Use dynamic programming to find a subset of numbers whose sum is closest to given number M

Given a set A of n positive integers a1, a2,... a3 and another positive integer M, I'm going to find a subset of numbers of A whose sum is closest to M. In other words, I'm trying to find a subset A′ of A such that the absolute value |M - 􀀀 Σ a∈A′| is minimized, where [ Σ a∈A′ a ] is the total sum of the numbers of A′. I only need to return the sum of the elements of the solution subset A′ without reporting the actual subset A′.
For example, if we have A as {1, 4, 7, 12} and M = 15. Then, the solution subset is A′ = {4, 12}, and thus the algorithm only needs to return 4 + 12 = 16 as the answer.
The dynamic programming algorithm for the problem should run in
O(nK) time in the worst case, where K is the sum of all numbers of A.
You construct a Dynamic Programming table of size n*K where
D[i][j] = Can you get sum j using the first i elements ?
The recursive relation you can use is: D[i][j] = D[i-1][j-a[i]] OR D[i-1][j] This relation can be derived if you consider that ith element can be added or left.
Time complexity : O(nK) where K=sum of all elements
Lastly you iterate over entire possible sum you can get, i.e. D[n][j] for j=1..K. Which ever is closest to M will be your answer.
For dynamic algorithm, we
Define the value we would work on
The set of values here is actually a table.
For this problem, we define value DP[i , j] as an indicator for whether we can obtain sum j using first i elements. (1 means yes, 0 means no)
Here 0<=i<=n, 0<=j<=K, where K is the sum of all elements in A
Define the recursive relation
DP[i+1 , j] = 1 , if ( DP[i,j] == 1 || DP[i,j-A[i+1]] ==1)
Else, DP[i+1, j] = 0.
Don't forget to initialize the table to 0 at first place. This solves boundary and trivial case.
Calculate the value you want
Through bottom-up implementation, you can finally fill the whole table.
Now, things become easy. You just need to find out the closest value to M in the table whose value is one.
Here, just work on DP[n][j], since n covers the whole set. Find the closest j to M whose value is 1.
Time complexity is O(kn), since you iterate k*n times in total.

Excel Solver: Is there a way to iterate over 2 changing variables?

I have an issue with solver as follows (simplified version):
So I have a nested If statement that describes condition for 2 changing variables(x,y). For example:
In one cell: IF(AND((x<=2),(x>=0.5),(y<=10),(y>=5)),1,0
The cell below it: IF(AND((x<=2.5),(x>=1.9),(y<=11),(y>=9)),1,0
The objective function is the sum of these 2 variables
Solver or goal seek (unless i give it the awnser) can't seem to get an awnser other than 0,0.
My actual problem is that i have 6 of these IF cells and I'm trying to find an (x,y) that maximizes my objective function. I want excel to go through as many combinations it can.
Any thoughts or other ways to do this? Thanks.
The reason that the Solver does not find the optimal solution in this toy problem is because the use of IF and AND statements make the problem non convex. For non-convex problems, the GRG Nonlinear solution method (the default used by solver) does not guarantee an optimal solution, as it can be trapped in locally best solutions which are not optimal.
Having said that, there is a way to formulate your problem as a mixed integer program, which, although still non-convex, can be solved with the "Simplex LP" method of Solver, and give a guaranteed maximum.
Model Setup
Here is a screenshot of the spreadsheet setup:
For convenience, I have used named ranges for the several quantities.
In particular:
- B2 --> x_var
- C2 --> x_UB1
- D2 --> x_LB1
- E2 --> x_UB2
- F2 --> x_LB2
and for row 3 I use the same convention, but instead of x_ we have y_.
The red cells (B4 and E4) have the conditions you described, and the blue cell (B5) has their sum.
For example, the condition for B4 reads
=IF(AND(x_var<=x_UB1,x_var>=x_LB1,y_var<=y_UB1,y_var>=y_LB1),1,0)
We are going to replace these expressions with two binary variables, which equal one if each expression is satisfied and zero otherwise.
The logic is that instead of an IF expression we can impose the constraints:
LB_x * z <= x <= UB_x * z
LB_y * z <= y <= UB_y * z
z is binary
then z = 1 ==> LB_x <= x <= UB_x
LB_y <= y <= UB_y
and because we maximize the sum of the two z variables, the x and y will try to fit i the corresponding ranges so that as many z as possible equal 1.
The green cells H2, J2 have the two new binary varibles, called cond1_true, cond2_true respectively. The other cells have the constraints described above:
For example, for the first expression:
J2: =x_var-cond1_true*x_UB1
J3: =y_var-cond1_true*y_UB1
K2: =x_LB1*cond1_true-x_var
K3: =y_LB1*cond1_true-y_var
All these cells need to be <= 0 in the solver model.
Solver Model:
In the mode, the objective function cell is the sum of the binary variables. The decision variables are x_var, y_yar, cond1_true, cond2_true. The constraints are all in expression <= 0 format. Here is the worksheet: https://www.dropbox.com/s/uek2k9gownhh3ni/excel-solver-is-there-a-way-to-iterate-over-2-changing-variables.xlsx?dl=0
Using this formulation, the solver goes through many combinations of variables and tries to pick up the best one. It can often guarantee an optimal solution (which is almost always the case for small problems)
UPDATE
If the intervals are non overlapping we need to modily
LB_x * z <= x <= UB_x * z
to
min(LB_x) * (1-z) + LB_x * z <= x <= UB_x * z + max(UB_x) * (1-z)
Where min(LB_x) is the minimum lower bound across all intervals (likewise for UB and for y). This way, if an x does not fall into the interval (z=0) it is only forced to fall in some other interval.
I hope this helps!

transform point to another coordinate

suggest here are 3 points on,(x0,y0),(x1,y1),(x2,y2)
O = (x0,y0)
e1 = (x1-x0,y1-y0)
e2 = (x2-x0,y2-y0)
3 can make a new cordinate (O,e1,e2)
here's a point (x,y)
how to calculate the point location in (O,e1,e2)?please write down the formula ,thanks.
once i remember,but now i forget.
Let's call new coordinates a and b.
In the old coordinate system the point will be O+a*e1+b*e2. Since it should be the same point (x,y), we have two linear equations:
x=Ox+a*e1x+b*e2x
y=Oy+a*e1y+b*e2y
Everything except a and b is known, two unknowns, two equations - solution exists if e1 and e2 are not parallel.
The system can be solved either by inversion of matrix ( (e1x,e2x) , (e1y,e2y) ), or by expressing a in terms of b from the first equation and substituting it into the second one.

Parenthesization that minimize values of an expression

I need to find a parenthesization for E= c1 O1 c2 O2 .... On-1 cn where c(i) are integers and O(i) could be + or *, that minimize the obtainable value through a parenthesization.
I know that's probably a very basic question, but I just started learning Dynamic Programming.
My main problem is how can I distinguish if O(i) is + or * (or is this useless ? )

Resources