Does RandomForestsClassifier have a `decision_function`? - scikit-learn

SVC in sklearn has a decision_function function, but why doesn't random forests? Secondly, how can I emulate the decision_function for a RandomForestsClassifier?

Yes, it has a decision function.
The function is coded directly into the Forest of Randomised Trees.
Simply put RandomForest learners use this trivial decision function strategy:
Rule a): The Forest ( the ensemble ) votes on result.
Rule b): Each Randomised Tree contains a series of decisions ( represented by Node(s).
Rule c): Each Node contains one or more simple conditions, based on which the decision making process moves farther from the root-Node towards the terminal-Leaf, according to the specified values of the Example FeatureVECTOR-input values.
Rule d): Terminal-Leaf contains a value, that a particular Randomised Tree presents to the ensemble voting ( Ad-a) ).
While this is not an exact RandomForestClassifier Tree-representation ( a Node can have more than one condition there - check the Scikit documentation on parameters ) but a fairly well illustrating the principles:
VOTE[12] = IN[0] < 6.85417 ? 1 : 2
| |
| 2:[IN[5]<183]
| |
| IN[5] < 183 ? 5 : 6
| | |
| | 6:[IN[10]<1.00118]
| | |
| | IN[10] < 1.00118 ? 13 : 14
| | | |
| | | 14:
| | | |
| | |
| | |
| |
| 5:[IN[6]<187]
| |
| IN[6] < 187 ? 11 : 12
| | |
1:
IN[10] < 1.00054 ? 3 : 4
| |
| 4:
| |
|
3:
voter[12]:
0:[inp0<6.85417] yes=1,no=2,missing=1
1:[inp10<1.00054] yes=3,no=4,missing=3
3:[inp21<0.974632] yes=7,no=8,missing=7
7:[inp22<1.01021] yes=15,no=16,missing=15
15:[inp15<0.994931] yes=31,no=32,missing=31
31:[inp12<0.999151] yes=63,no=64,missing=63
63:[inp23<0.957624] yes=111,no=112,missing=111
111:leaf=0.163636
112:leaf=-0.36
64:leaf=0.323077
32:[inp19<0.993949] yes=65,no=66,missing=65
65:[inp23<0.931146] yes=113,no=114,missing=113
113:leaf=-0
114:[inp23<0.972193] yes=161,no=162,missing=161
161:leaf=-0.421782
162:leaf=-0.133333
66:[inp2<61] yes=115,no=116,missing=115
115:leaf=0.381818
116:leaf=-0.388235
16:[inp17<0.985065] yes=33,no=34,missing=33
How to emulate decision function?
One can build another RandomForest-interpreter to traverse all the Trees and vote on resulting "ensemble compromise".
That will mimick the decision function ( there is no other there ) and would provide the same results as the RandomForest Trees were constructed to work this very way.

Related

How to add two arrays in a dataframe array of same schema and add the values within the column for int/long values?

I have a dataframe as below.
Input dataframe -
+-------+----------+----------------------------+----------------------------+
| name| Age | Marks_1 | Marks_2 |
+-------+----------+----------+-----------------+----------+-----------------+
|Harry | | Physics - [50,30] | Physics - [40,40] |
| | | Math - [70,30] | Math - [20,40] |
+-------+----------+----------+-----------------+----------------------------+
Expected Output -
+-------+----------+-------------------+---------------------------------------+
| name| Age | Marks_1 | Marks_2 | Mark_3 |
+-------+----------+-------------------+-------------------+-------------------+
|Harry | 25 | Physics - [50,30] | Physics - [40,40] | Physics - [90,70] |
| | | Math - [70,30] | Math - [20,40] | Math - [90,70] |
+-------+----------+----------+-----------------+------------------------------+
Basically wish to sum two arrays (Marks_1 and Marks_2) of same schema and create a new array (Marks_3) with the sum of the two arrays in Spark Scala.
Tried below approach, but it fails with type mismatch.
df.withColumn(col("marks_3"),col("marks_1.physics") + col("marks_2.physics"))
Can anyone help with the best approach for this?
Thanks in advance!

Beyond acceptable range of Nested IF & formula not working

Values to check with
| | Northing | Easting |
|-------|----------|---------|
| Inst1 | 41345 | 33467.8 |
| inst2 | 41600.5 | 33607.2 |
| Inst3 | 41900.8 | 33740.2 |
| Inst4 | 41933.4 | 33780 |
| Inst5 | 41829.5 | 33694.6 |
| Inst6 | 41449.9 | 33539 |
Range of Coordinate
| | Northing | | Easting | |
|----|----------|----------|---------|----------|
| T1 | 41158.68 | 41396.88 | 33357.6 | 33517.57 |
| T2 | 41307.9 | 41456.6 | 33384.2 | 33580.5 |
| T3 | 41372.1 | 41517.5 | 33411.3 | 33607.5 |
| T4 | 41431.6 | 41572.7 | 33435.8 | 33632.5 |
| T5 | 41482.9 | 41654.6 | 33472.3 | 33654.2 |
| S1 | 41564.9 | 41701.2 | 33493.1 | 33688.7 |
| S2 | 41611.5 | 41762.3 | 33520.2 | 33708.3 |
| S3 | 41672.7 | 41841.6 | 33555.5 | 33734.1 |
| S4 | 41752.2 | 41897.9 | 33580.6 | 33767.6 |
| S5 | 41809.3 | 41941.7 | 33600.1 | 33791.7 |
| S6 | 41854.6 | 41998.7 | 33625.4 | 33810.7 |
| T6 | 41914.8 | 42055.4 | 33650.7 | 33836.1 |
| T7 | 41971.5 | 42137.4 | 33687.2 | 33859.9 |
Nested IF is not displaying the right value and can't go beyond row 48.
How can I have range M41:Q53 included?
Current formula in place below
=IF(N41<=$H$41<=O41 & P41<=$I$41<=Q41,M41,IF(N42<=$H$41<=O42 &
P42<=$I$41<=Q42,M42,IF(N43<=$H$41<=O43 &
P43<=$I$41<=Q43,M43,IF(N44<=$H$41<=O44 &
P44<=$I$41<=Q44,M44,IF(N45<=$H$41<=O45 &
P45<=$I$41<=Q45,M45,IF(N46<=$H$41<=O46 &
P46<=$I$41<=Q46,M46,IF(N47<=$H$41<=O47 &
P47<=$I$41<=Q47,M47,IF(N48<=$H$41<=O48 & P48<=$I$41<=Q48,M48,"Not
here"))))))))
When comparing coordinates, the choice of coordinate system doesn't change the logic very much. :-)
It can be tricky (but not impossible) to consistently check if a point is within a polygon, but just a plain ol' rectangle like this is straightforward. If you intend to compare more than a few coordinates then nested If's just won't work. (In fact they should be avoided at all times!)
For my quick-n-dirty example I took your data and put it into Columns vs Rows instead of side by side.
The formula is H6 is:
=IF(AND(H$3>=MIN($C6:$D6),H$3<=MAX($C6:$D6),H$4>=MIN($E6:$F6),H$4<=MAX($E6:$F6)),"Match","-")
Basically it's just checking:
Is Northing To Match greater than or equal to `MIN of NorthingStart & NorthingEnd?
Is Northing To Match less than or equal to `MAX of NorthingStart & NorthingEnd?
If Yes to both then the point Northing to Match is within the specified rectangle.
There are a number of other ways this could also be tackled. Which one is the right one depends mainly on how much data you'll be comparing, and whether it's an ongoing need (needing to account for unforeseen circumstances or not)...
The same thing could also be accomplished with side-by-side datasets with the help of an array formula.
Further Reading:
Introduction to Coordinate Geometry (many handy links)
SE: Analytic Geometry: How to check if a point is inside a rectangle?
Wikipedia: Intersection_theory
How to check if two given line segments intersect?
And a but of a tangent (no pun intended), but since I mentioned it & just for fun, a short explanation of:
How to check if a given point lies inside or outside a polygon?
1. Draw a horizontal line to the right of each point and extend it to infinity
2. Count the number of times the line intersects with polygon edges.
3. A point is inside the polygon if either count of intersections is odd or
point lies on an edge of polygon. If none of the conditions is true, then
point lies outside.
How to handle point g in the above figure?
Note that we should returns true if the point lies on the line or same
as one of the vertices of the given polygon. To handle this, after
checking if the line from p to extreme intersects, we check whether
p is colinear with vertices of current line of polygon. If it is
colinear, then we check if the point p lies on current side of
polygon, if it lies, we return true, else false. (Source)

Divide two "Calculated Values" within Spofire Graphical Table

I have a spotfire question. Is it possible to divide two "calculated value" columns in a "graphical table".
I have a Count([Type]) calculated value. I then limit the data within the second calculated value to arrive at a different number of Count[Type].
I would like to divide the two in a third calculated value column.
ie.
Calculated value column 1:
Count([Type]) = 100 (NOT LIMITED)
Calculated value column 2:
Count([Type]) = 50 (Limited to [Type]="Good")
Now I would like to say 50/100 = 0.5 in the third calculated value column.
If it is possible to do this all within one calculated column value that is even better. Graphical Tables do not let you have if statements in the custom expression, the only way is to limit data. So I am struggling, any help is appreciated.
Graphical tables do allow IF() in custom expressions. In order to accomplish this you are going to have to move your logic away from the Limit Data Using Expressions and into your expression directly. Here should be your three Axes expressions:
Count([Type])
Count(If([Type]="Good",[Type]))
Count(If([Type]="Good",[Type])) / Count([Type])
Data Set
+----+------+
| ID | Type |
+----+------+
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
+----+------+
Results

Flipped switch statements?

Consider you have 10 boolean variables of which only one can be true at a time, and each time any one is 'switched on', all others must be 'turned off'. One of the problems that immediately arises is;
How can you quickly test which variable is true without necessarily
having to linearly check all the variable states each time?
For this, I was thinking if it was possible to have something like:
switch(true)
{
case boolean1:
//do stuff
...
//other variables
}
This looks like a bad way of testing for 10 different states of an object, but I think there're cases where this kind of feature may prove useful and would like to know if there's any programming language that supports this kind of feature?
There isn't a language feature that offers this behavior. But as an alternative, you could use the Command Pattern, in conjunction with a Priority Queue. This assumes that you would be able to prioritize what checks should be done.
Traditionally, when you have such radio button boolean values you use an integer to represent them:
+------------+---------+--------------------+
| BINARY | DECIMAL | BINARY-LOGARITHMIC |
+------------+---------+--------------------+
| 0000000001 | 1 | 0 |
| 0000000010 | 2 | 1 |
| 0000000100 | 4 | 2 |
| 0000001000 | 8 | 3 |
| 0000010000 | 16 | 4 |
| 0000100000 | 32 | 5 |
| 0001000000 | 64 | 6 |
| 0010000000 | 128 | 7 |
| 0100000000 | 256 | 8 |
| 1000000000 | 512 | 9 |
+------------+---------+--------------------+
Let's call the variable holding this boolean value flag. We can quickly jump to some code based on the flag by indexing a random access array of functions:
var functions = [ function0
, function1
, function2
, function3
, function4
, function5
, function6
, function7
, function8
, function9
];
functions[flag](); // quick jump
However, you will have to pay for the function call overhead.

SPSS - sum of squares change radically with slight model changes in ANOVA?

I have noticed that the sum of squares in my models can change fairly radically with even the slightest adjustment to my models???? Is this normal???? I'm using SPSS 16, and both models presented below used the same data and variables with only one small change - categorizing one of the variables as either a 2 level or 3 level variable.
Details - using a 2 x 2 x 6 mixed model ANOVA with the 6 being the repeated measure i get the following in the between group analysis
------------------------------------------------------------
Source | Type III SS | df | MS | F | Sig
------------------------------------------------------------
intercept | 4086.46 | 1 | 4086.46 | 104.93 | .000
X | 224.61 | 1 | 224.61 | 5.77 | .019
Y | 2.60 | 1 | 2.60 | .07 | .80
X by Y | 19.25 | 1 | 19.25 | .49 | .49
Error | 2570.40 | 66 | 38.95 |
Then, when I use the exact same data but a slightly different model in which variable Y has 3 levels instead of 2 levels I get the following
------------------------------------------------------------
Source | Type III SS | df | MS | F | Sig
------------------------------------------------------------
intercept | 3603.88 | 1 | 3603.88 | 90.89 | .000
X | 171.89 | 1 | 171.89 | 4.34 | .041
Y | 19.23 | 2 | 9.62 | .24 | .79
X by Y | 17.90 | 2 | 17.90 | .80 | .80
Error | 2537.76 | 64 | 39.65 |
I don't understand why variable X would have a different sum of squares simply because variable Y gets devided up into 3 levels instead of 2. This is also the case in the within groups analysis too.
Please help me understand :D
Thank you in advance
Pat
The type III Sum-of-Squares for X tells you how much you gain when you add X to a model including all the other terms. It appears that the 3-level Y variable is a much better predictor than the 2-level one: its SS went from 2.6 to 19.23. (this can happen, for example, if the effect of Y is quadratic: a cut at the vertex is not very predictive, but cutting into three groups would be better). Thus there is less left for X to explain - its SS decreases.
Just adding to what Aniko has said, the reason why variable X has a different sum of squares simply because variable Y gets divided up into 3 levels instead of 2, is that the SS formula for each factor depends on the number of samples in each treatment. When you change the number of levels in one factor, you actually change the number of samples for each treatment and this has an impact on the SS value for all the other factors.

Resources