I have noticed that the sum of squares in my models can change fairly radically with even the slightest adjustment to my models???? Is this normal???? I'm using SPSS 16, and both models presented below used the same data and variables with only one small change - categorizing one of the variables as either a 2 level or 3 level variable.
Details - using a 2 x 2 x 6 mixed model ANOVA with the 6 being the repeated measure i get the following in the between group analysis
------------------------------------------------------------
Source | Type III SS | df | MS | F | Sig
------------------------------------------------------------
intercept | 4086.46 | 1 | 4086.46 | 104.93 | .000
X | 224.61 | 1 | 224.61 | 5.77 | .019
Y | 2.60 | 1 | 2.60 | .07 | .80
X by Y | 19.25 | 1 | 19.25 | .49 | .49
Error | 2570.40 | 66 | 38.95 |
Then, when I use the exact same data but a slightly different model in which variable Y has 3 levels instead of 2 levels I get the following
------------------------------------------------------------
Source | Type III SS | df | MS | F | Sig
------------------------------------------------------------
intercept | 3603.88 | 1 | 3603.88 | 90.89 | .000
X | 171.89 | 1 | 171.89 | 4.34 | .041
Y | 19.23 | 2 | 9.62 | .24 | .79
X by Y | 17.90 | 2 | 17.90 | .80 | .80
Error | 2537.76 | 64 | 39.65 |
I don't understand why variable X would have a different sum of squares simply because variable Y gets devided up into 3 levels instead of 2. This is also the case in the within groups analysis too.
Please help me understand :D
Thank you in advance
Pat
The type III Sum-of-Squares for X tells you how much you gain when you add X to a model including all the other terms. It appears that the 3-level Y variable is a much better predictor than the 2-level one: its SS went from 2.6 to 19.23. (this can happen, for example, if the effect of Y is quadratic: a cut at the vertex is not very predictive, but cutting into three groups would be better). Thus there is less left for X to explain - its SS decreases.
Just adding to what Aniko has said, the reason why variable X has a different sum of squares simply because variable Y gets divided up into 3 levels instead of 2, is that the SS formula for each factor depends on the number of samples in each treatment. When you change the number of levels in one factor, you actually change the number of samples for each treatment and this has an impact on the SS value for all the other factors.
Related
I had a doubt about Principle Component analysis. If the variables are along the row:
delhi| kolkata| up| mp| bihar| assam|
popolation 1.2 | 2.2 | 1.3| 1.4| 2 | 1.1 |
crop a | b | c | a| b | c |
avg temp 1 | 2 | 3 | 4| 5 | 6 |
soil ph 1 | 2 | 1 | 3| 2 | 1 |
And one wants to do PCA to obtain most important uncorrelated variables, can one do that. The idea is not to reduce the columns but rows.
If anyone could explain this concept to me it would be very helpful as my understanding is variables exist only along columns and there are many code examples in python for column dimension reduction using pca. But I am not sure if row reduction is the same thing.
Thanks in advance.
I have a spotfire question. Is it possible to divide two "calculated value" columns in a "graphical table".
I have a Count([Type]) calculated value. I then limit the data within the second calculated value to arrive at a different number of Count[Type].
I would like to divide the two in a third calculated value column.
ie.
Calculated value column 1:
Count([Type]) = 100 (NOT LIMITED)
Calculated value column 2:
Count([Type]) = 50 (Limited to [Type]="Good")
Now I would like to say 50/100 = 0.5 in the third calculated value column.
If it is possible to do this all within one calculated column value that is even better. Graphical Tables do not let you have if statements in the custom expression, the only way is to limit data. So I am struggling, any help is appreciated.
Graphical tables do allow IF() in custom expressions. In order to accomplish this you are going to have to move your logic away from the Limit Data Using Expressions and into your expression directly. Here should be your three Axes expressions:
Count([Type])
Count(If([Type]="Good",[Type]))
Count(If([Type]="Good",[Type])) / Count([Type])
Data Set
+----+------+
| ID | Type |
+----+------+
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Good |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 1 | Bad |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Good |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
| 2 | Bad |
+----+------+
Results
SVC in sklearn has a decision_function function, but why doesn't random forests? Secondly, how can I emulate the decision_function for a RandomForestsClassifier?
Yes, it has a decision function.
The function is coded directly into the Forest of Randomised Trees.
Simply put RandomForest learners use this trivial decision function strategy:
Rule a): The Forest ( the ensemble ) votes on result.
Rule b): Each Randomised Tree contains a series of decisions ( represented by Node(s).
Rule c): Each Node contains one or more simple conditions, based on which the decision making process moves farther from the root-Node towards the terminal-Leaf, according to the specified values of the Example FeatureVECTOR-input values.
Rule d): Terminal-Leaf contains a value, that a particular Randomised Tree presents to the ensemble voting ( Ad-a) ).
While this is not an exact RandomForestClassifier Tree-representation ( a Node can have more than one condition there - check the Scikit documentation on parameters ) but a fairly well illustrating the principles:
VOTE[12] = IN[0] < 6.85417 ? 1 : 2
| |
| 2:[IN[5]<183]
| |
| IN[5] < 183 ? 5 : 6
| | |
| | 6:[IN[10]<1.00118]
| | |
| | IN[10] < 1.00118 ? 13 : 14
| | | |
| | | 14:
| | | |
| | |
| | |
| |
| 5:[IN[6]<187]
| |
| IN[6] < 187 ? 11 : 12
| | |
1:
IN[10] < 1.00054 ? 3 : 4
| |
| 4:
| |
|
3:
voter[12]:
0:[inp0<6.85417] yes=1,no=2,missing=1
1:[inp10<1.00054] yes=3,no=4,missing=3
3:[inp21<0.974632] yes=7,no=8,missing=7
7:[inp22<1.01021] yes=15,no=16,missing=15
15:[inp15<0.994931] yes=31,no=32,missing=31
31:[inp12<0.999151] yes=63,no=64,missing=63
63:[inp23<0.957624] yes=111,no=112,missing=111
111:leaf=0.163636
112:leaf=-0.36
64:leaf=0.323077
32:[inp19<0.993949] yes=65,no=66,missing=65
65:[inp23<0.931146] yes=113,no=114,missing=113
113:leaf=-0
114:[inp23<0.972193] yes=161,no=162,missing=161
161:leaf=-0.421782
162:leaf=-0.133333
66:[inp2<61] yes=115,no=116,missing=115
115:leaf=0.381818
116:leaf=-0.388235
16:[inp17<0.985065] yes=33,no=34,missing=33
How to emulate decision function?
One can build another RandomForest-interpreter to traverse all the Trees and vote on resulting "ensemble compromise".
That will mimick the decision function ( there is no other there ) and would provide the same results as the RandomForest Trees were constructed to work this very way.
Program: Excel 2010
Requirements: Prefer no VBA (Macro free book)
I am creating a spreadsheet to calculate items required for components (parts). I have a list of the product, and under the number of specific parts. I have a calculation which tells me what the total parts are needed, but, is there a better way?
=($C$32*C34)+($D$32*D34)+($E$32*E34)+($F$32*F34)+($G$32*G34)+($H$32*H34)+($I$32*I34)+($J$32*J34)+($K$32*K34)
| A | B | C | D | E | F |
| Making: | | 2 | 2 | 2 | |
|---------------|-------|------------|-------------|-----------------|---------|
| Item -> | Total | Small raft | Rowing boat | Sm sailing boat | Corbita |
| | | | | | |
| Planks | 20 | 4 | 6 | | |
| Logs | 8 | 4 | | | |
| Nails - Large | 16 | 8 | | | |
| Oars | | | | | |
In the above, you can see that ($C$32*C34) = 8 & ($D$32*D34) = 12 => 12+8 = 20 (B34) (Planks Total)
Is there an easier way of doing this, or will my equation just keep getting bigger?
Thanks in advance.
As chris neilsen mentioned in his comment, you can use the SUMPRODUCT function in Excel. The formula in your cell B34 (total planks) should look like this:
=SUMPRODUCT(C32:K32,C34:K34)
This has the effect of multiplying the corresponding components in the given ranges (C32 * C34, D32 * D34, etc.) and then returning the sum of those products/multiplications.
As you add more columns, you can expand K to the last column in the range that you want to add up in both ranges.
I have a Rpy2 data frame as <class 'rpy2.robjects.vectors.DataFrame'>. How can I convert it to a Python list or tuple with every row as an element? Thanks!
I figured it out. I hope this helps if you are looking for an answer:
output = [tuple([df[j][i] for j in range(df.ncol)]) for i in range(df.nrow)]
I stumbled recently over one potential problem. Given a data frame from R:
| | a | c | b | d |
|---|-------|---|---|-----|
| 1 | info1 | 2 | 1 | op1 |
| 2 | info2 | 3 | 2 | 3 |
| 3 | info3 | 4 | 3 | 3 |
| 4 | info4 | 5 | 4 | 3 |
| 5 | info5 | 6 | 5 | 3 |
| 6 | info6 | 7 | 6 | 3 |
| 7 | 9 | 8 | 7 | 3 |
(yes I know - mixed data types in one column i.e. str and float is maybe not realistic but the same holds true for factors only columns)
The conversion will show the index for columns a and d and not the real values usually intended. The issue is as stated in the rpy2 manual:
R’s factors are somewhat peculiar: they aim at representing a memory-efficient vector of labels, and in order to achieve it are implemented as vectors of integers to which are associated a (presumably shorter) vector of labels. Each integer represents the position of the label in the associated vector of labels.
The following rough draft code is a step towards handling this case:
colnames = list(dataframe.colnames)
rownames=list(dataframe.rownames)
col2data = []
for cn,col in dataframe.items():
if isinstance(col,robjects.vectors.FactorVector) is True:
colevel = tuple(col.levels)
col = tuple(col)
ncol = []
for i in col:
k=i-1
ncol.append(colevel[k])
else:
ncol = tuple(col)
col2data.append((cn,ncol))
col2data.append(('rownames',rownames))
col2data = dict(col2data)
The output is a dict with columnames to values mapping. Using a loop and transposing the list of lists will generate the output as needed.