ANOVA is significant but post-hoc test not. What next? - statistics

The data consists of weights of fresh harvested plants with four different treatments. The data is normally distributed and homogeneity of variances is given too.
Anova shows significiant differences:
anova_Ernte <- aov(Gewicht ~ Variante, data=Daten_Ernte)
Anova(anova_Ernte)
Anova Table (Type II tests)
Response: Gewicht
Sum Sq Df F value Pr(>F)
Variante 57213 3 2.9778 0.03226 *
Residuals 1511436 236
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However post-hoc test HSD.Test() doesnt show any significiant differences:
HSD.test(anova_Ernte, "Variante", group = TRUE, console = TRUE, main = "")
4 434.70 a
1 426.90 a
3 400.95 a
2 398.20 a
Gewicht std r Min Max
1 426.90 80.08929 80 234 596
2 398.20 79.90095 80 216 561
3 400.95 74.87869 40 228 568
4 434.70 84.98754 40 264 647
Tuckey-HSD shows the following
TukeyHSD(anova_Ernte)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Gewicht ~ Variante, data = Daten_Ernte)
$Variante
diff lwr upr p adj
2-1 -28.70 -61.439904 4.039904 0.1085058
3-1 -25.95 -66.048029 14.148029 0.3394912
4-1 7.80 -32.298029 47.898029 0.9582293
3-2 2.75 -37.348029 42.848029 0.9980106
4-2 36.50 -3.598029 76.598029 0.0887876
4-3 33.75 -12.551216 80.051216 0.2368136
And finally kurskal wallis does not show significiant differences between the groups:
kruskal(y=Daten_Ernte$Gewicht, trt=Daten_Ernte$Variante, p.adj = "bonferroni", console = TRUE)
kruskal.test(Daten_Ernte$Gewicht ~ Daten_Ernte$Variante)
4 133.9000 a
1 131.2063 a
3 109.8875 a
2 108.4000 a
Am i now safe to say that there are no significiant differences between the groups or do i have options to find out which groups differ according to anova?

Related

Conditional Probability for fake reviews

I am working on a conditional probability question.
A = probability of being legit review
B = probability of guessing correctly
P(A) = 0.98 → P(A’) = 0.02
P(B|A’) = 0.95
P(B|A) = 0.90
The question should be this: P(A’|B) =?
P(A’|B) = P(B|A’).P(A’) / P(B)
P(B) = P(B and A’) + P(B and A)
= P(B|A’). P(A’) + P(B|A). P(A)
= 0.901
P(A’|B) = P(B|A’).P(A’) / P(B)
= 0.95 x 0.02 / 0.901
= 0.021
However, my result is not listed on the choices of questions. Can you please tell me if I am missing anything? Or my logic is incorrect?
Example with numbers
This example with numbers is meant as an intuitive way to understand how Bayes' formula works:
Let's assume we have 10.000 typical reviews. We calculate what we would expect to happen with these 10.000 reviews:
9.800 are real
200 fake
To predict how many review are classified as fake:
Of the 9800 real ones, 10% are classified as fake → 9800 * 0.10 = 980
Of the 200 fake ones, 95% are classified as fake → 200 * 0.95 = 190
980 + 190 = 1.170 are classified a fake.
Now we have all the pieces we need to calculate the probability that a reviews is fake, given that it is classified as such:
All reviews that are classified as fake → 1.170
Of those, are actually fake → 190
190 / 1170 = 0.1623 or 16.23%
Using general Bayes' theorem
Let's set up the events. Note that my version of event B is slightly different from yours.
P(A): Real review
P(A'): Fake review
P(B): Predicted real
P(B'): Predicted fake
P(A'|B'): Probability that a review is actually fake, when it is predicted to be real
Now that we have our events defined, we can go ahead with Bayes:
P(A'|B') = P(A' and B') / P(B') # Bayes' formula
= P(A' and B') / (P(A and B') + P(A' and B')) # Law of total probability
We also know the following, by an adapted version of Bayes' rule:
P(A and B') = P(A) * P(B'|A )
= 0.98 * 0.10
= 0.098
P(A' and B') = P(A') * P(B'|A')
= 0.02 * 0.95
= 0.019
Putting the pieces together yields:
P(A'|B') = 0.019 / (0.098 + 0.019) = 0.1623

What does it mean when ECOS can't solve my SOCP with ~25,000 optimization variables?

tl;dr On a convex optimization problem with about 25,000 variables, ECOS runs to max_iters and terminates with the following error:
SolverError: Solver 'ECOS' failed. Try another solver, or solve with verbose=True for more information.
What does this mean?
I am trying to solve a convex optimization problem in cvxpy, where the setup is as follows:
# <table> is a contingency table with 3 columns where the first two columns are unique item ids, and the third column describes the frequency of co-occurrence
import numpy as np
import cvxpy as cp
theta = cp.Variable([196, 10], nonneg=True)
phi = cp.Variable([10], nonneg=True)
Q = cp.Parameter([2548, 10], nonneg=True)
Q.value = np.ones([196, 10])/10
obj_func = 0
for m, row in enumerate(table):
i, j, freq = row
obj_func += freq * Q[m,:] * (cp.log(theta[i,:]) + cp.log(theta[j,:]) + cp.log(phi)- cp.log(Q[m,:]))
objective = cp.Maximize(obj_func)
constraints = [
cp.sum(phi) == 1,
cp.sum(theta, axis=0) == 1,
]
problem = cp.Problem(objective, constraints)
opt_val = problem.solve()
When run with verbose=True and max_iters=500, the output looks like:
ECOS 2.0.7 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS
It pcost dcost gap pres dres k/t mu step sigma IR | BT
0 +0.000e+00 -1.108e+05 +1e+06 1e+00 1e+00 1e+00 1e+00 --- --- 0 0 - | - -
1 -1.571e+04 -1.265e+05 +1e+06 7e-01 1e+00 1e+00 9e-01 0.2387 5e-01 2 2 2 | 0 2
2 -7.070e+04 -1.814e+05 +8e+05 8e-01 1e+00 2e+00 7e-01 0.3791 3e-01 1 2 2 | 1 0
3 -1.869e+05 -2.975e+05 +5e+05 9e-01 1e+00 2e+00 4e-01 0.6988 5e-01 2 3 2 | 4 1
...
497 +4.782e+08 +4.782e+08 +4e-07 2e-03 5e-12 3e-04 3e-13 0.3208 9e-01 1 1 0 | 16 5
498 +4.782e+08 +4.782e+08 +4e-07 2e-03 5e-12 3e-04 3e-13 0.9791 1e+00 2 1 0 | 27 0
499 +4.782e+08 +4.782e+08 +4e-07 2e-03 5e-12 3e-04 3e-13 0.5013 1e+00 1 1 0 | 21 3
500 +4.782e+08 +4.782e+08 +4e-07 2e-03 5e-12 3e-04 3e-13 0.9791 1e+00 1 1 0 | 30 0
Maximum number of iterations reached, recovering best iterate (497) and stopping.
RAN OUT OF ITERATIONS (reached feastol=1.6e-03, reltol=8.3e-16, abstol=4.0e-07).
Runtime: 314.146930 seconds.
As far as I can tell this is a perfectly standard convex optimization problem. However, when I run ECOS on it, I reach max_iters without it converging. Repeating with max_iters = 500 (as compared to the default of 67) did not solve the issue.
My question is, why does this happen? What is ECOS trying to tell me? Is my problem infeasible? Is it just that there are too many variables to handle?
Waging a guess, I suspect this comes down to scaling. The primal and dual cost are very close together, the gap is small. Maybe the solver tolerances are set too tight for this instance?
Things to try:
Try rescaling your formulation, so that all involved constants have similar magnitudes.
Have a look at the solution computed by ECOS. It might very well be ok - in that case you'd just have to adjust the termination criteria of the solver.

How can I interpret the P-value of nonlinear term under rms: anova?

Here I demonstrated a survival model with rcs term. I was wondering whether the anova()under rms package is the way to test the linearity association? And How can I interpret the P-value of the Nonlinear term (see 0.094 here), does that support adding a rcs() term in the cox model?
library(rms)
data(pbc)
d <- pbc
rm(pbc, pbcseq)
d$status <- ifelse(d$status != 0, 1, 0)
dd = datadist(d)
options(datadist='dd')
# rcs model
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
Wald Statistics Response: Surv(time, status)
Factor Chi-Square d.f. P
albumin 82.80 3 <.0001
Nonlinear 4.73 2 0.094
TOTAL 82.80 3 <.0001
The proper way to test is with model comparison of the log-likelihood (aka deviance) across two models: a full and reduced:
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
m <- cph(Surv(time, status) ~ albumin, data=d)
p.val <- 1- pchisq( (m2$loglik[2]- m$loglik[2]), 2 )
You can see the difference in the inference using the less accurate Wald statistic (which in your case was not significant anyway since the p-value was > 0.05) versus this more accurate method in the example that Harrell used in his ?cph help page. Using his example:
> anova(f)
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 57.75 3 <.0001
Nonlinear 8.17 2 0.0168
sex 18.75 1 <.0001
TOTAL 75.63 4 <.0001
You would incorrectly conclude that the nonlinear term was "significant" at conventional 0.05 level. This despite the fact that code creating the model was constructed as entirely linear in age (on the log-hazard scale):
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
Create a reduced mode and compare:
f0 <- cph(S ~ age + sex, x=TRUE, y=TRUE)
anova(f0)
#-------------
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 56.64 1 <.0001
sex 16.26 1 1e-04
TOTAL 75.85 2 <.0001
The difference in deviance is not significant with 2 degrees of freedom difference:
1-pchisq((f$loglik[2]- f0$loglik[2]),2)
[1] 0.1243212
I don't know why Harrell leaves this example in, because I've taken his RMS course and know that he endorses the cross-model comparison of deviance as the more accurate approach.

Statsmodels GLM and OLS with formulas missing paramters

I am trying to run a general linear model using formulas on a data set that contains categorical variables. The results summary table appears to be leaving out one of the variables when I list the parameters?
I haven't been able to find doc's specific to the glm showing the output with categorical variables but I have for the OLS and it looks like it should list each categorical variable seperately. When it do it (with GLM or OLS) it leaves out one of the values for each category. For example:
import statsmodels.formula.api as smf
import pandas as pd
Data = pd.read_csv(root+'/Illisarvik/TestData.csv')
formula = 'Response~Day+Class+Var'
gm = sm.GLM.from_formula(formula=formula, data=Data,
family=sm.families.Gaussian()).fit()
ls = smf.ols(formula=formula,data=Data).fit()
print (Data)
print(gm.params)
print(ls.params)
Day Class Var Response
0 D A 0.533088 0.582931
1 D B 0.839837 0.075011
2 D C 1.454716 0.505442
3 D A 1.455503 0.188945
4 D B 1.163155 0.144176
5 N A 1.072238 0.918962
6 N B 0.815384 0.249160
7 N C 1.182626 0.520460
8 N A 1.448843 0.870644
9 N B 0.653531 0.460177
Intercept 0.625111
Day[T.N] 0.298084
Class[T.B] -0.439025
Class[T.C] -0.104725
Var -0.118662
dtype: float64
Intercept 0.625111
Day[T.N] 0.298084
Class[T.B] -0.439025
Class[T.C] -0.104725
Var -0.118662
dtype: float64
C:/Users/wesle/Dropbox/PhD_Work/Figures/SkeeterEtAlAnalysis.py:55: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
P.sort()
Is there something wrong with my model? The same issue presents its self when I print the full summary table:
print(gm.summary())
print(ls.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Response No. Observations: 10
Model: GLM Df Residuals: 5
Model Family: Gaussian Df Model: 4
Link Function: identity Scale: 0.0360609978309
Method: IRLS Log-Likelihood: 5.8891
Date: Sun, 05 Mar 2017 Deviance: 0.18030
Time: 23:26:48 Pearson chi2: 0.180
No. Iterations: 2
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6251 0.280 2.236 0.025 0.077 1.173
Day[T.N] 0.2981 0.121 2.469 0.014 0.061 0.535
Class[T.B] -0.4390 0.146 -3.005 0.003 -0.725 -0.153
Class[T.C] -0.1047 0.170 -0.617 0.537 -0.438 0.228
Var -0.1187 0.222 -0.535 0.593 -0.553 0.316
==============================================================================
OLS Regression Results
==============================================================================
Dep. Variable: Response R-squared: 0.764
Model: OLS Adj. R-squared: 0.576
Method: Least Squares F-statistic: 4.055
Date: Sun, 05 Mar 2017 Prob (F-statistic): 0.0784
Time: 23:26:48 Log-Likelihood: 5.8891
No. Observations: 10 AIC: -1.778
Df Residuals: 5 BIC: -0.2652
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6251 0.280 2.236 0.076 -0.094 1.344
Day[T.N] 0.2981 0.121 2.469 0.057 -0.012 0.608
Class[T.B] -0.4390 0.146 -3.005 0.030 -0.815 -0.064
Class[T.C] -0.1047 0.170 -0.617 0.564 -0.541 0.332
Var -0.1187 0.222 -0.535 0.615 -0.689 0.451
==============================================================================
Omnibus: 1.493 Durbin-Watson: 2.699
Prob(Omnibus): 0.474 Jarque-Bera (JB): 1.068
Skew: -0.674 Prob(JB): 0.586
Kurtosis: 2.136 Cond. No. 9.75
==============================================================================
This is a consequence of the way the linear model works.
For instance, where you have the categorical variable Day as far as the linear model is concerned this can be represented as just a single 'dummy' variable which is set to 0 (zero) for the value you mention first, namely D and one for the second value, namely N. Statistically speaking, you can recover only the difference between the effects of the two levels of this categorical variable.
If you now consider Class, which has two levels, you have two dummy variables which represent two differences between the levels of the available three levels of this categorical variable.
As a matter of fact, it's perfectly possible to expand on this idea using orthogonal polynomials on the treatment means but that's something for another day.
The short answer is that there's nothing wrong, at least on this account, with your model.

VBA: Surprising least squares result

When running the code:
x(0) = 1200
x(1) = 1800
x(2) = 2200
y(0) = 64
y(1) = 45
y(2) = 84
v = Application.LinEst(y, x, True, True)
I get v(1,1)= 1.59 (k-value) and v(1,2)= 36.74. How can this be a least squares regression?
y(0) approx = x(0)k + m
64 approx = 1200 1.59 + 36.74 ?????
The curve seems to differ a lot from the average relationship between x and y.
Because your three data points are almost random, having a R2 of only 0.17. Your data doesn't really suppprt a linear trend (and delving deeper into stats, 3 points doesn't give you enough degrees of freedom for a valid trend)
As the other response shows, I think you omitted the E-02 exponent from the k-value:
1200*0.0159+36.74 = 55.82

Resources