I am trying to run a general linear model using formulas on a data set that contains categorical variables. The results summary table appears to be leaving out one of the variables when I list the parameters?
I haven't been able to find doc's specific to the glm showing the output with categorical variables but I have for the OLS and it looks like it should list each categorical variable seperately. When it do it (with GLM or OLS) it leaves out one of the values for each category. For example:
import statsmodels.formula.api as smf
import pandas as pd
Data = pd.read_csv(root+'/Illisarvik/TestData.csv')
formula = 'Response~Day+Class+Var'
gm = sm.GLM.from_formula(formula=formula, data=Data,
family=sm.families.Gaussian()).fit()
ls = smf.ols(formula=formula,data=Data).fit()
print (Data)
print(gm.params)
print(ls.params)
Day Class Var Response
0 D A 0.533088 0.582931
1 D B 0.839837 0.075011
2 D C 1.454716 0.505442
3 D A 1.455503 0.188945
4 D B 1.163155 0.144176
5 N A 1.072238 0.918962
6 N B 0.815384 0.249160
7 N C 1.182626 0.520460
8 N A 1.448843 0.870644
9 N B 0.653531 0.460177
Intercept 0.625111
Day[T.N] 0.298084
Class[T.B] -0.439025
Class[T.C] -0.104725
Var -0.118662
dtype: float64
Intercept 0.625111
Day[T.N] 0.298084
Class[T.B] -0.439025
Class[T.C] -0.104725
Var -0.118662
dtype: float64
C:/Users/wesle/Dropbox/PhD_Work/Figures/SkeeterEtAlAnalysis.py:55: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
P.sort()
Is there something wrong with my model? The same issue presents its self when I print the full summary table:
print(gm.summary())
print(ls.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Response No. Observations: 10
Model: GLM Df Residuals: 5
Model Family: Gaussian Df Model: 4
Link Function: identity Scale: 0.0360609978309
Method: IRLS Log-Likelihood: 5.8891
Date: Sun, 05 Mar 2017 Deviance: 0.18030
Time: 23:26:48 Pearson chi2: 0.180
No. Iterations: 2
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6251 0.280 2.236 0.025 0.077 1.173
Day[T.N] 0.2981 0.121 2.469 0.014 0.061 0.535
Class[T.B] -0.4390 0.146 -3.005 0.003 -0.725 -0.153
Class[T.C] -0.1047 0.170 -0.617 0.537 -0.438 0.228
Var -0.1187 0.222 -0.535 0.593 -0.553 0.316
==============================================================================
OLS Regression Results
==============================================================================
Dep. Variable: Response R-squared: 0.764
Model: OLS Adj. R-squared: 0.576
Method: Least Squares F-statistic: 4.055
Date: Sun, 05 Mar 2017 Prob (F-statistic): 0.0784
Time: 23:26:48 Log-Likelihood: 5.8891
No. Observations: 10 AIC: -1.778
Df Residuals: 5 BIC: -0.2652
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6251 0.280 2.236 0.076 -0.094 1.344
Day[T.N] 0.2981 0.121 2.469 0.057 -0.012 0.608
Class[T.B] -0.4390 0.146 -3.005 0.030 -0.815 -0.064
Class[T.C] -0.1047 0.170 -0.617 0.564 -0.541 0.332
Var -0.1187 0.222 -0.535 0.615 -0.689 0.451
==============================================================================
Omnibus: 1.493 Durbin-Watson: 2.699
Prob(Omnibus): 0.474 Jarque-Bera (JB): 1.068
Skew: -0.674 Prob(JB): 0.586
Kurtosis: 2.136 Cond. No. 9.75
==============================================================================
This is a consequence of the way the linear model works.
For instance, where you have the categorical variable Day as far as the linear model is concerned this can be represented as just a single 'dummy' variable which is set to 0 (zero) for the value you mention first, namely D and one for the second value, namely N. Statistically speaking, you can recover only the difference between the effects of the two levels of this categorical variable.
If you now consider Class, which has two levels, you have two dummy variables which represent two differences between the levels of the available three levels of this categorical variable.
As a matter of fact, it's perfectly possible to expand on this idea using orthogonal polynomials on the treatment means but that's something for another day.
The short answer is that there's nothing wrong, at least on this account, with your model.
Related
The data consists of weights of fresh harvested plants with four different treatments. The data is normally distributed and homogeneity of variances is given too.
Anova shows significiant differences:
anova_Ernte <- aov(Gewicht ~ Variante, data=Daten_Ernte)
Anova(anova_Ernte)
Anova Table (Type II tests)
Response: Gewicht
Sum Sq Df F value Pr(>F)
Variante 57213 3 2.9778 0.03226 *
Residuals 1511436 236
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However post-hoc test HSD.Test() doesnt show any significiant differences:
HSD.test(anova_Ernte, "Variante", group = TRUE, console = TRUE, main = "")
4 434.70 a
1 426.90 a
3 400.95 a
2 398.20 a
Gewicht std r Min Max
1 426.90 80.08929 80 234 596
2 398.20 79.90095 80 216 561
3 400.95 74.87869 40 228 568
4 434.70 84.98754 40 264 647
Tuckey-HSD shows the following
TukeyHSD(anova_Ernte)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Gewicht ~ Variante, data = Daten_Ernte)
$Variante
diff lwr upr p adj
2-1 -28.70 -61.439904 4.039904 0.1085058
3-1 -25.95 -66.048029 14.148029 0.3394912
4-1 7.80 -32.298029 47.898029 0.9582293
3-2 2.75 -37.348029 42.848029 0.9980106
4-2 36.50 -3.598029 76.598029 0.0887876
4-3 33.75 -12.551216 80.051216 0.2368136
And finally kurskal wallis does not show significiant differences between the groups:
kruskal(y=Daten_Ernte$Gewicht, trt=Daten_Ernte$Variante, p.adj = "bonferroni", console = TRUE)
kruskal.test(Daten_Ernte$Gewicht ~ Daten_Ernte$Variante)
4 133.9000 a
1 131.2063 a
3 109.8875 a
2 108.4000 a
Am i now safe to say that there are no significiant differences between the groups or do i have options to find out which groups differ according to anova?
I want to find some good predictors (genes). This is my data, log transformed RNA-seq:
TRG CDK6 EGFR KIF2C CDC20
Sample 1 TRG12 11.39 10.62 9.75 10.34
Sample 2 TRG12 10.16 8.63 8.68 9.08
Sample 3 TRG12 9.29 10.24 9.89 10.11
Sample 4 TRG45 11.53 9.22 9.35 9.13
Sample 5 TRG45 8.35 10.62 10.25 10.01
Sample 6 TRG45 11.71 10.43 8.87 9.44
I have calculated confusion matrix for different models like below
1- I tested each of 23 genes individually in this code and each of them gives p-value < 0.05 remained as a good predictor; For example for CDK6 I have done
glm=glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
Finally I obtained five genes and I put them in this model:
final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))
I want a plot like this for ROC curve of each model but I don't know how to do that
Any help please?
I will give you an answer using the pROC package. Disclaimer: I am the author and maintiner of the package. There are alternative ways to do it.
The plot your are seeing was probably generated by the ggroc function of pROC. In order to generate such a plot from glm models, you need to 1) use the predict function to generate the predictions, 2) generate the roc curves and store them in a list, preferably named to get a legend automatically, and 3) call ggroc.
glm.cdk6 <- glm(TRG ~ CDK6, data = df, family = binomial(link = 'logit'))
final <- glm(TRG ~ CDK6 + CXCL8 + IL6 + ISG15 + PTGS2 , data = df, family = binomial(link = 'logit'))
rocs <- list()
library(pROC)
rocs[["CDK6"]] <- roc(df$TRG, predict(glm.cdk6))
rocs[["final"]] <- roc(df$TRG, predict(final))
ggroc(rocs)
I want to fit poission distribution on my data points and want to decide based on chisquare test that should I accept or reject this proposed distribution. I only used 10 observations. Here is my code
#Fitting function:
def Poisson_fit(x,a):
return (a*np.exp(-x))
#Code
hist, bins= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02,2.67094169e-03,4.39345778e-04,6.59603327e-05,1.01518320e-05,
1.06301906e-06]
XX = np.arange(len(hist))
print("XX: ",XX)
#XX: [0 1 2 3 4 5 6 7 8 9]
plt.scatter(XX, hist, marker='.',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, XX, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',color='red',
label='Fit')
print("hist: ",hist)
plt.xlabel('s')
plt.ylabel('P(s)')
#Chisquare test:
f_obs =hist
#f_obs: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02, 2.67094169e-03, 4.39345778e-04, 6.59603327e-05,
1.01518320e-05, 1.06301906e-06]
f_exp= Poisson_fit(XX,*popt)
f_exp: [6.76613820e-01, 2.48912314e-01, 9.15697229e-02, 3.36866185e-02,
1.23926144e-02, 4.55898806e-03, 1.67715798e-03, 6.16991940e-04,
2.26978650e-04, 8.35007789e-05]
chi,p_value=chisquare(f_obs,f_exp)
print("chi: ",chi)
print("p_value: ",p_value)
chi: 0.4588956658201067
p_value: 0.9999789643475111`
I am using 10 observations so degree of freedom would be 9. For this degree of freedom I can't find my p-value and chi value on Chi-square distribution table. Is there anything wrong in my code?Or my input values are too small that test fails? if P-value >0.05 distribution is accepted. Although p-value is large 0.999 but for this I can't find chisquare value 0.4588 on table. I think there is something wrong in my code. How to fix this error?
Is this returned chi value is the critical value of tails? How to check proposed hypothesis?
Here I demonstrated a survival model with rcs term. I was wondering whether the anova()under rms package is the way to test the linearity association? And How can I interpret the P-value of the Nonlinear term (see 0.094 here), does that support adding a rcs() term in the cox model?
library(rms)
data(pbc)
d <- pbc
rm(pbc, pbcseq)
d$status <- ifelse(d$status != 0, 1, 0)
dd = datadist(d)
options(datadist='dd')
# rcs model
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
Wald Statistics Response: Surv(time, status)
Factor Chi-Square d.f. P
albumin 82.80 3 <.0001
Nonlinear 4.73 2 0.094
TOTAL 82.80 3 <.0001
The proper way to test is with model comparison of the log-likelihood (aka deviance) across two models: a full and reduced:
m2 <- cph(Surv(time, status) ~ rcs(albumin, 4), data=d)
anova(m2)
m <- cph(Surv(time, status) ~ albumin, data=d)
p.val <- 1- pchisq( (m2$loglik[2]- m$loglik[2]), 2 )
You can see the difference in the inference using the less accurate Wald statistic (which in your case was not significant anyway since the p-value was > 0.05) versus this more accurate method in the example that Harrell used in his ?cph help page. Using his example:
> anova(f)
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 57.75 3 <.0001
Nonlinear 8.17 2 0.0168
sex 18.75 1 <.0001
TOTAL 75.63 4 <.0001
You would incorrectly conclude that the nonlinear term was "significant" at conventional 0.05 level. This despite the fact that code creating the model was constructed as entirely linear in age (on the log-hazard scale):
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
Create a reduced mode and compare:
f0 <- cph(S ~ age + sex, x=TRUE, y=TRUE)
anova(f0)
#-------------
Wald Statistics Response: S
Factor Chi-Square d.f. P
age 56.64 1 <.0001
sex 16.26 1 1e-04
TOTAL 75.85 2 <.0001
The difference in deviance is not significant with 2 degrees of freedom difference:
1-pchisq((f$loglik[2]- f0$loglik[2]),2)
[1] 0.1243212
I don't know why Harrell leaves this example in, because I've taken his RMS course and know that he endorses the cross-model comparison of deviance as the more accurate approach.
I'm currently trying to impute the missing data through Gaussian mixture model.
My reference paper is from here:
http://mlg.eng.cam.ac.uk/zoubin/papers/nips93.pdf
I currently focus on bivariate dataset with 2 Gaussian components.
This is the code to define the weight for each Gaussian component:
myData = faithful[,1:2]; # the data matrix
for (i in (1:N)) {
prob1 = pi1*dmvnorm(na.exclude(myData[,1:2]),m1,Sigma1); # probabilities of sample points under model 1
prob2 = pi2*dmvnorm(na.exclude(myData[,1:2]),m2,Sigma2); # same for model 2
Z<-rbinom(no,1,prob1/(prob1 + prob2 )) # Z is latent variable as to assign each data point to the particular component
pi1<-rbeta(1,sum(Z)+1/2,no-sum(Z)+1/2)
if (pi1>1/2) {
pi1<-1-pi1
Z<-1-Z
}
}
This is my code to define the missing values:
> whichMissXY<-myData[ which(is.na(myData$waiting)),1:2]
> whichMissXY
eruptions waiting
11 1.833 NA
12 3.917 NA
13 4.200 NA
14 1.750 NA
15 4.700 NA
16 2.167 NA
17 1.750 NA
18 4.800 NA
19 1.600 NA
20 4.250 NA
My constraint is, how to impute the missing data in "waiting" variable based on particular component.
This code is my first attempt to impute the missing data using conditional mean imputation. I know, it is definitely in the wrong way. The outcome would not lie to the particular component and produce outlier.
miss.B2 <- which(is.na(myData$waiting))
for (i in miss.B2) {
myData[i, "waiting"] <- m1[2] + ((rho * sqrt(Sigma1[2,2]/Sigma1[1,1])) * (myData[i, "eruptions"] - m1[1] ) + rnorm(1,0,Sigma1[2,2]))
#print(miss.B[i,])
}
I would appreciate if someone could give any advice on how to improve the imputation technique that could work with latent/hidden variable through Gaussian mixture model.
Thank you in advance
This is a solution for one type of covariance structure.
devtools::install_github("alexwhitworth/emclustr")
library(emclustr)
data(faithful)
set.seed(23414L)
ff <- apply(faithful, 2, function(j) {
na_idx <- sample.int(length(j), 50, replace=F)
j[na_idx] <- NA
return(j)
})
ff2 <- em_clust_mvn_miss(ff, nclust=2)
# hmm... seems I don't return the imputed values.
# note to self to update the code
plot(faithful, col= ff2$mix_est)
And the parameter outputs
$it
[1] 27
$clust_prop
[1] 0.3955708 0.6044292
$clust_params
$clust_params[[1]]
$clust_params[[1]]$mu
[1] 2.146797 54.833431
$clust_params[[1]]$sigma
[1] 13.41944
$clust_params[[2]]
$clust_params[[2]]$mu
[1] 4.317408 80.398192
$clust_params[[2]]$sigma
[1] 13.71741