A common one sample test for variance is the chi-square test, e.g., http://www.itl.nist.gov/div898/handbook/eda/section3/eda358.htm.
What are some robust testing alternatives for variance when the population is not normal and/or is subject to outliers?
Related
I use ANOVA, including the Levene test, unidimensional significance tests, descriptive statistics and the Tukey test. I have some doubts about what happens when the assumption of homogeneity of variance is not met in the Levene test. In many of the available materials I found the information:
"Basic assumptions of ANOVA tests:
Independence of random variables in the populations (groups) under consideration.
Measurability of the analysed variables.
Normality of the distribution of the variables in each population (group).
Homogeneity of the variance in all populations (groups).
If one of the first three assumptions is not met in the analysis of variance, the non-parametric Kruskal-Wallis test should be used. If the assumption of homogeneity of variances is not met, the Welch test should be used to assess the means."
If we had heterogeneous variances and a non-normal distribution then we would apply the Kruskal-Wallis test, right? On the other hand, what if we have both heterogeneous variances and a normal distribution, do we use the Welch test? If we do the Welch test, how should it be interpreted and what tests are subsequently recommended to see statistically significant differences between groups?
I would be very grateful for an answer
The goal of my research is to establish whether one model outperforms the other (for a single dataset!!!) and the result is statistically significant.
The procedure is as follows for every model out of the two: I use 10-fold CV and repeat the procedure 3 times with different seeds to obtain, let's say, 30 estimates of precision. Hence, I obtain two sets of 30 estimates based on a single dataset.
Test for normality showed that the 30 estimates are not normally distributed. Thus, I need to resort to a nonparametric test. I considered Wilcoxon Signed-Rank Test yet the test is not suitable for the case when the estimates are dependent (due to CV). How could I tackle this situation?
Multiple Linear Regression: a Significant ANOVA but NO significant coefficient predictors?
I have ran a multiple regression on 2 IVs to predict a dependant, all assumptions have been met, the ANOVA has a significant result but the coefficient table suggests that none of the predictors are significant?
what does this mean and how am I able to interpret the result of this?
(USED SPSS)
This almost certainly means the two predictors are substantially correlated with each other. The REGRESSION procedure in SPSS Statistics has a variety of collinearity diagnostics to aid in detection of more complicated situations involving collinearity, but in this case simply correlating the two predictors should establish the basic point.
I am studying the correlation between several human traits. One way is to use a chi-square test, but this is unable to include covariates. I am also using logistical regression to do this, and this makes it possible to include age and race as covariates.
However, I noticed that some tests support stratified data for a chi-square-like test.
Therefore, I am wondering what are the differences between including covariates in logistic regression and a stratified chi-square test?
I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test