I have an interview question which I don't know where to start learning.
Order the following situation based on the expected power of the statistical test intended to identify the difference between the two groups.
Datasets come from normal distributions with equal unknown variances
Datasets come from normal distributions with unequal known variances
Datasets come from two unknown distributions
Datasets come from normal distributions with unequal known variances and unequal known means
Datasets come from gamma distributions with unknown parameters
Any help is greatly appreciated. Thanks!
Related
I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!
Sklearn EllipticEnvelope calculates the covariance between two or more features and estimates the outliers. Instead of using two features, I created one new feature by dividing first with the second. When I apply EllipticEnvelope on just this one new feature. It works well. But my question is this a correct way to do it since the model relies on the covariance of two or more features?
I found the answer. It works for both univariate and multivariate. But still would love to see more answers about how it works with a single feature.
“EllipticEnvelope is a function that tries to figure out the key parameters of your data's general distribution by assuming that your entire data is an expression of an underlying multivariate Gaussian distribution. That's an assumption that cannot hold true for all datasets, yet when it does, it proves an effective method indeed for spotting outliers. Simplifying the complex estimations working behind the algorithm as much as possible, we can say that it checks the distance of each observation with respect to a grand mean that takes into account all the variables in your dataset. For this reason, it is able to spot both univariate and multivariate outliers.”
Source: Alberto Boschetti. “Python Data Science Essentials.”.
Multiple Linear Regression: a Significant ANOVA but NO significant coefficient predictors?
I have ran a multiple regression on 2 IVs to predict a dependant, all assumptions have been met, the ANOVA has a significant result but the coefficient table suggests that none of the predictors are significant?
what does this mean and how am I able to interpret the result of this?
(USED SPSS)
This almost certainly means the two predictors are substantially correlated with each other. The REGRESSION procedure in SPSS Statistics has a variety of collinearity diagnostics to aid in detection of more complicated situations involving collinearity, but in this case simply correlating the two predictors should establish the basic point.
I am currently working in anomaly detection algorithms. I read papers comparing unsupervised anomaly algorithms based on AUC values. For example i have anomaly scores and anomaly classes from Elliptic Envelope and Isolation Forest. How can i compare these two algorithms based on AUC values.
I am looking for a python code example.
Thanks
Problem solved. Steps i done so far;
1) Gathering class and score after anomaly function
2) Converting anomaly score to 0 - 100 scale for better compare with different algorihtms
3) Auc requires this variables to be arrays. My mistake was to use them like Data Frame column which returns "nan" all the time.
Python Script:
#outlier_class and outlier_score must be array
fpr,tpr,thresholds_sorted=metrics.roc_curve(outlier_class,outlier_score)
aucvalue_sorted=metrics.auc(fpr,tpr)
aucvalue_sorted
Regards,
Seçkin Dinç
Although you already solved your problem, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)
I have a dataset which the instances are of about 200 features, about 11 of these features are numerical (integer) and the rest are binary (1/0) , these features may be correlated and they are of different probability distributions ,
It's been a while that I've been for a good similarity score which works for a mixed vector and takes into account the correlation between the features,
Do you know such similarity score?
Thanks,
Arian
In your case, the similarity function relies heavily on the input data patterns. You might benefit from learning a distance metric for the input space of data from a given collection
of pair of similar/dissimilar points that preserves the distance relation among the
training data.
Here is a nice survey paper.
The numerous types of distance measures, Euclidean, Manhattan, etc are going provide different levels of accuracy depending on the dataset. Best to read papers covering your method of data fitting and see what heuristics they use. Not to mention that some methods require only homogeneous data that scale accordingly. Here is a paper that talks about a whole host of measures that you might find attractive.
And as always, test and cross validate to see if there really is an impact from the mixing of feature types.