I'm working on an issue in my research where I would like to express my statistical significance for a correlation peak in terms of sigma of a normal distribution. For example, if my peak was at 95% significance it would be at 2sigma. Essentially what I'm asking is say I have an arbitrary peak significance (e.g. 92%), how would I express this in terms of sigma of a normal distribution? I realize this is a more general statistics question, so any reading/background is encouraged. Or if Python as a straightforward function to convert/compute this that works too.
Thanks!
I'm not sure what you mean by "statistical significance of a correlation peak," so I can't comment on whether the statistics you're talking about make any sense. However it sounds like you'd like calculate the following: how many standard deviations from the mean (say 1.96 sigma) cover a given fraction (in this case, 0.95) of the normal distribution? If this is what you're asking, you can use the SciPy statistics library to easily solve this. If you don't have SciPy already, you'll need to install it first.
Once you have SciPy installed, you'll want to use the inverse survival function (ISF) of the normal distribution. The ISF is the inverse of the survival function, which itself is 1-CDF. Here's how you do it in python:
In [1]: import scipy.stats as st
In [2]: yourArea = 0.95
In [3]: st.norm.isf((1-yourArea)/2.)
Out[3]: 1.959963984540054
So that's how you calculate the number that I believe you want. The (1-A)/2 business just accounts for the fact the CDF integrates from -infinity, whereas you're interested in values calculated from the center of the distribution.
Related
In sklearn, the document of QuantileTransformer says
This method transforms the features to follow a uniform or a normal distribution
the document of PowerTransformer says,
Apply a power transform featurewise to make data more Gaussian-like
It seems both of them can transform features to a gaussian/normal distribution. What are the differences in terms of this aspect and when to use which ?
It is confusing terminology that they use because Gaussian and normal distribution are actually the SAME.
QuantileTransformer and PowerTransformer are both non-linear.
To answer your question about what exactly is the difference it is this according to https://scikit-learn.org:
"QuantileTransformer provides non-linear transformations in which distances between marginal outliers and inliers are shrunk. PowerTransformer provides non-linear transformations in which data is mapped to a normal distribution to stabilize variance and minimize skewness. "
Source and more info here: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#:~:text=QuantileTransformer%20provides%20non%2Dlinear%20transformations,stabilize%20variance%20and%20minimize%20skewness.
The main difference is PowerTransformer() being parametric and QuantileTransformer() being non-parametric. Box-Cox or Yeo-Johnson will make your data look more 'normal' (i.e. less skewed and more centered) but it's often still far from the perfect gaussian. QuantileTransformer(output_distribution='normal') results usually look much closer to gaussian, at the cost of distorting linear relationships somewhat more. I believe there's no rule of thumb to decide which one would work better in a certain case, but it's worth noting you can select an optimal scaler in a pipeline when doing e.g. GridSearchCV().
I'm quite new to biostatistics so I apologize if my question is too dumb.
I'm studying data transformation in biostatistics to fit my data to the normal distribution.
I started with the Poisson distribution (which is quite common in the biostatistics: daily admissions, prevalence of rare disease etc) It is recommended to use the square root to fit data to normal distribution.
I used stata and this free dataset ( https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017?resource=download ) with the results of a huge amount of football matches.
I have created a new variable for this dataset, made by the whole amount of goals scored by both teams in each match. You will find that as the independent variable distributed as following:
We can see that the distribution quite approximate the Poisson's one, as confirmed by the values of mean and std deviation.
Then, I've created a new variable with the square root of this variable and the distribution is the following (blue line is how the normal distrib with the same mean and std deviation looks like):
As you can see It's quite far from a normal distribution of my data, as proven by normality tests, but also easily visible from the q-q plot:
So, my question is, why sqrt didn't work? What can I do to transform my dataset to fit the normal distribution?
I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.
I've got files with irradiance data measured every minute 24 hours a day.
So if there is a day without any clouds on the sky the data shows a nice continuous bell curves.
When looking for a day without any clouds in the data I always plotted month after month with gnuplot and checked for nice bell curves.
I was wondering If there's a python way to check, if the Irradiance measurements form a continuos bell curve.
Don't know if the question is too vague but I'm simply looking for some ideas on that quest :-)
For a normal distribution, there are normality tests.
In short, we abuse some knowledge we have of what normal distributions look like to identify them.
The kurtosis of any normal distribution is 3. Compute the kurtosis of your data and it should be close to 3.
The skewness of a normal distribution is zero, so your data should have a skewness close to zero
More generally, you could compute a reference distribution and use a Bregman Divergence, to assess the difference (divergence) between the distributions. bin your data, create a histogram, and start with Jensen-Shannon divergence.
With the divergence approach, you can compare to an arbitrary distribution. You might record a thousand sunny days and check if the divergence between the sunny day and your measured day is below some threshold.
Just to complement the given answer with a code example: one can use a Kolmogorov-Smirnov test to obtain a measure for the "distance" between two distributions. SciPy offers a neat interface for this, called kstest:
from scipy import stats
import numpy as np
data = np.random.normal(size=100) # Our (synthetic) dataset
D, p = stats.kstest(data, "norm") # Perform a one-sided Kolmogorov-Smirnov test
In the above example, D denotes the distance between our data and a Gaussian normal (norm) distribution (smaller is better), and p denotes the corresponding p-value. Other distributions can be similarly tested by substituting norm with those implemented in scipy.stats.
I have a list of numbers. Below are some basic statistics:
N > 1000
Max: 9.24
Min: 0.00955
Mean: 1.84932
Median: 0.97696
It seems that the data is right skewed, i.e. many small numbers and a few very large numbers.
I want to find a distribution to generalize these numbers. I think Normal distribution, Gamma distribution, and Laplace distribution all look possible. How do I determine which distribution is the best?
I have to say that I usually do it in the same way you did it, by plotting the data I seeing its shape.
When being more accurate, and only for the normal distribution, I perform the Shapiro Wilk test for normality, which at least will tell me that the null hypotesis was not proven, which means that it was not possible to prove that the date does not follow a normal distribution. Usually, this is more than acceptable in scientific environments.
I know there exists equivalent tests for Laplace and Gamma distributions, although still in newly research like this. Instead, there are many sites that offer the Shapiro Wilk test online, like this one.
With all positive values and the mean being about double the median, your data are definitely skewed right. You can rule out both normal and Laplace because both are symmetric and can go negative.
Scope out some of the many fine alternatives at the Wikipedia distributions page. Make a histogram of your data and check it for similarities in shape to those distributions. Exponentials, log normals, chi-squares, and the gamma family could all give numeric results such as the ones you described, but without knowing anything about the variance/std deviation, whether your data are unimodal or multimodal, or where the mode(s) are, we can only make guesses about a very large pool of possibilities.