Statistical Kurtosis in relation to SPSS and MS excel

Statistical Kurtosis in relation to SPSS and MS excel - statistics

I am using SPSS as statistical analysis tool for my data set.
I have few queries on kurtosis concept and the one generated by SPSS and excel.
Please correct the understandings below and follow up questions:
Kurtosis as a measure of flatness or peakness (hump) around the mean in the distribution. In terms of distribution tails, it tells whether the dataset is heavy-tailed or light-tailed relative to a normal distribution.
A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0 which is kurt-3) and also called as mesokurtic distribution.
A distribution with high kurtosis will have its peak bigger than mesokurtic peak and is called as leptokurtic
A distribution with low kurtosis will have its peak smaller than mesokurtic peak and is called as platykurtic.
Questions:
What does it mean by excess kurtosis and what is the significance of using it? I am not getting clear picture between kurtosis vs excess kurtosis except that excess kurtosis is kurtosis-3 so that we take 0 as baseline.
SPSS tool generates "excess kurtosis" values or simple "kurtosis" values? In other words what baseline we generally consider in SPSS for kurtosis measurement and inference? Is it 0 or 3? In SPSS I am getting kurtosis of 1.16. So if I consider 3 as baseline then 1.16 is less than 3 and so my distribution could be platykurtic. But if I consider baseline as 0 (excess kurtosis), then 1.16 is clearly greater than 0 and so my distribution could be leptokurtic.
How it works out in excel again? Does the excel formula internally compute kurtosis as (kurt - 3) or simple kurt? I mean how to infer the result in MS excel too (baseline 3 or 0)?

Kurtosis does not measure "peakedness" or "height" of a distribution. It measures (potential) outliers (rare, extreme observations) only. For a clear explanation, please see here: https://en.wikipedia.org/wiki/Talk:Kurtosis#Why_kurtosis_should_not_be_interpreted_as_.22peakedness.22

Related

Can a 2 sample statistical comparison have too large of a population size to be accurate?

I'm trying to do a simple comparison of two samples to determine if their means are different. Regardless of whether their standard deviations are equal/unequal, the formulas for a t-test or z-test are similar.
(i can't post images on a new account)
t-value w/ unequal variances:
https://www.biologyforlife.com/uploads/2/2/3/9/22392738/949234_orig.jpg
t-value w/ equal/pooled variances:
https://vitalflux.com/wp-content/uploads/2022/01/pooled-t-statistics-300x126.jpg
The issue here is the inverse and sqrt of sample size in the denominator that causes large samples to seem to have massive t-values.
For instance, I have 2 samples w/
size: N1=168,000 and N2=705,000
avgs: X1=89 and X2=49
stddev: S1=96 and S2=66 .
At first glance, these standard deviations are larger than the mean and suggest a nonhomogeneous sample with a lot of internal variation. When comparing the two samples, however, the denominator of the t-test becomes approx 0.25, suggesting that a 1 unit difference in means is equivalent to 4 standard deviations. Thus my t-value here comes out to around 160(!!)
All this to say, I'm just plugging in numbers since I didn't do many of these problems in advanced stats and haven't seen this formula since Stats110.
It makes some sense that two massive populations need their variance biased downward before comparing, but this seems like not the best test out there for the magnitude of what I'm doing.
What other tests are out there that I could try? What is the logic behind this seemingly over-biased variance?

Actuarial vs. predicted survival comparison

I have a set of patients and their actuarial 1- and 5-years survival. I have also used their data with a certain commonly utilised medical score, that calcualtes survival probability for 1- and 5-years (for example 75% and 55% respectively). I'd like to compare both survival rates.
I did calculate the mean survival probability for all patients at 1- and 5-years as the mean of predicted survival probabilities. I then calculated the mean actuarial survival by using 100% if alive at 1 year and 0% if dead at 5 years. I then compared the means of both groups with a t-test.
I have a feeling that what i am doing is grossly incorrect and goes against all rules of statistics, however i have not find any solution of my problem anywhere. Maybe someone can help me? R packages and codes are welcome.

Kurtosis interpretation

For a set of data points I have found mean value is 2989.05,skewness is 26.67 and kurtosis is 1003.29.
Here it seems kurtosis is very high which I am not able to understand what it means.Can someone explain this.

Moors' interpretation of kurtosis: kurtosis is a measure of the dispersion of X around the two values μ ± σ.(mean ± S.D)
High values of kurtosis arise in two circumstances:
1) Highly densed at the tails of the distribution.
2) Your data points are concentrated around the mean, but because of presence of few outliers the kurtosis value become high

It means you have an outlier problem. Kurtosis tells you nothing about the peak - negative kurtosis does not mean "low and broad" because the peak can be infinite with negative kurtosis. Also, the peak can be flat with infinite kurtosis, so large kurtosis does not tell you that you have a "high and sharp" peak.
You have a serious outlier situaiton. Have look at your maximum value(s).

statistical test for samples that follow normal distribution, with each sample having multiple measurements?

I have a set of sample (i = 1 : n), with each one measured for a specific metric 10 times.
The metric mean of the 10 measurements for each sample has a mean mu(i).
I've done dbscan clustering on all the mu, to find out the outlier samples. Now I want to test whether a given outlier is statistically different from the core samples.
The samples appear to follow normal distribution. For each sample, the 10 measurements also appear to follow normal distribution.
If I just use the mu(i) as the metric for each sample, I can easily calculate Z-score and p-value based on normal distribution. My question is, how do I make use of the 10 measurements for each sample to add to my statistical power (is it possible?)
Not very good at statistics, anything would help, thanks in advance...

Check if numbers form bell curve (gauss distribution) Python 3

I've got files with irradiance data measured every minute 24 hours a day.
So if there is a day without any clouds on the sky the data shows a nice continuous bell curves.
When looking for a day without any clouds in the data I always plotted month after month with gnuplot and checked for nice bell curves.
I was wondering If there's a python way to check, if the Irradiance measurements form a continuos bell curve.
Don't know if the question is too vague but I'm simply looking for some ideas on that quest :-)

For a normal distribution, there are normality tests.
In short, we abuse some knowledge we have of what normal distributions look like to identify them.
The kurtosis of any normal distribution is 3. Compute the kurtosis of your data and it should be close to 3.
The skewness of a normal distribution is zero, so your data should have a skewness close to zero
More generally, you could compute a reference distribution and use a Bregman Divergence, to assess the difference (divergence) between the distributions. bin your data, create a histogram, and start with Jensen-Shannon divergence.
With the divergence approach, you can compare to an arbitrary distribution. You might record a thousand sunny days and check if the divergence between the sunny day and your measured day is below some threshold.

Just to complement the given answer with a code example: one can use a Kolmogorov-Smirnov test to obtain a measure for the "distance" between two distributions. SciPy offers a neat interface for this, called kstest:
from scipy import stats
import numpy as np
data = np.random.normal(size=100) # Our (synthetic) dataset
D, p = stats.kstest(data, "norm") # Perform a one-sided Kolmogorov-Smirnov test
In the above example, D denotes the distance between our data and a Gaussian normal (norm) distribution (smaller is better), and p denotes the corresponding p-value. Other distributions can be similarly tested by substituting norm with those implemented in scipy.stats.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string