For a set of data points I have found mean value is 2989.05,skewness is 26.67 and kurtosis is 1003.29.
Here it seems kurtosis is very high which I am not able to understand what it means.Can someone explain this.
Moors' interpretation of kurtosis: kurtosis is a measure of the dispersion of X around the two values μ ± σ.(mean ± S.D)
High values of kurtosis arise in two circumstances:
1) Highly densed at the tails of the distribution.
2) Your data points are concentrated around the mean, but because of presence of few outliers the kurtosis value become high
It means you have an outlier problem. Kurtosis tells you nothing about the peak - negative kurtosis does not mean "low and broad" because the peak can be infinite with negative kurtosis. Also, the peak can be flat with infinite kurtosis, so large kurtosis does not tell you that you have a "high and sharp" peak.
You have a serious outlier situaiton. Have look at your maximum value(s).
Related
I have a set of patients and their actuarial 1- and 5-years survival. I have also used their data with a certain commonly utilised medical score, that calcualtes survival probability for 1- and 5-years (for example 75% and 55% respectively). I'd like to compare both survival rates.
I did calculate the mean survival probability for all patients at 1- and 5-years as the mean of predicted survival probabilities. I then calculated the mean actuarial survival by using 100% if alive at 1 year and 0% if dead at 5 years. I then compared the means of both groups with a t-test.
I have a feeling that what i am doing is grossly incorrect and goes against all rules of statistics, however i have not find any solution of my problem anywhere. Maybe someone can help me? R packages and codes are welcome.
I am using SPSS as statistical analysis tool for my data set.
I have few queries on kurtosis concept and the one generated by SPSS and excel.
Please correct the understandings below and follow up questions:
Kurtosis as a measure of flatness or peakness (hump) around the mean in the distribution. In terms of distribution tails, it tells whether the dataset is heavy-tailed or light-tailed relative to a normal distribution.
A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0 which is kurt-3) and also called as mesokurtic distribution.
A distribution with high kurtosis will have its peak bigger than mesokurtic peak and is called as leptokurtic
A distribution with low kurtosis will have its peak smaller than mesokurtic peak and is called as platykurtic.
Questions:
What does it mean by excess kurtosis and what is the significance of using it? I am not getting clear picture between kurtosis vs excess kurtosis except that excess kurtosis is kurtosis-3 so that we take 0 as baseline.
SPSS tool generates "excess kurtosis" values or simple "kurtosis" values? In other words what baseline we generally consider in SPSS for kurtosis measurement and inference? Is it 0 or 3? In SPSS I am getting kurtosis of 1.16. So if I consider 3 as baseline then 1.16 is less than 3 and so my distribution could be platykurtic. But if I consider baseline as 0 (excess kurtosis), then 1.16 is clearly greater than 0 and so my distribution could be leptokurtic.
How it works out in excel again? Does the excel formula internally compute kurtosis as (kurt - 3) or simple kurt? I mean how to infer the result in MS excel too (baseline 3 or 0)?
Kurtosis does not measure "peakedness" or "height" of a distribution. It measures (potential) outliers (rare, extreme observations) only. For a clear explanation, please see here: https://en.wikipedia.org/wiki/Talk:Kurtosis#Why_kurtosis_should_not_be_interpreted_as_.22peakedness.22
Model to be chosen if Poisson distribution mean and variance are not the same, say If mean is greater than variance or variance is greater than mean?
If you only have mean and variance, and they are not equal, obviously you have to try two-parametric discrete distribution. From the top of my head:
Binomial
Negative binomial
Hypergeometric distribution
Negative Hypergeometric
Compound distributions like Gamma-Poisson mix
I have a requirement where I have to verify the transmit power out of a device as measured at its connector is within 2 dB of its expected value over 95% of test measurements.
I am using a signal analyzer to analyze the transmitted power. I only get the average power value, min, max and stdDev of the measurements and not the individual power measurements.
Now, the question is how would I verify the "95% thing" using average power, min, max and stdDev. It seems that I can use normal distribution to find the 95% confidence level.
I would appreciate if someone can help me on this.
Thanks in anticipation
The way I'm reading this, it seems you are a statistical beginner, so if I'm wrong there, the rest of this answer will probably be insultingly basic, and I'm sorry.
Anyway, the idea is that if a dataset is normally distributed, and all the observations are independent of one another, then 95% of the data points will fall within 1.96 standard deviations of the mean.
Do you get identical estimates of average power every time you measure, or are there some slight random differences from reading to reading? My guess is that it's the second. If you were to measure the power a whole bunch of times, and each time you plotted your average power value on a histogram, then that histogram of sample means would have the shape of a bell curve. This bell curve of sample means would have its own mean and standard deviation, and if you have thousands or millions of data points going into the calculation of each average power reading, it's not horrible to assume that it is a normal distribution. The explanation for this phenomenon is known as the 'central limit theorem', and I recommend both the Khan academy's presentation of it as well as the wikipedia page on it.
On the other hand, if your average power is the mean of some small number of data points, like for instance n= 5, or n= 30, then assumption of a normal distribution of sample means can be pretty bad. In this case, your 95% confidence interval around the average power goes from qt(0.975,n-1)*SD/sqrt(n) below the average to qt(0.975,n-1)*SD/sqrt(N) above the average, where qt(0.975,n-1) is the 97.5th percentile of the t distribution with n-1 degrees of freedom, and SD is your measured standard deviation.
I have derived and implemented an equation of an expected value.
To show that my code is free of errors i have employed the Monte-Carlo
computation a number of times to show that it converges into the same
value as the equation that i derived.
As I have the data now, how can i visualize this?
Is this even the correct test to do?
Can I give a measure how sure i am that the results are correct?
It's not clear what you mean by visualising the data, but here are some ideas.
If your Monte Carlo simulation is correct, then the Monte Carlo estimator for your quantity is just the mean of the samples. The variance of your estimator (how far away from the 'correct' value the average value will be) will scale inversely proportional to the number of samples you take: so long as you take enough, you'll get arbitrarily close to the correct answer. So, use a moderate (1000 should suffice if it's univariate) number of samples, and look at the average. If this doesn't agree with your theoretical expectation, then you have an error somewhere, in one of your estimates.
You can also use a histogram of your samples, again if they're one-dimensional. The distribution of samples in the histogram should match the theoretical distribution you're taking the expectation of.
If you know the variance in the same way as you know the expectation, you can also look at the sample variance (the mean squared difference between the sample and the expectation), and check that this matches as well.
EDIT: to put something more 'formal' in the answer!
if M(x) is your Monte Carlo estimator for E[X], then as n -> inf, abs(M(x) - E[X]) -> 0. The variance of M(x) is inversely proportional to n, but exactly what it is will depend on what M is an estimator for. You could construct a specific test for this based on the mean and variance of your samples to see that what you've done makes sense. Every 100 iterations, you could compute the mean of your samples, and take the difference between this and your theoretical E[X]. If this decreases, you're probably error free. If not, you have issues either in your theoretical estimate or your Monte Carlo estimator.
Why not just do a simple t-test? From your theoretical equation, you have the true mean mu_0 and your simulators mean,mu_1. Note that we can't calculate mu_1, we can only estimate it using the mean/average. So our hypotheses are:
H_0: mu_0 = mu_1 and H_1: mu_0 does not equal mu_1
The test statistic is the usual one-sample test statistic, i.e.
T = (mu_0 - x)/(s/sqrt(n))
where
mu_0 is the value from your equation
x is the average from your simulator
s is the standard deviation
n is the number of values used to calculate the mean.
In your case, n is going to be large, so this is equivalent to a Normal test. We reject H_0 when T is bigger/smaller than (-3, 3). This would be equivalent to a p-value < 0.01.
A couple of comments:
You can't "prove" that the means are equal.
You mentioned that you want to test a number of values. One possible solution is to implement a Bonferroni type correction. Basically, you reduce your p-value to: p-value/N where N is the number of tests you are running.
Make your sample size as large as possible. Since we don't have any idea about the variability in your Monte Carlo simulation it's impossible to say use n=....
The value of p-value < 0.01 when T is bigger/smaller than (-3, 3) just comes from the Normal distribution.