I came across the following sentence referred to the usual Extended Kalman Filter and I'm trying to make sense of it:
States before the current state are approximated with a normal distribution
What does it mean?
the modeled quantity has uncertainty because it is derived from measurements. you can't be sure it's exactly value X. that's why the quantity is represented by a probability density function (or a cumulative distribution function, which is the integral of that).
a probability distribution can look very arbitrary but there are many "simple" distributions that approximate the real world. you've heard of the normal distribution (gaussian), the uniform distribution (rectangle), ...
the normal distribution (parameters mu and sigma) occurs everywhere in nature so it's likely that your measurements already fit a normal distribution very well.
"a gaussian" implies that your distribution isn't a mixture (sum) of gaussians but a single gaussian.
Related
I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.
I study a problem of a random walk with drift and an absorbing boundary. The system is well theoretically understood. My task is to simulate it numerically, in particular to generate random numbers from this distribution, see the formula. It is the distribution of the coordinate x at time t given the starting point x_0, the noise intensity \sigma and the drift \mu. The question is how to generate random numbers from this distribution? I can of course use the inverse transform sampling, but it is slow. May be I can make use of the fact that the probability density function is the difference of two Gaussian functions? Can I relate somehow my distribution with the normal distribution?
For example, in generative adversarial network, we often hear that inference is easy because the conditional distribution of x given latent variable z is 'tractable'.
Also, I read somewhere that Boltzmann machine and variational autoencoder is used where the posterior distribution is not tractable so some sort of approximation need to be applied.
Could anyone tell me what 'tractable' means, in a rigorous definition? Or could anyone explain in any of the examples I gave above, what tractable exactly means in that context?
First of all, let's define what tractable and intractable problems are (Reference: http://www.cs.ucc.ie/~dgb/courses/toc/handout29.pdf).
Tractable Problem: a problem that is solvable by a polynomial-time algorithm. The upper bound is polynomial.
Intractable Problem: a problem that cannot be solved by a polynomial-time algorithm. The lower bound is exponential.
From this perspective, a definition for tractable distribution is that it takes polynomial-time to calculate the probability of this distribution at any given point.
If a distribution is in a closed-form expression, the probability of this distribution can definitely be calculated in polynomial-time, which, in the world of academia, means the distribution is tractable. Intractable distributions take equal to or more than exponential-time, which usually means that with existing computational resources, we can never calculate the probability at a given point with relatively "short" time (any time longer than polynomial-time is long...).
I have a list of numbers. Below are some basic statistics:
N > 1000
Max: 9.24
Min: 0.00955
Mean: 1.84932
Median: 0.97696
It seems that the data is right skewed, i.e. many small numbers and a few very large numbers.
I want to find a distribution to generalize these numbers. I think Normal distribution, Gamma distribution, and Laplace distribution all look possible. How do I determine which distribution is the best?
I have to say that I usually do it in the same way you did it, by plotting the data I seeing its shape.
When being more accurate, and only for the normal distribution, I perform the Shapiro Wilk test for normality, which at least will tell me that the null hypotesis was not proven, which means that it was not possible to prove that the date does not follow a normal distribution. Usually, this is more than acceptable in scientific environments.
I know there exists equivalent tests for Laplace and Gamma distributions, although still in newly research like this. Instead, there are many sites that offer the Shapiro Wilk test online, like this one.
With all positive values and the mean being about double the median, your data are definitely skewed right. You can rule out both normal and Laplace because both are symmetric and can go negative.
Scope out some of the many fine alternatives at the Wikipedia distributions page. Make a histogram of your data and check it for similarities in shape to those distributions. Exponentials, log normals, chi-squares, and the gamma family could all give numeric results such as the ones you described, but without knowing anything about the variance/std deviation, whether your data are unimodal or multimodal, or where the mode(s) are, we can only make guesses about a very large pool of possibilities.
I am using Octave and I would like to use the anderson_darling_test from the Octave forge Statistics package to test if two vectors of data are drawn from the same statistical distribution. Furthermore, the reference distribution is unlikely to be "normal". This reference distribution will be the known distribution and taken from the help for the above function " 'If you are selecting from a known distribution, convert your values into CDF values for the distribution and use "uniform'. "
My question therefore is: how would I convert my data values into CDF values for the reference distribution?
Some background information for the problem: I have a vector of raw data values from which I extract the cyclic component (this will be the reference distribution); I then wish to compare this cyclic component with the raw data itself to see if the raw data is essentially cyclic in nature. If the the null hypothesis that the two are the same can be rejected I will then know that most of the movement in the raw data is not due to cyclic influences but is due to either trend or just noise.
If your data has a specific distribution, for instance beta(3,3) then
p = betacdf(x, 3, 3)
will be uniform by the definition of a CDF. If you want to transform it to a normal, you can just call the inverse CDF function
x=norminv(p,0,1)
on the uniform p. Once transformed, use your favorite test. I'm not sure I understand your data, but you might consider using a Kolmogorov-Smirnov test instead, which is a nonparametric test of distributional equality.
Your approach is misguided in multiple ways. Several points:
The Anderson-Darling test implemented in Octave forge is a one-sample test: it requires one vector of data and a reference distribution. The distribution should be known - not come from data. While you quote the help-file correctly about using a CDF and the "uniform" option for a distribution that is not built in, you are ignoring the next sentence of the same help file:
Do not use "uniform" if the distribution parameters are estimated from the data itself, as this sharply biases the A^2 statistic toward smaller values.
So, don't do it.
Even if you found or wrote a function implementing a proper two-sample Anderson-Darling or Kolmogorov-Smirnov test, you would still be left with a couple of problems:
Your samples (the data and the cyclic part estimated from the data) are not independent, and these tests assume independence.
Given your description, I assume there is some sort of time predictor involved. So even if the distributions would coincide, that does not mean they coincide at the same time-points, because comparing distributions collapses over the time.
The distribution of cyclic trend + error would not expected to be the same as the distribution of the cyclic trend alone. Suppose the trend is sin(t). Then it never will go above 1. Now add a normally distributed random error term with standard deviation 0.1 (small, so that the trend is dominant). Obviously you could get values well above 1.
We do not have enough information to figure out the proper thing to do, and it is not really a programming question anyway. Look up time series theory - separating cyclic components is a major topic there. But many reasonable analyses will probably be based on the residuals: (observed value - predicted from cyclic component). You will still have to be careful about auto-correlation and other complexities, but at least it will be a move in the right direction.