I am interested in extrapolating the curve from the population that I know is normally distributed. However, in my process, I am only able to get access to a section of the curve (from -3 standard deviations to -2 standard deviations). My question is what is the best way to fitting a curve to a section of a bell curve.
A normal distribution can be defined by an equation f(x) (its PDF, which is a little difficult to write in non-latex, you can check out the Wikipedia page) with two parameters: the mean and the variance (or standard deviation).
Therefore, if you want to know which variance and mean define it, you just need to solve for the mean and the variance given two known values (which you have infinitely many of, even on a short interval).
Related
I was attempting some questions based on residplot() in seaborn. There were two residual plots in which I had to tell whether the relationship is linear. Can anyone explain how it is determined by just looking at the plot. Apparently:
1. This plot shows the linear relationship
2. This plot shows a non-linear relationship
Roughly speaking, these residual plots enable you to visually check whether the residuals still contain some nonlinear behaviour with respect to your explanatory variables. Two remarks for further explanation:
The residuals of a correctly specified model (e.g. the baseline linear model) should be similar to random noise. In absence of remaining patterns in the residuals, we have no indications that important features have been omitted.
If the residuals suggest a pattern, then this means that we failed to take some (nonlinear) effects into account. You should reconsider the model specification. If the baseline model was linear, then including some nonlinear terms might "clean" the residuals.
This kind of visual inspection is often subjective. However, you can argue that 1. is just a random cloud of points whereas 2. shows some remaining curvature. There is also a statistical test to do this kind of assessment for you: the Ramsey Regression Equation Specification Error Test (RESET)
I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.
I am calculating a variance-covariance matrix and I see two different ways of calculating the standard errors:
sqrt(diagonal values/number of observations)
e.g. standard deviation / sqrt(number of observations)
(as is given from on how to calculate the standard error https://en.wikipedia.org/wiki/Standard_error)
or some people say it is simply
sqrt(diagonal values)
I had previously thought that the diagonal values in the variance-co-variance matrix were the variance and hence the square root would be the standard deviation (not the SE). However, the more I read the more I think I may be wrong and that it is the SE, but I am unsure why this is the case.
Can anyone help? Many thanks!!
Yes, the diagonal elements of the covariance matrix are the variances. The square root of these variances are the standard deviations. If you need the standard error you have to clarify the question "the standard error of what?" (see also the wikipedia entry of your post). If you mean the standard error of the mean then yes, "standard deviation / sqrt(number of observations)" is what you are looking for.
I am calculating an dynamic resistance of a diode and I have a lot of measurements and I've created a graph from them. And the question is, how do I find from this graph an exact value of arguments, for example: I want to obtain f(x) value for x=5 where i have measurement for exact value fe. x=10 -> y=213, x=1 y->110, and got a graph curve, but how to find f(5) = ?
This is not trivial: it will depend on your interpolation scheme and Excel does not expose the scheme it uses when drawing a graph.
Unless you tell it otherwise, Excel (I think) uses a Bezier Curve with 2 control points to perform its graphing.
This interpolation scheme transforms, via some linear algebra, to a cubic spline interpolation.
But to use cubic spline interpolation, you need more than two data points.
Since you've only given us two points, the best thing you can do is to interpolate linearly but that will not be what Excel does.
An answer more detailed than this if anything will epitomise the rather broad nature of your question. Do Google any terms that I've used: armed with a bit of time and a good internet connection, you ought to be able to solve this problem adequately.
See https://en.wikipedia.org/wiki/Spline_interpolation, https://en.wikipedia.org/wiki/B%C3%A9zier_curve
I think that you can use a preinstalled add-on named Solver. You have to activate it as shown here.
Then you have to follow one of the tutorial you can find over the Internet (like this one) without finding min o max but finding the exact value you want.
here is what I want to do (preferably with Matlab):
Basically I have several traces of cars driving on an intersection. Each one is noisy, so I want to take the mean over all measurements to get a better approximation of the real route. In other words, I am looking for a way to approximate the Curve, which has the smallest distence to all of the meassured traces (in a least-square sense).
At the first glance, this is quite similar what can be achieved with spap2 of the CurveFitting Toolbox (good example in section Least-Squares Approximation here).
But this algorithm has some major drawback: It assumes a function (with exactly one y(x) for every x), but what I want is a curve in 2d (which may have several y(x) for one x). This leads to problems when cars turn right or left with more then 90 degrees.
Futhermore it takes the vertical offsets and not the perpendicular offsets (according to the definition on wolfram).
Has anybody an idea how to solve this problem? I thought of using a B-Spline and change the number of knots and the degree until I reached a certain fitting quality, but I can't find a way to solve this problem analytically or with the functions provided by the CurveFitting Toolbox. Is there a way to solve this without numerical optimization?
mbeckish is right. In order to get sufficient flexibility in the curve shape, you must use a parametric curve representation (x(t), y(t)) instead of an explicit representation y(x). See Parametric equation.
Given n successive points on the curve, assign them their true time if you know it or just integers 0..n-1 if you don't. Then call spap2 twice with vectors T, X and T, Y instead of X, Y. Now for arbitrary t you get a point (x, y) on the curve.
This won't give you a true least squares solution, but should be good enough for your needs.