conditional normal distribution proportional to a joint density? - statistics

First of all, I am sorry if this is not the proper place to ask questions about statistics. I study Machine Learning so I wonder if there is a more appropriate website for this kind of questions.
I only wonder why a conditional gaussian p(x1|x2) is said to be proportional to the joint density where it comes from, p(x1, x2).
Thanks

a conditional gaussian p(x1|x2) is said to be proportional to the joint density where it comes from, p(x1, x2)
In a sense, every conditional density is proportional to the corresponding joint density, since p(x1 | x2) = p(x1, x2) / p(x2), so maybe that's what they mean; it's not a property which is special to the Gaussian distribution. However, that's not "proportional" in the usual sense, because the factor 1/p(x2) is not a constant.

Related

Why is Standard Deviation the square of difference of an obsevation from the mean?

I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.

Why additive noise needs to be calibrated with sensitivity in differential privacy?

As a beginner to differential privacy, I would like to why the variance for noise mechanisms needs to be calibrated with sensitivity? What is the purpose of that? What happens if we don't calibrate it and add a random variance?
Example scenario here In Laplacian noise, why scale parameter is calibrated?
One way you can understand this intuitively is by imagining a function that returns either of two values, say 0 and a for some real a.
Suppose further that we have an additive noise mechanism, so that we end up with two probability distributions on the real line, as in the image from your attached link (this is an example of the setup above, with a=1):
In pure DP, we are interested in computing the maximum of the ratio of these distributions over the entire real line. As the calculation in your link shows, this ratio is bounded everywhere by e to the power of epsilon.
Now, imagine moving the centers of these distributions further apart, say by shifting the red distribution further to the right (IE, increasing a). Clearly this will place less probability mass from the red distribution on the value 0, which is where the maximum of this ratio will be achieved. Therefore the ratio between these distributions at 0 will be increased--a constant (the mass the blue distribution places on 0) is divided by a smaller number.
One way we could move the ratio back down would be to "fatten" the distributions out. This would correspond pictorally to moving the peaks of the distributions lower, and spreading the mass out over a wider area (since they have to integrate to 1, these two things are necessarily coupled for a distribution like the Laplace). Mathematically we would accomplish this by increasing the variance in the Laplace distribution (increasing b in the parameterization here), which has the effect of lowering the peak of the blue distribution at 0 and raising the mass the red distribution places at 0, thereby reducing the ratio between them back down (a smaller numerator and a larger denominator).
If you perform the calculations, you will find that the relationship between the variance parameter b and the sensitivity of the function f is in fact linear; that is, setting b to be
fixes the maximum of this ratio, to
which is precisely the definition of pure differential privacy.
If you add arbitrary amounts of random noise, you simply end up with random data. Sure, it preserves privacy, but at the same time as destroying any real value in the data. The noise you add needs to match your existing distribution so that it preserves privacy without destroying the value of the data. That’s what the calibration step does.

Average and Measure of Spread of 3D Rotations

I've seen several similar questions, and have some ideas of what I might try, but I don't remember seeing anything about spread.
So: I am working on a measurement system, ultimately computer vision based.
I take N captures, and process them using a library which outputs pose estimations in the form of 4x4 affine transformation matrices of translation and rotation.
There's some noise in these pose estimations. The standard deviation in Euler angles for each axis of rotation is less than 2.5 degrees, so all orientations are pretty close to each other (for a case where all Euler angles are close to 0 or 180). Standard errors of less than 0.25 degrees are important to me. But I have already run into the problems endemic to Euler angles.
I want to average all these pretty-close-together pose estimates to get a single final pose estimate. And I also want to find some measure of spread so that I can estimate accuracy.
I'm aware that "average" isn't actually well defined for rotations.
(For the record, my code is in Numpy-heavy Python.)
I also may want to weight this average, since some captures (and some axes) are known to be more accurate than others.
My impression is that I can just take the mean and standard deviation of the translation vector, and that for the rotation I can convert to quaternions, take the mean, and re-normalize with OK accuracy since these quaternions are pretty close together.
I've also heard mentions of least-squares across all the quaternions, but most of my research into how this would be implemented has been a dismal failure.
Is this workable? Is there a reasonably well-defined measure of spread in this context?
Without more info about your geometry setup is hard to answer. Anyway for rotations I would:
create 3 unit vectors
x=(1,0,0),y=(0,1,0),z=(0,0,1)
and apply the rotation on them and call the output
x(i),y(i),z(i)
it is just applying the matrix(i) with position at (0,0,0)
do this for all measurements you have
now average all vectors
X=avg(x(1),x(2),...x(n))
Y=avg(y(1),y(2),...y(n))
Z=avg(z(1),z(2),...z(n))
correct the vector values
so make each of the X,Y,Z unit vectors again and take the axis which is more closest to the rotation axis as main axis. It will stay as is and recompute the remaining two axises as cross product of main axis and the other vector to ensure orthogonality. Beware of the multiplication order (wrong order of operands will negate the output)
construct averaged transform matrix
see transform matrix anatomy as origin you can use averaged origin of the measurement matrices
Moakher wrote a paper that explains there are basically two ways to take an average of Rotation matrices. The first is a weighted average followed by a projection back to SO(3) using the SVD. The second is the Riemannian center of mass. That one is a closer notion to the geometric mean, and its more complicated to compute.

Two Dimensional Curve Approximation

here is what I want to do (preferably with Matlab):
Basically I have several traces of cars driving on an intersection. Each one is noisy, so I want to take the mean over all measurements to get a better approximation of the real route. In other words, I am looking for a way to approximate the Curve, which has the smallest distence to all of the meassured traces (in a least-square sense).
At the first glance, this is quite similar what can be achieved with spap2 of the CurveFitting Toolbox (good example in section Least-Squares Approximation here).
But this algorithm has some major drawback: It assumes a function (with exactly one y(x) for every x), but what I want is a curve in 2d (which may have several y(x) for one x). This leads to problems when cars turn right or left with more then 90 degrees.
Futhermore it takes the vertical offsets and not the perpendicular offsets (according to the definition on wolfram).
Has anybody an idea how to solve this problem? I thought of using a B-Spline and change the number of knots and the degree until I reached a certain fitting quality, but I can't find a way to solve this problem analytically or with the functions provided by the CurveFitting Toolbox. Is there a way to solve this without numerical optimization?
mbeckish is right. In order to get sufficient flexibility in the curve shape, you must use a parametric curve representation (x(t), y(t)) instead of an explicit representation y(x). See Parametric equation.
Given n successive points on the curve, assign them their true time if you know it or just integers 0..n-1 if you don't. Then call spap2 twice with vectors T, X and T, Y instead of X, Y. Now for arbitrary t you get a point (x, y) on the curve.
This won't give you a true least squares solution, but should be good enough for your needs.

RayTracing: When to Normalize a vector?

I am rewriting my ray tracer and just trying to better understand certain aspects of it.
I seem to have down pat the issue regarding normals and how you should multiply them by the inverse of the transpose of a transformation matrix.
What I'm confused about is when I should be normalizing my direction vectors?
I'm following a certain book and sometimes it'll explicitly state to Normalize my vector and other cases it doesn't and I find out that I needed to.
Normalized vector is in the same direction with just unit length 1? So I'm unclear when it is necessary?
Thanks
You never need to normalize a vector unless you are working with the angles between vectors, or unless you are rotating a vector.
That's it.
In the former case, all of your trig functions require your vectors to land on a unit circle, which means the vectors are normalized. In the latter case, you are dividing out the magnitude, rotating the vector, making sure it stays a unit, and then multiplying the magnitude back in. Normalization just goes with the territory.
If someone tells you that coordinate system are defined by n unit vectors, know that i-hat, j-hat, k-hat, and so on can be any arbitrary vector(s) of any length and direction, so long as none of them are parallel. This is the heart of affine transformations.
If someone tries to tell you that the dot product requires normalized vectors, shake your head and smile. The dot product only needs normalized vectors when you are using it to get the angle between two vectors.
But doesn't normalization make the math "simpler"?
Not really -- It adds a magnitude computation and a division. Numbers between 0..1 are no different than numbers between 0..x.
Having said that, you sometimes normalize in order to play well with others. But if you find yourself normalizing vectors as a matter of principle before calling methods, consider using a flag attached to the vector to save yourself a step. Mathematically, it is unimportant, but practically, it can make a huge difference in performance.
So again... it's all about rotating a vector or measuring its angle against another vector. If you aren't doing that, don't waste cycles.
tl;dr: Normalized vectors simplify your math. They also reduce the number of very hard to diagnose visual artifacts in your images.
Normalized vector is in the same direction with just unit length 1? So
I'm unclear when it is necessary?
You almost always want all vectors in a ray tracer to be normalized.
The simplest example is that of the intersection test: where does a bouncing ray hit another object.
Consider a ray where:
p(t) = p_0 + v * t
In this case, a point anywhere along that ray p(t) is defined as an offset from the original point p_0 and an offset along a particular direction v. For every increment of parameter t, the resulting p(t) will move another increment of length equal to the length of the vector v.
Remember, you know p_0 and v. When you are trying to find the point where this ray next hits another object, you have to solve for that t. It is obviously more convenient, if not always obviously necessary, to use normalized vector vs in that representation.
However, that same vector v is used in lighting calculations. Imagine that we have another direction vector u that points towards a lighting source. For the purpose of a very simple shading model, we can define the light at a particular point to be the dot product between those two vectors:
L(p) = v * u
Admittedly, this is a very uninteresting reflection model but it captures the high points of the discussion. A spot on a surface is bright if reflection points towards the light and dim if not.
Now, remember that another way of writing this dot product is the product of the magnitudes of the vectors times the cosine of the angle between them:
L(p) = ||v|| ||u|| cos(theta)
If u and v are of unit length (normalized), then the equation will evaluate to be proportional to the angle between the two vectors. However, if v is not of unit length, say because you didn't bother to normalize after reflecting the vector in the ray model above, now your lighting model has a problem. Spots on the surface using a larger v will be much brighter than spots that do not.
It is necessary to normalize a direction vector whenever you use it in some math that is influenced by its length.
The prime example is the dot product, which is used in most lighting equations. You also sometimes need to normalize vectors that you use in lighting calculations, even if you believe that they are normal.
For example, when using an interpolated normal on a triangle. Common sense tells you that since the normals at the vertices are normal, the vectors you get by interpolating are too. So much for common sense... the truth is that they will be shorter unless they incidentially all point into the same direction. Which means that you will shade the triangle too dark (to make matters worse, the effect is more pronounced the closer the light source gets to the surface, which is a... very funny result).
Another example where a vector might or might not be normalized is the cross product, depending on what you are doing. For example, when using the two cross products to build an orthonormal base, then you must at least normalize once (though if you do it naively, you end up doing it more often).
If you only care about the direction of the resulting "up vector", or about the sign, you don't need to normalize.
I'll answer the opposite question. When do you NOT need to normalize? Almost all calculations related to lighting require unit vectors - the dot product then gives you the cosine of the angle between vectors which is really useful. Some equations can still cope but become more complex (essentially doing the normalization in the equation) That leaves mostly intersection tests.
Equations for many intersection tests can be simplified if you have unit vectors. Some do not require it - for example if you have a plane equation (with a unit normal) you can find the ray-plane intersection without normalizing the ray direction vector. The distance will be in terms of the ray direction vectors length. This might be OK if all you want is to intersect a bunch of those planes (the relative distances will all be correct). But as soon as you want to compare with a different distance - calculated using the normalized ray direction - the distance values will not compare properly.
You might think about normalizing a direction vector AFTER doing some work that does not require it - maybe you have an acceleration structure that can be traversed without a normalized vector. But that isn't relevant either because eventually the ray will hit something and you're going to want to do a lighting/shading calculation with it. So you may as well normalize them from the start...
In other words, any specific calculation may not require a normalized direction vector, but a given direction vector will almost certainly need to be normalized at some point in the process.
Vectors are used to store two conceptually different elements: points in space and directions:
If you are storing a point in space (for example the position of the camera, the origin of the ray, the vertices of triangles) you don't want to normalize, because you would be modifying the value of the vector, and losing the specific position.
If you are storing a direction (for example the camera up, the ray direction, the object normals) you want to normalize, because in this case you are interested not in the specific value of the point, but on the direction it represents, so you don't need the magnitude. Normalization is useful in this case because it simplifies some operations, such as calculating the cosine of two vectors, something that can be done with a dot product if both are normalized.

Resources