Can anyone explain to me what does Normal Approximation do over and above what is done by MAP in some easy words?
I have read enough on http://pymc-devs.github.io/pymc/modelfitting.html#normal-approximations
but it is too complicated for me.
An example showing the difference will be very helpful.
MAP simply returns a posterior mode, while the NormalApproximation uses a quadratic Taylor series approximation to the posterior, and so can return both the expected value and the covariance matrix. Of course, it uses a normal distribution to approximate the posterior, which may not be appropriate.
Related
As I understand it, normal maps simply describe the direction that the surface "faces". As such, it does not make much sense to me for these vectors to have a magnitude other than 1. However, I have come across multiple learning resources suggesting that scaling a normal map is a good way to increase "bumpiness" of a surface.
I would expect the correct approach to be scaling up the bump map, and then generating a new normal map (with normalized vectors). Multiplying normals should not give the same result.
Additionally, I was looking at Unreal's "FlattenNormal" material function. This function lerps between a normal and (0, 0, 1). This clearly does not always result in a normalized vector. This suggests to me that either:
the magnitude of a normal vector does in fact matter;
or normal vectors get normalized before being used.
If 2. is the case, then scaling up normal vectors by a scalar should not achieve anything, right?
I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.
I'm currently working to diagonalize a 5000x5000 Hermitian matrix, and I find that when I use Julia's eigen function in the LinearAlgebra module, which produces both the eigenvalues and eigenvectors, I get different results for the eigenvectors compared to when I solve the problem using numpy's np.linalg.eigh function. I believe both of them use BLAS, but I'm not sure what else they may be using that is different.
Has anyone else experienced this/knows what is going on?
numpy.linalg.eigh(a, UPLO='L') is a different algorithm. It assumes the matrix is symmetric and takes the lower triangular matrix (as a default) to more efficiently compute the decomposition.
The equivalent to Julia's LinearAlgebra.eigen() is numpy.linalg.eig. You should get the same result if you turn your matrix in Julia into a Symmetric(A, uplo=:L) matrix before feeding it into LinearAlgebra.eigen().
Check out numpy's docs on eig and eigh. Whilst Julia's standard LinearAlgebra capabilities are here. If you go down to the special matrices sections, it details what special methods it uses depending on the type of special matrix thanks to multiple dispatch.
I was wondering if there was a direct way of computing the iteration matrix for nth Linear Block Gauss Seidel iteration within OpenMDAO?
thank you
If I understand you correctly, you are referring to the matrix-form of the Gauss Seidel algorithm where you take Ax=b, and break A up into the Diagonal (D), Lower (L) and Upper (U) parts, then use those parts to compute the next iterate.
Specifically you compute [D-L]^-1. This, I believe is what you are referring to as the "iteration matrix" (I am not familiar with this terminology, but based on the algorithm I'm comfortable making an educated guess).
This formulation of the algorithm is useful to think about and a simple way to implement it, but OpenMDAO takes a different approach. The LBGS algorithm implemented in OpenMDAO is set up to work in a matrix-free manner. That means it only interacts with the linear operator methods solve_linear and apply_linear and never explicitly assembles the A matrix at all. Hence there isn't an opportunity to split A up into D, L, U.
Depending on the way you constructed the model, the A matrix you would need might or might not be there at all because OpenMDAO is capable of working in a completely matrix free context. However, if all of your components use the compute_partials or linearize methods to provide partial derivatives then the data you would need for the A matrix does exist in memory.
You'll have to dig for it a bit, and ironically the best place to see how to do that is in the direct solver which does actually require the matrix be formed to compute a factorization.
Also, in that code you'll see a function can iteratively call the linear operator to construct a dense matrix even if the underlying components don't provide their partials directly. Please note that this approach for assembling the matrix is extremely slow and is not recommended for normal operations.
For example, in generative adversarial network, we often hear that inference is easy because the conditional distribution of x given latent variable z is 'tractable'.
Also, I read somewhere that Boltzmann machine and variational autoencoder is used where the posterior distribution is not tractable so some sort of approximation need to be applied.
Could anyone tell me what 'tractable' means, in a rigorous definition? Or could anyone explain in any of the examples I gave above, what tractable exactly means in that context?
First of all, let's define what tractable and intractable problems are (Reference: http://www.cs.ucc.ie/~dgb/courses/toc/handout29.pdf).
Tractable Problem: a problem that is solvable by a polynomial-time algorithm. The upper bound is polynomial.
Intractable Problem: a problem that cannot be solved by a polynomial-time algorithm. The lower bound is exponential.
From this perspective, a definition for tractable distribution is that it takes polynomial-time to calculate the probability of this distribution at any given point.
If a distribution is in a closed-form expression, the probability of this distribution can definitely be calculated in polynomial-time, which, in the world of academia, means the distribution is tractable. Intractable distributions take equal to or more than exponential-time, which usually means that with existing computational resources, we can never calculate the probability at a given point with relatively "short" time (any time longer than polynomial-time is long...).