How Gaussian Mixture covariance hypothesis affect density - scikit-learn

I'm wondering how and why changing the covariance_type from 'full' to 'diag' change that much the output distribution.
Let's take this image as example.
.
The 'full' likelihood is much more large than the diag one and the green gaussian even include the blue one. I get pretty much the same result with my own experiment, even worse. Some time, I even get one (or two) gaussian density overlapping why high energy all my training set.
According to the sklearn documentation it's due to overfitting but I understand why and how (mathematically) the diag covariance hypothesis prevent this phenomenon.

Related

why we chose sse(sum of square error ) to decide the best fit line in linear regression

we choose SSE(sum of squared error) for deciding the best fit line instead of sum of residual or sum of absolute residual
The purpose is to allow linear algebra to directly solve for equation coefficients in regression. The other fitting targets you mention cannot be used in this way. Using derivative calculus, it was found that a fitting target of lowest sum of squared error allowed a direct, non-iterative solution to the problem of fitting experimental data to equations that are linear in their coefficients - such as standard polynomial equations.
James is right that the ability to formulate the estimates of regression coefficients as a form of linear algebra is one large advantage of the least squares estimate (minimizing SSE), but using the least squares estimate provides a few other useful properties.
With the least squares estimate you're minimizing the variance of the errors - which is often desired. This gives us the best linear unbiased estimator (BLUE) of the coefficients (given the Gauss–Markov assumptions are met). (Gauss-Markov assumptions and a proof showing why this formulation gives us the best linear unbiased estimates can be found here.)
With the least squares, you also end up with a unique solution (assuming you have more observations than estimated coefficients and no perfect multicollinearity).
As for using the sum of residual, this wouldn’t work well since this would be minimized by having all negative residuals.
But the sum of absolute residual is used in some linear models where you may want the estimates to be more robust to outliers and aren’t necessarily concerned with the variance of the residuals.

Average and Measure of Spread of 3D Rotations

I've seen several similar questions, and have some ideas of what I might try, but I don't remember seeing anything about spread.
So: I am working on a measurement system, ultimately computer vision based.
I take N captures, and process them using a library which outputs pose estimations in the form of 4x4 affine transformation matrices of translation and rotation.
There's some noise in these pose estimations. The standard deviation in Euler angles for each axis of rotation is less than 2.5 degrees, so all orientations are pretty close to each other (for a case where all Euler angles are close to 0 or 180). Standard errors of less than 0.25 degrees are important to me. But I have already run into the problems endemic to Euler angles.
I want to average all these pretty-close-together pose estimates to get a single final pose estimate. And I also want to find some measure of spread so that I can estimate accuracy.
I'm aware that "average" isn't actually well defined for rotations.
(For the record, my code is in Numpy-heavy Python.)
I also may want to weight this average, since some captures (and some axes) are known to be more accurate than others.
My impression is that I can just take the mean and standard deviation of the translation vector, and that for the rotation I can convert to quaternions, take the mean, and re-normalize with OK accuracy since these quaternions are pretty close together.
I've also heard mentions of least-squares across all the quaternions, but most of my research into how this would be implemented has been a dismal failure.
Is this workable? Is there a reasonably well-defined measure of spread in this context?
Without more info about your geometry setup is hard to answer. Anyway for rotations I would:
create 3 unit vectors
x=(1,0,0),y=(0,1,0),z=(0,0,1)
and apply the rotation on them and call the output
x(i),y(i),z(i)
it is just applying the matrix(i) with position at (0,0,0)
do this for all measurements you have
now average all vectors
X=avg(x(1),x(2),...x(n))
Y=avg(y(1),y(2),...y(n))
Z=avg(z(1),z(2),...z(n))
correct the vector values
so make each of the X,Y,Z unit vectors again and take the axis which is more closest to the rotation axis as main axis. It will stay as is and recompute the remaining two axises as cross product of main axis and the other vector to ensure orthogonality. Beware of the multiplication order (wrong order of operands will negate the output)
construct averaged transform matrix
see transform matrix anatomy as origin you can use averaged origin of the measurement matrices
Moakher wrote a paper that explains there are basically two ways to take an average of Rotation matrices. The first is a weighted average followed by a projection back to SO(3) using the SVD. The second is the Riemannian center of mass. That one is a closer notion to the geometric mean, and its more complicated to compute.

Log transforming predictor variables in survival analysis

I am running shared gamma frailty models (i.e., Coxph survival analysis models with a random effect) and want to know if it is "acceptable" to log transform one of your continuous predictor variables. I found a website (http://www.medcalc.org/manual/cox_proportional_hazards.php) that said "The Cox proportional regression model assumes ... there should be a linear relationship between the endpoint and predictor variables. Predictor variables that have a highly skewed distribution may require logarithmic transformation to reduce the effect of extreme values. Logarithmic transformation of a variable var can be obtained by entering LOG(var) as predictor variable".
I would really appreciate a second opinion from someone with more statistical knowledge on this topic. In a nutshell: Is it OK/commonplace/etc to transform (specifically log transform) predictor variables in a survival analysis model (e.g., Coxph model).
Thanks.
You can log transform any predictor in Cox regression. This is frequently necessary but has some drawbacks.
Why log transform? There are a number of good reasons why. You decrease the extent and effect of outliers, data becomes more normally distributed etc.
When possible? I doubt that there are circumstances when you can not do it. I find it hard to believe that it would compromise the precision of your estimates.
Why not do it always? Well it becomes difficult to interpret the results for a predictor which have been log transformed. If you don't log transform, and your predictor is, for example, blood pressure and you obtain a hazard ratio of 1.05, meaning a 5% increase in risk of event for 1 unit increase in blood pressure. IF you log transform blood pressure, the hazard ratio of 1.05 (it would most likely not land on 1.05 again after log transform but we'll stick to 1.05 for simplicity) means 5% increase for each log unit increase in blood pressure. Now thats more difficult to grasp.
But, if you are not interested in the particular variable that you think about log transforming (i.e you just need to adjust for it as a covariate), go ahead do it.

Why does scikit learn return log-density?

The function score_samples from sklearn.neighbors.kde.KernelDensity returns the log of the density. What is the advantage of that over returning the density it self?
I know that the logarithm makes sense for probabilities, which are between 0 and 1 (See this quenstion: Why use log-probability estimates in GaussianNB [scikit-learn]?) But why do you do the same for densities which are between 0 and infinity?
Is there a way to estimate log-density directly, or is it just the logarithm taken from the estimated density?
Much of what applies to probabilities also applies to densities, so the answers in Why use log-probability estimates in GaussianNB [scikit-learn]? apply:
As long as the density is everywhere positive, the logarithm is well defined. It has much better numerical resolution and stability as density tends toward 0. Imagine a gaussian kernel of a certain width to model your points and imagine them in a cluster somewhere. As you move away from this dense area, the log density amounts to the negative squared distance to the cluster. The exponential of that will quickly yield very small quantities in which you may rightfully not trust anymore.

using skewness to segment volume

my knowledge in statistics is minuscule, sorry. I have a large volume of measured amplitudes. In the absence of a signal, the noise is assumed to have a normal distribution. When a signal is present with higher amplitude than the surrounding noise, the shape of the distribution is more tailed on the positive side. I was thinking of using skewness for detection of signal. But the area of higher amplitude (cells in the volume) is rather small compared to the volume itself. So, we are talking of in magnitude of hundreds of cells from a total of some thousands. If the skewness is zero for a normal distribution, how can I extract those cells in my volume which contribute to the non-zero skewness. If say, my skewness value is 0.5, is there a way to drop all cells and keep only those which raised the skewness value. Perhaps I sound unclear but that just shows how little I understand of the topic.
Thanks in advance.
It seems to me that the problem might best be modeled as a mixture model: we have a Gaussian background
B ~ N(0, sigma)
and a signal, about which the poster has not specified a particular model.
If we can assume that the signal also takes the form of one (or possible a mixture of several) Gaussian(s), then Gaussian mixture modelling with the EM algorithm may be a good way to solve it (see Wikipedia).
A good paper in the context of segmentation is this here:
http://www.fil.ion.ucl.ac.uk/~karl/Unified%20segmentation.pdf
If we cannot make such an assumption, I would use a robust regression method to estimate the parameters of the Gaussian noise, where the signal is treated as an outlier, e.g. Least trimmed squares (again see Wikipedia).
The outlier cells can then be found via (Bonferroni-corrected) hypothesis testing, as described e.g. in this paper:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2900857/

Resources