Why does scikit learn return log-density? - scikit-learn

The function score_samples from sklearn.neighbors.kde.KernelDensity returns the log of the density. What is the advantage of that over returning the density it self?
I know that the logarithm makes sense for probabilities, which are between 0 and 1 (See this quenstion: Why use log-probability estimates in GaussianNB [scikit-learn]?) But why do you do the same for densities which are between 0 and infinity?
Is there a way to estimate log-density directly, or is it just the logarithm taken from the estimated density?

Much of what applies to probabilities also applies to densities, so the answers in Why use log-probability estimates in GaussianNB [scikit-learn]? apply:
As long as the density is everywhere positive, the logarithm is well defined. It has much better numerical resolution and stability as density tends toward 0. Imagine a gaussian kernel of a certain width to model your points and imagine them in a cluster somewhere. As you move away from this dense area, the log density amounts to the negative squared distance to the cluster. The exponential of that will quickly yield very small quantities in which you may rightfully not trust anymore.

Related

How Gaussian Mixture covariance hypothesis affect density

I'm wondering how and why changing the covariance_type from 'full' to 'diag' change that much the output distribution.
Let's take this image as example.
.
The 'full' likelihood is much more large than the diag one and the green gaussian even include the blue one. I get pretty much the same result with my own experiment, even worse. Some time, I even get one (or two) gaussian density overlapping why high energy all my training set.
According to the sklearn documentation it's due to overfitting but I understand why and how (mathematically) the diag covariance hypothesis prevent this phenomenon.

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive
R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

why we chose sse(sum of square error ) to decide the best fit line in linear regression

we choose SSE(sum of squared error) for deciding the best fit line instead of sum of residual or sum of absolute residual
The purpose is to allow linear algebra to directly solve for equation coefficients in regression. The other fitting targets you mention cannot be used in this way. Using derivative calculus, it was found that a fitting target of lowest sum of squared error allowed a direct, non-iterative solution to the problem of fitting experimental data to equations that are linear in their coefficients - such as standard polynomial equations.
James is right that the ability to formulate the estimates of regression coefficients as a form of linear algebra is one large advantage of the least squares estimate (minimizing SSE), but using the least squares estimate provides a few other useful properties.
With the least squares estimate you're minimizing the variance of the errors - which is often desired. This gives us the best linear unbiased estimator (BLUE) of the coefficients (given the Gauss–Markov assumptions are met). (Gauss-Markov assumptions and a proof showing why this formulation gives us the best linear unbiased estimates can be found here.)
With the least squares, you also end up with a unique solution (assuming you have more observations than estimated coefficients and no perfect multicollinearity).
As for using the sum of residual, this wouldn’t work well since this would be minimized by having all negative residuals.
But the sum of absolute residual is used in some linear models where you may want the estimates to be more robust to outliers and aren’t necessarily concerned with the variance of the residuals.

Why scikit-learn truncatedSVD uses 'randomized' algorithm as default?

I used with truncatedSVD with 30000 by 40000 size of term-document matrix to reducing the dimension to 3000 dimension,
when using 'randomized', variance ratio is about 0.5 (n_iter=10)
when using 'arpack', variance ratio is about 0.9
Variance ratio of 'randomized' algorithm is lower than one of 'arpack'.
So why scikit-learn truncatedSVD uses 'randomized' algorithm as default?
Speed!
According to the docs, sklearn.decomposition.TruncatedSVD can use a randomized algorithm due to Halko, Martinson, and Tropp (2009). This paper claims that their algorithm is considerably faster.
For a dense matrix, it runs in O(m*n*log(k)) time, whereas the classical algorithm takes O(m*n*k) time, where m and n are the dimensions of the matrix from which you want the kth largest components. The randomized algorithm is also easier to efficiently parallelize and makes fewer passes over the data.
Table 7.1 of the paper (on page 45) shows the performance of a few algorithms as a function of matrix size and # of components, and the randomized algorithm is often an order of magnitude faster.
The accuracy of the output is also claimed to be pretty good (Figure 7.5), though there are some modifications and constants that might affect it and I haven't gone through the sklearn code to see what they did/did not do.

Average and Measure of Spread of 3D Rotations

I've seen several similar questions, and have some ideas of what I might try, but I don't remember seeing anything about spread.
So: I am working on a measurement system, ultimately computer vision based.
I take N captures, and process them using a library which outputs pose estimations in the form of 4x4 affine transformation matrices of translation and rotation.
There's some noise in these pose estimations. The standard deviation in Euler angles for each axis of rotation is less than 2.5 degrees, so all orientations are pretty close to each other (for a case where all Euler angles are close to 0 or 180). Standard errors of less than 0.25 degrees are important to me. But I have already run into the problems endemic to Euler angles.
I want to average all these pretty-close-together pose estimates to get a single final pose estimate. And I also want to find some measure of spread so that I can estimate accuracy.
I'm aware that "average" isn't actually well defined for rotations.
(For the record, my code is in Numpy-heavy Python.)
I also may want to weight this average, since some captures (and some axes) are known to be more accurate than others.
My impression is that I can just take the mean and standard deviation of the translation vector, and that for the rotation I can convert to quaternions, take the mean, and re-normalize with OK accuracy since these quaternions are pretty close together.
I've also heard mentions of least-squares across all the quaternions, but most of my research into how this would be implemented has been a dismal failure.
Is this workable? Is there a reasonably well-defined measure of spread in this context?
Without more info about your geometry setup is hard to answer. Anyway for rotations I would:
create 3 unit vectors
x=(1,0,0),y=(0,1,0),z=(0,0,1)
and apply the rotation on them and call the output
x(i),y(i),z(i)
it is just applying the matrix(i) with position at (0,0,0)
do this for all measurements you have
now average all vectors
X=avg(x(1),x(2),...x(n))
Y=avg(y(1),y(2),...y(n))
Z=avg(z(1),z(2),...z(n))
correct the vector values
so make each of the X,Y,Z unit vectors again and take the axis which is more closest to the rotation axis as main axis. It will stay as is and recompute the remaining two axises as cross product of main axis and the other vector to ensure orthogonality. Beware of the multiplication order (wrong order of operands will negate the output)
construct averaged transform matrix
see transform matrix anatomy as origin you can use averaged origin of the measurement matrices
Moakher wrote a paper that explains there are basically two ways to take an average of Rotation matrices. The first is a weighted average followed by a projection back to SO(3) using the SVD. The second is the Riemannian center of mass. That one is a closer notion to the geometric mean, and its more complicated to compute.

Resources