Scipy Guassian_kde Nomalisation - statistics

I've been using scipy.stats.gausian_kde but have a few questions about its output. I've plotted the normalised histogram and the gaussian_kde plot on the same graph. Why are the y-values so vastly different? My understanding is that the gaussian_kde plot should touch the tips of the histograms, roughly. Using the scipy.integrate.quad functions I determined the area under the graph to be 0.7, rather than 1.0, which is what I expected.
Actually what I really want is for the gaussian_kde to represent the non-normalised histogram, does anyone know how can I do that?

Your expectations are a little off. The area under each of the KDE's peaks should roughly equal the area in their corresponding bars. That appears to hold, to my eye. Nonadaptive KDEs with a global bandwidth estimate (like scipy.stats.gaussian_kde) tend to broaden multimodal distributions with sharp peaks.
As for the underestimate of the total area under the KDE, I cannot say without the data and the code that you used to do the integration.
In order to make a KDE approximate an unnormalized histogram, you need to multiply by (bin_width*N) where N is the total number of data points.

Related

Looking for a detailed method to plot contours of confidence level

I try to find out a method or a tutorial to know how are plotted the contours of different confidence levels (68%, 95%, 99.7% etc ...).
Here below an example of these contours on a plot that I would like to generate:
It represents the constraints on cosmological parameters (\omega_Lambda represents dark energy and \Omega_m total matter quantity).
Once I have data sets on \Omega_Lambda and \Omega_mat, how can I produce these contours : I know what is a confidence level but I only know the standard deviation.
If I plot standard deviation for both parameters from the expected values, I get a cross symbol on it (horizontally for \Omega_m and vertically for \Omega_Lambda) : but from this cross, how to draw contours at different confidence levels?
On the figure above, these contours look like a 2D parametric curve where I have points (Omega_Lambda(t), Omega_m(t)) with t parameter but I don't think they are drawn like this.
You might want to check out Matplotlib's contour plot: the levels parameter seems to be what you need.
The plots in your example are not obtained from raw data, but from a statistical model of raw data. So you could first fit multivariate normal distributions to your data using numpy.mean and numpy.cov, then generate the multivariate normal pdf values with scipy.stats.multivariate_normal. You can also find a code snippet doing confidence ellipses here (which seems to be exactly the kind of thing you were looking for).

Getting distance to the hyperplane from sklearn's svm.svc

I'm currently using svc to separate two classes of data (the features below are named data and the labels are condition). After fitting the data using the gridSearchCV I get a classification score of about .7 and I'm fairly happy with that number. After that though I went to get the relative distances from the hyper-plane for data from each class using grid.best_estimator_.decision_function() and plot them in a boxplot and a histogram to get a better idea of how much overlap there is. My problem is that in the histogram and the boxplot these look perfectly seperable shich I know is not the case. I'm sure I'm calling decision_function() incorrectly but not sure how to do this really.
svc=SVC(kernel='linear,probability=True,decision_function_shape='ovr')
cv=KFold(n_splits=4,shuffle=True)
svc=SVC(kernel='linear,probability=True,decision_function_shape='ovr')
C_range=[1,.001,.005,.01,.05,.1,.5,5,50,10,100]
param_grid=dict(C=C_range)
grid=GridSearchCV(svc,param_grid=param_grid, cv=cv,n_jobs=4,iid=False, refit=True)
grid.fit(data,condition)
print grid.best_params
print grid.best_score_
x=grid.best_estimator_.decision_function(data)
plt.hist(x)
sb.boxplot(condition,x)
sb.swarmplot
In the histogram and box plots it looks like almost all of the points have a distance of either exactly positive or negative one with nothing between them.

Average and Measure of Spread of 3D Rotations

I've seen several similar questions, and have some ideas of what I might try, but I don't remember seeing anything about spread.
So: I am working on a measurement system, ultimately computer vision based.
I take N captures, and process them using a library which outputs pose estimations in the form of 4x4 affine transformation matrices of translation and rotation.
There's some noise in these pose estimations. The standard deviation in Euler angles for each axis of rotation is less than 2.5 degrees, so all orientations are pretty close to each other (for a case where all Euler angles are close to 0 or 180). Standard errors of less than 0.25 degrees are important to me. But I have already run into the problems endemic to Euler angles.
I want to average all these pretty-close-together pose estimates to get a single final pose estimate. And I also want to find some measure of spread so that I can estimate accuracy.
I'm aware that "average" isn't actually well defined for rotations.
(For the record, my code is in Numpy-heavy Python.)
I also may want to weight this average, since some captures (and some axes) are known to be more accurate than others.
My impression is that I can just take the mean and standard deviation of the translation vector, and that for the rotation I can convert to quaternions, take the mean, and re-normalize with OK accuracy since these quaternions are pretty close together.
I've also heard mentions of least-squares across all the quaternions, but most of my research into how this would be implemented has been a dismal failure.
Is this workable? Is there a reasonably well-defined measure of spread in this context?
Without more info about your geometry setup is hard to answer. Anyway for rotations I would:
create 3 unit vectors
x=(1,0,0),y=(0,1,0),z=(0,0,1)
and apply the rotation on them and call the output
x(i),y(i),z(i)
it is just applying the matrix(i) with position at (0,0,0)
do this for all measurements you have
now average all vectors
X=avg(x(1),x(2),...x(n))
Y=avg(y(1),y(2),...y(n))
Z=avg(z(1),z(2),...z(n))
correct the vector values
so make each of the X,Y,Z unit vectors again and take the axis which is more closest to the rotation axis as main axis. It will stay as is and recompute the remaining two axises as cross product of main axis and the other vector to ensure orthogonality. Beware of the multiplication order (wrong order of operands will negate the output)
construct averaged transform matrix
see transform matrix anatomy as origin you can use averaged origin of the measurement matrices
Moakher wrote a paper that explains there are basically two ways to take an average of Rotation matrices. The first is a weighted average followed by a projection back to SO(3) using the SVD. The second is the Riemannian center of mass. That one is a closer notion to the geometric mean, and its more complicated to compute.

Fit log-log data with gnuplot

i try to fit this plot as you cans see the fit is not so good for the data.
My code is:
clear
reset
set terminal pngcairo size 1000,600 enhanced font 'Verdana,10'
set output 'LocalEnergyStepZoom.png'
set ylabel '{/Symbol D}H/H_0'
set xlabel 'n_{step}'
set format y '%.2e'
set xrange [*:*]
set yrange [1e-16:*]
f(x) = a*x**b
fit f(x) "revErrEnergyGfortCaotic.txt" via a,b
set logscale
plot 'revErrEnergyGfortCaotic.txt' w p,\
'revErrEnergyGfortRegular.txt' w p,\
f(x) w l lc rgb "black" lw 3
exit
So the question is how mistake i compute here? because i suppose that in a log-log plane a fit of the form i put in the code should rappresent very well the data.
Thanks a lot
Finally i can be able to solve the problem using the suggestion in the answer of Christop and modify it just a bit.
I found the approximate slop of the function (something near to -4) then taking this parameter fix i just fit the curve with only a, found it i fix it and modify only b. After that using the output as starting solution for the fit i found the best fit.
You must find appropriate starting values to get a correct fit, because that kind of fitting doesn't have one global solution.
If you don't define a and b, both are set to 1 which might be too far away. Try using
a = 100
b = -3
for a better start. Maybe you need to tweak those value a bit more, I couldn't because I don't have the data file.
Also, you might want to restrict the region of the fitting to the part above 10:
fit [10:] f(x) "revErrEnergyGfortCaotic.txt" via a,b
Of course only, if it is appropriate.
This is a common issue in data analysis, and I'm not certain if there's a nice Gnuplot way to solve it.
The issue is that the penalty functions in standard fitting routines are typically the sum of squares of errors, and try as you might, if your data have a lot of dynamic range, the errors for the smallest y-values come out to essentially zero from the point of view of the algorithm.
I recently taught a course to students where they needed to fit such data. Lots of them beat their (matlab) fitting routines into submission by choosing very stringent convergence criteria, but even this did not help too much.
What you really need to do, if you want to fit this power-law tail well, is to convert the data into log-log form and run a linear regression on that log-log representation.
The main problem here is that the residual errors of the function values of the higher x are very small compared to the residuals at lower x values. After all, you almost span 20 orders of magnitude on the y axis.
Just weight the y values with 1/y**2, or even better: if you have the standard deviations of your data points weight the values with 1/std**2. Then the fit should converge much much better.
In gnuplot weighting is done using a third data column:
fit f(x) 'data' using 1:2:(1/$2**2") via ...
Or you can use Raman Shah's advice and linearize the y axis and do a linear regression.
you need to use weights for your fit (currently low values are not considered as important) and have a better starting guess (via "pars_file.pars")

D3 - Difference between basis and linear interpolation in SVG line

I implemented a multi-series line chart like the one given here by M. Bostock and ran into a curious issue which I cannot explain myself. When I choose linear interpolation and set my scales and axis everything is correct and values are well-aligned.
But when I change my interpolation to basis, without any modification of my axis and scales, values between the lines and the axis are incorrect.
What is happening here? With the monotone setting I can achieve pretty much the same effect as the basis interpolation but without the syncing problem between lines and axis. Still I would like to understand what is happening.
The basis interpolation is implementing a beta spline, which people like to use as an interpolation function precisely because it smooths out extreme peaks. This is useful when you are modeling something you expect to vary smoothly but only have sharp, infrequently sampled data. A consequence of this is that resulting line will not connect all data points, changing the appearance of extreme values.
In your case, the sharp peaks are the interesting features, the exception to the typically 0 baseline value. When you use a spline interpolation, you are smoothing over these peaks.
Here is a fun demo to play with the different types of line interpoations:
http://bl.ocks.org/mbostock/4342190
You can drag the data around so they resemble a sharp peak like yours, even click to add new points. Then, switch to a basis interpolation and watch the peak get averaged out.

Resources