Getting distance to the hyperplane from sklearn's svm.svc - scikit-learn

I'm currently using svc to separate two classes of data (the features below are named data and the labels are condition). After fitting the data using the gridSearchCV I get a classification score of about .7 and I'm fairly happy with that number. After that though I went to get the relative distances from the hyper-plane for data from each class using grid.best_estimator_.decision_function() and plot them in a boxplot and a histogram to get a better idea of how much overlap there is. My problem is that in the histogram and the boxplot these look perfectly seperable shich I know is not the case. I'm sure I'm calling decision_function() incorrectly but not sure how to do this really.
svc=SVC(kernel='linear,probability=True,decision_function_shape='ovr')
cv=KFold(n_splits=4,shuffle=True)
svc=SVC(kernel='linear,probability=True,decision_function_shape='ovr')
C_range=[1,.001,.005,.01,.05,.1,.5,5,50,10,100]
param_grid=dict(C=C_range)
grid=GridSearchCV(svc,param_grid=param_grid, cv=cv,n_jobs=4,iid=False, refit=True)
grid.fit(data,condition)
print grid.best_params
print grid.best_score_
x=grid.best_estimator_.decision_function(data)
plt.hist(x)
sb.boxplot(condition,x)
sb.swarmplot
In the histogram and box plots it looks like almost all of the points have a distance of either exactly positive or negative one with nothing between them.

Related

How do I visualise orthogonal parameter steps in gradient descent, using Matplotlib?

I have implemented multivariate linear regression, where parameters theta0 (intersect), theta1, theta2 are optimized by minimizing MSE loss, chosen with line search in gradient descent. How do I visually illustrate the mathematical property that the direction of steepest descent (negative gradient) of successive steps are orthogonal? I'm trying to generate a contour map similar to this image: Plot, but with respect to 2 parameters instead of 1 (if it's not possible, 2 separate plots would also be great).
Also, I originally wanted to perform multivariate linear regression with 4 features, but ultimately decided to use only the 2 most strongly correlated ones (after comparing their PCC) in order to be able to plot a graph. Although I'm not aware of any way to plot 4-dimensional data, does anyone know if this is possible and how?

How to classify arrays using svm?

I want to make a model which can differentiate between general functions eg. If a given set of points fall on a line or a parabola etc.
I am not able to train a svc directly on arrays as it expects an array of 2d shape
Any suggestions?
Note: eventually i want to build it into classifying into periodic functions given a set of data points
Okay so your input is an array of points, each point has coordinates (x,y) and your label is the type of function.
In math, this task is called interpolation and this is where you get a set of points and you return the function that should be returned.
What you are describing seems more like non-linear regression (curve fitting) than it is about classification, you'll have too many classes to cover and it doesn't really make sense to do that anyway.
Here is a tutorial in python about non-linear regression that would be more useful. https://scipy-cookbook.readthedocs.io/items/robust_regression.html

Looking for a detailed method to plot contours of confidence level

I try to find out a method or a tutorial to know how are plotted the contours of different confidence levels (68%, 95%, 99.7% etc ...).
Here below an example of these contours on a plot that I would like to generate:
It represents the constraints on cosmological parameters (\omega_Lambda represents dark energy and \Omega_m total matter quantity).
Once I have data sets on \Omega_Lambda and \Omega_mat, how can I produce these contours : I know what is a confidence level but I only know the standard deviation.
If I plot standard deviation for both parameters from the expected values, I get a cross symbol on it (horizontally for \Omega_m and vertically for \Omega_Lambda) : but from this cross, how to draw contours at different confidence levels?
On the figure above, these contours look like a 2D parametric curve where I have points (Omega_Lambda(t), Omega_m(t)) with t parameter but I don't think they are drawn like this.
You might want to check out Matplotlib's contour plot: the levels parameter seems to be what you need.
The plots in your example are not obtained from raw data, but from a statistical model of raw data. So you could first fit multivariate normal distributions to your data using numpy.mean and numpy.cov, then generate the multivariate normal pdf values with scipy.stats.multivariate_normal. You can also find a code snippet doing confidence ellipses here (which seems to be exactly the kind of thing you were looking for).

Fit log-log data with gnuplot

i try to fit this plot as you cans see the fit is not so good for the data.
My code is:
clear
reset
set terminal pngcairo size 1000,600 enhanced font 'Verdana,10'
set output 'LocalEnergyStepZoom.png'
set ylabel '{/Symbol D}H/H_0'
set xlabel 'n_{step}'
set format y '%.2e'
set xrange [*:*]
set yrange [1e-16:*]
f(x) = a*x**b
fit f(x) "revErrEnergyGfortCaotic.txt" via a,b
set logscale
plot 'revErrEnergyGfortCaotic.txt' w p,\
'revErrEnergyGfortRegular.txt' w p,\
f(x) w l lc rgb "black" lw 3
exit
So the question is how mistake i compute here? because i suppose that in a log-log plane a fit of the form i put in the code should rappresent very well the data.
Thanks a lot
Finally i can be able to solve the problem using the suggestion in the answer of Christop and modify it just a bit.
I found the approximate slop of the function (something near to -4) then taking this parameter fix i just fit the curve with only a, found it i fix it and modify only b. After that using the output as starting solution for the fit i found the best fit.
You must find appropriate starting values to get a correct fit, because that kind of fitting doesn't have one global solution.
If you don't define a and b, both are set to 1 which might be too far away. Try using
a = 100
b = -3
for a better start. Maybe you need to tweak those value a bit more, I couldn't because I don't have the data file.
Also, you might want to restrict the region of the fitting to the part above 10:
fit [10:] f(x) "revErrEnergyGfortCaotic.txt" via a,b
Of course only, if it is appropriate.
This is a common issue in data analysis, and I'm not certain if there's a nice Gnuplot way to solve it.
The issue is that the penalty functions in standard fitting routines are typically the sum of squares of errors, and try as you might, if your data have a lot of dynamic range, the errors for the smallest y-values come out to essentially zero from the point of view of the algorithm.
I recently taught a course to students where they needed to fit such data. Lots of them beat their (matlab) fitting routines into submission by choosing very stringent convergence criteria, but even this did not help too much.
What you really need to do, if you want to fit this power-law tail well, is to convert the data into log-log form and run a linear regression on that log-log representation.
The main problem here is that the residual errors of the function values of the higher x are very small compared to the residuals at lower x values. After all, you almost span 20 orders of magnitude on the y axis.
Just weight the y values with 1/y**2, or even better: if you have the standard deviations of your data points weight the values with 1/std**2. Then the fit should converge much much better.
In gnuplot weighting is done using a third data column:
fit f(x) 'data' using 1:2:(1/$2**2") via ...
Or you can use Raman Shah's advice and linearize the y axis and do a linear regression.
you need to use weights for your fit (currently low values are not considered as important) and have a better starting guess (via "pars_file.pars")

Scipy Guassian_kde Nomalisation

I've been using scipy.stats.gausian_kde but have a few questions about its output. I've plotted the normalised histogram and the gaussian_kde plot on the same graph. Why are the y-values so vastly different? My understanding is that the gaussian_kde plot should touch the tips of the histograms, roughly. Using the scipy.integrate.quad functions I determined the area under the graph to be 0.7, rather than 1.0, which is what I expected.
Actually what I really want is for the gaussian_kde to represent the non-normalised histogram, does anyone know how can I do that?
Your expectations are a little off. The area under each of the KDE's peaks should roughly equal the area in their corresponding bars. That appears to hold, to my eye. Nonadaptive KDEs with a global bandwidth estimate (like scipy.stats.gaussian_kde) tend to broaden multimodal distributions with sharp peaks.
As for the underestimate of the total area under the KDE, I cannot say without the data and the code that you used to do the integration.
In order to make a KDE approximate an unnormalized histogram, you need to multiply by (bin_width*N) where N is the total number of data points.

Resources