I'm using scikit-learn's LinearSVC SVM implementation, and I'm trying understand the multi-class prediction. Looking at coef_ and intercept_ I can get the hyperplane weights. For example, on my learning problem with two features and four labels I get
f0 = 1.99861379*x1 - 0.09489263*x2 + 0.89433196
f1 = -2.04309715*x1 - 3.51285420*x2 - 3.1206355
f2 = 0.73536996*x1 + 2.52111207*x2 - 3.04176149
f3 = -0.56607817*x1 - 0.16981337*x2 - 0.92804815
When I use the decision_function method I get the values that correspond to the above functions. But the documentation says
The confidence score for a sample is the signed distance of that
sample to the hyperplane.
But decision_function does not return the signed distance, it just returns f().
To be more specific, I'm assuming that the LinearSVC uses the standard trick of having a constant 1 feature to represent a threshold. (This might be wrong.) For my example problem this gives a three dimensional feature space where instances are always of the form (1,x1,x2). Assuming no other threshold term, the algorithm learns a hyperplane w=(w0, w1, w2) that goes through the origin in this three dimensional space. Now I get a point to predict, call it z=(1,a,b). What is the signed distance (margin) of this point to the hyperplane. It's just dot(w,z)/2norm(w). The LinearSVC code is returning dot(w,z)
Thanks,
Chris
Related
I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.
LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).
I'm exploring the Scikit-learn logistic regression algorithm. I understand that as part of the training, the algorithm builds a regression curve where the y-variable ranges from 0 to 1 (sigmoid S-curve). The y-variable is a continuous variable here (although in reality it is a discrete variable). .
How is the algorithm able to learn the S-curve, when the training dataset reflects reality and includes the y-variable as a discrete variable? There is no probability estimate in the training, so I'm wondering how is the algorithm able to learn the S-curve.
There is no probability estimate in the training
Sure, but we pretend there is for modeling purposes. We want to maximize the probability of, as you call it, “reality”—if the observed response (the discrete value you refer to) is a 0, we want to predict that with probability 1; similarly, if the response is a 1, we want to predict that with probability 1.
Fitting the model to one data point, getting the right answer with probability 1, would be easy. Of course, we have more than one data point. We have to balance concerns between these. We want the predicted value sigmoid(weights * features) to be close to the true response (0 or 1) for all of the data points, but there may not be a way to set the parameters of the model to achieve this. (That is, the data may not be linearly separable.)
Good question! The fitting process in logistic regression is a search procedure that seeks the beta coefficients that minimize the error in the probabilities predicted by the model (continuous values) and the data (discrete values).
In logistic regression, you model probabilities using a logistic function (also known as a sigmoid function):
XB = B0 + B1 * X1 + B2 * X2 + ... + BN * XN
p(X) = e^(XB) / (1 + e^(XB))
The algorithm tries to find the beta coefficients that minimize the error using Maximum Likelihood estimation. The function to be minimized is called the cost function, and it can be any number of things. The most common ones are:
sum (P(X_i) - y_i)^2
sum |P(X_i) - y_i|
A random set of betas is picked at random, the cost is calculated and the algorithm will pick a new set of betas that will result in a lower cost. The algorithm stops searching for new betas when the decrease in cost is smaller than a given threshold (set by the tol parameter in sklearn).
The way the model converges to the final set of coefficients depends on the solver parameter. Each solver has a different way of converging to the final set of betas, but they usually converge to the same results.
I'm conducting multiple linear regression in Python, ML. To the best of my knowledge r2_score supposed to be in the range of -1 to 1. But, I obtained -18.709.
What is the problem to obtain this answer and how can I correct it? Its coding and result look as follows:
calculate R
from SK-learn.meterics import r2_score
score = r2_score(y_test, y_pred)
print(score)
The output:
-18.7097
Its prediction result is as follows:
y_pred = model.predict(X_test)
print(y_pred)
Result:
[ 25000. 123000. 73000. 103000.]
The coefficient of determination r-square is defined as
Nash–Sutcliffe model efficiency coefficient (Explanation below)
There are cases where the computational definition of R2 can yield
negative values, depending on the definition used. This can arise when
the predictions that are being compared to the corresponding outcomes
have not been derived from a model-fitting procedure using those data.
Even if a model-fitting procedure has been used, R2 may still be
negative, for example when linear regression is conducted without
including an intercept, or when a non-linear function is used to fit
the data. In cases where negative values arise, the mean of the data
provides a better fit to the outcomes than do the fitted function
values, according to this particular criterion. Since the most general
definition of the coefficient of determination is also known as the
Nash–Sutcliffe model efficiency coefficient, this last notation is
preferred in many fields, because denoting a goodness-of-fit indicator
that can vary from −∞ to 1 (i.e., it can yield negative values) with a
squared letter is confusing.
SOURCE: wikipedia
I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.
For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:
# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
results = sm.OLS(y, x).fit()
return results
# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
lr = linear_model.LinearRegression()
lr.fit(x, y)
return lr.coef_
The input is too complicated to post here. Is it possible that a singular input x caused this problem?
By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.
You say that
I added a column of 1's in x for constant term when using both methods
But the documentation of LinearRegression says that
LinearRegression(fit_intercept=True, [...])
it fits an intercept by default. This could explain why you have the differences in the constant term.
Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.
I'm studying a scikit-learn example (Classifier comparison) and got confused with predict_proba and decision_function.
They plot the classification results by drawing the contours using either Z = clf.decision_function(), or Z = clf.predict_proba().
What's the differences between these two? Is it so that each classification method has either of the two as score?
Which one is more proper to interpret the classification result and how should I choose from the two?
The latter, predict_proba is a method of a (soft) classifier outputting the probability of the instance being in each of the classes.
The former, decision_function, finds the distance to the separating hyperplane. For example, a(n) SVM classifier finds hyperplanes separating the space into areas associated with classification outcomes. This function, given a point, finds the distance to the separators.
I'd guess that predict_prob is more useful in your case, in general - the other method is more specific to the algorithm.
Your example is
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
so the code uses decision_function if it exists. On the SVM case, predict_proba is computed (in the binary case)
using Platt scaling
which is both "expensive" and has "theoretical issues". That's why decision_function is used here. (as #Ami said, this is the margin -
the distance to the hyperplane, which is accessible without much further computation). In the SVM case, it is advised to
use decision_function instead of predict_proba.
There are other decision_functions: SGDClassifier's. Here, predict_proba depends on the loss function, and decision_function is universally available.