What is the difference between LinearSVC and SVC(kernel="linear")? - scikit-learn

I found sklearn.svm.LinearSVC and sklearn.svm.SVC(kernel='linear') and they seem very similar to me, but I get very different results on Reuters.
sklearn.svm.LinearSVC: 81.05% in 28.87s train / 9.71s test
sklearn.svm.SVC : 33.55% in 6536.53s train / 2418.62s test
Both have a linear kernel. The tolerance of the LinearSVC is higher than the one of SVC:
LinearSVC(C=1.0, tol=0.0001, max_iter=1000, penalty='l2', loss='squared_hinge', dual=True, multi_class='ovr', fit_intercept=True, intercept_scaling=1)
SVC (C=1.0, tol=0.001, max_iter=-1, shrinking=True, probability=False, cache_size=200, decision_function_shape=None)
How do both functions differ otherwise? Even if I set kernel='linear, tol=0.0001, max_iter=1000 anddecision_function_shape='ovr'theSVCtakes much longer thanLinearSVC`. Why?
I use sklearn 0.18 and both are wrapped in the OneVsRestClassifier. I'm not sure if this makes the same as multi_class='ovr' / decision_function_shape='ovr'.

Truly, LinearSVC and SVC(kernel='linear') yield different results, i. e. metrics score and decision boundaries, because they use different approaches. The toy example below proves it:
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC, SVC
X, y = load_iris(return_X_y=True)
clf_1 = LinearSVC().fit(X, y) # possible to state loss='hinge'
clf_2 = SVC(kernel='linear').fit(X, y)
score_1 = clf_1.score(X, y)
score_2 = clf_2.score(X, y)
print('LinearSVC score %s' % score_1)
print('SVC score %s' % score_2)
--------------------------
>>> 0.96666666666666667
>>> 0.98666666666666669
The key principles of that difference are the following:
By default scaling, LinearSVC minimizes the squared hinge loss while SVC minimizes the regular hinge loss. It is possible to manually define a 'hinge' string for loss parameter in LinearSVC.
LinearSVC uses the One-vs-All (also known as One-vs-Rest) multiclass reduction while SVC uses the One-vs-One multiclass reduction. It is also noted here. Also, for multi-class classification problem SVC fits N * (N - 1) / 2 models where N is the amount of classes. LinearSVC, by contrast, simply fits N models. If the classification problem is binary, then only one model is fit in both scenarios. multi_class and decision_function_shape parameters have nothing in common. The second one is an aggregator that transforms the results of the decision function in a convenient shape of (n_features, n_samples). multi_class is an algorithmic approach to establish a solution.
The underlying estimators for LinearSVC are liblinear, that do in fact penalize the intercept. SVC uses libsvm estimators that do not. liblinear estimators are optimized for a linear (special) case and thus converge faster on big amounts of data than libsvm. That is why LinearSVC takes less time to solve the problem.
In fact, LinearSVC is not actually linear after the intercept scaling as it was stated in the comments section.

The main difference between them is linearsvc lets your choose only linear classifier whereas svc let yo choose from a variety of non-linear classifiers. however it is not recommended to use svc for non-linear problems as they are super slow. try importing other libraries for doing non-linear classifications.
now the point that even after defining kernel='linear' we don't get same output is because both linearsvc and svc try different approaches while doing the background mathematics. also linearsvc works on principle of one-vs-rest, and svc works on one-vs-one.
I hope this answers your question.

Related

Why does LogisticRegression give the same result every time, even with different random state?

I am not an expert on logistic regression, but I thought when solving it using lgfgs it was doing optimization, finding local minima for the objective function. But every time I run it using scikit-learn, it is returning the same results, even when I feed it a different random state.
Below is code that reproduces my issue.
First set up the problem by generating data
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import datasets
# generate data
X, y = datasets.make_classification(n_samples=1000,
n_features=10,
n_redundant=4,
n_clusters_per_class=1,
random_state=42)
# Set up the test/training data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)
Second, train the model and inspect results
# Set up a different random state each time
rand_state = np.random.randint(1000)
print(rand_state)
model = LogisticRegression(max_iter=1000,
solver='lbfgs',
random_state=rand_state)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
conf_mat = metrics.confusion_matrix(y_test, y_pred)
print(y_pred[:20],"\n", conf_mat)
I get the same y_pred (and obviously confusion matrix) every time I run this even though I'm using the lbfgs solver with a different random state each run. I'm confused, as I thought this was a stochastic solver that was traveling down a gradient into a local minimum.
Maybe I'm not properly randomizing the initial state? I haven't been able to figure it out from the documentation.
Discussion of Related Question
There is a related question, which I didn't find during my research:
Does logistic regression always find global optimum, assuming that the optimisation converges?
The answer there is that the cost function is convex, so if the numerical solution is well-behaved, it will find a global minimum. That is, there aren't a bunch of local minima that your optimization algorithm will get stuck in: it will reach the same (global) minimum each time (perhaps depending on the solver you choose?).
However, in the comments someone pointed out, depending on what solvers you choose there are cases when you will not reach the same solution, that it depends on the random_state parameter. At the very least, I think this would be helpful to resolve.
First, let me put in the answer what got this closed as duplicate earlier: a logistic regression problem (without perfect separation) has a global optimum, and so there are no local optima to get stuck in with different random seeds. If the solver converges satisfactorily, it will do so on the global optimum. So the only time random_state can have any effect is when the solver fails to converge.
Now, the documentation for LogisticRegression's parameter random_state states:
Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. [...]
So for your code, with solver='lbfgs', indeed there is no expected effect.
It's not too hard to make sag and saga fail to converge, and with different random_states to end at different solutions; to make it easier, set max_iter=1. liblinear apparently does not use the random_state unless solving the dual, so also setting dual=True admits different solutions. I found that thanks to this comment on a github issue (the rest of the issue may be worth reading for more background).

Why SVC, NuSVC and LinearSVC are producing very different results?

I am working on a classification task — geolocation of Twitter users based on their tweets.
I did many experiments by using sklearn's SVC, NuSVC and LinearSVC and bag-of-words model. The accuracies are 35%, 60% and 80%. The difference between SVC and LinearSVC is more than double which is shocking.
I am not quite sure why this is happening exactly. It might be because of overfitting or underfitting? Why is there so much difference between the classifiers?
In general non-linear kernels are more suitable to model more complex functions than linear functions, but it depends on the data, the chosen hyper parameters (e.g. penalty and kernel) and how you evaluate your results.
LinearSVC
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Source: sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
SVC
The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
Source: sklearn.svm.SVC.html#sklearn.svm.SVC
At first you should test a LinearSVC model, because it has just a few hyper parameters and should give you a first result. After that you can try to train a bunch of SVC models and pick the best. For that I recommend to make a gridsearch over C, kernel, degree, gamma, coef0 and tol.

In scikit-learn Stochastic Gradient Descent classifier, how to find the most influential independent variables?

I do this:
from sklearn.linear_model import SGDClassifier
sgclass = SGDClassifier(random_state=10)
sgclass.fit(X_train,y_train)
pred = sgclass.predict(X_test)
from sklearn.metrics import classification_report,accuracy_score
print(classification_report(y_test, pred))
print(accuracy_score(y_test, pred))
These are useful reports on the recall and precision of the model.
However how to acquire the most influential independent variables that predict the dependent variable? I started with about 12 candidates and want to see their rank order in terms of influence in the model.
As the documentation specifies, you can use the coef_ attribute to get feature weights. The greater the absolute value of the feature is, the greater is its importance.
You can see that in action in the feature selection class from scikit, SelectFromModel. The best features are selected from any classifier that has attributes feature_importances_ or coef_.

Balanced Random Forest in scikit-learn (python)

I'm wondering if there is an implementation of the Balanced Random Forest (BRF) in recent versions of the scikit-learn package. BRF is used in the case of imbalanced data. It works as normal RF, but for each bootstrapping iteration, it balances the prevalence class by undersampling. For example, given two classes N0 = 100, and N1 = 30 instances, at each random sampling it draws (with replacement) 30 instances from the first class and the same amount of instances from the second class, i.e. it trains a tree on a balanced data set. For more information please refer to this paper.
RandomForestClassifier() does have the 'class_weight=' parameter, which might be set to 'balanced', but I'm not sure that it is related to downsampling of the bootsrapped training samples.
What you're looking for is the BalancedBaggingClassifier from imblearn.
imblearn.ensemble.BalancedBaggingClassifier(base_estimator=None,
n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,
bootstrap_features=False, oob_score=False, warm_start=False, ratio='auto',
replacement=False, n_jobs=1, random_state=None, verbose=0)
Effectively what it allow you to do is to successively undersample your majority class while fitting an estimator on top. You can use random forest or any base estimator from scikit-learn. Here is an example.
There is now a class in imblearn called BalancedRandomForestClassifier. It works similar to previously mentioned BalancedBaggingClassifier but is specifically for random forests.
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)

Different Linear Regression Coefficients with statsmodels and sklearn

I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.
For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:
# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
results = sm.OLS(y, x).fit()
return results
# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
lr = linear_model.LinearRegression()
lr.fit(x, y)
return lr.coef_
The input is too complicated to post here. Is it possible that a singular input x caused this problem?
By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.
You say that
I added a column of 1's in x for constant term when using both methods
But the documentation of LinearRegression says that
LinearRegression(fit_intercept=True, [...])
it fits an intercept by default. This could explain why you have the differences in the constant term.
Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.

Resources