Correct syntax to "fit" a classifier in sklearn? - scikit-learn

I am trying following simple example:
from sklearn import datasets, svm
iris = datasets.load_iris()
clf = svm.SVC(random_state=0)
For fitting, should I use following statement:
clf = clf.fit(iris.data, iris.target)
Or just:
clf.fit(iris.data, iris.target)
Both above formats have been used in different places, so I am confused.
The first method clf = clf(X,y).fit() seems to be the official version (see here).
I have tried and found that both seem to work, but I may be missing some point here.

They both perform the same. The fit() function modifies the states of the estimator and returns it. It is used so that you can write a one-liner such as:
clf = svm.SVC(random_state=0).fit(iris.data, iris.target)

Related

Scikit-learn output in a pleasant way [duplicate]

As an R user, I wanted to also get up to speed on scikit.
Creating a linear regression model(s) is fine, but can't seem to find a reasonable way to get a standard summary of regression output.
Code example:
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# Load the diabetes datasets
dataset = datasets.load_diabetes()
# Fit a linear regression model to the data
model = LinearRegression()
model.fit(dataset.data, dataset.target)
print(model)
# Make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# Summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print model.intercept_, model.coef_, mse,
print(model.score(dataset.data, dataset.target))
Issues:
seems like the intercept and coef are built into the model, and I just type print (second to last line) to see them.
What about all the other standard regression output like R^2, adjusted R^2, p values, etc. If I read the examples correctly, seems like you have to write a function/equation for each of these and then print it.
So, is there no standard summary output for lin. reg. models?
Also, in my printed array of outputs of coefficients, there are no variable names associated with each of these? I just get the numeric array. Is there a way to print these where I get an output of the coefficients and the variable they go with?
My printed output:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
152.133484163 [ -10.01219782 -239.81908937 519.83978679 324.39042769 -792.18416163
476.74583782 101.04457032 177.06417623 751.27932109 67.62538639] 2859.69039877
0.517749425413
Notes: Started off with Linear, Ridge and Lasso. I have gone through the examples. Below is for the basic OLS.
There exists no R type regression summary report in sklearn. The main reason is that sklearn is used for predictive modelling / machine learning and the evaluation criteria are based on performance on previously unseen data (such as predictive r^2 for regression).
There does exist a summary function for classification called sklearn.metrics.classification_report which calculates several types of (predictive) scores on a classification model.
For a more classic statistical approach, take a look at statsmodels.
I use:
import sklearn.metrics as metrics
def regression_results(y_true, y_pred):
# Regression metrics
explained_variance=metrics.explained_variance_score(y_true, y_pred)
mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred)
mse=metrics.mean_squared_error(y_true, y_pred)
mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
r2=metrics.r2_score(y_true, y_pred)
print('explained_variance: ', round(explained_variance,4))
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
statsmodels package gives a quiet decent summary
from statsmodels.api import OLS
OLS(dataset.target,dataset.data).fit().summary()
You can do using statsmodels
import statsmodels.api as sm
X = sm.add_constant(X.ravel())
results = sm.OLS(y,x).fit()
results.summary()
results.summary() will organize the results into three tabels
You can use the following option to have a summary table:
import statsmodels.api as sm
#log_clf = LogisticRegression()
log_clf =sm.Logit(y_train,X_train)
classifier = log_clf.fit()
y_pred = classifier.predict(X_test)
print(classifier.summary2())
Use model.summary() after predict
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# load the diabetes datasets
dataset = datasets.load_diabetes()
# fit a linear regression model to the data
model = LinearRegression()
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# >>>>>>>Print out the statistics<<<<<<<<<<<<<
model.summary()
# summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print model.intercept_, model.coef_, mse,
print(model.score(dataset.data, dataset.target))

scikit-learn linear regression K fold cross validation

I want to run Linear Regression along with K fold cross validation using sklearn library on my training data to obtain the best regression model. I then plan to use the predictor with the lowest mean error returned on my test set.
For example the below piece of code gives me an array of 20 results with different neg mean absolute errors, I am interested in finding the predictor which gives me this (least) error and then use that predictor on my test set.
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
There is no such thing as "predictor which gives me this (least) error" in cross_val_score, all estimators in :
sklearn.model_selection.cross_val_score(LinearRegression(), trainx, trainy, scoring='neg_mean_absolute_error', cv=20)
are the same.
You may wish to check GridSearchCV that will indeed search through different sets of hyperparams and return the best estimator:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
X,y = datasets.make_regression()
lr_model = LinearRegression()
parameters = {'normalize':[True,False]}
clf = GridSearchCV(lr_model, parameters, refit=True, cv=5)
best_model = clf.fit(X,y)
Note the refit=True param that ensures the best model is refit on the whole dataset and returned.

sklearn auc ValueError: Only one class present in y_true

I searched Google, and saw a couple of StackOverflow posts about this error. They are not my cases.
I use keras to train a simple neural network and make some predictions on the splitted test dataset. But when use roc_auc_score to calculate AUC, I got the following error:
"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.".
I inspect the target label distribution, and they are highly imbalanced. Some labels(in the total 29 labels) have only 1 instance. So it's likely they will have no positive label instance in the test label. So the sklearn's roc_auc_score function reported the only one class problem. That's reasonable.
But I'm curious, as when I use sklearn's cross_val_score function, it can handle the AUC calculation without error.
my_metric = 'roc_auc'
scores = cross_validation.cross_val_score(myestimator, data,
labels, cv=5,scoring=my_metric)
I wonder what happened in the cross_val_score, is it because the cross_val_score use a stratified cross-validation data split?
UPDATE
I continued to make some digging, but still can't find the difference behind.I see that cross_val_score call check_scoring(estimator, scoring=None, allow_none=False) to return a scorer, and the check_scoring will call get_scorer(scoring) which will return scorer=SCORERS[scoring]
And the SCORERS['roc_auc'] is roc_auc_scorer;
the roc_auc_scorer is made by
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
So, it's still using the roc_auc_score function. I don't get why cross_val_score behave differently with directly calling roc_auc_score.
I think your hunch is correct. The AUC (area under ROC curve) needs a sufficient number of either classes in order to make sense.
By default, cross_val_score calculates the performance metric one each fold separately. Another option could be to do cross_val_predict and compute the AUC over all folds combined.
You could do something like:
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
class ProbaEstimator(LogisticRegression):
"""
This little hack needed, because `cross_val_predict`
uses `estimator.predict(X)` internally.
Replace `LogisticRegression` with whatever classifier you like.
"""
def predict(self, X):
return super(self.__class__, self).predict_proba(X)[:, 1]
# some example data
X, y = make_classification()
# define your estimator
estimator = ProbaEstimator()
# get predictions
pred = cross_val_predict(estimator, X, y, cv=5)
# compute AUC score
roc_auc_score(y, pred)

Difference in SGD classifier results and statsmodels results for logistic with l1

As a check on my work, I've been comparing the output of scikit learn's SGDClassifier logistic implementation with statsmodels logistic. Once I add some l1 in combination with categorical variables, I'm getting very different results. Is this a result of different solution techniques or am I not using the correct parameter?
Much bigger differences on my own dataset, but still pretty large using mtcars:
df = sm.datasets.get_rdataset("mtcars", "datasets").data
y, X = patsy.dmatrices('am~standardize(wt) + standardize(disp) + C(cyl) - 1', df)
logit = sm.Logit(y, X).fit_regularized(alpha=.0035)
clf = SGDClassifier(alpha=.0035, penalty='l1', loss='log', l1_ratio=1,
n_iter=1000, fit_intercept=False)
clf.fit(X, y)
gives:
sklearn: [-3.79663192 -1.16145654 0.95744308 -5.90284803 -0.67666106]
statsmodels: [-7.28440744 -2.53098894 3.33574042 -7.50604097 -3.15087396]
I've been working through some similar issues. I think the short answer might be that SGD doesn't work so well with only a few samples, but is (much more) performant with larger data. I'd be interested in hearing from sklearn devs. Compare, for example, using LogisticRegression
clf2 = LogisticRegression(penalty='l1', C=1/.0035, fit_intercept=False)
clf2.fit(X, y)
gives very similar to l1 penalized Logit.
array([[-7.27275526, -2.52638167, 3.32801895, -7.50119041, -3.14198402]])

How to get best_estimator parameters from GridSearch using cross_val_score?

I want to know the result of the GridSearch when I'm using nested cross validation with cross_val_score for convenience.
When using cross_val_score, you get an array of scores. It would be useful to receive the fitted estimator back or a summary of the chosen parameters for that estimator.
I know you can do this yourself but just implementing cross-validation manually but it is much more convenient if it can be done in conjunction with cross_val_score.
Any way to do it or is this a feature to suggest?
The GridSearchCV class in scikit-learn already does cross validation internally. You can pass any CV iterator as the cv argument of the constructor of GridSearchCV.
The answer to your question is that it is a feature to suggest. Unfortunately, you can't get the best parameters of the models fitted with nested cross-validation using cross_val_score (as of now, scikit 0.14).
See this example:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
digits = datasets.load_digits()
X = digits.data
y = digits.target
hyperparams = [{'fit_intercept':[True, False]}]
algo = LinearRegression()
grid = GridSearchCV(algo, hyperparams, cv=5, scoring='mean_squared_error')
# Nested cross validation
cross_val_score(grid, X, y)
grid.best_score_
[Out]:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-4c4ac83c58fb> in <module>()
15 # Nested cross validation
16 cross_val_score(grid, X, y)
---> 17 grid.best_score_
AttributeError: 'GridSearchCV' object has no attribute 'best_score_'
(Note also that the scores you get from cross_val_score are not the ones defined in scoring, here the mean squared error. What you see is the score function of the best estimator. The bug of v0.14 is described here.)
In sklearn v0.20.0 (which will be released in late 2018), the trained estimators are exposed by the function cross_validate if requested.
See here the corresponding pull-request for the new feature. Something like this will work:
from sklearn.metrics.scorer import check_scoring
from sklearn.model_selection import cross_validate
scorer = check_scoring(estimator=gridSearch, scoring=scoring)
cvRet = cross_validate(estimator=gridSearch, X=X, y=y,
scoring={'score': scorer}, cv=cvOuter,
return_train_score=False,
return_estimator=True,
n_jobs=nJobs)
scores = cvRet['test_score'] # Equivalent to output of cross_val_score()
estimators = cvRet['estimator']
If return_estimator=True, the estimators can be retrieved from the returned dictionary as cvRet['estimator']. The list stored in cvRet['test_score'] is equivalent to the output of cross_val_score. See here how cross_val_score() is implemented by means of cross_validate().

Resources