I want evaluate three models i.e. LogisticRegression , SVM and Random Forest using an imbalanced dataset. I decided to use a stritified method.
The first option is to use train_test_split and set the stratyfy=y
Howerever I used the StratifyKfold method with 10 splits.
In this case how do i evaluate my three models using the same splits?
If you use the same dataset, you can fix the random_state parameter of the StratifyKfold. If you do that, you would be evaluating the three models with same 10 splits.
Related
If you run cross-val_score() or cross_validate() on a dataset, is the estimator trained using all the folds at the end of the run?
I read somewhere that cross-val_score takes a copy of the estimator. Whereas I thought this was how you train a model using k-fold.
Or, at the end of the cross_validate() or cross_val_score() you have a single estimator and then use that for predict()
Is my thinking correct?
You can refer to sklearn-document here.
If you do 3-Fold cross validation,
the sklearn will split your dataset to 3 parts. (For example, the 1st part contains 1st-3rd rows, 2nd part contains 4th-6th rows, and so on)
sklearn iterate to train new model 3 times with different training set and validation set
In the first round, it combine 1st and 2nd part together and use it as training set and test the model with 3rd part.
In the second round, it combine 1st and 3rd part together and use it as training set and test the model with 2nd part.
and so on.
So, after using cross-validate, you will get three models. If you want the model objects of each round, you can add parameter return_estimato=True. The result which is the dictionary will have another key named estimator containing the list of estimator of each training.
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
cv_results = cross_validate(lasso, X, y, cv=3, return_estimator=True)
print(sorted(cv_results.keys()))
#Output: ['estimator', 'fit_time', 'score_time', 'test_score']
cv_results['estimator']
#Output: [Lasso(), Lasso(), Lasso()]
However, in practice, the cross validation method is used only for testing the model. After you found the good model and parameter setting that give you the high cross-validation score. It will be better if you fit the model with the whole training set again and test the model with the testing set.
I'm trying to use BaggingClassifier from Sklearn to define multiple base_estimator. From my understanding, something would be similar to this.
clf = BaggingClassifier(base_estimator=[SVC(), DecisionTreeClassifier()], n_estimators=3, random_state=0)
But BaggingClassifier here doesn't take a list as its base_estimator.
I assume I can switch to StackingRegressor(estimators=) to define multiple estimators manually. But it will be a pain to list out, for example 100 estimators, no mentioning there will be many permutations and combinations of the base estimators.
Can you help me understand how to define multiple base_estimator in sklearn.BaggingClassifier?
You can only pass one estimator to base_estimator. The whole idea behind BaggingBlassifier is to train one model on random samples of the training data in an attempt to reduce its variance.
If you need two or more estimators, each one of them trained on random subsets of data, I suggest two different options:
Create your own voting process from two separate bagging classifiers
Train two different BaggingClassifiers and pass them to sklearn.ensemble.StackingClassifier.
I want to make a fair comparison between different machine learning models. However, I find that the ridge regression model will automatically use multiple processors and there is no parameter that I can restrict the number of used processors (such as n_jobs). Is there any possible way to solve this problem?
A minimal example:
from sklearn.datasets import make_regression
from sklearn.linear_model import RidgeCV
features, target = make_regression(n_samples=10000, n_features=1000)
r = RidgeCV()
r.fit(features, target)
print(r.score(features, target))
If you set the environmental variable OMP_NUM_THREADS to n, you will get the expected behaviour. E.g. on linux, do export OMP_NUM_THREADS=1 in the terminal to restrict the use to 1 cpu.
Depending on your system, you can also set it directly in python. See e.g. How to set environment variables in Python?
Trying to expand further on #PV8 answer, what happens whenever you instantiate an instance of RidgeCV() without explicitly setting cv parameter (as in your case) is that an Efficient Leave One Out cross-validation is run (according to the algorithms referenced here, implementation here).
On the other side, when explicitly passing cv parameter to RidgeCV() this happens:
model = Ridge()
parameters = {'alpha': [0.1, 1.0, 10.0]}
gs = GridSearchCV(model, param_grid=parameters)
gs.fit(features, target)
print(gs.best_score_)
(as you can see here), namely that you'll use GridSearchCV with default n_jobs=None.
Most importantly, as pointed out by one of sklearn core-dev here, the issue you are experimenting might be not dependent on sklearn, but rather on
[...] your numpy setup performing vectorized operations with parallelism.
(where vectorized operations are performed within the computationally efficient LOO cross-validation procedure that you are implicitly calling by not passing cv to RidgeCV()).
Based on the docs for RidgeCV:
Ridge regression with built-in cross-validation.
By default, it performs Leave-One-Out Cross-Validation, which is a form of efficient Leave-One-Out cross-validation.
And by default you use None - to use the efficient Leave-One-Out cross-validation.
An alternate approach with ridge regression and cross validation:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
clf = Ridge(a)
scores = cross_val_score(clf, features, target, cv=1, n_jobs=1)
print(scores)
See also the docs of Ridge and cross_val_score.
Here it is try to take a look here sklearn.utils.parallel_backend i think you can set up the number of cores for calculation using the njobs parameter.
What is the difference between using the stratify argument in train_test_split function of sklearn, and the StratifiedShuffleSplit function? Don't they do the same thing?
These two modules perform different operations.
train_test_split, as its name clearly implies, is used for splitting the data in a single training & single test subset, and the stratify argument permits doing this in a stratified way.
StratifiedShuffleSplit, on the other hand, provides splits for cross-validation; from the docs:
Stratified ShuffleSplit cross-validator
Provides train/test indices to split data in train/test sets.
Notice the plural sets (emphasis mine).
So, StratifiedShuffleSplit is there to be used instead of KFold when we want to ensure the CV splits are stratified, and not to replace train_test_split.
I'm using RandomForestClassifier in sklearn, and using GridsearchCV for getting best estimator.
I'm wondering when there are many estimators (from simple one to complex one) having the same scores in GridsearchCV, what will be the resulted estimator out of GridsearchCV? The simplest one? or random one?
GridSearchCV does not assess the model complexity (though that would be a neat feature). Neither does it choose among the best models randomly.
Instead, GridSearchCV simply performs an np.argmin() on the stored errors. See the corresponding line in the source code.
Now, according to the NumPy docs,
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
That is, GridSearchCV will always select the first among the best models.