stratify python-weka-wrapper3 for train_test_split function - python-3.x

Trying to do a train_test_split but the database is so unbalanced so i want to stratify it, how can i do it?
On the documentation I saw that train_test_split function just receive 2 arguments:
train_test_split(percentage, rnd=None)
so don't know if it's possible to do that with stratification.
My code is that:
train_model_1_ORG, test_model_1_ORG = data_modelos_1_2.train_test_split(70.0, Random(1))

No, Weka doesn't offer that functionality.
However, I just committed a change to the weka.core.Instances class that allows you to generate cross-validation splits (method: cv_splits).
Since you want 70% in your training set, you could generate splits for a 10-fold CV and then combine 7 of the test splits into a training set and the others into a test set (using the Instances.append_instances(Inst1, Inst2) class method).
NB: You need to install pww3 directly from Github or use a release newer than 0.2.9.

Related

How to use GroupKFold with CalibratedClassifierCV?

Unlike GridSearchCV, CalibratedClassifierCV doesn't seem to support passing the groups parameter to the fit method. I found this very old github issue that reports this problem but it doesn't seem to be fixed yet. The documentation makes it clear that not properly stratifying your cv folds will result in an incorrectly calibrated model. My dataset has multiple observations from the same users so I would need to use GroupKFold to ensure proper calibration.
scikit-learn can take an iterable of (train, test) splits as the cv object, so just create them manually. For example:
my_cv = (
(train, test) for train, test in GroupKFold(n_splits=5).split(X, groups=my_groups)
)
cal_clf = CalibratedClassifierCV(clf, cv=my_cv)
I've created a modified version of CalibratedClassifierCV that addresses this issue for now. Until this is fixed in sklearn master, you can similarly modify the fit method of CalibratedClassifierCV to use GroupKFold. My solution can be found in this gist. This is based of sklearn version 0.24.1 but you can easily adapt it to your version of sklearn as needed.

Stratify and model evaluation

I want evaluate three models i.e. LogisticRegression , SVM and Random Forest using an imbalanced dataset. I decided to use a stritified method.
The first option is to use train_test_split and set the stratyfy=y
Howerever I used the StratifyKfold method with 10 splits.
In this case how do i evaluate my three models using the same splits?
If you use the same dataset, you can fix the random_state parameter of the StratifyKfold. If you do that, you would be evaluating the three models with same 10 splits.

How do I restrict the number of processors used by the ridge regression model in sklearn?

I want to make a fair comparison between different machine learning models. However, I find that the ridge regression model will automatically use multiple processors and there is no parameter that I can restrict the number of used processors (such as n_jobs). Is there any possible way to solve this problem?
A minimal example:
from sklearn.datasets import make_regression
from sklearn.linear_model import RidgeCV
features, target = make_regression(n_samples=10000, n_features=1000)
r = RidgeCV()
r.fit(features, target)
print(r.score(features, target))
If you set the environmental variable OMP_NUM_THREADS to n, you will get the expected behaviour. E.g. on linux, do export OMP_NUM_THREADS=1 in the terminal to restrict the use to 1 cpu.
Depending on your system, you can also set it directly in python. See e.g. How to set environment variables in Python?
Trying to expand further on #PV8 answer, what happens whenever you instantiate an instance of RidgeCV() without explicitly setting cv parameter (as in your case) is that an Efficient Leave One Out cross-validation is run (according to the algorithms referenced here, implementation here).
On the other side, when explicitly passing cv parameter to RidgeCV() this happens:
model = Ridge()
parameters = {'alpha': [0.1, 1.0, 10.0]}
gs = GridSearchCV(model, param_grid=parameters)
gs.fit(features, target)
print(gs.best_score_)
(as you can see here), namely that you'll use GridSearchCV with default n_jobs=None.
Most importantly, as pointed out by one of sklearn core-dev here, the issue you are experimenting might be not dependent on sklearn, but rather on
[...] your numpy setup performing vectorized operations with parallelism.
(where vectorized operations are performed within the computationally efficient LOO cross-validation procedure that you are implicitly calling by not passing cv to RidgeCV()).
Based on the docs for RidgeCV:
Ridge regression with built-in cross-validation.
By default, it performs Leave-One-Out Cross-Validation, which is a form of efficient Leave-One-Out cross-validation.
And by default you use None - to use the efficient Leave-One-Out cross-validation.
An alternate approach with ridge regression and cross validation:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
clf = Ridge(a)
scores = cross_val_score(clf, features, target, cv=1, n_jobs=1)
print(scores)
See also the docs of Ridge and cross_val_score.
Here it is try to take a look here sklearn.utils.parallel_backend i think you can set up the number of cores for calculation using the njobs parameter.

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

sklearn SGDClassifier can't stop

I am using sklearn to train a model. The train dataset is about 3000k, so i use SGDClassifier. The feature is not very good, so i know it may not converge. But i want SGDClassifier to stop early according to my setting just like max_iter = 1000. As far as I am concerned, the function SGDClassifier has no parameter like max_iter. How can i do it?
This is the code.
This is the print information.
Any help will be appreciated...
This is weird, by default in scikit-learn 0.18.2, n_iter is set to 5 epochs. Can you please update your question with a script that makes it possible to reproduce the behavior using a toy dataset (for instance generated with numpy.random.randn or similar).
Note that in scikit-learn master and 0.19 once released, n_iter will be deprecated and replaced by max_iter and a tol (for instance set to 1e-3) to automatically stop when the objective function is no longer making progress.
The 20hours running could be not so strange since you have a dataset of 3000k and you use SGDClassifier that is slow. What processor do you have?
Try stopping it by using CTRL+C if you are in Windows. Then, use n_iter to control the number of iterations that you want. The default is 5 however.
Finally, if you want to save a model see here:
Save and Load Machine Learning Models in Python with scikit-learn

Resources