How to use GroupKFold with CalibratedClassifierCV? - scikit-learn

Unlike GridSearchCV, CalibratedClassifierCV doesn't seem to support passing the groups parameter to the fit method. I found this very old github issue that reports this problem but it doesn't seem to be fixed yet. The documentation makes it clear that not properly stratifying your cv folds will result in an incorrectly calibrated model. My dataset has multiple observations from the same users so I would need to use GroupKFold to ensure proper calibration.

scikit-learn can take an iterable of (train, test) splits as the cv object, so just create them manually. For example:
my_cv = (
(train, test) for train, test in GroupKFold(n_splits=5).split(X, groups=my_groups)
)
cal_clf = CalibratedClassifierCV(clf, cv=my_cv)

I've created a modified version of CalibratedClassifierCV that addresses this issue for now. Until this is fixed in sklearn master, you can similarly modify the fit method of CalibratedClassifierCV to use GroupKFold. My solution can be found in this gist. This is based of sklearn version 0.24.1 but you can easily adapt it to your version of sklearn as needed.

Related

stratify python-weka-wrapper3 for train_test_split function

Trying to do a train_test_split but the database is so unbalanced so i want to stratify it, how can i do it?
On the documentation I saw that train_test_split function just receive 2 arguments:
train_test_split(percentage, rnd=None)
so don't know if it's possible to do that with stratification.
My code is that:
train_model_1_ORG, test_model_1_ORG = data_modelos_1_2.train_test_split(70.0, Random(1))
No, Weka doesn't offer that functionality.
However, I just committed a change to the weka.core.Instances class that allows you to generate cross-validation splits (method: cv_splits).
Since you want 70% in your training set, you could generate splits for a 10-fold CV and then combine 7 of the test splits into a training set and the others into a test set (using the Instances.append_instances(Inst1, Inst2) class method).
NB: You need to install pww3 directly from Github or use a release newer than 0.2.9.

How do I restrict the number of processors used by the ridge regression model in sklearn?

I want to make a fair comparison between different machine learning models. However, I find that the ridge regression model will automatically use multiple processors and there is no parameter that I can restrict the number of used processors (such as n_jobs). Is there any possible way to solve this problem?
A minimal example:
from sklearn.datasets import make_regression
from sklearn.linear_model import RidgeCV
features, target = make_regression(n_samples=10000, n_features=1000)
r = RidgeCV()
r.fit(features, target)
print(r.score(features, target))
If you set the environmental variable OMP_NUM_THREADS to n, you will get the expected behaviour. E.g. on linux, do export OMP_NUM_THREADS=1 in the terminal to restrict the use to 1 cpu.
Depending on your system, you can also set it directly in python. See e.g. How to set environment variables in Python?
Trying to expand further on #PV8 answer, what happens whenever you instantiate an instance of RidgeCV() without explicitly setting cv parameter (as in your case) is that an Efficient Leave One Out cross-validation is run (according to the algorithms referenced here, implementation here).
On the other side, when explicitly passing cv parameter to RidgeCV() this happens:
model = Ridge()
parameters = {'alpha': [0.1, 1.0, 10.0]}
gs = GridSearchCV(model, param_grid=parameters)
gs.fit(features, target)
print(gs.best_score_)
(as you can see here), namely that you'll use GridSearchCV with default n_jobs=None.
Most importantly, as pointed out by one of sklearn core-dev here, the issue you are experimenting might be not dependent on sklearn, but rather on
[...] your numpy setup performing vectorized operations with parallelism.
(where vectorized operations are performed within the computationally efficient LOO cross-validation procedure that you are implicitly calling by not passing cv to RidgeCV()).
Based on the docs for RidgeCV:
Ridge regression with built-in cross-validation.
By default, it performs Leave-One-Out Cross-Validation, which is a form of efficient Leave-One-Out cross-validation.
And by default you use None - to use the efficient Leave-One-Out cross-validation.
An alternate approach with ridge regression and cross validation:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
clf = Ridge(a)
scores = cross_val_score(clf, features, target, cv=1, n_jobs=1)
print(scores)
See also the docs of Ridge and cross_val_score.
Here it is try to take a look here sklearn.utils.parallel_backend i think you can set up the number of cores for calculation using the njobs parameter.

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

RobustScaler partial_fit() similar to MinMaxScaler or StandardScaler

I have been using RobustScaler to scale data and recently we added additional data that is pushing the memory limits of fit_transform. I was hoping to do partial_fit in subset data but looks like RobustScaler does not provide that functionality. Most of the other scalers (MinMax, Standard, Abs) seem to have partial_fit.
Since I have outliers in the data, I need to use RobustScaler. I tried using MinMax and Standard scalers but outliers influence the data too much.
I was hoping to find an alternative to doing fit_transform for large dataset, similar to partial_fit in other scalers.
If it is not a hard requirement for you to use scikit-learn, you can perhaps check out another library for Biomolecular Dynamics called msmbuilder.
It claims to have RobustScaler similar to scikit-learn and with the option of using partial_fit, this is as per their documentation.
Link: http://msmbuilder.org/3.7.0/_preprocessing/msmbuilder.preprocessing.RobustScaler.html#msmbuilder.preprocessing.RobustScaler
PS: I have not tested it.

memory saving gradients or memory check pointing in keras

I recently found a github repo: https://github.com/openai/gradient-checkpointing
The main purpose is to reduce gpu memory consumption. And the usage seems pretty straight forward:
from tensorflow.python.keras._impl.keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory
How can I do the same thing but with keras installed separately, not as a part of tensorflow? Since this didn't work:
from keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory
Thank you in advance
I know I am a bit late, but I recently ran into the same problem, and I was able to solve it.
The problem (I think) is that memory_saving_gradients.gradients_memory uses a heuristic approach which does not work well for many scenarios. Fortunately, there is an alternative function: memory_saving_gradients.gradients_collection, which works perfectly fine, but it requires you to specify at which points in the network the gradient must be checkpointed.
As an example on how this can be accomplished, suppose that we want to checkpoint all the Keras layers whose name contains the word 'add' (for instance, to make a resnet memory effcient). Then, you could include something like this after building your model, but before training it:
layer_names= [layer.name for layer in self.model.layers]
[tf.add_to_collection("checkpoints", self.model.get_layer(l).get_output_at(0))
for l in [i for i in layer_names if 'add' in i]]
K.__dict__["gradients"] = memory_saving_gradients.gradients_collection
I hope it helps!

Resources