XGBoost get classifier object form booster object? - scikit-learn

I usually get to feature importance using
regr = XGBClassifier()
regr.fit(X, y)
regr.feature_importances_
where type(regr) is .
However, I have a pickled mXGBoost model, which when unpacked returns an object of type . This is the same object as if I would have ran regr.get_booster().
I have found a few solutions for getting variable importance from a booster object, but is there a way to get to the classifier object from the booster object so I can just apply the same feature_importances_ command? This seems like the most straightforward solution, or it seems like I have to write a function that mimics the output of feature_importances_ in order for it to fit my logged feature importances...
So ideally I'd have something like
xbg_booster = pickle.load(open("xgboost-model", "rb"))
assert str(type(xgb_booster)) == "<class 'xgboost.core.Booster'>", 'wrong class'
xgb_classifier = xgb_booster.get_classifier()
xgb_classifier.feature_importances_
Are there any limitations to what can be done with a booster object in terms finding the classifier? I figure there's some combination of save/load/dump that will get me what I need but I'm stuck for now...
Also for context, the pickled model is the output from AWS sagemaker, so I'm just unpacking it to do some further evaluation

Based on my own experience trying to recreate a classifier from a booster object generated by SageMaker I learned the following:
It doesn't appear to be possible to recreate the classifier from the booster. :(
https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster has the details on the booster class so you can review what it can do.
Crazy things you can do however:
You can create a classifier object and then over-ride the booster within it:
xgb_classifier = xgb.XGBClassifier(**xgboost_params)
[..]
xgb_classifier._Boster = booster
This is nearly useless unless you fit it otherwise it doesn't have any feature data. (I didn't go all the way through this scenario to validate if fitting would provide the feature data required to be functional.)
You can remove the booster object from the classifier and then pickle the classifier using xgboost directly. Then later restore the SageMaker booster back into it. This abomination is closer and appears to work, but is not truly a rehydrated classifier object from the SageMaker output alone.
Recommendation
If you’re not stuck using the SageMaker training solution you can certainly use XGBoost directly to train with. At that point you have access to everything you need to dump/save the data for use in a different context.
I know you're after feature importance so I hope this gets you closer, I had a different use case and was ultimately able to leverage the booster for what I needed.

I was able to get xgboost.XGBClassifier model virtually identical to a xgboost.Booster version model by
(1) extracting all tuning parameters from the booster model using this:
import json
json.loads(your_booster_model.save_config())
(2) implementing these same tuning parameters and then training a XGBClassifier model using the same training dataset used to train the Booster model before that.
Note: one mistake I made was that I forgot to explicitly assign the same seed /random_state in both Booster and Classifier versions.

Related

How to save model architecture in PyTorch?

I know I can save a model by torch.save(model.state_dict(), FILE) or torch.save(model, FILE). But both of them don't save the architecture of model.
So how can we save the architecture of a model in PyTorch like creating a .pb file in Tensorflow ? I want to apply different tweaks to my model. Do I have any better way than copying the whole class definition every time and creating a new class if I can't save the architecture of a model?
You can refer to this article to understand how to save the classifier. To make a tweaks to a model, what you can do is create a new model which is a child of the existing model.
class newModel( oldModelClass):
def __init__(self):
super(newModel, self).__init__()
With this setup, newModel has all the layers as well as the forward function of oldModelClass. If you need to make tweaks, you can define new layers in the __init__ function and then write a new forward function to define it.
Saving all the parameters (state_dict) and all the Modules is not enough, since there are operations that manipulates the tensors, but are only reflected in the actual code of the specific implementation (e.g., reshapeing in ResNet).
Furthermore, the network might not have a fixed and pre-determined compute graph: You can think of a network that has branching or a loop (recurrence).
Therefore, you must save the actual code.
Alternatively, if there are no branches/loops in the net, you may save the computation graph, see, e.g., this post.
You should also consider exporting your model using onnx and have a representation that captures both the trained weights as well as the computation graph.
Regarding the actual question:
So how can we save the architecture of a model in PyTorch like creating a .pb file in Tensorflow ?
The answer is: You cannot
Is there any way to load a trained model without declaring the class definition before ?
I want the model architecture as well as parameters to be loaded.
no, you have to load the class definition before, this is a python pickling limitation.
https://discuss.pytorch.org/t/how-to-save-load-torch-models/718/11
Though, there are other options (probably you have already seen most of those) that are listed at this PyTorch post:
https://pytorch.org/tutorials/beginner/saving_loading_models.html
PyTorch's way of serializing a model for inference is to use torch.jit to compile the model to TorchScript.
PyTorch's TorchScript supports more advanced control flows than TensorFlow, and thus the serialization can happen either through tracing (torch.jit.trace) or compiling the Python model code (torch.jit.script).
Great references:
Video which explains this: https://www.youtube.com/watch?app=desktop&v=2awmrMRf0dA
Documentation: https://pytorch.org/docs/stable/jit.html

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

Using a pytorch model for inference

I am using the fastai library (fast.ai) to train an image classifier. The model created by fastai is actually a pytorch model.
type(model)
<class 'torch.nn.modules.container.Sequential'>
Now, I want to use this model from pytorch for inference. Here is my code so far:
torch.save(model,"./torch_model_v1")
the_model = torch.load("./torch_model_v1")
the_model.eval() # shows the entire network architecture
Based on the example shown here: http://pytorch.org/tutorials/beginner/data_loading_tutorial.html#sphx-glr-beginner-data-loading-tutorial-py, I understand that I need to write my own data loading class which will override some of the functions in the Dataset class. But what is not clear to me is the transformations that I need to apply at test time? In particular, how do I normalize the images at test time?
Another question: is my approach of saving and loading the model in pytorch fine? I read in the tutorial here: http://pytorch.org/docs/master/notes/serialization.html that the approach that I have used is not recommended. The reason is not clear though.
Just to clarify: the_model.eval() not only prints the architecture, but sets the model to evaluation mode.
In particular, how do I normalize the images at test time?
It depends on the model you have. For instance, for torchvision modules, you have to normalize the inputs this way.
Regarding on how to save / load models, torch.save/torch.load "saves/loads an object to a disk file."
So, if you save the_model, it will save the entire model object, including its architecture definition and some other internal aspects. If you save the_model.state_dict(), it will save a dictionary containing the model state (i.e. parameters and buffers) only. Saving the model can break the code in various ways, so the preferred method is to save and load only the model state. However, I'm not sure if fast.ai "model file" is actually a full model or the state of a model. You have to check this so you can correctly load it.

sklearn pickled model "Attribute Error: model has no attribute classes_"

I built a sklearn model in one PC and pickled it. When, I tried to use the same model in another PC, I get below error:
Attribute Error: model has no attribute classes_
When I built the model , I did check
Model.classes_
It printed the classes. What could be the reason for this?
it would help knowing the version details of scikit-learn in both the PCs but the information on the website tells it is possible to happen due to different versions. Hope it helps.
Think of pickle as a way to dump and load a snapshot of an object and its environment at a given moment.
Sometimes the object you deal with does not mean anything by itself . You must provide extra data with it.
That's particularly the case with trained classifiers. In your case, model_classes works perfectly in your script when you have you data and fitted your classifier. Suppose now that you have dumped the classifier and loaded it later in another script : what are the classes you are talking about ? About what data are we talking about ? ... Got it ?
What you have to do then is to provide additional meta data when pickling.
This section of sklearn's documentation describes what needs to be pickled alonside the classifier (training data, source code ...).
NB:
Check first that both versions of sklearn are the same. Sometimes it can be just that.

How am I supposed to use RandomizedLogisticRegression in Scikit-learn?

I simply have failed to understand the documentation for this class.
I can fit data using it, and get the scores for features, but it this all this class is supposed to do?
I can't see how I can use it to actually perform regression using the model that was fit. The example in the documentation above is simply creating an instance of the class, so I can't see how that is supposed to help.
There are methods that perform 'transform' operation, but no mention of what kind of transform that is.
so is it possible to use this class to get actual predictions on new test data, and is it possible to use it in cross fold validation to compare performance with other methods I'm using?
I've used the highest ranking features in other classifiers, but I'm not sure if more than that is possible with this classifier.
Update: I've found the use for fit_transform under feature selection part of the documentation:
When the goal is to reduce the dimensionality of the data to use with another classifier, they expose a transform method to select the non-zero coefficient
Unless I get an answer that says I'm wrong, I'll assume that this classifier indeed does not do prediction. I'll wait before I answer my own question.
Randomized LR is supposed to be a feature selection method, not a classifier in and of itself. Its API matches that of a standard scikit-learn transformer:
randomlr = RandomizedLogisticRegression()
X_train = randomlr.fit_transform(X_train)
X_test = randomlr.transform(X_test)
Then fit a model to X_train and do classification on X_test as usual.

Resources