H2O: GLM MOJO models doesn't hold variable importance? - python-3.x

A question on h2o mojo models.
Is my understanding correct that GLM MOJO models don't hold variable importance for the model?
Or is there something that am missing?
I get the below message on screenshot sometimes when I query varimp/varimp_plot from the GLM model.
"Warning: This model doesn't have variable importance."
Is this usual? while we get varimp from the same model within the kernel which generated them. Just curious to understand this.
Any leads would be much appreciated.

MOJO models are H2O's primary way of taking models into production. These self-contained zip files are primarily meant to be run via genmodel and not inspected. MOJO model does not equal binary model, which is tied to a certain H2O version. The reason for that is simple - algorithm parameters and the algorithm itself may change between versions.
Anyway, H2O provides a way to import MOJOs back into H2O and primarily use them for scoring. Some attributes of MOJOs are still extracted from the MOJO and provided to the user. But, as the documentation says, it is not guaranteed which model parameters are exposed and some might be missing. MOJO model import is implemented as a part of H2O's Generic model functionality - the ability of H2O to "embrace" any model, even the ones trained outside H2O, provided the "Generic model driver" is available.
With that said, there is definitely a way to provide variable importances to MOJO import functionality users. This is a known problem already and is tracked in H2O JIRA.
More resources on MOJO model on my blog.

Related

what does this command do for a bert transformers?

!pip install transformers
from transformers import InputExample, InputFeatures
What are InputExample and InputFeatures here?
thanks.
Check out the documentation.
Processors
This library includes processors for several traditional tasks. These
processors can be used to process a dataset into examples that can be
fed to a model.
And
class transformers.InputExample
A single training/test example for simple sequence classification.
As well as
class transformers.InputFeatures
A single set of features of data. Property names are the same names as
the corresponding inputs to a model.
So basically InputExample is just a raw input and InputFeatures is the (numerical) feature representation of that Input that the model uses.
I couldn't find any tutorial explicitly explaining this but you can check out Chapter 4 (From text to features) in this tutorial where it is nicely explained on an example.
From my experience the transformers library has an absolute ton of classes and structures so going too deep into the technical implementation can make it easy to get lost in. For starters I would recommend trying to get an idea of the broader picture by just getting some example projects to work as well as checking out their 🤗 Course.

How to save model architecture in PyTorch?

I know I can save a model by torch.save(model.state_dict(), FILE) or torch.save(model, FILE). But both of them don't save the architecture of model.
So how can we save the architecture of a model in PyTorch like creating a .pb file in Tensorflow ? I want to apply different tweaks to my model. Do I have any better way than copying the whole class definition every time and creating a new class if I can't save the architecture of a model?
You can refer to this article to understand how to save the classifier. To make a tweaks to a model, what you can do is create a new model which is a child of the existing model.
class newModel( oldModelClass):
def __init__(self):
super(newModel, self).__init__()
With this setup, newModel has all the layers as well as the forward function of oldModelClass. If you need to make tweaks, you can define new layers in the __init__ function and then write a new forward function to define it.
Saving all the parameters (state_dict) and all the Modules is not enough, since there are operations that manipulates the tensors, but are only reflected in the actual code of the specific implementation (e.g., reshapeing in ResNet).
Furthermore, the network might not have a fixed and pre-determined compute graph: You can think of a network that has branching or a loop (recurrence).
Therefore, you must save the actual code.
Alternatively, if there are no branches/loops in the net, you may save the computation graph, see, e.g., this post.
You should also consider exporting your model using onnx and have a representation that captures both the trained weights as well as the computation graph.
Regarding the actual question:
So how can we save the architecture of a model in PyTorch like creating a .pb file in Tensorflow ?
The answer is: You cannot
Is there any way to load a trained model without declaring the class definition before ?
I want the model architecture as well as parameters to be loaded.
no, you have to load the class definition before, this is a python pickling limitation.
https://discuss.pytorch.org/t/how-to-save-load-torch-models/718/11
Though, there are other options (probably you have already seen most of those) that are listed at this PyTorch post:
https://pytorch.org/tutorials/beginner/saving_loading_models.html
PyTorch's way of serializing a model for inference is to use torch.jit to compile the model to TorchScript.
PyTorch's TorchScript supports more advanced control flows than TensorFlow, and thus the serialization can happen either through tracing (torch.jit.trace) or compiling the Python model code (torch.jit.script).
Great references:
Video which explains this: https://www.youtube.com/watch?app=desktop&v=2awmrMRf0dA
Documentation: https://pytorch.org/docs/stable/jit.html

Using a pytorch model for inference

I am using the fastai library (fast.ai) to train an image classifier. The model created by fastai is actually a pytorch model.
type(model)
<class 'torch.nn.modules.container.Sequential'>
Now, I want to use this model from pytorch for inference. Here is my code so far:
torch.save(model,"./torch_model_v1")
the_model = torch.load("./torch_model_v1")
the_model.eval() # shows the entire network architecture
Based on the example shown here: http://pytorch.org/tutorials/beginner/data_loading_tutorial.html#sphx-glr-beginner-data-loading-tutorial-py, I understand that I need to write my own data loading class which will override some of the functions in the Dataset class. But what is not clear to me is the transformations that I need to apply at test time? In particular, how do I normalize the images at test time?
Another question: is my approach of saving and loading the model in pytorch fine? I read in the tutorial here: http://pytorch.org/docs/master/notes/serialization.html that the approach that I have used is not recommended. The reason is not clear though.
Just to clarify: the_model.eval() not only prints the architecture, but sets the model to evaluation mode.
In particular, how do I normalize the images at test time?
It depends on the model you have. For instance, for torchvision modules, you have to normalize the inputs this way.
Regarding on how to save / load models, torch.save/torch.load "saves/loads an object to a disk file."
So, if you save the_model, it will save the entire model object, including its architecture definition and some other internal aspects. If you save the_model.state_dict(), it will save a dictionary containing the model state (i.e. parameters and buffers) only. Saving the model can break the code in various ways, so the preferred method is to save and load only the model state. However, I'm not sure if fast.ai "model file" is actually a full model or the state of a model. You have to check this so you can correctly load it.

sklearn pickled model "Attribute Error: model has no attribute classes_"

I built a sklearn model in one PC and pickled it. When, I tried to use the same model in another PC, I get below error:
Attribute Error: model has no attribute classes_
When I built the model , I did check
Model.classes_
It printed the classes. What could be the reason for this?
it would help knowing the version details of scikit-learn in both the PCs but the information on the website tells it is possible to happen due to different versions. Hope it helps.
Think of pickle as a way to dump and load a snapshot of an object and its environment at a given moment.
Sometimes the object you deal with does not mean anything by itself . You must provide extra data with it.
That's particularly the case with trained classifiers. In your case, model_classes works perfectly in your script when you have you data and fitted your classifier. Suppose now that you have dumped the classifier and loaded it later in another script : what are the classes you are talking about ? About what data are we talking about ? ... Got it ?
What you have to do then is to provide additional meta data when pickling.
This section of sklearn's documentation describes what needs to be pickled alonside the classifier (training data, source code ...).
NB:
Check first that both versions of sklearn are the same. Sometimes it can be just that.

How to get the probability per instance in classifications models in spark.mllib

I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages?
In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a different threshold. But this cannot be done for RandomForest (see How to set cutoff while training the data in Random Forest in Spark)
Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.
There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014
There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.
And here is a note from #sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.
Reference : source.
MAJOR EDIT: This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.

Resources