I am currently working on the MLPClassifier of the neural_network package in sklearn.
I have fit the model; I want to access the weights given by the classifier to the input features. How do I access them?
Thanks in advance!
Check out the documentation.
See the field coefs_.
Try:
print model.coefs_
Generally, I recommend:
checking the documentation
if that fails, then
print dir(model)
or
help(model)
will tell you what's available for in most cases.
Related
I followed this scikit learn guidance to find feature importance for a classification problem. Here's the code from the link:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_
The problem is that, it's not actually what I really want. What I'd like to do is to discover feature importance per class.
One idea that comes to my mind is to turn the data into a binary classification, per class and to train a DecisionTree per class.
Is that a good approach? What are common ideas to deal with this problem?
Thanks!
Yes, one-vs-all classification is a common way of dealing with that issue. You could take that approach. While I don't think there is a principled way of obtaining class-specific feature importance for random forests, you could use the SHAP package to get Shapley values empirically.
Unlike GridSearchCV, CalibratedClassifierCV doesn't seem to support passing the groups parameter to the fit method. I found this very old github issue that reports this problem but it doesn't seem to be fixed yet. The documentation makes it clear that not properly stratifying your cv folds will result in an incorrectly calibrated model. My dataset has multiple observations from the same users so I would need to use GroupKFold to ensure proper calibration.
scikit-learn can take an iterable of (train, test) splits as the cv object, so just create them manually. For example:
my_cv = (
(train, test) for train, test in GroupKFold(n_splits=5).split(X, groups=my_groups)
)
cal_clf = CalibratedClassifierCV(clf, cv=my_cv)
I've created a modified version of CalibratedClassifierCV that addresses this issue for now. Until this is fixed in sklearn master, you can similarly modify the fit method of CalibratedClassifierCV to use GroupKFold. My solution can be found in this gist. This is based of sklearn version 0.24.1 but you can easily adapt it to your version of sklearn as needed.
I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.
My goal is to make binary classification, using neural network.
The problem is that dataset is unbalanced, I have 90% of class 1 and 10 of class 0.
To deal with it I want to use Stratified cross-validation.
The problem that is I am working with Pytorch, I can't find any example and documentation doesn't provide it, and I'm student, quite new for neural networks.
Can anybody help?
Thank you!
The easiest way I've found is to do you stratified splits before passing your data to Pytorch Dataset and DataLoader. That lets you avoid having to port all your code to skorch, which can break compatibility with some cluster computing frameworks.
Have a look at skorch. It's a scikit-learn compatible neural network library that wraps PyTorch. It has a function CVSplit for cross validation or you can use sklearn.
From the docs:
net = NeuralNetClassifier(
module=MyModule,
train_split=None,
)
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(net, X, y, cv=5)
Trying to use SVC from sklearn to do a classification problem. Given a bunch of data, and information telling me whether some subject is in a certain class or not, I want to be able to give a probability that a new, unknown subject is in a class.
I only have 2 classes, so the problem is binary. Here is my code and some of my errors
from sklearn.svm import SVC
clf=SVC()
clf=clf.fit(X,Y)
SVC(probability=True)
print clf.predict_proba(W) #Error is here
But it returns the following error:
NotImplementedError: probability estimates must be enabled to use this method
How can I fix this?
You have to construct the SVC object with probability=True
from sklearn.svm import SVC
clf=SVC(probability=True)
clf.fit(X,Y)
print clf.predict_proba(W) #No error
Your code creates a SVC with probability estimates and discards it (as you do not store it in any variable) and use some previous SVC stored in clf (without probability)
Always set the parameters before fit.
from sklearn.svm import SVC
clf=SVC(probability=True)
clf=clf.fit(X,Y)
print clf.predict_proba(W)