Feature importance/ selection per class. How? - python-3.x

I followed this scikit learn guidance to find feature importance for a classification problem. Here's the code from the link:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_
The problem is that, it's not actually what I really want. What I'd like to do is to discover feature importance per class.
One idea that comes to my mind is to turn the data into a binary classification, per class and to train a DecisionTree per class.
Is that a good approach? What are common ideas to deal with this problem?
Thanks!

Yes, one-vs-all classification is a common way of dealing with that issue. You could take that approach. While I don't think there is a principled way of obtaining class-specific feature importance for random forests, you could use the SHAP package to get Shapley values empirically.

Related

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

Is is necessary to normalize data before using MLPregressor?

I want to use MLPregression in sklearn and I have input with different scale. I am using MLPRegressor in scikit-learn in Python.
Here is my code:
smlp = MLPRegressor(hidden_layer_sizes=(committee,),
activation='relu',
solver='adam',
learning_rate='adaptive',
max_iter=3000,
learning_rate_init=0.01,
alpha=0.01)
It is better to standardize the data in order to improve the convergence.
from sklearn.preprocessing import StandardScaler
Regarding the output values - You might want to standardize them too. It might help the convergence. However. it will be harder to interpret the results afterwards.
Nevertheless, if You are aiming neural networks, it might be worth looking into keras library, allowing much more up-to-date functionality, usage of GPU for training, etc.

How to know which features have more impact in predicting the target class?

I have a business problem, I have run the regression model in python to predict my target value. When validating it with my test set I came to know that my predicted variable is very far from my actual value. Now the thing I want to extract from this model is that, which feature played the role to deviate my predicted value from actual value (let say difference is in some threshold value)?
I want to rank the features impact wise so that I could address to my client.
Thanks
It depends on the estimator you chose, linear models often have a coef_ method you can call to get the coef used for each feature, given they are normalized this tells you what you want to know.
As told above for tree model you have the feature importance. You can also use libraries like treeinterpreter described here:
Interpreting Random Forest
examples
You can have a look at this -
Feature selection
Check the Random Forest Regressor - for performing Regression.
# Example
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=2, random_state=0,
n_estimators=100)
regr.fit(X, y)
print(regr.feature_importances_)
print(regr.predict([[0, 0, 0, 0]]))
Check regr.feature_importances_ for getting the higher, more important features. Further information on FeatureImportance
Edit-1:
As pointed out in user (#blacksite) comment, only feature_importance does not provide complete interpretation of Random forest. For further analysis of results and responsible Features. Please refer to following blogs
https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45 (preferred as it provides multiple techniques )
https://blog.datadive.net/interpreting-random-forests/ (focuses on 1 technique but also provides python library - treeinterpreter)
More on feature_importance:
If you simply use the feature_importances_ attribute to select the
features with the highest importance score. Feature selection using
feature
importances
Feature importance also depends on the criteria used for splitting
and calculating importance Interpreting Decision Tree in context of
feature
importances

Stratified cross validation with Pytorch

My goal is to make binary classification, using neural network.
The problem is that dataset is unbalanced, I have 90% of class 1 and 10 of class 0.
To deal with it I want to use Stratified cross-validation.
The problem that is I am working with Pytorch, I can't find any example and documentation doesn't provide it, and I'm student, quite new for neural networks.
Can anybody help?
Thank you!
The easiest way I've found is to do you stratified splits before passing your data to Pytorch Dataset and DataLoader. That lets you avoid having to port all your code to skorch, which can break compatibility with some cluster computing frameworks.
Have a look at skorch. It's a scikit-learn compatible neural network library that wraps PyTorch. It has a function CVSplit for cross validation or you can use sklearn.
From the docs:
net = NeuralNetClassifier(
module=MyModule,
train_split=None,
)
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(net, X, y, cv=5)

Use gensim Random Projection in sklearn SVM

Is it possible to use a gensim Random Projection to train a SVM in sklearn?
I need to use gensim's tfidf implementation because it's better at dealing with large inputs and then want to put that into a random projection on which I will train my SVM. I'd also be happy to just pass the tfidf model generated by gensim to sklearn and use their random projection, if that makes things easier.
But so far I haven't found a way to get either model out of gensim into sklearn.
I have tried using gensim.matutils.corpus2cscbut of course that doesn't work: neither TfidfModel nor RpModel are corpi, so now I'm clueless at what to try next.
This is now very easy thanks to an awesome gensim contribution from Chinmaya Pancholi (see post here).
Simply import the sklearn wrapper from `gensim:
from gensim.sklearn_api import RpTransformer
Then, you can use the model to do analysis as you would any other sklearn classifier:
model = RpTransformer(num_topics=2)
clf = svm.SVC()
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(X_train, y_train)
One thing to be aware of, when using the gensim models, is that you still need to perform the dictionary and corpus steps. So instead of fitting your model on X_train, you'll have to do something along the following lines:
dictionary = Dictionary(X_train)
corpus_train = [dictionary.doc2bow(text) for text in X_train]
corpus_test = [dictionary.doc2bow(text) for text in X_test]
Then fit/predict your model on corpus_train or corpus_test.

Resources