scikit learn: learning curve without information leak? - scikit-learn

I would like to generate a learning curve for an LinearSVC estimator that is using countVectorizer to extract the features. The countVectorizer is also applying some feature selection step.
I could do the following:
fit the vectorizer on all data, including selection of top N features
use these features in fitting the linearSVC
use the linearSVC as the estimator in sklearn.model_selection.learning_curve()
But I think that it will result in information leak: information based on all data will be used to select features for the smaller sets used in the learning curve.
Is this correct?
Is there a way to use the built-in sklearn.model_selection.learning_curve() with countVectorizer without information leak?
Thank you!

You need to use a pipeline in conjunction with the learning_curve.
The pipeline will call fit_transform of the transformer when training and only transform when testing. The learning_curve will also apply cross-validation which can be controlled by the parameter cv.
With this pipeline, there is no leak of information. Here, is an example using an integrated toy library in scikit-learn.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import learning_curve
categories = [
'alt.atheism',
'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None
data = fetch_20newsgroups(subset='train', categories=categories)
pipeline = make_pipeline(
CountVectorizer(), TfidfTransformer(), LinearSVC()
)
learning_curve(pipeline, data.data, data.target, cv=5)

Related

Feature importance/ selection per class. How?

I followed this scikit learn guidance to find feature importance for a classification problem. Here's the code from the link:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)
clf.feature_importances_
The problem is that, it's not actually what I really want. What I'd like to do is to discover feature importance per class.
One idea that comes to my mind is to turn the data into a binary classification, per class and to train a DecisionTree per class.
Is that a good approach? What are common ideas to deal with this problem?
Thanks!
Yes, one-vs-all classification is a common way of dealing with that issue. You could take that approach. While I don't think there is a principled way of obtaining class-specific feature importance for random forests, you could use the SHAP package to get Shapley values empirically.

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

Single prediction using a model pre-trained with scaled features

I trained a SVM scikit-learn model with scaled features and persist it to be used later. In another file I loaded the saved model and I want to submit a new set of features to perform a prediction. Do I have to scale this new set of features? How can I do this with only one set of features?
I am not scaling the new values and I am getting weird outcomes and I cannot do the predictions. Despite of this, the prediction with a large test set generated by StratifiedShuffleSplit is working fine and I am getting a 97% of accuracy.
The problem is with the single predictions using a persisted SVM model trained with scaled features. Some idea of what am I doing wrong?
Yes, you should absolutely perform the same scaling on the new data. However, this might be impossible if you haven't saved the scaler you trained before.
This is why instead of training and saving your SVM, you should train and save your scaler with your SVM together. In the machine learning jargon, this is called a Pipeline.
This is how you would use it on a toy example:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
pipe = Pipeline([('scaler',StandardScaler()), ('svc', SVC())])
This pipeline then supports the same operations as a regular scikit-learn model:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
When fitting the pipe, it first scales and then feeds the scaled features into the classifier.
Once it is trained, you can save the pipe object just like you saved the SVM before. When you will load it and apply it to new data, it will do the scaling as desired before the predictions.

ValueError: could not convert string to float in panda

My code is :
import pandas as pd
data = pd.read_table('train.tsv')
X=data.Phrase
Y=data.Sentiment
from sklearn import cross_validation
X_train,X_test,Y_train,Y_test=cross_validation.train_test_split(X,Y,test_size=0.2,random_state=0)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X,Y)
I get the error :ValueError: could not convert string to float:
What changes can I make that my code works?
You can't pass in text data into MultinomialNB of scikit-learn as stated in its documentation.
None of the algorithms in scikit-learn works directly with text data. You need to do some preprocessing to get desired output. You'll need to first extract the features from text data using techniques like bagging or tokenizing. Have a look at this link for better understanding.
You also might want to look at using NLTK for such use cases as yours.
ValueError when using Multinomial Naive Bayes classifier
You probably should preprocess your data as shown in the answer above.

sklearn: vectorizing in cross validation for text classification

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.
The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])
clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling
>>> cross_val_score(clf, documents, y)
will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

Resources