Using sklearn RFE with an estimator from another package - scikit-learn

Is it possible to use sklearn Recursive Feature Elimination(RFE) with an estimator from another package?
Specifically, I want to use GLM from statsmodels package and wrap it in sklearn RFE?
If yes, could you please give some examples?

Yes, it is possible. You just need to create a class that inherit sklearn.base.BaseEstimator, make sure it has fit & predict methods, and make sure its fit method expose feature importance through either coef_ or feature_importances_ attribute. Here is a simplified example of a class:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
class MyEstimator(BaseEstimator):
def __init__(self):
self.model = LogisticRegression()
def fit(self, X, y, **kwargs):
self.model.fit(X, y)
self.coef_ = self.model.coef_
def predict(self, X):
result = self.model.predict(X)
return np.array(result)
if __name__ == '__main__':
X, y = make_classification(n_features=10, n_redundant=0, n_informative=7, n_clusters_per_class=1)
estimator = MyEstimator()
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)
print(selector.support_)
print(selector.ranking_)

Related

What can be the cause of the validation loss increasing and the accuracy remaining constant to zero while the train loss decreases?

I am trying to solve a multiclass text classification problem. Due to specific requirements from my project I am trying to use skorch (https://skorch.readthedocs.io/en/stable/index.html) to wrap pytorch for the sklearn pipeline. What I am trying to do is fine-tune a pretrained version of BERT from Huggingface (https://huggingface.co) with my dataset. I have tried, in the best of my knowledge, to follow the instructions from skorch on how I should input my data, structure the model etc. Still during the training the train loss decreases until the 8th epoch where it starts fluctuating, all while the validation loss increases from the beginning and the validation accuracy remains constant to zero. My pipeline setup is
from sklearn.pipeline import Pipeline
pipeline = Pipeline(
[
("tokenizer", Tokenizer()),
("classifier", _get_new_transformer())
]
in which I am using a tokenizer class to preprocess my dataset, tokenizing it for BERT and creating the attention masks. It looks like this
import torch
from transformers import AutoTokenizer, AutoModel
from torch import nn
import torch.nn.functional as F
from sklearn.base import BaseEstimator, TransformerMixin
from tqdm import tqdm
import numpy as np
class Tokenizer(BaseEstimator, TransformerMixin):
def __init__(self):
super(Tokenizer, self).__init__()
self.tokenizer = AutoTokenizer.from_pretrained(/path/to/model)
def _tokenize(self, X, y=None):
tokenized = self.tokenizer.encode_plus(X, max_length=20, add_special_tokens=True, pad_to_max_length=True)
tokenized_text = tokenized['input_ids']
attention_mask = tokenized['attention_mask']
return np.array(tokenized_text), np.array(attention_mask)
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
word_tokens, attention_tokens = np.array([self._tokenize(string)[0] for string in tqdm(X)]), \
np.array([self._tokenize(string)[1] for string in tqdm(X)])
X = word_tokens, attention_tokens
return X
def fit_transform(self, X, y=None, **fit_params):
self = self.fit(X, y)
return self.transform(X, y)
then I initialize the model I want to fine-tune as
class Transformer(nn.Module):
def __init__(self, num_labels=213, dropout_proba=.1):
super(Transformer, self).__init__()
self.num_labels = num_labels
self.model = AutoModel.from_pretrained(/path/to/model)
self.dropout = torch.nn.Dropout(dropout_proba)
self.classifier = torch.nn.Linear(768, num_labels)
def forward(self, X, **kwargs):
X_tokenized, attention_mask = torch.stack([x.unsqueeze(0) for x in X[0]]),\
torch.stack([x.unsqueeze(0) for x in X[1]])
_, X = self.model(X_tokenized.squeeze(), attention_mask.squeeze())
X = F.relu(X)
X = self.dropout(X)
X = self.classifier(X)
return X
I initialize the model and create the classifier with skorch as follows
from skorch import NeuralNetClassifier
from skorch.dataset import CVSplit
from skorch.callbacks import ProgressBar
import torch
from transformers import AdamW
def _get_new_transformer() -> NeuralNetClassifier:
transformer = Transformer()
net = NeuralNetClassifier(
transformer,
lr=2e-5,
max_epochs=10,
criterion=torch.nn.CrossEntropyLoss,
optimizer=AdamW,
callbacks=[ProgressBar(postfix_keys=['train_loss', 'valid_loss'])],
train_split=CVSplit(cv=2, random_state=0)
)
return net
and I use fit like that
pipeline.fit(X=dataset.training_samples, y=dataset.training_labels)
in which my training samples are lists of strings and my labels are the an array containing the indexes of each class, as pytorch requires.
This is a sample of what happens
training history
I have tried to keep train only the fully connected layer and not BERT but I have the same issue again. I also tested the train accuracy after the training process and it was only 0,16%. I would be grateful for any advice or insight on how to solve my problem! I am pretty new with skorch and not so comfortable with pytorch yet and I believe that I am missing something really simple. Thank you very much in advance!

Custom k-means clustering GridSearchCV

I am trying to find the 'best' value of k for k-means clustering by using a pipeline where I use a standard scaler followed by custom k-means which is finally followed by a Decision Tree classifier. I am then trying to use this pipeline for a Grid Search to get the best value of k. Python 3.7 and sklearn are being used.
The code I have is as follows:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs
from sklearn.pipeline import Pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchC
class KMeansTransformer(BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
# The purpose of 'self.model' is to contain the
# underlying cluster model-
self.model = KMeans(**kwargs)
def fit(self, X):
self.X = X
self.model.fit(X)
def transform(self, X):
pred = self.model.predict(X)
return np.hstack([self.X, pred.reshape(-1, 1)])
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
# Create features and target-
X, y = make_blobs(n_samples=100, n_features=2, centers=3)
# Get shape/dimension-
X.shape, y.shape
# ((100, 2), (100,))
# Create another pipeline using Decision Tree as classifier-
pipe_dt = Pipeline(
[
('sc', StandardScaler()),
('kmt', KMeansTransformer()),
('dt_clf', DecisionTreeClassifier())
]
)
# Train defined pipline-
pipe_dt.fit(X, y)
# Get accuracy score of pipeline-
pipe_dt.score(X, y)
# 1.0
# Make predictions using pipeline defined above-
y_pred_dt = pipe_dt.predict(X)
# Perform hyperparameter search/optimization using 'GridSearchCV'-
# Specify parameters to be hyper-tuned-
params = {
'n_clusters': [2, 3, 5, 7]
}
# Initialize GridSearchCV() object using 3-fold CV-
grid_kmt = GridSearchCV(param_grid=params, estimator=pipe_dt, cv = 3)
# Perform GridSearchCV on training data-
grid_kmt.fit(X, y)
When I use 'grid_kmt.fit(X, y)' it gives me the following error:
ValueError: Invalid parameter n_clusters for estimator
Pipeline(memory=None,
steps=[('sc',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('kmt', KMeansTransformer()),
('dt_clf',
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated', random_state=None,
splitter='best'))],
verbose=False). Check the list of available parameters with
estimator.get_params().keys().
However, when I initialize an object for custom kmeans-
# Initialize a new clustering object-
km = KMeansTransformer(n_clusters=3, init = 'k-means++')
# Get the list of available parameters-
km.get_params().keys()
# dict_keys([])
Then why am I getting a 'ValueError'? n_clusters happens to be in the list of available parameters for custom clustering object.
Looking closely at the error message:
ValueError: Invalid parameter n_clusters for estimator Pipeline [...]
it's clear that your GridSearchCV looks for a parameter n_clusters in the pipeline itself (not in its components, that is), can't find any, and returns an error. To correctly access the n_clusters parameter of your ('kmt', KMeansTransformer()) component, you should use
params = {
'kmt__n_clusters': [2, 3, 5, 7] # two underscores
}
provided of course that your own KMeansTransformer does accept a parameter n_clusters.

Pipeline and GridSearch for Doc2Vec

I currently have following script that helps to find the best model for a doc2vec model. It works like this: First train a few models based on given parameters and then test against a classifier. Finally, it outputs the best model and classifier (I hope).
Data
Example data (data.csv) can be downloaded here: https://pastebin.com/takYp6T8
Note that the data has a structure that should make an ideal classifier with 1.0 accuracy.
Script
import sys
import os
from time import time
from operator import itemgetter
import pickle
import pandas as pd
import numpy as np
from argparse import ArgumentParser
from gensim.models.doc2vec import Doc2Vec
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from sklearn.base import BaseEstimator
from gensim import corpora
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
dataset = pd.read_csv("data.csv")
class Doc2VecModel(BaseEstimator):
def __init__(self, dm=1, size=1, window=1):
self.d2v_model = None
self.size = size
self.window = window
self.dm = dm
def fit(self, raw_documents, y=None):
# Initialize model
self.d2v_model = Doc2Vec(size=self.size, window=self.window, dm=self.dm, iter=5, alpha=0.025, min_alpha=0.001)
# Tag docs
tagged_documents = []
for index, row in raw_documents.iteritems():
tag = '{}_{}'.format("type", index)
tokens = row.split()
tagged_documents.append(TaggedDocument(words=tokens, tags=[tag]))
# Build vocabulary
self.d2v_model.build_vocab(tagged_documents)
# Train model
self.d2v_model.train(tagged_documents, total_examples=len(tagged_documents), epochs=self.d2v_model.iter)
return self
def transform(self, raw_documents):
X = []
for index, row in raw_documents.iteritems():
X.append(self.d2v_model.infer_vector(row))
X = pd.DataFrame(X, index=raw_documents.index)
return X
def fit_transform(self, raw_documents, y=None):
self.fit(raw_documents)
return self.transform(raw_documents)
param_grid = {'doc2vec__window': [2, 3],
'doc2vec__dm': [0,1],
'doc2vec__size': [100,200],
'logreg__C': [0.1, 1],
}
pipe_log = Pipeline([('doc2vec', Doc2VecModel()), ('log', LogisticRegression())])
log_grid = GridSearchCV(pipe_log,
param_grid=param_grid,
scoring="accuracy",
verbose=3,
n_jobs=1)
fitted = log_grid.fit(dataset["posts"], dataset["type"])
# Best parameters
print("Best Parameters: {}\n".format(log_grid.best_params_))
print("Best accuracy: {}\n".format(log_grid.best_score_))
print("Finished.")
I do have following questions regarding my script (I combine them here to avoid three posts with the same code snippet):
What's the purpose of def __init__(self, dm=1, size=1, window=1):? Can I possibly remove this part, somehow (tried unsuccessfully)?
How can I add a RandomForest classifier (or others) to the GridSearch workflow/pipeline?
How could a train/test data split added to the code above, as the current script only trains on the full dataset?
1) init() lets you define the parameters you would like your class to take at initialization (equivalent to contructor in java).
Please look at these questions for more details:
Python __init__ and self what do they do?
Python constructors and __init__
2) Why do you want to add the RandomForestClassifier and what will be its input?
Looking at your other two questions, do you want to compare the output of RandomForestClassifier with LogisticRegression here? If so, you are doing good in this question of yours.
3) You have imported the train_test_split, just use it.
X_train, X_test, y_train, y_test = train_test_split(dataset["posts"], dataset["type"])
fitted = log_grid.fit(X_train, y_train)

Combining w2vec and feature selection in pipeline

Based on this article: http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/ I am trying to implement a gensim word2vec model with the pretrained vectors of GloVe in a text classification task. However, I would like to do FeatureSelection also in my text data. I tried multiple sequences in the pipeline but i get fast a memory error which points to the transform part of TfidfEmbeddingVectorizer.
return np.array([
np.mean([self.word2vec[w] * self.word2weight[w]
for w in words if w in self.word2vec] or
[np.zeros(self.dim)], axis=0)
for words in X
If I replace the TfidfEmbeddingVectorizer class with a regular TfIdfVectorizer it works properly. Is there a way I could combine SelectFromModel and W2vec in the pipeline?
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np
import itertools
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_recall_fscore_support as score, f1_score
import pickle
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
import gensim
import collections
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return (X[self.column])
class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer"""
def fit(self, x, y=None):
return self
def transform(self, posts):
return [{'REPORT_M': text}
for text in posts]
class TfidfEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.word2weight = None
self.dim = len(word2vec.values())
def fit(self, X, y):
tfidf = TfidfVectorizer(analyzer=lambda x: x)
tfidf.fit(X)
# if a word was never seen - it must be at least as infrequent
# as any of the known words - so the default idf is the max of
# known idf's
max_idf = max(tfidf.idf_)
self.word2weight = collections.defaultdict(
lambda: max_idf,
[(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] * self.word2weight[w]
for w in words if w in self.word2vec] or
[np.zeros(self.dim)], axis=0)
for words in X
])
# training model
def train(data_train, data_val):
with open("glove.6B/glove.6B.50d.txt", "rb") as lines:
w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
for line in lines}
classifier = Pipeline([
('union', FeatureUnion([
('text', Pipeline([
('selector', ItemSelector(column='TEXT')),
("word2vec vectorizer", TfidfEmbeddingVectorizer(w2v)),
('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False),threshold=0.01))
])),
('category', Pipeline([
('selector', ItemSelector(column='category')),
('stats', TextStats()),
('vect', DictVectorizer())
]))
])),
('clf',ExtraTreesClassifier(n_estimators=200, max_depth=500, min_samples_split=6, class_weight= 'balanced'))])
classifier.fit(data_train,data_train.CLASSES)
predicted = classifier.predict(data_val)
I think in here self.dim = len(word2vec.values()) you should specify the dimension of the model. If you are using glove.6B.50d.txt, then the dimension should be 50.
len(word2vec.values()) is the total number of words, thus will create a huge matrix, i.e., memory error.

Scikit-learn pipeline for same data and steps fails to classifiy

I have vectors of floats that I created from doc2vec algorithm, and their labels. When i use them with a simple classifier, it works normally and gives an expected accuracy. Working code is below:
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np
train_vecs #ndarray (20418,100)
#train_vecs = [[0.3244, 0.3232, -0.5454, 1.4543, ...],...]
y_train #labels
test_vecs #ndarray (6885,100)
y_test #labels
classifier = LinearSVC()
classifier.fit(train_vecs, y_train )
print('Test Accuracy: %.2f'%classifier.score(test_vecs, y_test))
However now I want to move it into a pipeline, because in the future I plan to do a feature union with different features. What I do is move the vectors into a dataframe, then use 2 custom transformers to i)select the column, ii) change the array type. Strangely the exact same data, with exact same shape, dtype and type.. gives 0.0005 accuracy. Which it does not make sense to me at all, it should give almost equal accuracy. After the ArrayCaster transformer the shapes and types of the inputs are exactly the same as before. The whole thing has been really frustrating.
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
# transformer that picks a column from the dataframe
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
print('item selector type',type(X[self.column]))
print('item selector shape',len(X[self.column]))
print('item selector dtype',X[self.column].dtype)
return (X[self.column])
# transformer that converts the series into an ndarray
class ArrayCaster(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, data):
print('array caster type',type(np.array(data.tolist())))
print('array caster shape',np.array(data.tolist()).shape)
print('array caster dtype',np.array(data.tolist()).dtype)
return np.array(data.tolist())
train_vecs #ndarray (20418,100)
y_train #labels
test_vecs #ndarray (6885,100)
y_test #labels
train['vecs'] = pd.Series(train_vecs.tolist())
val['vecs'] = pd.Series(test_vecs.tolist())
classifier = Pipeline([
('selector', ItemSelector(column='vecs')),
('array', ArrayCaster()),
('clf',LinearSVC())])
classifier.fit(train, y_train)
print('Test Accuracy: %.2f'%classifier.score(test, y_test))
Ok sorry about that.. I figured it out. The error is pretty annoying to notice. All I had to do is cast them as list and place them into the dataframe, instead of converting them to series.
Change this
train['vecs'] = pd.Series(train_vecs.tolist())
val['vecs'] = pd.Series(test_vecs.tolist())
into:
train['vecs'] = list(train_vecs)
val['vecs'] = list(test_vecs)

Resources