I am using a FeatureUnion to join features found from the title and description of events:
union = FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the event's title
('title', Pipeline([
('selector', TextSelector(key='title')),
('count', CountVectorizer(stop_words='english')),
])),
# Pipeline for standard bag-of-words model for description
('description', Pipeline([
('selector', TextSelector(key='description_snippet')),
('count', TfidfVectorizer(stop_words='english')),
])),
],
transformer_weights ={
'title': 1.0,
'description': 0.2
},
)
However, calling union.get_feature_names() gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?
Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?
You are going to have to implement this method within your custom transform if you want this to work.
Here is a concrete example for you:
from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd
dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']
# define first custom transformer
class first_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
class second_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
pipe = Pipeline([
('features', FeatureUnion([
('custom_transform_first', first_transform()),
('custom_transform_second', second_transform())
])
)])
>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
'custom_transform_first__ZN',
'custom_transform_first__INDUS',
'custom_transform_first__CHAS',
'custom_transform_first__NOX',
'custom_transform_first__RM',
'custom_transform_first__AGE',
'custom_transform_first__DIS',
'custom_transform_first__RAD',
'custom_transform_first__TAX',
'custom_transform_first__PTRATIO',
'custom_transform_first__B',
'custom_transform_first__LSTAT',
'custom_transform_second__CRIM',
'custom_transform_second__ZN',
'custom_transform_second__INDUS',
'custom_transform_second__CHAS',
'custom_transform_second__NOX',
'custom_transform_second__RM',
'custom_transform_second__AGE',
'custom_transform_second__DIS',
'custom_transform_second__RAD',
'custom_transform_second__TAX',
'custom_transform_second__PTRATIO',
'custom_transform_second__B',
'custom_transform_second__LSTAT']
Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.
However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:
Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.
Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.
Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.
One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.
You can call your different Vectorizers as a nested feature by this (thanks edesz):
pipevect= dict(pipeline.named_steps['union'].transformer_list).get('title').named_steps['count']
And then you got the TfidfVectorizer() instance to pass in another function:
Show_most_informative_features(pipevect,
pipeline.named_steps['classifier'], n=MostIF)
# or direct
print(pipevect.get_feature_names())
Related
I have designed the following pipelines to train my models:
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='constant',fill_value='missing')
num_imputer = SimpleImputer(strategy='constant',fill_value=0,add_indicator=True)
categorical_pipeline = Pipeline([
('imputer',cat_imputer),
('encoder',OneHotEncoder())
])
numerical_pipeline = Pipeline([
('imputer',num_imputer)
])
def get_column_types(X):
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
return numerical_columns, categorical_columns
def get_transformer(X,y):
numerical_columns, categorical_columns = get_column_types(X)
pre_transformer = ColumnTransformer([
('cat_pipe', pre_categorical_pipeline, categorical_columns),
('num_pipe', pre_numerical_pipeline, numerical_columns)
])
return transformer
When I fit the transformer on my data I get an inconsistency in the nubmer of features when I extract the names, this code is as follows:
transformer = models_and_pipelines.get_transformer(X,y)
X = transformer.fit_transform(X)
# this extracts the feature names. I also used an alternive function listed below which yields the same results
starting_features = list(transformer.transformers_[0][1]['encoder'].get_feature_names()) + list(transformer.transformers_[1][2])
print(X.shape[1])
print(len(starting_features)
With the following output:
1094
1090
Where does this inconsistency in the number of feature names come from?
other links: function to extract feature names
If you're using v1.1, get_feature_names_out is fully fleshed out, so you won't need the manual approach you're trying or the one from your link.
It's possible some of your columns were all-missing? From the docs for SimpleImputer (which I guess is what your num_imputer is?):
Notes
Columns which only contained missing values at fit are discarded upon transform if strategy is not "constant".
I have the following pipeline in sklearn:
pipe = sklearn.pipeline.Pipeline(steps=[
('scalar', StandardScaler()),
('pca', utils.PCA(n_components=n_pca_components)),
('reduce',umap.UMAP(n_neighbors=umap_n_neighbors, min_dist=umap_min_dist, metric=umap_metric)),
('model',utils.DBSCAN(eps=dbscan_eps,min_samples=dbscan_min_samples)),
])
Is there a simple way to eliminate one of the dimensions in the step between the pca and the umap?
So if my pca output is (0:100,0:10) and I want for example to remove the first channel before passing the data in the pipeline to the umap (0:100,1:10)
You could potentially have an intermediate step between the two stages to achieve what you want with FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
def custom_function(x):
# Add your code here
pipe = sklearn.pipeline.Pipeline(steps=[
('scalar', StandardScaler()),
('pca', utils.PCA(n_components=n_pca_components)),
('remove_dimension', FunctionTransformer(custom_function))
('reduce',umap.UMAP(n_neighbors=umap_n_neighbors, min_dist=umap_min_dist, metric=umap_metric)),
('model',utils.DBSCAN(eps=dbscan_eps,min_samples=dbscan_min_samples)),
])
This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))
Does any one knows if sklearn supports different parameters for the various classifiers inside a OneVsRestClassifier ? For instance in that exemple, I would like to have different values of C for the different classes.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
text_clf = OneVsRestClassifier(LinearSVC(C=1.0, class_weight="balanced"))
No OneVsRestClassifier doesnt currently different parameter of estimators or different estimators for different classes currently.
There are some implemented in other things like LogisticRegressionCV which will automatically tune different values of parameters according to classes, but its not extended yet for OneVsRestClassifier yet.
But if you want that, we can do the change in the source to implement that.
Current source of fit() in the master branch is this:
...
...
self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(_fit_binary)(
self.estimator, X, column, classes=[
"not %s" % self.label_binarizer_.classes_[i],
self.label_binarizer_.classes_[i]])
for i, column in enumerate(columns))
As you can see, same estimator (self.estimator) is being passed to all classes to be trained. So we will make a new version of OneVsRestClassifier to change this:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
from sklearn.externals.joblib import Parallel, delayed
from sklearn.multiclass import _fit_binary
class CustomOneVsRestClassifier(OneVsRestClassifier):
# Changed the estimator to estimators which can take a list now
def __init__(self, estimators, n_jobs=1):
self.estimators = estimators
self.n_jobs = n_jobs
def fit(self, X, y):
self.label_binarizer_ = LabelBinarizer(sparse_output=True)
Y = self.label_binarizer_.fit_transform(y)
Y = Y.tocsc()
self.classes_ = self.label_binarizer_.classes_
columns = (col.toarray().ravel() for col in Y.T)
# This is where we change the training method
self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(_fit_binary)(
estimator, X, column, classes=[
"not %s" % self.label_binarizer_.classes_[i],
self.label_binarizer_.classes_[i]])
for i, (column, estimator) in enumerate(zip(columns, self.estimators)))
return self
And now you can use it.
# Make sure you add those many estimators as there are classes
# In binary case, only a single estimator should be used
estimators = []
# I am considering 3 classes as of now
estimators.append(LinearSVC(C=1.0, class_weight="balanced"))
estimators.append(LinearSVC(C=0.1, class_weight="balanced"))
estimators.append(LinearSVC(C=10, class_weight="balanced"))
clf = CustomOneVsRestClassifier(estimators)
clf.fit(X, y)
Note: I haven't yet implemented partial_fit() in it yet. If you intend to use that we can work on it.
The documentation and a few related post lead me to believe I should be able to get the feature names from the labels scikit-learn's LabelBinarizer is one-hot encoding.
I have a pipeline defined as following:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
This works just fine, (note that DataFrameSelector is a custom class), however, it seems like I could extract the feature names like this:
feature_names = cat_pipeline.named_steps['label_binarizer'].get_feature_names()
I also tried substituting get_feature_names() with get_support() to not avail.
This is possible when using LabelEncoder on its own outside of a pipeline before one hot encoding as follows:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
print(encoder.classes_)
For further context please see the notebook I am working through:
https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb