get feature names from Sklearn LabelBinarizer - python-3.x

The documentation and a few related post lead me to believe I should be able to get the feature names from the labels scikit-learn's LabelBinarizer is one-hot encoding.
I have a pipeline defined as following:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
This works just fine, (note that DataFrameSelector is a custom class), however, it seems like I could extract the feature names like this:
feature_names = cat_pipeline.named_steps['label_binarizer'].get_feature_names()
I also tried substituting get_feature_names() with get_support() to not avail.
This is possible when using LabelEncoder on its own outside of a pipeline before one hot encoding as follows:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
print(encoder.classes_)
For further context please see the notebook I am working through:
https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

Related

Length of feature names mismatches the actual size of input X when using sklearn ColumnTransformer

I have designed the following pipelines to train my models:
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='constant',fill_value='missing')
num_imputer = SimpleImputer(strategy='constant',fill_value=0,add_indicator=True)
categorical_pipeline = Pipeline([
('imputer',cat_imputer),
('encoder',OneHotEncoder())
])
numerical_pipeline = Pipeline([
('imputer',num_imputer)
])
def get_column_types(X):
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
return numerical_columns, categorical_columns
def get_transformer(X,y):
numerical_columns, categorical_columns = get_column_types(X)
pre_transformer = ColumnTransformer([
('cat_pipe', pre_categorical_pipeline, categorical_columns),
('num_pipe', pre_numerical_pipeline, numerical_columns)
])
return transformer
When I fit the transformer on my data I get an inconsistency in the nubmer of features when I extract the names, this code is as follows:
transformer = models_and_pipelines.get_transformer(X,y)
X = transformer.fit_transform(X)
# this extracts the feature names. I also used an alternive function listed below which yields the same results
starting_features = list(transformer.transformers_[0][1]['encoder'].get_feature_names()) + list(transformer.transformers_[1][2])
print(X.shape[1])
print(len(starting_features)
With the following output:
1094
1090
Where does this inconsistency in the number of feature names come from?
other links: function to extract feature names
If you're using v1.1, get_feature_names_out is fully fleshed out, so you won't need the manual approach you're trying or the one from your link.
It's possible some of your columns were all-missing? From the docs for SimpleImputer (which I guess is what your num_imputer is?):
Notes
Columns which only contained missing values at fit are discarded upon transform if strategy is not "constant".

Recover feature names from pipeline [duplicate]

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

Why am I getting a score of 0.0 when finding the score of test data using Gaussian NB classifier?

I have two different data sets. One for training my classifier and the other one is for testing. Both the datasets are text files with two columns separated by a ",". FIrst column (numbers) is for the independent variable (group) and the second column is for the dependent variable.
Training data set
(just few lines for example. there are no empty lines between each row):
EMI3776438,1
EMI3776438,1
EMI3669492,1
EMI3752004,1
Testing data setup
(as you can see, i have picked data from the training data to be sure that the score surely can't be zero)
EMI3776438,1
Code in Python 3.6:
# #all the import statements have been ignored to keep the code short
# #loading the training data set
training_file_path=r'C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\modified_columns.txt'
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
training_file_data = pandas.read_table(training_file_path,
header=None,
names=['numbers','group'],
sep=',')
training_file_data = training_file_data.apply(le.fit_transform)
features = ['numbers']
x = training_file_data[features]
y = training_file_data["group"]
from sklearn.model_selection import train_test_split
training_x,testing_x, training_y, testing_y = train_test_split(x, y,
random_state=0,
test_size=0.1)
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(training_x, training_y)
# #loading the testing data
testing_final_path=r"C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\testing_final.txt"
testing_sample_data=pandas.read_table(testing_final_path,
sep=',',
header=None,
names=['numbers','group'])
testing_sample_data = testing_sample_data.apply(le.fit_transform)
category = ["numbers"]
testing_sample_data_x = testing_sample_data[category]
# #finding the score of the test data
print(gnb.score(testing_sample_data_x, testing_sample_data["group"]))
First, the above data samples dont show how many classes are there in it. You need to describe more about it.
Secondly, you are calling le.fit_transform again on test data which will forget all the training samples mappings from strings to numbers. The LabelEncoder le will start encoding the test data again from scratch, which will not be equal to how it mapped training data. So the input to GaussianNB is now incorrect and hence incorrect results.
Change that to:
testing_sample_data = testing_sample_data.apply(le.transform)
UPDATE:
I'm sorry I overlooked the fact that you had two columns in your data. LabelEncoder only works on a single column of data. For making it work on multiple pandas columns at once, look at the answers of following question:
Label encoding across multiple columns in scikit-learn
If you are using the latest version of scikit (0.20) or can update to it, then you would not need any such hacks and directly use the OrdinalEncoder:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
training_file_data = enc.fit_transform(training_file_data)
And during testing:
training_file_data = enc.transform(training_file_data)

Using Keras/sklearn with CalibratedClassifierCV from sklearn.calibration

Is it possible to use Keras model objects with CalibratedClassifierCV from sklearn.calibration? Or is there another way to performa isotonic regression in sklearn/other python packages without having to pass it a model object.
I tried using the sklearn wrapper for Keras, but it didn't work. Here is the doc for the CalibratedClassifierCV class.
You can train an isotonic regression a posteriori, after prediction. Let 'file1' be a csv containing your predictions pred and real observed events obs on a subset of data. Ideally, this subset has never been used before (not even in Keras training). Let file2 contain the predictions you want to calibrate (Keras predictions for the test set).
import pandas as pd
from sklearn.isotonic import IsotonicRegression
never_seen=pd.read_csv('file1')
uncalibrated=pd.read_csv('file2')
ir = IsotonicRegression( out_of_bounds = 'clip' )
ir.fit( never_seen.pred,never_seen.obs )
p_calibrated = ir.transform( uncalibrated.pred )

Getting feature names from within a FeatureUnion + Pipeline

I am using a FeatureUnion to join features found from the title and description of events:
union = FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the event's title
('title', Pipeline([
('selector', TextSelector(key='title')),
('count', CountVectorizer(stop_words='english')),
])),
# Pipeline for standard bag-of-words model for description
('description', Pipeline([
('selector', TextSelector(key='description_snippet')),
('count', TfidfVectorizer(stop_words='english')),
])),
],
transformer_weights ={
'title': 1.0,
'description': 0.2
},
)
However, calling union.get_feature_names() gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?
Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?
You are going to have to implement this method within your custom transform if you want this to work.
Here is a concrete example for you:
from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd
dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']
# define first custom transformer
class first_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
class second_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
pipe = Pipeline([
('features', FeatureUnion([
('custom_transform_first', first_transform()),
('custom_transform_second', second_transform())
])
)])
>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
'custom_transform_first__ZN',
'custom_transform_first__INDUS',
'custom_transform_first__CHAS',
'custom_transform_first__NOX',
'custom_transform_first__RM',
'custom_transform_first__AGE',
'custom_transform_first__DIS',
'custom_transform_first__RAD',
'custom_transform_first__TAX',
'custom_transform_first__PTRATIO',
'custom_transform_first__B',
'custom_transform_first__LSTAT',
'custom_transform_second__CRIM',
'custom_transform_second__ZN',
'custom_transform_second__INDUS',
'custom_transform_second__CHAS',
'custom_transform_second__NOX',
'custom_transform_second__RM',
'custom_transform_second__AGE',
'custom_transform_second__DIS',
'custom_transform_second__RAD',
'custom_transform_second__TAX',
'custom_transform_second__PTRATIO',
'custom_transform_second__B',
'custom_transform_second__LSTAT']
Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.
However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:
Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.
Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.
Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.
One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.
You can call your different Vectorizers as a nested feature by this (thanks edesz):
pipevect= dict(pipeline.named_steps['union'].transformer_list).get('title').named_steps['count']
And then you got the TfidfVectorizer() instance to pass in another function:
Show_most_informative_features(pipevect,
pipeline.named_steps['classifier'], n=MostIF)
# or direct
print(pipevect.get_feature_names())

Resources