Recover feature names from pipeline [duplicate] - scikit-learn

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.

Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

Related

Length of feature names mismatches the actual size of input X when using sklearn ColumnTransformer

I have designed the following pipelines to train my models:
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='constant',fill_value='missing')
num_imputer = SimpleImputer(strategy='constant',fill_value=0,add_indicator=True)
categorical_pipeline = Pipeline([
('imputer',cat_imputer),
('encoder',OneHotEncoder())
])
numerical_pipeline = Pipeline([
('imputer',num_imputer)
])
def get_column_types(X):
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)
return numerical_columns, categorical_columns
def get_transformer(X,y):
numerical_columns, categorical_columns = get_column_types(X)
pre_transformer = ColumnTransformer([
('cat_pipe', pre_categorical_pipeline, categorical_columns),
('num_pipe', pre_numerical_pipeline, numerical_columns)
])
return transformer
When I fit the transformer on my data I get an inconsistency in the nubmer of features when I extract the names, this code is as follows:
transformer = models_and_pipelines.get_transformer(X,y)
X = transformer.fit_transform(X)
# this extracts the feature names. I also used an alternive function listed below which yields the same results
starting_features = list(transformer.transformers_[0][1]['encoder'].get_feature_names()) + list(transformer.transformers_[1][2])
print(X.shape[1])
print(len(starting_features)
With the following output:
1094
1090
Where does this inconsistency in the number of feature names come from?
other links: function to extract feature names
If you're using v1.1, get_feature_names_out is fully fleshed out, so you won't need the manual approach you're trying or the one from your link.
It's possible some of your columns were all-missing? From the docs for SimpleImputer (which I guess is what your num_imputer is?):
Notes
Columns which only contained missing values at fit are discarded upon transform if strategy is not "constant".

Issues with One Hot Encoding for model with values not in training data

I would like to use One Hot Encoding for my simple model. Yet it seems to trigger an error no matter how I set it up. First, One Hot Encoding is not converting string to float even though I have version 1.0.2 of sklearn. Now the issue is because the values in my training data are not the same length as in test data. Training only has 2 values, testing has all three. How do I fix that? The exact error is the truth value of a series is ambiguous. The error with this other idea is to reshape the data.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
transformers=[('ohc', ohc, [0])]
,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)
])
params = {'model__learning_rate':[0.1]
,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy'
,verbose=-1)
lgbm_gs.fit(X,y)
The issue should be related to the fact that you're passing categories as a list rather than as a list of array-like (eg a list of list(s)) as the doc states. Therefore, the following adjustment should fix it.
import lightgbm as lgbm
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [['apple',5],['banana',1],['apple',6],['banana',2]]
X = pd.DataFrame(X).to_numpy()
test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
y = [1,0,1,0]
y = pd.DataFrame(y).to_numpy()
labels = [['apple', 'banana', 'pineapple']] # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
('model', model)])
params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
estimator = mymodel, param_grid=params, n_jobs = -1,
cv=2, scoring='accuracy', verbose=-1)
lgbm_gs.fit(X, y.ravel())
As a further remark, observe what the guide suggests when dealing with cases where test data has categories that cannot be found in the training set.
If there is a possibility that the training data might have missing categorical features, it can often be better to specify handle_unknown='ignore' instead of setting the categories manually as above. When handle_unknown='ignore' is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (handle_unknown='ignore' is only supported for one-hot encoding):
Eventually, you can observe that the attribute categories_ (which specifies the categories of each feature determined during fitting) is a list of array(s) (single array here as you're one-hot-encoding one column only), too. Example with categories='auto':
ohc = OneHotEncoder(handle_unknown='ignore')
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana'], dtype=object)]
Example with your custom categories:
ohc = OneHotEncoder(categories=labels)
ohc.fit(X[:, 0].reshape(-1, 1)).categories_
# Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]

(Python 3) Classification of dataset, using a user input Elo, to suggest opening move based on Chess dataset?

I'm just looking for a little bit of help. I'm struggling to work out if what I'm doing is right or not, and even if Naive Bayes is even the right way to do this.
I am wanting the user to be able to input their elo, and the 'app' to suggest them a opening move set, based on win rate at that ELO. For this I am using the following dataset: https://www.kaggle.com/datasnaek/chess
The important data out of this, are the opening name (what I'm trying to suggest), the average rating (what the user can input), and winner (we need to see if white wins).
This is my code so far:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from matplotlib.colors import ListedColormap
from sklearn import preprocessing
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Read in dataset
data = pd.read_csv(f"games.csv")
# set new column that is true/false depending on if white wins
data['white_wins'] = (data['winner'] == "white")
# Create new columns, average rating (based on white rating and black rating) and category (categorization of rating for Naive Bayes)
data['average_rating'] = data.apply(lambda row: (row['white_rating'] + row['black_rating']) / 2, axis=1)
data['category'] = data['average_rating'] // 100 + 1
# Drop unneccessary columns
data = data.drop(['turns', 'moves', 'victory_status', 'id', 'winner', 'rated', 'created_at', 'last_move_at', 'opening_ply', 'white_id', 'black_id', 'increment_code', 'opening_eco', 'white_rating', 'black_rating'], axis=1)
#Label Encoder Initialisation
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
opening_name_encoded=le.fit_transform(data.opening_name)
category_encoded=le.fit_transform(data.category)
label=le.fit_transform(data.white_wins)
#Package features together
features=zip(opening_name_encoded, category_encoded)
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
And i currently get the error:
error
Also, i'm not even convinced this is correct, as if i continue down this stream, I am only going to be predicting if white wins based on the opening moveset, and elo. I'm really unsure on where to take this to get it to the point i need.
Thanks for any help!
zip returns in iterator, so your code is not doing what you expect. My guess is that you intended features to be a list of 2-tuples. If that is the case, then adjust your code to features = list(zip(opening_name_encoded, category_encoded))
In [31]: zip([1, 2, 3], ['a', 'b', 'c'])
Out[31]: <zip at 0x25d61abfd80>
In [32]: list(zip([1, 2, 3], ['a', 'b', 'c']))
Out[32]: [(1, 'a'), (2, 'b'), (3, 'c')]

Why am I getting a score of 0.0 when finding the score of test data using Gaussian NB classifier?

I have two different data sets. One for training my classifier and the other one is for testing. Both the datasets are text files with two columns separated by a ",". FIrst column (numbers) is for the independent variable (group) and the second column is for the dependent variable.
Training data set
(just few lines for example. there are no empty lines between each row):
EMI3776438,1
EMI3776438,1
EMI3669492,1
EMI3752004,1
Testing data setup
(as you can see, i have picked data from the training data to be sure that the score surely can't be zero)
EMI3776438,1
Code in Python 3.6:
# #all the import statements have been ignored to keep the code short
# #loading the training data set
training_file_path=r'C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\modified_columns.txt'
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
training_file_data = pandas.read_table(training_file_path,
header=None,
names=['numbers','group'],
sep=',')
training_file_data = training_file_data.apply(le.fit_transform)
features = ['numbers']
x = training_file_data[features]
y = training_file_data["group"]
from sklearn.model_selection import train_test_split
training_x,testing_x, training_y, testing_y = train_test_split(x, y,
random_state=0,
test_size=0.1)
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(training_x, training_y)
# #loading the testing data
testing_final_path=r"C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\testing_final.txt"
testing_sample_data=pandas.read_table(testing_final_path,
sep=',',
header=None,
names=['numbers','group'])
testing_sample_data = testing_sample_data.apply(le.fit_transform)
category = ["numbers"]
testing_sample_data_x = testing_sample_data[category]
# #finding the score of the test data
print(gnb.score(testing_sample_data_x, testing_sample_data["group"]))
First, the above data samples dont show how many classes are there in it. You need to describe more about it.
Secondly, you are calling le.fit_transform again on test data which will forget all the training samples mappings from strings to numbers. The LabelEncoder le will start encoding the test data again from scratch, which will not be equal to how it mapped training data. So the input to GaussianNB is now incorrect and hence incorrect results.
Change that to:
testing_sample_data = testing_sample_data.apply(le.transform)
UPDATE:
I'm sorry I overlooked the fact that you had two columns in your data. LabelEncoder only works on a single column of data. For making it work on multiple pandas columns at once, look at the answers of following question:
Label encoding across multiple columns in scikit-learn
If you are using the latest version of scikit (0.20) or can update to it, then you would not need any such hacks and directly use the OrdinalEncoder:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
training_file_data = enc.fit_transform(training_file_data)
And during testing:
training_file_data = enc.transform(training_file_data)

Getting feature names from within a FeatureUnion + Pipeline

I am using a FeatureUnion to join features found from the title and description of events:
union = FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the event's title
('title', Pipeline([
('selector', TextSelector(key='title')),
('count', CountVectorizer(stop_words='english')),
])),
# Pipeline for standard bag-of-words model for description
('description', Pipeline([
('selector', TextSelector(key='description_snippet')),
('count', TfidfVectorizer(stop_words='english')),
])),
],
transformer_weights ={
'title': 1.0,
'description': 0.2
},
)
However, calling union.get_feature_names() gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?
Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?
You are going to have to implement this method within your custom transform if you want this to work.
Here is a concrete example for you:
from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd
dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']
# define first custom transformer
class first_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
class second_transform(TransformerMixin):
def transform(self, df):
return df
def get_feature_names(self):
return df.columns.tolist()
pipe = Pipeline([
('features', FeatureUnion([
('custom_transform_first', first_transform()),
('custom_transform_second', second_transform())
])
)])
>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
'custom_transform_first__ZN',
'custom_transform_first__INDUS',
'custom_transform_first__CHAS',
'custom_transform_first__NOX',
'custom_transform_first__RM',
'custom_transform_first__AGE',
'custom_transform_first__DIS',
'custom_transform_first__RAD',
'custom_transform_first__TAX',
'custom_transform_first__PTRATIO',
'custom_transform_first__B',
'custom_transform_first__LSTAT',
'custom_transform_second__CRIM',
'custom_transform_second__ZN',
'custom_transform_second__INDUS',
'custom_transform_second__CHAS',
'custom_transform_second__NOX',
'custom_transform_second__RM',
'custom_transform_second__AGE',
'custom_transform_second__DIS',
'custom_transform_second__RAD',
'custom_transform_second__TAX',
'custom_transform_second__PTRATIO',
'custom_transform_second__B',
'custom_transform_second__LSTAT']
Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.
However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:
Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.
Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.
Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.
One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.
You can call your different Vectorizers as a nested feature by this (thanks edesz):
pipevect= dict(pipeline.named_steps['union'].transformer_list).get('title').named_steps['count']
And then you got the TfidfVectorizer() instance to pass in another function:
Show_most_informative_features(pipevect,
pipeline.named_steps['classifier'], n=MostIF)
# or direct
print(pipevect.get_feature_names())

Resources