Get feature names of ColumnTransformer using StandarScaler and One-Hot-Encoding - python-3.x

I am using a simple ColumnTransformer with StandardScaler and OneHotEncoder like:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
num_features = ['num_feat_1',
'num_feat_2',
'num_feat_3']
cat_features = ['cat_feat_1',
'cat_feat_2',
'cat_feat_3']
ct = ColumnTransformer([
("scaler", StandardScaler(), num_features),
("onehot", OneHotEncoder(sparse=False,
handle_unknown='ignore'), cat_features)],
remainder='passthrough')
ct.fit(X_train)
X_train_trans = ct.transform(X_train)
X_test_trans = ct.transform(X_test)
To map the coefficients of a LinearRegression, I need ct.get_feature_names(), but I get the error Transformer scaler (type StandardScaler) does not provide get_feature_names. Why is that and how can I solve this?

In your case, get_feature_names() will work only on the onehot , and for StandardScaler() you would not change the names of the transformed variable, so we go through the transformers, if the get_feature doesn't work, we retain the original feature names.
Using an example dataset:
import pandas as pd
import numpy as np
X = pd.concat([
pd.DataFrame(np.random.uniform(0,1,(100,3)),columns=num_features),
pd.DataFrame(np.random.choice(['a','b'],(100,3)),columns=cat_features)
],axis=1)
X_train = X.iloc[:50,:]
X_test = X.iloc[50:,:]
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
num_features = ['num_feat_1',
'num_feat_2',
'num_feat_3']
cat_features = ['cat_feat_1',
'cat_feat_2',
'cat_feat_3']
ct = ColumnTransformer([
("scaler", StandardScaler(), num_features),
("onehot", OneHotEncoder(sparse=False,
handle_unknown='ignore'), cat_features)],
remainder='passthrough')
ct.fit(X_train)
We try this:
tx = ct.get_params()['transformers']
feature_names = []
for name,transformer,features in tx:
try:
Var = ct.named_transformers_[name].get_feature_names().tolist()
except AttributeError:
Var = features
feature_names = feature_names + Var
feature_names
['num_feat_1',
'num_feat_2',
'num_feat_3',
'x0_a',
'x0_b',
'x1_a',
'x1_b',
'x2_a',
'x2_b']

Related

About Sklearn double cross validation with wrapper feature_selection

About Double-CV or Nested-CV.
The simplest example would be
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
gcv = GridSearchCV(RandomForestRegressor(), param_grid={"n_estimators":[5,10]})
score_ = cross_val_score(gcv , X,y,cv=5)
No question about this.
So, when using the feature_selection of the Wrapper type, there are a method of evaluating with CV (RFECV) and a method of evaluating using all data (RFE), but is RFE correct when using pipeline? This is my first question.
from sklearn.feature_selection import RFE, RFECV
rfr = RandomForestRegressor()
pipe = Pipeline([("selector", RFE(estimator=rfr)), ("estimator", rfr)])
gcv = GridSearchCV(pipe, param_grid={"estimator__n_estimators":[5,10]})
score_ = cross_val_score(gcv , X,y,cv=5)
I feel that the code below with RFECV will result in triple-CV, and the amount of calculation will increase.
from sklearn.feature_selection import RFE, RFECV
pipe = Pipeline([("selector", RFECV(rfr, cv=5)), ("estimator", rfr)])
gcv = GridSearchCV(pipe, param_grid={"estimator__n_estimators":[5,10]})
score_ = cross_val_score(gcv , X,y,cv=5)
Next, in the case of a SequentialFeatureSelector that only has a CV evaluation method, what kind of code is correct as double-CV?
from sklearn.feature_selection import SequentialFeatureSelector
estimator_in_selector = RandomForestRegressor()
sfs = SequentialFeatureSelector (estimator_in_selector , cv=5)
pipe = Pipeline([("selector", sfs), ("estimator", rfr)])
gcv = GridSearchCV(pipe, param_grid=
{"estimator__n_estimators":[5,10]},cv=5)
score_ = cross_val_score(gcv , X,y,cv=5)
If we consider a more complicated case,
from sklearn.feature_selection import SequentialFeatureSelector
estimator_in_selector = RandomForestRegressor()
sfs = SequentialFeatureSelector(estimator_in_selector , cv=5)
pipe = Pipeline([("selector", sfs), ("estimator", rfr)])
param_grid = {"selector__n_features_to_select":[3,5],
"selector__estimator__n_estimators":[10,50],
"estimator__n_estimators":[10,50]}
gcv = GridSearchCV(pipe, param_grid=param_grid)
score_ = cross_val_score(pipe , X,y,cv=5)
And also..when using genetic algorithm.
from sklearn_genetic import GAFeatureSelectionCV
selector = GAFeatureSelectionCV(rfr, cv=5)

Linear Regression Using sklearn issues with reshape code

I've got my data cleaned and prepped. I've done a split test and am now trying to do a linear regression. The issue is, when I first tried it, it say that I needed to create an array and reshape the data. I have done this, but now it's giving me an error " _reshape_dispatcher() missing 1 required positional argument: 'newshape'". All of the methods I've looked up to declare a newshape have not worked.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
df = pd.read_csv('googleplaystore.csv') # 1
df = df.dropna() # 3
df['Size'] = df['Size'].str.extract(r'(\d+\.?\d)', expand=False).astype(float) * df['Size'].str[-1].replace({'M': 1024, 'k': 1}) # 4
df = df.dropna() # remove nan from "Varies with device"
df['Price'] = df['Price'].str.strip('$').astype(float) # 5
df['Installs'] = df['Installs'].str.strip('+')
df['Installs'] = df['Installs'].str.replace(',',"").astype(int)
df['Reviews'] = df['Reviews'].astype(float)
df['Size'] = df['Size'].astype(float)
df = df.loc[df['Rating'].between(1, 5)] # 6
df = df.loc[df['Type'] != 'Free'] # 7
df.drop(df[df['Price'] >= 200].index, inplace = True)
df.drop(df[df['Reviews'] >2000000].index, inplace = True)
df.drop(df[df['Installs'] >10000].index, inplace = True)
inp1 = df.copy()
df_reviewslog=np.log10(df['Reviews'])
df_installslog=np.log10(df['Installs'])
del df['App']
del df['Last Updated']
del df['Current Ver']
del df['Android Ver']
pd.get_dummies(df, columns=['Category', 'Genres', 'Content Rating'], drop_first=True)
inp2 = df.copy()
df_train = X_train,X_test,y_train,y_test=train_test_split(df['Reviews'],df['Installs'], test_size=0.7, random_state=0)
df_test = X_train,X_Test,y_train,y_test=train_test_split(df['Reviews'],df['Installs'], test_size=0.3, random_state=0)
df_train = np.array(df_train)
df_test = np.array(df_test)
df_train = np.reshape(df_train.shape)
df_test = np.reshape(df_test.shape)
lr = LinearRegression()
lr.fit(X_train,y_train)
print(lr.score(X_Test,y_test))

Recovering features names of StandardScaler().fit_transform() with sklearn

Edited from a tutorial in Kaggle, I try to run the code below and data (available to download from here):
Code:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for plotting facilities
from datetime import datetime, date
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("./data/Aquifer_Petrignano.csv")
df['Date'] = pd.to_datetime(df.Date, format = '%d/%m/%Y')
df = df[df.Rainfall_Bastia_Umbra.notna()].reset_index(drop=True)
df = df.interpolate(method ='ffill')
df = df[['Date', 'Rainfall_Bastia_Umbra', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25', 'Temperature_Bastia_Umbra', 'Temperature_Petrignano', 'Volume_C10_Petrignano', 'Hydrometry_Fiume_Chiascio_Petrignano']].resample('7D', on='Date').mean().reset_index(drop=False)
X = df.drop(['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25','Date'], axis=1)
y1 = df.Depth_to_Groundwater_P24
y2 = df.Depth_to_Groundwater_P25
scaler = StandardScaler()
X = scaler.fit_transform(X)
model = xgb.XGBRegressor()
param_search = {'max_depth': range(1, 2, 2),
'min_child_weight': range(1, 2, 2),
'n_estimators' : [1000],
'learning_rate' : [0.1]}
tscv = TimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
param_grid=param_search)
gsearch.fit(X, y1)
xgb_grid = xgb.XGBRegressor(**gsearch.best_params_)
xgb_grid.fit(X, y1)
ax = xgb.plot_importance(xgb_grid)
ax.figure.tight_layout()
ax.figure.savefig('test.png')
y_val = y1[-80:]
X_val = X[-80:]
y_pred = xgb_grid.predict(X_val)
print(mean_absolute_error(y_val, y_pred))
print(math.sqrt(mean_squared_error(y_val, y_pred)))
I plotted a features importance figure whose original features names are hidden:
If I comment out these two lines:
scaler = StandardScaler()
X = scaler.fit_transform(X)
I get the output:
How could I use scaler.fit_transform() for X and get a feature importance plot with the original feature names?
The reason behind this is that StandardScaler returns a numpy.ndarray of your feature values (same shape as pandas.DataFrame.values, but not normalized) and you need to convert it back to pandas.DataFrame with the same column names.
Here's the part of your code that needs changing.
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

LabelEncoder in sklearn_pandas mapper with pipeline after cross_val_score returns error

I have a strange error, that I could not understand.
I have a data:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn_pandas import DataFrameMapper
test = pd.DataFrame({"a": ['a','c','-','9','c','a','a','c','b','i','c','r'],
"b": [0,0,1,0,0,1, 0,0,1,0,0,1] })
Then I make DataFrameMapper()
Mapper = DataFrameMapper([ ('a', LabelEncoder()) ])
Then Pipeline()
pipeline = Pipeline([('featurize', Mapper),('forest',RandomForestClassifier())])
X = test[test.columns.drop('b')]
y = test['b']
model = pipeline.fit(X = X, y = y)
Everything works fine, i can predict with this model.
But, when I do cross_val_score
cross_val_score(pipeline, X, y, 'accuracy', cv=2)
It returns error:
a: y contains new labels: ['-' '9']
How can I avoid this or why does it work this way? Because I thought that LabelEncoder fits the data first, then cross-validation goes. I have tried to fit encoder firstly
enc = LabelEncoder()
enc.fit(test['a'])
on entire column then insert in Mapper, but it doesn't work

Unorderable Types: str() > float error KNN model

I have read quite a bit on this particular error and haven't been able to find an answer that addresses my issue. I have a data set that I have split into train and test sets and am looking to run a KNeighborsClassifier. My code is below... My problem is that when I look at the dtypes of my X_train i don't see any string formatted columns at all. My y_train is a single categorical variable. This is my first stackoverflow post so my apologies if I've overlooked any formalities and thanks for the help! :)
Error:
TypeError: unorderable types: str() > float()
Dtypes:
X_train.dtypes.value_counts()
Out[54]:
int64 2035
float64 178
dtype: int64
Code:
# Import Packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.dummy import DummyRegressor
from sklearn.cross_validation import train_test_split, KFold
from matplotlib.ticker import FormatStrFormatter
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pdb
# Set Directory Path
path = "file_path"
os.chdir(path)
#Select Import File
data = 'RawData2.csv'
delim = ','
#Import Data File
df = pd.read_csv(data, sep = delim)
print (df.head())
df.columns.get_loc('Categories')
#Model
#Select/Update Features
X = df[df.columns[14:2215]]
#Get Column Index for Target Variable
df.columns.get_loc('Categories')
#Select Target and fill na's with "Small" label
y = y[y.columns[21]]
print(y.values)
y.fillna('Small')
#Training/Test Set
X_sample = X.loc[X.Var1 <1279]
X_valid = X.loc[X.Var1 > 1278]
y_sample = y.head(len(X_sample))
y_test = y.head(len(y)-len(X_sample))
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size = 0.2)
cv = KFold(n = X_train.shape[0], n_folds = 5, random_state = 17)
print(X_train.shape, y_train.shape)
X_train.dtypes.value_counts()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train) **<-- This is where the error is flagged**
accuracy_score(knn.predict(X_test))
Everything in sklearn is based on numpy which only uses numbers. Hence categorical X and Y need to be encoded as numbers. For x you can use get_dummies. For y you can use LabelEncoder.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Resources