lgb.plot_importance(model['classifier']) NOT plotting coherent column's naming - scikit-learn

I have this Model architecture, and would like to get the feature's name importance as they are declared on my dataframe with their respective names,
import lightgbm as lgb
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
X_train, y_train = load_iris(return_X_y=True)
lgbmc_classifier = lgb.LGBMClassifier(
boosting_type='goss',
learning_rate=0.01,
min_child_samples=10,
n_estimators=100,
num_leaves=10,
random_state=1,
importance_type='gain',
)
lgbmc_classifier.fit(X_train, y_train)
model = Pipeline([
("classifier", lgbmc_classifier)
])
lgb.plot_importance(model['classifier'], figsize=(4,2))
The thing is that I have over than 145 columns and after doing a model.named_steps['classifier'].feature_name_ I get over 999 columns !!!
How can i understand their meaning ? as they are plotted as follows :

Related

Difference between shap.TreeExplainer and shap.Explainer bar charts

For the code given below, I am getting different bar plots for the shap values.
In this example, I have a dataset of 1000 train samples with 9 classes and 500 test samples. I then use the random forest as the classifier and generate a model. When I go about generating the shap bar plots I get different results in these two senarios:
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
and then:
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
Can you explain what is the difference between the two plots and which one to use for feature importance?
Here is my code:
from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle
import joblib
import warnings
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
X_train,y_train = make_classification(n_samples=1000,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
X_test,y_test = make_classification(n_samples=500,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
model = RandomForestClassifier()
parameter_space = {
'n_estimators': [10,50,100],
'criterion': ['gini', 'entropy'],
'max_depth': np.linspace(10,50,11),
}
clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train,y_train)
print(f'Best Parameters: {clf.best_params_}')
# save the model to disk
filename = f'Testt-RF.sav'
pickle.dump(clf, open(filename, 'wb'))
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
shap.plots.bar(shap_values)
Thanks for your help and time!
There are 2 problems with your code:
It's not reproducible
You seem to be missing some important concepts in SHAP package, namely what data is used to "train" the explainer ("true to model" or "true to data" explanation) and what data is used to predict SHAP values.
As far as the first one is concerned, you may find many tutorials and even books online.
Concerning the second:
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
is different to:
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
because:
first uses trained trees to predict; whereas second uses supplied X_test dataset to calculate SHAP values.
Moreover, when you say
shap.Explainer(clf.best_estimator_.predict, X_test)
I'm pretty sure it's not the whole dataset X_test used for training your explainer, but rather a 100 datapoints subset of it.
Finally,
shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
is different to
explainer2(X_test)
in that in the first case you're predicting (and averaging) for X_train, whereas in the second you're predicting (and averaging) for X_test. It's easy to confirm that when you compare the shapes.
So, how to reconcile the two? See the below for a reproducible example:
1. Imports, model, and data to train explainers on:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from shap import maskers
from shap import TreeExplainer, Explainer
X, y = make_classification(1500, 10)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=1000, random_state=42)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
background = maskers.Independent(X_train, 10) # data to train both explainers on
2. Compare explainers:
exp = TreeExplainer(clf, background)
sv = exp.shap_values(X_test)
exp2 = Explainer(clf, background)
sv2 = exp2(X_test)
np.allclose(sv[0], sv2.values[:,:,0])
True
I perhaps should have stated this from the very beginning: the 2 are guaranteed to show the same results (if used correctly), as Explainer class is a superset of TreeExplainer (it uses the latter when it sees a tree model).
Please ask questions if something is not clear.

Tensorflow classifier.evaluate running indefinitely?

I've started trying out some of the Tensorflow API's. I am using the iris data set to experiment with Tensorflows Estimator's. I'm loosely following this tutorial except that I load my data in a little differently: https://www.tensorflow.org/guide/premade_estimators#top_of_page
My problem is that when the code below executes and I get to the section with:
# Evaluate the model.
eval_result = classifier.evaluate(
My computer just runs seemingly without end. I've been waiting for my jupyter notebook to complete this step now for an hour and a half but no end in sight. The lastoutput of the notebook is:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Problem statement: How can I adjust my code to make this evaluation more efficient I'm obviously making it do much more work than I anticipated.
So far I have tried adjusting the batch size and the number or neurons in the layers but with no luck.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
#Extract both into features and labels.
#features should be a dictionary.
#label can just be an array
def extract_features_and_labels(dataframe):
#features and label for training
x = dataframe.copy()
y = dataframe.pop('species')
return dict(x), y
#break the data up into train and test.
#split the overall df into training and testing data
train, test = train_test_split(df, test_size=0.2)
train_x, train_y = extract_features_and_labels(train)
test_x, test_y = extract_features_and_labels(test)
print(len(train_x), 'training examples')
print(len(train_y), 'testing examples')
my_feature_columns = []
for key in train_x.keys():
my_feature_columns.append(tf.feature_column.numeric_column(key=key))
def train_input_fn(features, labels, batch_size):
"""An input function for training"""
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
return dataset.shuffle(1000).repeat().batch(batch_size)
#Build the classifier!!!!
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
# Two hidden layers of 10 nodes each.
hidden_units=[4, 4],
# The model must choose between 3 classes.
n_classes=3)
# Train the Model.
classifier.train(
input_fn=lambda:train_input_fn(train_x, train_y, 10), steps=1000)
# Evaluate the model.
eval_result = classifier.evaluate(
input_fn=lambda:train_input_fn(test_x, test_y, 100))
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
Had to switch a lot of things up but finally got the estimator working on the IRIS data set. Here is the code below for any who may find it useful in the future. Cheers.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
#Grab only our categorial data
#categories = df.select_dtypes(include=[object])
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
# use df.apply() to apply le.fit_transform to all columns
#X_2 = X.apply(le.fit_transform)
#Reshape as the one hot encoder doesnt like one row/column
#X_2 = X.reshape(-1, 1)
#features is everything but our label so makes sense to simply....
features = df.drop('species', axis=1)
print(features.head())
target = df['species']
print(target.head())
#trains and test
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.33, random_state=42)
#Introduce Tensorflow feature column (numeric column)
numeric_column = ['sepal_length','sepal_width','petal_length','petal_width']
numeric_features = [tf.feature_column.numeric_column(key = column)
for column in numeric_column]
print(numeric_features[0])
#Build the input function for training
training_input_fn = tf.estimator.inputs.pandas_input_fn(x = X_train,
y=y_train,
batch_size=10,
shuffle=True,
num_epochs=None)
#Build the input function for testing input
eval_input_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,
y=y_test,
batch_size=10,
shuffle=False,
num_epochs=1)
#Instansiate the model
dnn_classifier = tf.estimator.DNNClassifier(feature_columns=numeric_features,
hidden_units=[3,3],
optimizer=tf.train.AdamOptimizer(1e-4),
n_classes=3,
dropout=0.1,
model_dir = "dnn_classifier")
dnn_classifier.train(input_fn = training_input_fn,steps=2000)
#Evaluate the trained model
dnn_classifier.evaluate(input_fn = eval_input_fn)
# print("Loss is " + str(loss))
pred = list(dnn_classifier.predict(input_fn = eval_input_fn))
for e in pred:
print(e)
print("\n")
#pred = [p['species'][0] for p in pred]

How to Insert new data to make a prediction? Sklearn

I'm doing the "Hello world" in machine learning, using the Iris dataset. I already have an acceptable result for the entry of this model, I am using 80% of the information to train it and the remaining 20% ​​to do the validation. I am using 6 prediction algorithms, which work well.
but I have a problem, how can I insert new information so that it is analyzed? How do I insert the characteristics of a flower and tell me the type of iris it is? Either: Iris-setosa, Iris-versicolor or Iris-virginica?
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
#######Evaluate Some Algorithms########
#Create a Validation Dataset
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
########Build Models########
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
########Make Predictions########
print('######## Make Predictions ########')
# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
I think you can follow this other post to save your model, and after you can load him and pass new data and make some predictions.
Remember to set the data to same input shape as used during training.
import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
cPickle.dump(gnb, fid)
# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
gnb_loaded = cPickle.load(fid)
# make predictions

FastText: Can't get cross_validation

I am struggling to implement FastText (FTTransformer) into a Pipeline that iterates over different vectorizers. More particular, I can't get cross-validation scores. Following code is used:
%%time
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
w2v_texts = [simple_preprocess(doc) for doc in X_train]
models = [FTTransformer(size=10, min_count=0, seed=42)]
classifiers = [LogisticRegression(random_state=0)]
for model in models:
for classifier in classifiers:
model.fit(w2v_texts)
classifier.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', classifier)
])
print(pipeline.score(X_train, y_train))
#print(model.gensim_model.wv.most_similar('kirk'))
cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
KeyError: 'all ngrams for word "Machine learning can be useful
branding sometimes" absent from model'
How can the problem be solved?
Sidenote: My other pipelines with D2VTransformer or TfIdfVectorizer work just fine. Here, I can simply apply pipeline.fit(X_train, y_train) after defining the pipeline, instead of the two fits as shown above. It seems like FTTransformer doesn't integrate so well with other given vectorizers?
Yes, to be used in a pipeline, FTTransformer needs to be modified to split documents to words inside its fit method. One can do it as follows:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
class FTTransformer2(FTTransformer):
def fit(self, x, y):
super().fit([simple_preprocess(doc) for doc in x])
return self
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
classifiers = [LogisticRegression(random_state=0)]
for classifier in classifiers:
pipeline = Pipeline([
('ftt', FTTransformer2(size=10, min_count=0, seed=0)),
('clf', classifier)
])
score = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print(score)

Cross-validation for paragraph-vector model

I just came across an error when trying to apply a cross-validation for a paragraph vector model:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(size=10, min_count=1, iter=5, seed=1)
clf = LogisticRegression(random_state=0)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_train, y_train)
print("Score:", score) # This works
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=3)
print("Cross-Validation:", cval) # This doesn't work
KeyError: 0
I experimented by replacing X_train in cross_val_score with model.transform(X_train) or model.fit_transform(X_train). Also, I tried the same with raw input data (data.text), instead of pre-processed text. I suspect that something must be wrong with the format of X_train for the cross-validation, as compared to the .score function for Pipeline, which works just fine. I also noted that the cross_val_score worked with CountVectorizer().
Does anyone spot the mistake?
No, this has nothing to do with transformation from model. Its related to cross_val_score.
cross_val_score will split the supplied data according the the cv param. For this, it will do something like this:
for train, test in splitter.split(X_train, y_train):
new_X_train, new_y_train = X_train[train], y_train[train]
But your X_train is a pandas.Series object in which the index based selection does not work like this. See this:https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position
Change this line:
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
to:
# Access the internal numpy array
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).values
OR
# Convert series to list
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).tolist()

Resources