Difference between shap.TreeExplainer and shap.Explainer bar charts - python-3.x

For the code given below, I am getting different bar plots for the shap values.
In this example, I have a dataset of 1000 train samples with 9 classes and 500 test samples. I then use the random forest as the classifier and generate a model. When I go about generating the shap bar plots I get different results in these two senarios:
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
and then:
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
Can you explain what is the difference between the two plots and which one to use for feature importance?
Here is my code:
from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle
import joblib
import warnings
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
X_train,y_train = make_classification(n_samples=1000,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
X_test,y_test = make_classification(n_samples=500,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
model = RandomForestClassifier()
parameter_space = {
'n_estimators': [10,50,100],
'criterion': ['gini', 'entropy'],
'max_depth': np.linspace(10,50,11),
}
clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train,y_train)
print(f'Best Parameters: {clf.best_params_}')
# save the model to disk
filename = f'Testt-RF.sav'
pickle.dump(clf, open(filename, 'wb'))
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
shap.plots.bar(shap_values)
Thanks for your help and time!

There are 2 problems with your code:
It's not reproducible
You seem to be missing some important concepts in SHAP package, namely what data is used to "train" the explainer ("true to model" or "true to data" explanation) and what data is used to predict SHAP values.
As far as the first one is concerned, you may find many tutorials and even books online.
Concerning the second:
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
is different to:
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
because:
first uses trained trees to predict; whereas second uses supplied X_test dataset to calculate SHAP values.
Moreover, when you say
shap.Explainer(clf.best_estimator_.predict, X_test)
I'm pretty sure it's not the whole dataset X_test used for training your explainer, but rather a 100 datapoints subset of it.
Finally,
shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
is different to
explainer2(X_test)
in that in the first case you're predicting (and averaging) for X_train, whereas in the second you're predicting (and averaging) for X_test. It's easy to confirm that when you compare the shapes.
So, how to reconcile the two? See the below for a reproducible example:
1. Imports, model, and data to train explainers on:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from shap import maskers
from shap import TreeExplainer, Explainer
X, y = make_classification(1500, 10)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=1000, random_state=42)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
background = maskers.Independent(X_train, 10) # data to train both explainers on
2. Compare explainers:
exp = TreeExplainer(clf, background)
sv = exp.shap_values(X_test)
exp2 = Explainer(clf, background)
sv2 = exp2(X_test)
np.allclose(sv[0], sv2.values[:,:,0])
True
I perhaps should have stated this from the very beginning: the 2 are guaranteed to show the same results (if used correctly), as Explainer class is a superset of TreeExplainer (it uses the latter when it sees a tree model).
Please ask questions if something is not clear.

Related

How to get predictions for new data from MultinomialNB?

I'm venturing into a new topic and experimenting with categorising product names. Without deeper knowledge, the use of MultinomialNB (superficially) already yielded quite good results for my use case.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
df = pd.DataFrame({
'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
'label':['shirt','shirt','shoe','shoe','shoe']
})
count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['title'])
bow = np.array(bow.todense())
X = bow
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
model = MultinomialNB().fit(X_train, y_train)
model.predict(X_test)
Based on the trainigs of the above simplified example, I would like to categorise completely new titles and output them with the predicted labels:
new = pd.DataFrame({
'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
'label': np.nan
})
Unfortunately, I am not sure of the next steps and would hope for some support.
...
count_vec = CountVectorizer()
bow = count_vec.fit_transform(new['title'])
bow = np.array(bow.todense())
model.predict(bow)
It's a mistake to fit CountVectorizer on the whole dataset, because the test set should not be used at all during training. This discipline not only follows proper ML principles (to prevent data leakage), it also avoids this practical problem: when the test set is prepared together with the training set, it gets confusing to apply the model to another test set.
The clean way to proceed is to always split the data first between training and test set, this way one is forced to correctly transform the test set independently from the training set. Then it's easy to apply the model on another test set.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing
df = pd.DataFrame({
'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
'label':['shirt','shirt','shoe','shoe','shoe']
})
X = df['title']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
# 1) Training: use only training set!
# labels should be encoded
le = preprocessing.LabelEncoder()
y_train_enc = le.fit_transform(y_train)
count_vec = CountVectorizer()
X_train_bow = count_vec.fit_transform(X_train)
X_train_bow = np.array(X_train_bow.todense())
model = MultinomialNB().fit(X_train_bow, y_train_enc)
# 2) Testing: apply previous transformation to test set before applying model
X_test_bow = count_vec.transform(X_test)
X_test_bow = np.array(X_test_bow.todense())
y_test_enc = model.predict(X_test_bow)
print("Predicted labels test set 1:",le.inverse_transform(y_test_enc))
# 3) apply to another dataset = just another step of testing, same as above
new = pd.DataFrame({
'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
'label': np.nan
})
X_new = new['title']
X_new_bow = count_vec.transform(X_new)
X_new_bow = np.array(X_new_bow.todense())
y_new_enc = model.predict(X_new_bow)
print("Predicted labels test set 2:", le.inverse_transform(y_new_enc))
Notes:
This point is not specific to MultinomialNB, this is the correct method for any classifier.
With real data it's often a good idea to use the min_df argument with CountVectorizer, because rare words increase the number of features, don't help predicting the label and cause overfitting.

SHAP values for Gaussian Processes Regressor are zero

I am trying to get SHAP values for a Gaussian Processes Regression (GPR) model using SHAP library. However, all SHAP values are zero. I am using the example in the official documentation. I only changed the model to GPR.
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import shap
import time
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel
shap.initjs()
X,y = shap.datasets.diabetes()
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# rather than use the whole training set to estimate expected values, we summarize with
# a set of weighted kmeans, each weighted by the number of points they represent.
X_train_summary = shap.kmeans(X_train, 10)
kernel = Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)
gp = GaussianProcessRegressor(kernel)
gp.fit(X_train, y_train)
# explain all the predictions in the test set
explainer = shap.KernelExplainer(gp.predict, X_train_summary)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Running the above code gives the following plot:
When I use Neural Network or Linear Regression, the above code works fine without problem.
If you have any idea how to solve this issue, please let me know.
Your model doesn't predict anything:
plt.scatter(y_test, gp.predict(X_test));
Train your model properly, like below:
plt.scatter(y_test, gp.predict(X_test));
and you're fine to go:
explainer = shap.KernelExplainer(gp.predict, X_train_summary)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Full reproducible example:
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import shap
import time
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, DotProduct
X,y = shap.datasets.diabetes()
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train_summary = shap.kmeans(X_train, 10)
kernel = DotProduct() + WhiteKernel()
gp = GaussianProcessRegressor(kernel)
gp.fit(X_train, y_train)
explainer = shap.KernelExplainer(gp.predict, X_train_summary)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Try this code:
kernel = 1.0 * Matern(length_scale=1.0, nu=2.5) + \
WhiteKernel(noise_level=10**-1,noise_level_bounds=(10**-1, 10**1))
model = GaussianProcessRegressor(kernel=kernel,
optimizer='fmin_l_bfgs_b',random_state=123)
explainer = shap.Explainer(model.predict,X_train)
shap_values = explainer.shap_values(X_train)
shap.plots.bar(shap_values) ## bar plot
shap.summary_plot(shap_values, X_train,show=False) ## summary

Different R-squared scores for different times

I just learned about cross-validation and when I give in different arguments, there are different results.
This is the code for building the Regression Model and the R-squared output was about .5 :
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
boston = load_boston()
X = boston.data
y = boston['target']
X_rooms = X[:,5]
X_train, X_test, y_train, y_test = train_test_split(X_rooms, y)
reg = LinearRegression()
reg.fit(X_train.reshape(-1,1), y_train)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
plt.scatter(X_test, y_test)
plt.plot(prediction_space, reg.predict(prediction_space), color = 'black')
reg.score(X_test.reshape(-1,1), y_test)
Now when I give the cross-validation for X_train, X_test, and X(respectively), it shows different R-squared values.
Here's the X_test and y_test arguments:
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_test.reshape(-1,1), y_test, cv = 8)
cv
The result:
array([ 0.42082715, 0.6507651 , -3.35208835, 0.6959869 , 0.7770039 ,
0.59771158, 0.53494622, -0.03308137])
Now when I use the arguments, X_train and y_train, different results are outputted.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
cv
The result:
array([0.46500321, 0.27860944, 0.02537985, 0.72248968, 0.3166983 ,
0.51262191, 0.53049663, 0.60138472])
Now, when I input different arguments again; this time X(which in my case is X_rooms) and y, I yet again get different R-squared values.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_rooms.reshape(-1,1), y, cv = 8)
cv
The output:
array([ 0.61748403, 0.79811218, 0.61559222, 0.6475456 , 0.61468198,
-0.7458466 , -3.71140488, -1.17174927])
Which one should I use?
I know this is a long question so Thanks!!
Train set should be distinctly use for training your model, while test set is for final evaluation. But unfortunately, you need to test your model's score on some set before checking it on final result (test set): for example when you try to tune some hyper-parameters. There are some other reasons to use cv, it's just one of them.
Usually the process is:
Split train and test
Train model use cv to check stability, including hyper-tune params (which is irrelevant in your case)
Assess model score on test set.
scikit-learn's cross_val_score receives an object (before training!) and data. It trains each time model on different section of data, and then returns the score. It's like having a lot of "train-test" checks.
Therefore, you should:
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
solely on train set. Test set should be used for other purposes.
What you get is a list of accuracy score. You can see if your model is stable (does accuracy is in same range among all folds?) or general performance of model (avg score)

Tensorflow classifier.evaluate running indefinitely?

I've started trying out some of the Tensorflow API's. I am using the iris data set to experiment with Tensorflows Estimator's. I'm loosely following this tutorial except that I load my data in a little differently: https://www.tensorflow.org/guide/premade_estimators#top_of_page
My problem is that when the code below executes and I get to the section with:
# Evaluate the model.
eval_result = classifier.evaluate(
My computer just runs seemingly without end. I've been waiting for my jupyter notebook to complete this step now for an hour and a half but no end in sight. The lastoutput of the notebook is:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Problem statement: How can I adjust my code to make this evaluation more efficient I'm obviously making it do much more work than I anticipated.
So far I have tried adjusting the batch size and the number or neurons in the layers but with no luck.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
#Extract both into features and labels.
#features should be a dictionary.
#label can just be an array
def extract_features_and_labels(dataframe):
#features and label for training
x = dataframe.copy()
y = dataframe.pop('species')
return dict(x), y
#break the data up into train and test.
#split the overall df into training and testing data
train, test = train_test_split(df, test_size=0.2)
train_x, train_y = extract_features_and_labels(train)
test_x, test_y = extract_features_and_labels(test)
print(len(train_x), 'training examples')
print(len(train_y), 'testing examples')
my_feature_columns = []
for key in train_x.keys():
my_feature_columns.append(tf.feature_column.numeric_column(key=key))
def train_input_fn(features, labels, batch_size):
"""An input function for training"""
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
return dataset.shuffle(1000).repeat().batch(batch_size)
#Build the classifier!!!!
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
# Two hidden layers of 10 nodes each.
hidden_units=[4, 4],
# The model must choose between 3 classes.
n_classes=3)
# Train the Model.
classifier.train(
input_fn=lambda:train_input_fn(train_x, train_y, 10), steps=1000)
# Evaluate the model.
eval_result = classifier.evaluate(
input_fn=lambda:train_input_fn(test_x, test_y, 100))
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
Had to switch a lot of things up but finally got the estimator working on the IRIS data set. Here is the code below for any who may find it useful in the future. Cheers.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
#Grab only our categorial data
#categories = df.select_dtypes(include=[object])
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
# use df.apply() to apply le.fit_transform to all columns
#X_2 = X.apply(le.fit_transform)
#Reshape as the one hot encoder doesnt like one row/column
#X_2 = X.reshape(-1, 1)
#features is everything but our label so makes sense to simply....
features = df.drop('species', axis=1)
print(features.head())
target = df['species']
print(target.head())
#trains and test
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.33, random_state=42)
#Introduce Tensorflow feature column (numeric column)
numeric_column = ['sepal_length','sepal_width','petal_length','petal_width']
numeric_features = [tf.feature_column.numeric_column(key = column)
for column in numeric_column]
print(numeric_features[0])
#Build the input function for training
training_input_fn = tf.estimator.inputs.pandas_input_fn(x = X_train,
y=y_train,
batch_size=10,
shuffle=True,
num_epochs=None)
#Build the input function for testing input
eval_input_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,
y=y_test,
batch_size=10,
shuffle=False,
num_epochs=1)
#Instansiate the model
dnn_classifier = tf.estimator.DNNClassifier(feature_columns=numeric_features,
hidden_units=[3,3],
optimizer=tf.train.AdamOptimizer(1e-4),
n_classes=3,
dropout=0.1,
model_dir = "dnn_classifier")
dnn_classifier.train(input_fn = training_input_fn,steps=2000)
#Evaluate the trained model
dnn_classifier.evaluate(input_fn = eval_input_fn)
# print("Loss is " + str(loss))
pred = list(dnn_classifier.predict(input_fn = eval_input_fn))
for e in pred:
print(e)
print("\n")
#pred = [p['species'][0] for p in pred]

Why do my ML models have horrible accuracy?

I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)
You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.
I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.

Resources