Visualization of a decision tree inside a pipeline - python-3.x

I would like to visualize my decision tree with export_graphviz, however I keep on getting the following error:
File "C:\Users\User\AppData\Local\Continuum\anaconda3\envs\data_science\lib\site-packages\sklearn\utils\validation.py", line 951, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
I am pretty sure my Pipeline is fitted because I call predict in my code which works just fine. Here is the code in question:
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
#Parameters for model building an reproducibility
state = 13
data_age.dropna(inplace=True)
X_age = data_age.iloc[:,0:77]
y_age = data_age.iloc[:,77]
X = X_age
y = y_age
#split between testing and training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= state)
# Pipeline with the regressor
regressors = [DecisionTreeRegressor(random_state = state)]
for reg in regressors:
steps=[('regressor', reg)]
pipeline = Pipeline(steps) #seed that controls the random grid search
#Train the model
pipeline.set_params(regressor__max_depth = 5, regressor__min_samples_split =5, regressor__min_samples_leaf = 5).fit(X_train, y_train)
pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)
export_graphviz(pipeline, out_file='tree.dot')
I know I don't really need the Pipeline here but I would still like to understand what is the problem for future reference and be able to plot a decision tree, whithin a pipeline which has been fitted.

So, based on Farseer answer, the last line has to be:
#Train the model
pipeline.set_params(regressor__max_depth = 5, regressor__min_samples_split =5, regressor__min_samples_leaf = 5).fit(X_train, y_train)
pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)
#export as a .dot file
export_graphviz(regressors[0], out_file='tree.dot')
And now it works.

Signature of export_graphviz is export_graphviz(decision_tree, ...) as can be seen in documentation.
So, you should pass your decision tree as argument to export_graphviz function and not your Pipeline.
You can also see in source code, that export_grpahviz is calling check_is_fitted(decision_tree, 'tree_') method.

Related

Difference between shap.TreeExplainer and shap.Explainer bar charts

For the code given below, I am getting different bar plots for the shap values.
In this example, I have a dataset of 1000 train samples with 9 classes and 500 test samples. I then use the random forest as the classifier and generate a model. When I go about generating the shap bar plots I get different results in these two senarios:
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
and then:
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
Can you explain what is the difference between the two plots and which one to use for feature importance?
Here is my code:
from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle
import joblib
import warnings
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
X_train,y_train = make_classification(n_samples=1000,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
X_test,y_test = make_classification(n_samples=500,
n_features=50,
n_informative=9,
n_redundant=0,
n_repeated=0,
n_classes=10,
n_clusters_per_class=1,
class_sep=9,
flip_y=0.2,
#weights=[0.5,0.5],
random_state=17)
model = RandomForestClassifier()
parameter_space = {
'n_estimators': [10,50,100],
'criterion': ['gini', 'entropy'],
'max_depth': np.linspace(10,50,11),
}
clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train,y_train)
print(f'Best Parameters: {clf.best_params_}')
# save the model to disk
filename = f'Testt-RF.sav'
pickle.dump(clf, open(filename, 'wb'))
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
shap.plots.bar(shap_values)
Thanks for your help and time!
There are 2 problems with your code:
It's not reproducible
You seem to be missing some important concepts in SHAP package, namely what data is used to "train" the explainer ("true to model" or "true to data" explanation) and what data is used to predict SHAP values.
As far as the first one is concerned, you may find many tutorials and even books online.
Concerning the second:
shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)
is different to:
explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)
because:
first uses trained trees to predict; whereas second uses supplied X_test dataset to calculate SHAP values.
Moreover, when you say
shap.Explainer(clf.best_estimator_.predict, X_test)
I'm pretty sure it's not the whole dataset X_test used for training your explainer, but rather a 100 datapoints subset of it.
Finally,
shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
is different to
explainer2(X_test)
in that in the first case you're predicting (and averaging) for X_train, whereas in the second you're predicting (and averaging) for X_test. It's easy to confirm that when you compare the shapes.
So, how to reconcile the two? See the below for a reproducible example:
1. Imports, model, and data to train explainers on:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from shap import maskers
from shap import TreeExplainer, Explainer
X, y = make_classification(1500, 10)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=1000, random_state=42)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
background = maskers.Independent(X_train, 10) # data to train both explainers on
2. Compare explainers:
exp = TreeExplainer(clf, background)
sv = exp.shap_values(X_test)
exp2 = Explainer(clf, background)
sv2 = exp2(X_test)
np.allclose(sv[0], sv2.values[:,:,0])
True
I perhaps should have stated this from the very beginning: the 2 are guaranteed to show the same results (if used correctly), as Explainer class is a superset of TreeExplainer (it uses the latter when it sees a tree model).
Please ask questions if something is not clear.

Different R-squared scores for different times

I just learned about cross-validation and when I give in different arguments, there are different results.
This is the code for building the Regression Model and the R-squared output was about .5 :
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
boston = load_boston()
X = boston.data
y = boston['target']
X_rooms = X[:,5]
X_train, X_test, y_train, y_test = train_test_split(X_rooms, y)
reg = LinearRegression()
reg.fit(X_train.reshape(-1,1), y_train)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
plt.scatter(X_test, y_test)
plt.plot(prediction_space, reg.predict(prediction_space), color = 'black')
reg.score(X_test.reshape(-1,1), y_test)
Now when I give the cross-validation for X_train, X_test, and X(respectively), it shows different R-squared values.
Here's the X_test and y_test arguments:
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_test.reshape(-1,1), y_test, cv = 8)
cv
The result:
array([ 0.42082715, 0.6507651 , -3.35208835, 0.6959869 , 0.7770039 ,
0.59771158, 0.53494622, -0.03308137])
Now when I use the arguments, X_train and y_train, different results are outputted.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
cv
The result:
array([0.46500321, 0.27860944, 0.02537985, 0.72248968, 0.3166983 ,
0.51262191, 0.53049663, 0.60138472])
Now, when I input different arguments again; this time X(which in my case is X_rooms) and y, I yet again get different R-squared values.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_rooms.reshape(-1,1), y, cv = 8)
cv
The output:
array([ 0.61748403, 0.79811218, 0.61559222, 0.6475456 , 0.61468198,
-0.7458466 , -3.71140488, -1.17174927])
Which one should I use?
I know this is a long question so Thanks!!
Train set should be distinctly use for training your model, while test set is for final evaluation. But unfortunately, you need to test your model's score on some set before checking it on final result (test set): for example when you try to tune some hyper-parameters. There are some other reasons to use cv, it's just one of them.
Usually the process is:
Split train and test
Train model use cv to check stability, including hyper-tune params (which is irrelevant in your case)
Assess model score on test set.
scikit-learn's cross_val_score receives an object (before training!) and data. It trains each time model on different section of data, and then returns the score. It's like having a lot of "train-test" checks.
Therefore, you should:
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
solely on train set. Test set should be used for other purposes.
What you get is a list of accuracy score. You can see if your model is stable (does accuracy is in same range among all folds?) or general performance of model (avg score)

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

scikit-learn pipelines: Normalising after PCA produces undesired random results

I am running a pipeline that normalises the inputs, runs PCA, normalises PCA factors before finally running a logistic regression.
However, I am getting variable results on the confusion matrix I produce.
I am finding that, if I remove the 3rd step ("normalise_pca" ), my results are constant.
I have set random_state=0 for all the pipeline steps I can. Any idea why I am getting variable results?
def exp2_classifier(X_train, y_train):
estimators = [('robust_scaler', RobustScaler()),
('reduce_dim', PCA(random_state=0)),
('normalise_pca', PowerTransformer()), #I applied this as the distribution of the PCA factors were skew
('clf', LogisticRegression(random_state=0, solver="liblinear"))]
#solver specified here to suppress warnings, it doesn't seem to effect gridSearch
pipe = Pipeline(estimators)
return pipe
exp2_eval = Evaluation().print_confusion_matrix
logit_grid = Experiment().run_experiment(asdp.data, "heavy_drinker", exp2_classifier, exp2_eval);
I am not able to reproduce your error. I have tried other sample dataset from sklearn but got consistent results for multiple runs. Hence, the variance may not be due to normalize_pca
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler,PowerTransformer
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
from sklearn.model_selection import train_test_split
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [('robust_scaler', RobustScaler()),
('reduce_dim', PCA(random_state=0)),
('normalise_pca', PowerTransformer()), #I applied this as the distribution of the PCA factors were skew
('clf', LogisticRegression(random_state=0, solver="liblinear"))]
#solver specified here to suppress warnings, it doesn't seem to effect gridSearch
pipe = Pipeline(estimators)
pipe.fit(X_train,y_train)
print('train data :')
print(confusion_matrix(y_train,pipe.predict(X_train)))
print('test data :')
print(confusion_matrix(y_eval,pipe.predict(X_eval)))
output:
train data :
[[166 3]
[ 4 282]]
test data :
[[40 3]
[ 3 68]]

trying a custom computation of grid.best_score_ (obtained with GridSearchCV)

I'm trying to recompute grid.best_score_ I obtained on my own data without success...
So I tried it using a conventional dataset but no more success. Here is the code :
from sklearn import datasets
from sklearn import linear_model
from sklearn.cross_validation import ShuffleSplit
from sklearn import grid_search
from sklearn.metrics import r2_score
import numpy as np
lr = linear_model.LinearRegression()
boston = datasets.load_boston()
target = boston.target
param_grid = {'fit_intercept':[False]}
cv = ShuffleSplit(target.size, n_iter=5, test_size=0.30, random_state=0)
grid = grid_search.GridSearchCV(lr, param_grid, cv=cv)
grid.fit(boston.data, target)
# got cv score computed by gridSearchCV :
print grid.best_score_
0.677708680059
# now try a custom computation of cv score
cv_scores = []
for (train, test) in cv:
y_true = target[test]
y_pred = grid.best_estimator_.predict(boston.data[test,:])
cv_scores.append(r2_score(y_true, y_pred))
print np.mean(cv_scores)
0.703865991851
I can't see why it's different, GridSearchCV is supposed to use scorer from LinearRegression, which is r2 score. Maybe the way I code cv score is not the one used to compute best_score_... I'm asking here before going through GridSearchCV code.
Unless refit=False in the GridSearchCV constructor, the winning estimator is refit on the entire dataset at the end of fit. best_score_ is the estimator's average score using the cross-validation splits, while best_estimator_ is an estimator of the winning configuration fit on all the data.
lr2 = linear_model.LinearRegression(fit_intercept=False)
scores2 = [lr2.fit(boston.data[train,:], target[train]).score(boston.data[test,:], target[test])
for train, test in cv]
print np.mean(scores2)
Will print 0.67770868005943297.

Resources