How to visualize an sklearn GradientBoostingClassifier? - scikit-learn

I've trained a gradient boost classifier, and I would like to visualize it using the graphviz_exporter tool shown here.
When I try it I get:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'tree_'
this is because the graphviz_exporter is meant for decision trees, but I guess there's still a way to visualize it, since the gradient boost classifier must have an underlying decision tree.
Does anybody know how to do that?

The attribute estimators contains the underlying decision trees. The following code displays one of the trees of a trained GradientBoostingClassifier. Notice that although the ensemble is a classifier as a whole, each individual tree computes floating point values.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import export_graphviz
import numpy as np
# Ficticuous data
np.random.seed(0)
X = np.random.normal(0,1,(1000, 3))
y = X[:,0]+X[:,1]*X[:,2] > 0
# Classifier
clf = GradientBoostingClassifier(max_depth=3, random_state=0)
clf.fit(X[:600], y[:600])
# Get the tree number 42
sub_tree_42 = clf.estimators_[42, 0]
# Visualization
# Install graphviz: https://www.graphviz.org/download/
from pydotplus import graph_from_dot_data
from IPython.display import Image
dot_data = export_graphviz(
sub_tree_42,
out_file=None, filled=True, rounded=True,
special_characters=True,
proportion=False, impurity=False, # enable them if you want
)
graph = graph_from_dot_data(dot_data)
Image(graph.create_png())
Tree number 42:

Related

Why is sklearn RandomForestClassifier root node different from the most important feature?

How is feature importance calculated in RandomForestClassifier in scikit-learn?
Here's a reproducible code. I run the classifier once with criterion set to gini and once to entropy. For each of them, I print the feature importance and plot the tree.
In neither of the instances, the root tree is the same as the most important feature. Why is that?
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from IPython.display import Image, display
from subprocess import call
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.datasets import load_iris
wines = load_wine()
iris = load_iris()
def create_and_fit(clf,model_name):
print(clf)
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=5, random_state=seed)
# X,y = iris.data; iris.target
# X,y = wines.data, wines.target
# fit the mode
clf.fit(X, y)
# get importance
importance = clf.feature_importances_
indices = np.argsort(importance)[::-1]
for f in range(X.shape[1]):
print("feature {}: ({})".format(indices[f], importance[indices[f]]))
filename = model_name+model.criterion
if model_name == 'forest_':
print('forest')
export_graphviz(clf.estimators_[0], out_file=filename+'.dot')
else:
export_graphviz(clf, out_file=filename+'.dot')
f = 'tree_'+model.criterion+'.png'
call(['dot', '-Tpng', filename+'.dot', '-o', filename+'.png', '-Gdpi=600'])
seed=0
models = [
RandomForestClassifier(criterion='gini',max_depth=5, random_state=seed),
RandomForestClassifier(criterion='entropy',max_depth=5, random_state=seed),
]
names =['forest_', 'forest_']
for name, model in zip(names, models):
create_and_fit(model,name)
Here's the snippet to load the image:
Image(filename = 'forest_gini'+'.png')
and for the entropy
Image(filename = 'forest_entropy'+'.png')
This behaviour seems to only happen with ensembles not trees (I'm generalizing as I only tried on Random forest and Decision Tree).
Here's the snippet for decision trees
models = [
DecisionTreeClassifier(criterion='gini',max_depth=5, random_state=seed),
DecisionTreeClassifier(criterion='entropy',max_depth=5, random_state=seed)
]
names =['tree_', 'tree_']
for name, model in zip(names, models):
create_and_fit(model,name)
Here's the snippet to load the image:
Image(filename = 'tree_gini'+'.png')
and for the entropy
Image(filename = 'tree_entropy'+'.png')
I think I found the answer, which is related to max_features parameter in RandomForestClassifier. Here's scikit-learn documentation:
max_features{“sqrt”, “log2”, None}, int or float,
default=”sqrt”
The number of features to consider when looking for
the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features *
n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

scikit-learn: most important feature due to SelectKBest() is not the same feature of top node in DecisionTreeClassifier() with unedited data?

I am applying the breast cancer dataset to a decision tree as simple as possible:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import graphviz
cancer = load_breast_cancer()
#print(cancer.feature_names)
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
tree = DecisionTreeClassifier(random_state=0, max_depth=2)
tree.fit(X_train, y_train)
print(f"\nscore train: {tree.score(X_train, y_train)}")
print(f"score test : {tree.score(X_test, y_test)}")
>>>
score train: 0.9413145539906104
score test : 0.9370629370629371
export_graphviz(tree, out_file=f"./src/dot/testing/breast_cancer.dot", class_names=['malignant', 'benign'], feature_names=cancer.feature_names, impurity=False, filled=True)
with open(f"./src/dot/testing/breast_cancer.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
Which lead to this graph:
Playing with feature selection, I want to get only the most important feature. In my understanding it should be the feature in the root-leaf, no? Unfortunately it's not, it's "worst concave points". Here is what I did to get the most important feature:
select = SelectKBest(k=1)
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
print("X_train.shape : {}".format(X_train.shape))
print("X_train_selected.shape: {}\n".format(X_train_selected.shape))
>>>
X_train.shape : (426, 30)
X_train_selected.shape: (426, 1)
mask = select.get_support()
# plt.matshow(mask.reshape(1, -1), cmap='gray_r')
# plt.xlabel("Sample index")
print("most important features:")
for mask, feature in zip(mask, cancer.feature_names):
if mask: print(feature)
>>>
most important features:
worst concave points
I guess I am getting something wrong here. Could somebody clarify this? Any hint? Thanks
The most important feature does not necessarily mean that it will be the one used to make the first split. In fact, sklearn.tree.DecisionTreeClassifier uses entropy to decide which feature to use when making a split, so unless SelectKBest does this too, there is no need for both methods to reach the same conclusions in the same order. Even the same feature will reduce entropy differently in different stages of a tree classifier.
As a side note, trees do not always consider all features when making nodes. Take a look at max_features here. This means that, depending on your random-state and max_features hyper parameters, your tree may or may not have considered worst_concave_points when making the first split.

Features for Support Vector Machine (SVM)

I have to classify some texts with support vector machine. In my train file I have 5 different categories. I have to do classify at first with "Bag of Words" feature, after with SVD feature by keeping 90% of the total variance.
I 'm using python and sklearn but I don't know how to create the above SVD feature.
My train set is separated with tab (\t), my texts are in 'Content' column and the categories are in 'Category' column.
The high level steps for a tf-idf/PCA/SVM workflow are as follows:
Load data (will be different in your case):
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
train_text = newsgroups_train.data
y = newsgroups_train.target
Preprocess features and train classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import SVC
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(train_text)
pca = PCA(.8)
X = pca.fit_transform(X_tfidf.todense())
clf = SVC(kernel="linear")
clf.fit(X,y)
Finally, do the same preprocessing steps for test dataset and make predictions.
PS
If you wish, you may combine preprocessing steps into Pipeline:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
preproc = Pipeline([('tfidf',TfidfVectorizer())
,('todense', FunctionTransformer(lambda x: x.todense(), validate=False))
,('pca', PCA(.9))])
X = preproc.fit_transform(train_text)
and use it later for dealing with test data as well.

Sckit learn with GraphViz exports empty outputs

I would like to export decision tree using sklearn.
First I trained a decision tree classifier:
self._selected_classifier = tree.DecisionTreeClassifier()
self._selected_classifier.fit(train_dataframe, train_class)
self._column_names = list(train_dataframe.columns.values)
After that I used the following method in order to export the decision tree:
def _create_graph_visualization(self):
decision_tree_classifier = self._selected_classifier
from sklearn.externals.six import StringIO
dot_data = StringIO()
tree.export_graphviz(decision_tree_classifier,
out_file=dot_data,
feature_names=self._column_names)
import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("decision_tree_output.pdf")
After many errors regarding missing executables now the program is finished successfully.
The file is created, but it is empty.
What am I doing wrong?
Here is an example with output which works for me, using pydotplus:
from sklearn import tree
import pydotplus
import StringIO
# Define training and target set for the classifier
train = [[1,2,3],[2,5,1],[2,1,7]]
target = [10,20,30]
# Initialize Classifier. Random values are initialized with always the same random seed of value 0
# (allows reproducible results)
dectree = tree.DecisionTreeClassifier(random_state=0)
dectree.fit(train, target)
# Test classifier with other, unknown feature vector
test = [2,2,3]
predicted = dectree.predict(test)
dotfile = StringIO.StringIO()
tree.export_graphviz(dectree, out_file=dotfile)
graph=pydotplus.graph_from_dot_data(dotfile.getvalue())
graph.write_png("dtree.png")
graph.write_pdf("dtree.pdf")

Sklearn.mixture.dpgmm not functioning correctly

I'm having trouble with sklearn.mixture.dpgmm. The main issue is that it is not returning correct covariances for synthetic data (2 separated 2D gaussians), where it really should have no issue. In particular, when I do dpgmm._get_covars(), the covariance matrices have diagonal elements that are always exactly 1.0 too large, regardless of the input data distributions. This seems like a bug, as gmm works perfectly (when limiting to known exact number of groups)
Another issue is that dpgmm.weights_ makes no sense, they sum to one but the values appear meaningless.
Does anyone have a solution to this or see something clearly wrong with my example?
Here is the exact script I'm running:
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
import pdb
from sklearn import mixture
# Generate 2D random sample, two gaussians each with 10000 points
rsamp1 = np.random.multivariate_normal(np.array([5.0,5.0]),np.array([[1.0,-0.2],[-0.2,1.0]]),10000)
rsamp2 = np.random.multivariate_normal(np.array([0.0,0.0]),np.array([[0.2,-0.0],[-0.0,3.0]]),10000)
X = np.concatenate((rsamp1,rsamp2),axis=0)
# Fit a mixture of Gaussians with EM using 2
gmm = mixture.GMM(n_components=2, covariance_type='full',n_iter=10000)
gmm.fit(X)
# Fit a Dirichlet process mixture of Gaussians using 10 components
dpgmm = mixture.DPGMM(n_components=10, covariance_type='full',min_covar=0.5,tol=0.00001,n_iter = 1000000)
dpgmm.fit(X)
print("Groups With data in them")
print(np.unique(dpgmm.predict(X)))
##print the input and output covars as example, should be very similar
correct_c0 = np.array([[1.0,-0.2],[-0.2,1.0]])
print "Input covar"
print correct_c0
covars = dpgmm._get_covars()
c0 = np.round(covars[0],decimals=1)
print "Output Covar"
print c0
print("Output Variances Too Big by 1.0")
According to the dpgmm docs this Class is Deprecated in version 0.18 and will be removed in version 0.20
You should use BayesianGaussianMixture Class instead, with parameter weight_concentration_prior_type set with option "dirichlet_process"
Hope it helps
instead of writing
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
you should write:
from sklearn.mixture import BayesianGaussianMixture
gmm = BayesianGaussianMixture(2, covariance_type='full', random_state=0)

Resources