Sckit learn with GraphViz exports empty outputs - scikit-learn

I would like to export decision tree using sklearn.
First I trained a decision tree classifier:
self._selected_classifier = tree.DecisionTreeClassifier()
self._selected_classifier.fit(train_dataframe, train_class)
self._column_names = list(train_dataframe.columns.values)
After that I used the following method in order to export the decision tree:
def _create_graph_visualization(self):
decision_tree_classifier = self._selected_classifier
from sklearn.externals.six import StringIO
dot_data = StringIO()
tree.export_graphviz(decision_tree_classifier,
out_file=dot_data,
feature_names=self._column_names)
import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("decision_tree_output.pdf")
After many errors regarding missing executables now the program is finished successfully.
The file is created, but it is empty.
What am I doing wrong?

Here is an example with output which works for me, using pydotplus:
from sklearn import tree
import pydotplus
import StringIO
# Define training and target set for the classifier
train = [[1,2,3],[2,5,1],[2,1,7]]
target = [10,20,30]
# Initialize Classifier. Random values are initialized with always the same random seed of value 0
# (allows reproducible results)
dectree = tree.DecisionTreeClassifier(random_state=0)
dectree.fit(train, target)
# Test classifier with other, unknown feature vector
test = [2,2,3]
predicted = dectree.predict(test)
dotfile = StringIO.StringIO()
tree.export_graphviz(dectree, out_file=dotfile)
graph=pydotplus.graph_from_dot_data(dotfile.getvalue())
graph.write_png("dtree.png")
graph.write_pdf("dtree.pdf")

Related

'Subset' object is not an iterator for updating torch' legacy IMDB dataset

I'm updating a pytorch network from legacy code to the current code. Following documentation such as that here.
I used to have:
import torch
from torchtext import data
from torchtext import datasets
# setting the seed so our random output is actually deterministic
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# defining our input fields (text) and labels.
# We use the Spacy function because it provides strong support for tokenization in languages other than English
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)
from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
import random
train_data, valid_data = train_data.split(random_state = random.seed(SEED))
example = next(iter(test_data))
example.text
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data,
max_size = MAX_VOCAB_SIZE,
vectors = "glove.6B.100d",
unk_init = torch.Tensor.normal_) # how to initialize unseen words not in glove
LABEL.build_vocab(train_data)
Now in the new code I am struggling to add the validation set. All goes well until here:
from torchtext.datasets import IMDB
train_data, test_data = IMDB(split=('train', 'test'))
I can print the outputs, while they look different (problems later on?), they have all the info. I can print test_data fine with next(train_data.
Then after I do:
test_size = int(len(train_dataset)/2)
train_data, valid_data = torch.utils.data.random_split(train_dataset, [test_size,test_size])
It tells me:
next(train_data)
TypeError: 'Subset' object is not an iterator
This makes me think I am not correct in applying random_split. How to correctly create the validation set for this dataset? Without causing issues.
Try next(iter(train_data)). It seems one have to create iterator over dataset explicitly. And use Dataloader when effectiveness is required.

Explanation of mathematics in odt file saved from Decision Tree Regression model

I am trying to solve a regression problem using the decision tree algorithm where I want to know the mathematics lies in the odt file which is generated after saving the trained model. Here, I want to mention that none of the value of the variable is categorical here.
I have gone through this, this but their values are categorical.
The code I have written for this purpose is given below:
from sklearn import *
import numpy as np
import sklearn
data = [[2,5,1,10],[3,7,2,12],[5,9,4,14],[6,3,3,16],[2,5,8,7],[1,1,1,1]]
data = np.array(data)
type(data)
data
feature = data[:,:-1]
target = data[:,-1]
target = np.reshape(target,(-1,1))
model_tree = sklearn.tree.DecisionTreeRegressor()
model_tree.fit(feature, target)
import graphviz
dot_data = tree.export_graphviz(model_tree, out_file='manual_1.dot')
I have given here the graph which I have got from the saved odt file.

randomly sample a vector-ARMA model

I intend to randomly sample a VARMA model but I cannot seem to see a function in statsmodels for this, I studied the example on the ARMA and can replicate this successfully for a 1 variable.
# for the ARMA
import numpy as np
from statsmodels.tsa.arima_model import ARMA
import statsmodels.api as sm
arparams=np.array([.9,-.7])
maparams=np.array([.5,.8])
ar=np.r_[1,-arparams]
ma=np.r_[1,maparams]
obs=10000
sigma=1
# for the VARMA
import numpy as np
from statsmodels.tsa.statespace.varmax import VARMAX
# generate a a 2-D correlated normal series
mean = [0,0]
cov = [[1,0.9],[0.9,1]]
data = np.random.multivariate_normal(mean,cov,100)
# fit the data into a VARMA model
model = VARMAX(data, order=(1,1)).fit()
`enter code here`
# I cant seem to find a way to randomly sample the VARMA
Results objects from fitting a VARMAX model have a simulate method which can be used to generate a random sample. For example:
mod = VARMAX(data, order=(1,1))
res = mod.fit()
# to generate a time series of length 100 following the VARMAX process described by `res`:
sample = res.simulate(100)
This is true of any state space model, including SARIMAX, UnobservedComponents, VARMAX, and DynamicFactor.
(Also, the model class has a simulate method. The main difference is that since model objects don't have associated parameter values, you need to pass a particular parameter vector in that case).

How to visualize an sklearn GradientBoostingClassifier?

I've trained a gradient boost classifier, and I would like to visualize it using the graphviz_exporter tool shown here.
When I try it I get:
AttributeError: 'GradientBoostingClassifier' object has no attribute 'tree_'
this is because the graphviz_exporter is meant for decision trees, but I guess there's still a way to visualize it, since the gradient boost classifier must have an underlying decision tree.
Does anybody know how to do that?
The attribute estimators contains the underlying decision trees. The following code displays one of the trees of a trained GradientBoostingClassifier. Notice that although the ensemble is a classifier as a whole, each individual tree computes floating point values.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import export_graphviz
import numpy as np
# Ficticuous data
np.random.seed(0)
X = np.random.normal(0,1,(1000, 3))
y = X[:,0]+X[:,1]*X[:,2] > 0
# Classifier
clf = GradientBoostingClassifier(max_depth=3, random_state=0)
clf.fit(X[:600], y[:600])
# Get the tree number 42
sub_tree_42 = clf.estimators_[42, 0]
# Visualization
# Install graphviz: https://www.graphviz.org/download/
from pydotplus import graph_from_dot_data
from IPython.display import Image
dot_data = export_graphviz(
sub_tree_42,
out_file=None, filled=True, rounded=True,
special_characters=True,
proportion=False, impurity=False, # enable them if you want
)
graph = graph_from_dot_data(dot_data)
Image(graph.create_png())
Tree number 42:

Tensorflow Scikit Flow get GraphDef for Android (save *.pb file)

I want to use my Tensorflow algorithm in an Android app. The Tensorflow Android example starts by downloading a GraphDef that contains the model definition and weights (in a *.pb file). Now this should be from my Scikit Flow algorithm (part of Tensorflow).
At the first glance it seems easy you just have to say classifier.save('model/') but the files saved to that folder are not *.ckpt, *.def and certainly not *.pb. Instead you have to deal with a *.pbtxt and a checkpoint (without ending) file.
I'm stuck there since quite a while. Here a code example to export something:
#imports
import tensorflow as tf
import tensorflow.contrib.learn as skflow
import tensorflow.contrib.learn.python.learn as learn
from sklearn import datasets, metrics
#skflow example
iris = datasets.load_iris()
feature_columns = learn.infer_real_valued_columns_from_input(iris.data)
classifier = learn.LinearClassifier(n_classes=3, feature_columns=feature_columns,model_dir="modeltest")
classifier.fit(iris.data, iris.target, steps=200, batch_size=32)
iris_predictions = list(classifier.predict(iris.data, as_iterable=True))
score = metrics.accuracy_score(iris.target, iris_predictions)
print("Accuracy: %f" % score)
The files you get are:
checkpoint
graph.pbtxt
model.ckpt-1.meta
model.ckpt-1-00000-of-00001
model.ckpt-200.meta
model.ckpt-200-00000-of-00001
Many possible workarounds I found would require having the GraphDef in a variable (don't know how with Scikit Flow). Or a Tensorflow session which doesn't seem to be required using Scikit Flow.
To save as pb file, you need to extract the graph_def from the constructed graph. You can do that as--
from tensorflow.python.framework import tensor_shape, graph_util
from tensorflow.python.platform import gfile
sess = tf.Session()
final_tensor_name = 'results:0' #Replace final_tensor_name with name of the final tensor in your graph
#########Build your graph and train########
## Your tensorflow code to build the graph
###########################################
outpt_filename = 'output_graph.pb'
output_graph_def = sess.graph.as_graph_def()
with gfile.FastGFile(outpt_filename, 'wb') as f:
f.write(output_graph_def.SerializeToString())
If you want to convert your trained variables to constants (to avoid using ckpt files to load the weights), you can use:
output_graph_def = graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), [final_tensor_name])
Hope this helps!

Resources