How to make sense the output of DecisionTreeClassifier in scikit-learn? - scikit-learn

I'm learning ML and uses scikit-learn to do a basic decision tree classify.
The value of features are categorical so I used DictVectorizer to convert the original feature values. Here's my code:
training_set # list of dict representing the traing set
labels # corresponding labels of the training set
vec = DictVectorizer()
vectorized = vec.fit_transform(training_set)
clf = tree.DecisionTreeClassifier()
clf.fit(vectorized.toarray(), labels)
with open("output.dot", "w") as output_file:
tree.export_graphviz(clf, out_file=output_file)
But I don't understand the output graph. It contains a tree with each node marked X[1] <= 0.5000 or something like that. What I expected was that the nodes marked with FEATURE_1 == VALUE_1, the un-vectorized information show on the tree.
Is it possible?
UPDATE:
For example, FEATURE_1 has three possible values A, B, C, which in turn be vectorized into 0,0, 0,1, 1,0 respectively. What I want on the graph is FEATURE_1 == A instead of X[1] <= 0.5

You can pass the feature names to the tree exporter method:
with open("output.dot", "w") as output_file:
tree.export_graphviz(clf, feature_names=vec.get_feature_names(),
out_file=output_file)
The classifier itself is unaware of the "meaning" of the data, it just deals with continuous numerical values, hence the need to use a vectorizer to one-hot-encode the categorical variables as binary variables that can safely treated as continuous variables in the range [0, 1] with all the actual values being either 0 or 1 and nothing in between.
To understand how the DictVectorizer does the one-hot-encoding, have a look at the example snippet in the documentation.

X[1] <= 0.5000 means X[1] = 0 if you have binary variables. If the equation holds, left branch is chosen. Otherwise, right branch. You can certainly parse the dot file and overwrite it (it's merely a text file and it's easy to do with regular expressions), but the way it is constructed initially is fixed like this, because by default the node of a tree is an inequality.

When the values are in the continuous interval, the machine learner will sort the values and look for all the intermediate values to find the value with the Highest Gini index.
This is reasonable since in the continuous domain, the chance of finding a test instance with a value exactly, let's say, 3.1415 is zero. In such cases the classifier shouldn't know what to do.
I don't know about scikit-learn, but in WEKA for instance, one can specify whether the values are continue or discrete.

When you do an export_graphviz specify the feature_names which are the column names in this case for the independent variables DataFrame.
This would yield you the column names in your output file as below.
model = clf.fit(X, y)
dot_data = tree.export_graphviz(model, out_file=None, feature_names=X.columns.values.tolist(), class_names = None, filled=True, rounded=True, special_characters=True)
with open("output.dot", "w") as output_file:
output_file.write(dot_data)

Related

What is the meaning of "value" in a node in sklearn decisiontree plot_tree

I plotted my sklearn decision tree using the plot_tree function. The nodes have the following structure:
But I don't understand what does the value = [2417, 1059] mean. In other nodes there are other values. Thanks for explaining.
DecisionTreeClassifier:
value in a DecisionTreeClassifier is the class split in each node's samples.
Keep in mind it might also be weighted if you weighted your classes on the call to fit().
For example:
cw={0: 0.6495288248337029, 1: 2.1719184430027805}
Taking the true node, your true class split is calculated as:
>>> [3819.229 / cw[0], 1216.274 / cw[1]]
[5880, 560]
And if it's not clear, your criterion is calculated on the weighted split:
>>> a, b = 3819.229, 1216.274
>>> ab = a + b
>>> (-(a / ab)*math.log2(a / ab)) - ((b / ab)*math.log2(b / ab))
0.7975914228753467
DecisionTreeRegressor:
value in a DecisionTreeRegressor is the value that the tree would predict for a new example falling in that node. If your criterion is MSE, you'll find that value is an average measure of the samples in that node.
For example:
*(Data: Seaborn's "dots" example set.)
A depth-1 regressor tree fitted on coherence to predict firing_rate. It's not a very useful tree, but it illustrates the idea.
Taking the true node, value is calculated as:
>>> value = data[data.coherence <= 19.2].firing_rate.mean()
>>> value
40.48326118418657
squared_error for that node is:
>>> ((data[data.coherence <= 19.2].firing_rate - value)**2).mean()
134.6504380931471
They are indicating you the number of sample by class that you have in the step.
For example, your picture show that before splitting for "hops<=5" you have 2417 samples of class 0 and 1059 samples of the class 1.
Realize that if you sum this two values, you will obtain the same number (3476) as the parameter "samples".
If the tree works, you will observe how the data is splitting better in every step. For final leaf you will see that you have clear values like [300, 2]. Then you can say that all this sample are class 0.

Retrieve elements from a 3D tensor with a 2D index tensor

I am playing around with GPT2 and I have 2 tensors:
O: An output tensor of shaped (B, S-1, V) where B is the batch size S is the the number of timestep and V is the vocabulary size. This is the output of a generative model and is softmaxed along the 2nd dimension.
L: A 2D tensor shaped (B, S-1) where each element is the index of the correct token for each timestep for each sample. This is basically the labels.
I want to extract the predicted probability of the corresponding correct token from tensor O based on tensor L such that I will end up with a 2D tensor shaped (B, S). Is there an efficient way of doing this apart from using loops?
For reference, I based my answer on this Medium article.
Essentially, your answer lies in torch.gather, assuming that both of your tensors are just regular torch.Tensors (or can be converted to one).
import torch
# Specify some arbitrary dimensions for now
B = 3
V = 6
S = 4
# Make example reproducible
torch.manual_seed(42)
# L necessarily has to be a torch.LongTensor, otherwise indexing will fail.
L = torch.randint(0, V, size=[B, S])
O = torch.rand([B, S, V])
# Now collect the results. L needs to have similar dimension,
# except in the axis you want to collect along.
X = torch.gather(O, dim=2, index=L.unsqueeze(dim=2))
# Make sure X has no "unnecessary" dimension
X = X.squeeze(dim=2)
It is a bit difficult to see whether this produces the exact correct results, which is why I included a random seed which makes the example deterministic in the result, and you an easily verify that it gets you the desired results. However, for clarification, one could also use a lower-dimensional tensor, for which this becomes clearer what exactly torch.gather does.
Note that torch.gather also allows you to index multiple indexes in the same row theoretically. Meaning if you instead got a multiclass example for which multiple values are correct, you could similarly use a tensor L of shape [B, S, number_of_correct_samples].

Hyperopt list of values per hyperparameter

I'm trying to use Hyperopt on a regression model such that one of its hyperparameters is defined per variable and needs to be passed as a list. For example, if I have a regression with 3 independent variables (excluding constant), I would pass hyperparameter = [x, y, z] (where x, y, z are floats).
The values of this hyperparameter have the same bounds regardless of which variable they are applied to. If this hyperparameter was applied to all variables, I could simply use hp.uniform('hyperparameter', a, b). What I want the search space to be instead is a cartesian product of hp.uniform('hyperparameter', a, b) of length n, where n is the number of variables in a regression (so, basically, itertools.product(hp.uniform('hyperparameter', a, b), repeat = n))
I'd like to know whether this is possible within Hyperopt. If not, any suggestions for an optimizer where this is possible are welcome.
As noted in my comment, I am not 100% sure what you are looking for, but here is an example of using hyperopt to optimize 3 variables combination:
import random
# define an objective function
def objective(args):
v1 = args['v1']
v2 = args['v2']
v3 = args['v3']
result = random.uniform(v2,v3)/v1
return result
# define a search space
from hyperopt import hp
space = {
'v1': hp.uniform('v1', 0.5,1.5),
'v2': hp.uniform('v2', 0.5,1.5),
'v3': hp.uniform('v3', 0.5,1.5),
}
# minimize the objective over the space
from hyperopt import fmin, tpe, space_eval
best = fmin(objective, space, algo=tpe.suggest, max_evals=100)
print(best)
they all have the same search space in this case (as I understand this was your problem definition). Hyperopt aims to minimize the objective function, so running this will end up with v2 and v3 near the minimum value, and v1 near the maximum value. Since this most generally minimizes the result of the objective function.
You could use this function to create the space:
def get_spaces(a, b, num_spaces=9):
return_set = {}
for set_num in range(9):
name = str(set_num)
return_set = {
**return_set,
**{name: hp.uniform(name, a, b)}
}
return return_set
I would first define my pre-combinatorial space as a dict. The keys are names. The values are a tuple.
from hyperopt import hp
space = {'foo': (hp.choice, (False, True)), 'bar': (hp.quniform, 1, 10, 1)}
Next, produce the required combinatorial variants using loops or itertools. Each name is kept unique using a suffix or prefix.
types = (1, 2)
space = {f'{name}_{type_}': args for type_ in types for name, args in space.items()}
>>> space
{'foo_1': (<function hyperopt.pyll_utils.hp_choice(label, options)>,
(False, True)),
'bar_1': (<function hyperopt.pyll_utils.hp_quniform(label, *args, **kwargs)>,
1, 10, 1),
'foo_2': (<function hyperopt.pyll_utils.hp_choice(label, options)>,
(False, True)),
'bar_2': (<function hyperopt.pyll_utils.hp_quniform(label, *args, **kwargs)>,
1, 10, 1)}
Finally, initialize and create the actual hyperopt space:
space = {name: fn(name, *args) for name, (fn, *args) in space.items()}
values = tuple(space.values())
>>> space
{'foo_1': <hyperopt.pyll.base.Apply at 0x7f291f45d4b0>,
'bar_1': <hyperopt.pyll.base.Apply at 0x7f291f45d150>,
'foo_2': <hyperopt.pyll.base.Apply at 0x7f291f45d420>,
'bar_2': <hyperopt.pyll.base.Apply at 0x7f291f45d660>}
This was done with hyperopt 0.2.7. As a disclaimer, I strongly advise against using hyperopt because in my experience it has significantly poor performance relative to other optimizers.
Hi so I implemented this solution with optuna. The advantage of optuna is that it will create a hyperspace for all individual values, but optimizes this values in a more intelligent way and just uses one hyperparameter optimization. For example I optimized a neural network with the Batch-SIze, Learning-rate and Dropout-Rate:
The search space is much larger than the actual values being used. This safes a lot of time instead of an grid search.
The Pseudo-Code of the implementation is:
def function(trial): #trials is the parameter of optuna, which selects the next hyperparameter
distribution = [0 , 1]
a = trials.uniform("a": distribution) #this is a uniform distribution
b = trials.uniform("a": distribution)
return (a*b)-b
#This above is the function which optuna tries to optimze/minimze
For more detailed source-Code visit Optuna. It saved a lot of time for me and it was a really good result.

Multiclass semantic segmentation model evaluation

I am doing a project on multiclass semantic segmentation. I have formulated a model that outputs pretty descent segmented images by decreasing the loss value. However, I cannot evaluate the model performance in metrics, such as meanIoU or Dice coefficient.
In case of binary semantic segmentation it was easy just to set the threshold of 0.5, to classify the outputs as an object or background, but it does not work in the case of multiclass semantic segmentation. Could you please tell me how to obtain model performance on the aforementioned metrics? Any help will be highly appreciated!
By the way, I am using PyTorch framework and CamVid dataset.
If anyone is interested in this answer, please also look at this issue. The author of the issue points out that mIoU can be computed in a different way (and that method is more accepted in literature). So, consider that before using the implementation for any formal publication.
Basically, the other method suggested by the issue-poster is to separately accumulate the intersections and unions over the entire dataset and divide them at the final step. The method in the below original answer computes intersection and union for a batch of images, then divides them to get IoU for the current batch, and then takes a mean of the IoUs over the entire dataset.
However, this below given original method is problematic because the final mean IoU would vary with the batch-size. On the other hand, the mIoU would not vary with the batch size for the method mentioned in the issue as the separate accumulation would ensure that batch size is irrelevant (though higher batch size can definitely help speed up the evaluation).
Original answer:
Given below is an implementation of mean IoU (Intersection over Union) in PyTorch.
def mIOU(label, pred, num_classes=19):
pred = F.softmax(pred, dim=1)
pred = torch.argmax(pred, dim=1).squeeze(1)
iou_list = list()
present_iou_list = list()
pred = pred.view(-1)
label = label.view(-1)
# Note: Following for loop goes from 0 to (num_classes-1)
# and ignore_index is num_classes, thus ignore_index is
# not considered in computation of IoU.
for sem_class in range(num_classes):
pred_inds = (pred == sem_class)
target_inds = (label == sem_class)
if target_inds.long().sum().item() == 0:
iou_now = float('nan')
else:
intersection_now = (pred_inds[target_inds]).long().sum().item()
union_now = pred_inds.long().sum().item() + target_inds.long().sum().item() - intersection_now
iou_now = float(intersection_now) / float(union_now)
present_iou_list.append(iou_now)
iou_list.append(iou_now)
return np.mean(present_iou_list)
Prediction of your model will be in one-hot form, so first take softmax (if your model doesn't already) followed by argmax to get the index with the highest probability at each pixel. Then, we calculate IoU for each class (and take the mean over it at the end).
We can reshape both the prediction and the label as 1-D vectors (I read that it makes the computation faster). For each class, we first identify the indices of that class using pred_inds = (pred == sem_class) and target_inds = (label == sem_class). The resulting pred_inds and target_inds will have 1 at pixels labelled as that particular class while 0 for any other class.
Then, there is a possibility that the target does not contain that particular class at all. This will make that class's IoU calculation invalid as it is not present in the target. So, you assign such classes a NaN IoU (so you can identify them later) and not involve them in the calculation of the mean.
If the particular class is present in the target, then pred_inds[target_inds] will give a vector of 1s and 0s where indices with 1 are those where prediction and target are equal and zero otherwise. Taking the sum of all elements of this will give us the intersection.
If we add all the elements of pred_inds and target_inds, we'll get the union + intersection of pixels of that particular class. So, we subtract the already calculated intersection to get the union. Then, we can divide the intersection and union to get the IoU of that particular class and add it to a list of valid IoUs.
At the end, you take the mean of the entire list to get the mIoU. If you want the Dice Coefficient, you can calculate it in a similar fashion.

How to correctly implement a batch-input LSTM network in PyTorch?

This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.

Resources