Adding correct labels to decision trees - scikit-learn

I am using a Random Forests regressor in a Machine Learning Project. In order to better understand the logic of the predictions, I'd like to visualize some decision trees and check which features are used when.
In order to do so, I wrote the following code:
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image
# Select one estimator from the Random Forests
estimator = best_estimators_regr['RandomForestRegressor'][0].estimators_[0]
export_graphviz(estimator, out_file=path+'tree.dot',
rounded=True, proportion=False,
precision=2, filled=True)
call(['dot', '-Tpng', path+'tree.dot', '-o', path+'tree.png', '-Gdpi=600'])
Image(filename=path+'tree.png')
The problem is that I use the max_features parameter when training the model, so I do not know which features are used in each tree. Thus, when plotting a tree, I simply get X[some_number]. Does this number correspond to the column in the original dataset? If not, how can I tell it to use the name of the columns rather than the number?

The 'max_features' parameter in RandomForestClassifier is used to get the number of features at a time to find the best split. That parameter is passed to all the individual estimators (DecisionTreeClassifier). The base DecisionTreeClassifier objects all accept the whole data (where the samples are sampled from the training data but all column features are passed to each tree). The feature ordering is decided into that single DecisionTreeClassifier object. So no need to worry about that.
You can just use the feature_names parameter in export_graphviz to pass the names of each features for all your features.

Related

Tree based algorithm different behavior with duplicated features

I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead?
Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all
When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.

Save Scikit-learn RandomForestClassifier in Weka J48 format

I'm looking to save a fitted RandomForestClassifier in the format specified Page 27 - Figure 26 here. Alternatively, if such a thing does not exist, how can I extract the internals of the decision trees to build the J48 formatted file myself? DecisionTreeClassifier makes this available through tree.export_text, but RandomForestClassifier does not have a tree_ attribute.
Toward the last sentence of your question, RandomForestClassifier has its fitted trees accessible through the attribute estimators_; these are instances of DecisionTreeClassifier, so each of them has a tree_ attribute.
If you use the sklearn-weka-plugin (examples), and train Weka's RandomForest on your data, then you can create that output by simply turning the model into a string using str(...).
Of course, this will require a Java JDK installation to run.

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

What will GridsearchCV choose if there are multiple estimators having the same score?

I'm using RandomForestClassifier in sklearn, and using GridsearchCV for getting best estimator.
I'm wondering when there are many estimators (from simple one to complex one) having the same scores in GridsearchCV, what will be the resulted estimator out of GridsearchCV? The simplest one? or random one?
GridSearchCV does not assess the model complexity (though that would be a neat feature). Neither does it choose among the best models randomly.
Instead, GridSearchCV simply performs an np.argmin() on the stored errors. See the corresponding line in the source code.
Now, according to the NumPy docs,
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
That is, GridSearchCV will always select the first among the best models.

How to know which features have more impact in predicting the target class?

I have a business problem, I have run the regression model in python to predict my target value. When validating it with my test set I came to know that my predicted variable is very far from my actual value. Now the thing I want to extract from this model is that, which feature played the role to deviate my predicted value from actual value (let say difference is in some threshold value)?
I want to rank the features impact wise so that I could address to my client.
Thanks
It depends on the estimator you chose, linear models often have a coef_ method you can call to get the coef used for each feature, given they are normalized this tells you what you want to know.
As told above for tree model you have the feature importance. You can also use libraries like treeinterpreter described here:
Interpreting Random Forest
examples
You can have a look at this -
Feature selection
Check the Random Forest Regressor - for performing Regression.
# Example
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=2, random_state=0,
n_estimators=100)
regr.fit(X, y)
print(regr.feature_importances_)
print(regr.predict([[0, 0, 0, 0]]))
Check regr.feature_importances_ for getting the higher, more important features. Further information on FeatureImportance
Edit-1:
As pointed out in user (#blacksite) comment, only feature_importance does not provide complete interpretation of Random forest. For further analysis of results and responsible Features. Please refer to following blogs
https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45 (preferred as it provides multiple techniques )
https://blog.datadive.net/interpreting-random-forests/ (focuses on 1 technique but also provides python library - treeinterpreter)
More on feature_importance:
If you simply use the feature_importances_ attribute to select the
features with the highest importance score. Feature selection using
feature
importances
Feature importance also depends on the criteria used for splitting
and calculating importance Interpreting Decision Tree in context of
feature
importances

Resources