Tree based algorithm different behavior with duplicated features - scikit-learn

I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead?
Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all

When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.

Related

Does K-Fold iteratively train a model

If you run cross-val_score() or cross_validate() on a dataset, is the estimator trained using all the folds at the end of the run?
I read somewhere that cross-val_score takes a copy of the estimator. Whereas I thought this was how you train a model using k-fold.
Or, at the end of the cross_validate() or cross_val_score() you have a single estimator and then use that for predict()
Is my thinking correct?
You can refer to sklearn-document here.
If you do 3-Fold cross validation,
the sklearn will split your dataset to 3 parts. (For example, the 1st part contains 1st-3rd rows, 2nd part contains 4th-6th rows, and so on)
sklearn iterate to train new model 3 times with different training set and validation set
In the first round, it combine 1st and 2nd part together and use it as training set and test the model with 3rd part.
In the second round, it combine 1st and 3rd part together and use it as training set and test the model with 2nd part.
and so on.
So, after using cross-validate, you will get three models. If you want the model objects of each round, you can add parameter return_estimato=True. The result which is the dictionary will have another key named estimator containing the list of estimator of each training.
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
cv_results = cross_validate(lasso, X, y, cv=3, return_estimator=True)
print(sorted(cv_results.keys()))
#Output: ['estimator', 'fit_time', 'score_time', 'test_score']
cv_results['estimator']
#Output: [Lasso(), Lasso(), Lasso()]
However, in practice, the cross validation method is used only for testing the model. After you found the good model and parameter setting that give you the high cross-validation score. It will be better if you fit the model with the whole training set again and test the model with the testing set.

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

Adding correct labels to decision trees

I am using a Random Forests regressor in a Machine Learning Project. In order to better understand the logic of the predictions, I'd like to visualize some decision trees and check which features are used when.
In order to do so, I wrote the following code:
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image
# Select one estimator from the Random Forests
estimator = best_estimators_regr['RandomForestRegressor'][0].estimators_[0]
export_graphviz(estimator, out_file=path+'tree.dot',
rounded=True, proportion=False,
precision=2, filled=True)
call(['dot', '-Tpng', path+'tree.dot', '-o', path+'tree.png', '-Gdpi=600'])
Image(filename=path+'tree.png')
The problem is that I use the max_features parameter when training the model, so I do not know which features are used in each tree. Thus, when plotting a tree, I simply get X[some_number]. Does this number correspond to the column in the original dataset? If not, how can I tell it to use the name of the columns rather than the number?
The 'max_features' parameter in RandomForestClassifier is used to get the number of features at a time to find the best split. That parameter is passed to all the individual estimators (DecisionTreeClassifier). The base DecisionTreeClassifier objects all accept the whole data (where the samples are sampled from the training data but all column features are passed to each tree). The feature ordering is decided into that single DecisionTreeClassifier object. So no need to worry about that.
You can just use the feature_names parameter in export_graphviz to pass the names of each features for all your features.

Neural network regression with skewed data

I have been trying to build a machine learning model using Keras which predicts the radiation dose based on pre-treatment parameters. My dataset has approximately 2200 samples of which 20% goes into validation and testing.
The problem with the target variable is that it is very skewed since large radiation doses are much more rare than the small ones. Hence, I suspect that my regression model fails to predict the large values at all, and predicts everything around the mean, which is apparent from the figure. I have tried to log-normalise the target variable to make it more normally distributed, but it has had no effect.
Any suggestion how to fix this?
Target variable
Regression predictions
Computing individual sample weights based on 10 histogram bins helped in my case. See the code below:
import pandas as pd
import numpy as np
from sklearn.utils.class_weight import compute_sample_weight
hist, bin_edges = np.histogram(training_targets, bins = 10)
classes = training_targets.apply(lambda x: pd.cut(x, bin_edges, labels = False,
include_lowest = True)).values
sample_weights = compute_sample_weight('balanced', classes)

Scikit-Learn: Avoiding Data Leakage During Cross-Validation

I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup.
Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my entire train dataset and then run k-fold cross-validation.
The leakage comes in because, if I'm doing 5-fold cross-validation, I'm training on 80% of my train data and testing it on the remaining 20% of the train data.
I really should just be imputing the 20% based on the 80% of train (whereas I was using 100% of the data before).
1) Is this the right way to think about cross-validation?
2) I've been looking at the Pipeline class in sklearn.pipeline and it seems useful for doing a bunch of transformations and then finally fitting a model to the resulting data. However, I'm doing a bunch of stuff like "impute missing data in float64 columns with the mean", "impute all other data with the mode", etc.
There isn't an obvious transformer for this kind of imputation. How would I go about adding this step to a Pipeline? Would I just make my own subclass of BaseEstimator?
Any guidance here would be great!
1) Yes, you should impute the 20% test data using the 80% training data.
2) I wrote a blog post that answers your second question, but I'll include the core parts here.
With sklearn.pipeline, you can apply separate preprocessing rules to different feature types (e.g., numeric, categorical). In the example code below, I impute the median of numeric features before scaling them. The categorical and boolean features are imputed with the mode -- the categorical features are one-hot encoded.
You can include an estimator at the end of the pipeline for regression, classification, etc.
import numpy as np
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, Imputer, StandardScaler
preprocess_pipeline = make_pipeline(
FeatureUnion(transformer_list=[
("numeric_features", make_pipeline(
TypeSelector(np.number),
Imputer(strategy="median"),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
Imputer(strategy="most_frequent"),
OneHotEncoder()
)),
("boolean_features", make_pipeline(
TypeSelector("bool"),
Imputer(strategy="most_frequent")
))
])
)
The TypeSelector portion of the pipeline assumes the object X is a pandas DataFrame. The subset of columns with the given data type are selected with TypeSelector.transform.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
I recommend thinking of 5-fold cross validation as simply splitting up the data into 5 parts (or folds). You hold out one fold for testing and using the other 4 together for your training set. We repeat this process another 4 times until each fold has had the chance to be tested.
For your imputation to work correctly and not be subject to contamination, you would need to determine the mean from the 4 folds used for testing, and use it to impute that value in both the training set and test set.
I like to implement the CV split with StratifiedKFold. This will ensure you have the same number of samples for each class in the folds.
To answer your question about using Pipelines, I would say you should probably subclass the BaseEstimator with your custom Imputation transformer. Inside of your loop for the CV-split, you should compute the mean from your training set then set this mean as a parameter in your transformer. Then you can call fit or transform.

Resources