How to do training and testing for decision tree classifier - scikit-learn

im try to do training and testing for my decision tree classifier. im still new in decision tree. i have 150 data with two columns in my csv file and im tried to split it into 100 training and 50 for testing. i've tried using scikit but i still don't understand.
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=17)
classifier.fit(train_x, train_Y)
pred_y = classifier.predict(test_x)
print(classification_report(test_Y,pred_y))
accuracy_score(test_Y,pred_y)
can anyone help me how to do it ? i appreciate every help

You need to perform a train-test-split.
As you got 150 samples in total and 50 should be part of your test set, you can set the test size as an integer equal to 50.
You might want to set the random_state for reproducability. Generally, it's also good advice to leave shuffle=True activated. If your data is time-correlated, deactivate it to prevent data leakage. You can find detailled examples in this book.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=50, random_state=42)

Related

Cross validation in cross_val_score

When fitting my data in python I'm usually doing:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
I splits my data into two chunks: one for training, other with testing.
After that I fit my data with:
model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)
And I can get the accuracy with:
accuracy_score(y_test,y_pred)
I understand these steps.
But what is happening in sklearn.model_selection.cross_val_score? For example:
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10).
Is it doing everything that I did before, but 10 times?
Do I have to split the data to train,test sets? From my understanding it splits the data, fits it, predicts the test data and gets the accuracy score. 10 times. In one line.
But I don't see how large is the train and test sets. Can I set it manually? Also are they same size with each run?
The function "train_test_split" splits the train and test set randomly with a split ratio.
While the following "cross_val_score" function does 10-Fold cross-validation.
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10)
In this case, the main difference is that the 10-Fold CV does not shuffle the data, and the folds are looped in the same sequence as the original data. You should think critically if the sequence of the data matters for cross-validation, this depends on your specific application.
Choosing which validation method to use: https://stats.stackexchange.com/questions/103459/how-do-i-know-which-method-of-cross-validation-is-best
You can read the docs about K-Fold here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold
Based on my understanding, if you set cv=10, it will divide your dataset into 10 folds. So if you have 1000 rows of data, that's mean 900 will be training dataset and the rest of the 100 will be your testing dataset. Hence, you are not required to set any test_size like what you did in train_test_split.

GridSearchCV, Data Leaks & Production Process Clarity

I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:
# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold
# is respectively transformed by fit scalar, preventing data leaks between each test/train
pipe = Pipeline([('sc', StandardScaler()),
('knn', KNeighborsClassifier())
])
# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}
# create grid
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True)
search.fit(X_train, y_train)
Assuming my understanding and the above process is correct, my question is what's next?
My guess is that we:
fit X_train to our scaler
transform X_train and X_test with our scaler
train a new model using X_train and our newly discovered best parameters from the grid search process
test the new model with our very first holdout-test set.
Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.
Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?
Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.
Bonus Question:
I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.
As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.
You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True,
refit=True)
search.fit(X_train, y_train)
tuned_pipe = search.best_estimator_
tuned_pipe.score(X_test, y_test)

How to use .fit() with cross validation

I'm pretty new to data science and a bit confused.
And just want to ensure that my approach makes sense.
I create modells like:
lr7 = GaussianNB().fit(X_train,y_train)
and using the cross_val_predict() right after.
y_pred8 = cross_val_predict(lr8, X_test, y_test, cv=5, n_jobs=-1, verbose=5)
Wouldn't it make much more sense to cross validate the train set first?
There is also a cross_validate()function in scikitlearn.
Is it correct to use this one with the train dataset? In the documentation they use X and y for both and not train/test splitted data.
A simple way to implement cross-validation is by using the cross_val_score function ( from sklearn. This may be appropriate for your problem.
# build model
lr7 = GaussianNB()
scores = cross_val_score(lr7, X, y, cv=5)
Note that in cross validation you use either the whole dataset or the training part X_train, y_train, but never the test part as you show in your code.

Single prediction using a model pre-trained with scaled features

I trained a SVM scikit-learn model with scaled features and persist it to be used later. In another file I loaded the saved model and I want to submit a new set of features to perform a prediction. Do I have to scale this new set of features? How can I do this with only one set of features?
I am not scaling the new values and I am getting weird outcomes and I cannot do the predictions. Despite of this, the prediction with a large test set generated by StratifiedShuffleSplit is working fine and I am getting a 97% of accuracy.
The problem is with the single predictions using a persisted SVM model trained with scaled features. Some idea of what am I doing wrong?
Yes, you should absolutely perform the same scaling on the new data. However, this might be impossible if you haven't saved the scaler you trained before.
This is why instead of training and saving your SVM, you should train and save your scaler with your SVM together. In the machine learning jargon, this is called a Pipeline.
This is how you would use it on a toy example:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
pipe = Pipeline([('scaler',StandardScaler()), ('svc', SVC())])
This pipeline then supports the same operations as a regular scikit-learn model:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
When fitting the pipe, it first scales and then feeds the scaled features into the classifier.
Once it is trained, you can save the pipe object just like you saved the SVM before. When you will load it and apply it to new data, it will do the scaling as desired before the predictions.

Data Underfitting Or Not?

Is the line of regression underfitting and if yes what can I do for accurate results? I have not been able to identify such things like if the line of regression is overfitting or underfitting or accurate so suggestions regarding those will also be appreciated. The File "Advertising.csv":-https://github.com/marcopeix/ISL-linear-regression/tree/master/data
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
#reading and knowing the data
data=pd.read_csv('Advertising.csv')
#print(data.head())
#print(data.columns)
#print(data.shape)
#plotting the data
plt.figure(figsize=(10,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
#storing data into variable and shaping data
X=data['TV'].values.reshape(-1,1)
Y=data['sales'].values.reshape(-1,1)
#calling the model and fitting the model
reg=LinearRegression()
reg.fit(X,Y)
#making predictions
predictions=reg.predict(X)
#plotting the predicted data
plt.figure(figsize=(16,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.plot(data['TV'],predictions, c='blue',linewidth=2)
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
r2= r2_score(Y,predictions)
print("R2 score is: ",r2)
print("Accuracy: {:.2f}".format(reg.score(X,Y)))
To work out if your model is underfitting (or overfitting) you need to look at the bias of the model (the distance between the output predicted by your model and the expected output). You can't (to the best of my knowledge) do it just by looking at your code, you need to evaluate your model as well (run it).
As it's a linear regression it's likely that you're underfitting.
I'd suggest splitting your data into a training set and a testing set. You can fit your model on the training set, and see how well it performs on unseen data using the testing set. A model is underfitting if it performs miserably on both the training data as well as the testing data. It's overfitting if it performs brilliantly on the training data but less well on the testing data.
Try something along the lines of:
from sklearn.model_selection import train_test_split
# This will split the data into a train set and a test set, leaving 20% (the test_size parameter) for testing
X, X_test, Y, Y_test = train_test_split(data['TV'].values.reshape(-1,1), data['sales'].values.reshape(-1,1), test_size=0.2)
# Then fit your model ...
# e.g. reg.fit(X,Y)
# Finally evaluate how well it does on the training and test data.
print("Test score " + str(reg.score(X_test, Y_test)))
print("Train score " + str(reg.score(X_test, Y_test)))
Instead of training and testing on same data.
Split your data set into 2,3 sets (train,validation,test)
You may only need to split it in 2 (train,test) use sklearn library function train_test_split
Train your model on training data. Then test on testing data and see if you get good result.
If model's training accuracy is very high but testing is very low then you may say it have overfit. Or if model don't even get high accuracy on train then it is underfitting.
Hope it will you. :)

Resources