Single prediction using a model pre-trained with scaled features - scikit-learn

I trained a SVM scikit-learn model with scaled features and persist it to be used later. In another file I loaded the saved model and I want to submit a new set of features to perform a prediction. Do I have to scale this new set of features? How can I do this with only one set of features?
I am not scaling the new values and I am getting weird outcomes and I cannot do the predictions. Despite of this, the prediction with a large test set generated by StratifiedShuffleSplit is working fine and I am getting a 97% of accuracy.
The problem is with the single predictions using a persisted SVM model trained with scaled features. Some idea of what am I doing wrong?

Yes, you should absolutely perform the same scaling on the new data. However, this might be impossible if you haven't saved the scaler you trained before.
This is why instead of training and saving your SVM, you should train and save your scaler with your SVM together. In the machine learning jargon, this is called a Pipeline.
This is how you would use it on a toy example:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
pipe = Pipeline([('scaler',StandardScaler()), ('svc', SVC())])
This pipeline then supports the same operations as a regular scikit-learn model:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
When fitting the pipe, it first scales and then feeds the scaled features into the classifier.
Once it is trained, you can save the pipe object just like you saved the SVM before. When you will load it and apply it to new data, it will do the scaling as desired before the predictions.

Related

How to do training and testing for decision tree classifier

im try to do training and testing for my decision tree classifier. im still new in decision tree. i have 150 data with two columns in my csv file and im tried to split it into 100 training and 50 for testing. i've tried using scikit but i still don't understand.
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=17)
classifier.fit(train_x, train_Y)
pred_y = classifier.predict(test_x)
print(classification_report(test_Y,pred_y))
accuracy_score(test_Y,pred_y)
can anyone help me how to do it ? i appreciate every help
You need to perform a train-test-split.
As you got 150 samples in total and 50 should be part of your test set, you can set the test size as an integer equal to 50.
You might want to set the random_state for reproducability. Generally, it's also good advice to leave shuffle=True activated. If your data is time-correlated, deactivate it to prevent data leakage. You can find detailled examples in this book.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=50, random_state=42)

is it possible to set the splitting strategy for GridSearchCv?

I'm optimizing model's hyperparameters by GridSearchCv. And because the data I'm working with is very imbalanced, I need to "choose" the manner that the algortihm splits the train/test sets in order to ensure that the underrepresented points are in both sets.
By reading scikit-learn's documentation, I have the idea that it's possible to set the splitting strategy for GridSearch but I'm not sure how or if this is the case.
I would be very grateful if someone could help me with this.
Yes, pass in the GridSearchCV as cv a StratifiedKFold object.
from sklearn.model_selection import StratifiedKFold
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
skf = StratifiedKFold(n_splits=5)
clf = GridSearchCV(svc, parameters, cv = skf)
clf.fit(iris.data, iris.target)
By default, if you are training a classification model with GridSearchCV, the default method for splitting the dataset is StratifiedKFold, that takes care of balancing the dataset according to the target variable.
If your dataset is imbalanced for some other reason (not the target variable), you can choose another criteria to perform the split. Carefully read the documentation of GridSearchCV, and select an appropriate CV splitter.
In the scikit-learn documentation of model selection, there are many Splitter Classes that you could use. Or you can define your own splitter class according to your criteria, but it would be more difficult.

GridSearchCV, Data Leaks & Production Process Clarity

I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:
# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold
# is respectively transformed by fit scalar, preventing data leaks between each test/train
pipe = Pipeline([('sc', StandardScaler()),
('knn', KNeighborsClassifier())
])
# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}
# create grid
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True)
search.fit(X_train, y_train)
Assuming my understanding and the above process is correct, my question is what's next?
My guess is that we:
fit X_train to our scaler
transform X_train and X_test with our scaler
train a new model using X_train and our newly discovered best parameters from the grid search process
test the new model with our very first holdout-test set.
Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.
Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?
Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.
Bonus Question:
I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.
As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.
You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True,
refit=True)
search.fit(X_train, y_train)
tuned_pipe = search.best_estimator_
tuned_pipe.score(X_test, y_test)

writing training and testing data sets into separate files

i am training an autoencoder neural network for my work purpose.However i am taking
the image numpy array dataset as input(total samples 16110) and want to split dataset into training and test set using the below autoencoder.fit command. Additionally while training the network it is writing like Train on 12856 samples, validate on 3254 samples.
However, i need to save both the training and testing data into separate files. How can i do it?
es=EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=5)
mc=ModelCheckpoint('best_model.h5',monitor='val_loss',mode='min',save_best_only=True)
history = autoencoder.fit(dataNoise,dataNoise, epochs=30, batch_size=256, shuffle=256,callbacks=[es,mc], validation_split = 0.2)
you can use the train_test_split function from sklearn. See code below
from sklearn.model_selection import train_test_split
train_split=.9 # set this as the % you want for training
train_noise, valid_noise=train_test_split(dataNoise, train_size=train_split, shuffle=True,
random_state=123)
now use train noise as x,y and valid noise for validation data in model.fit

Data Underfitting Or Not?

Is the line of regression underfitting and if yes what can I do for accurate results? I have not been able to identify such things like if the line of regression is overfitting or underfitting or accurate so suggestions regarding those will also be appreciated. The File "Advertising.csv":-https://github.com/marcopeix/ISL-linear-regression/tree/master/data
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
#reading and knowing the data
data=pd.read_csv('Advertising.csv')
#print(data.head())
#print(data.columns)
#print(data.shape)
#plotting the data
plt.figure(figsize=(10,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
#storing data into variable and shaping data
X=data['TV'].values.reshape(-1,1)
Y=data['sales'].values.reshape(-1,1)
#calling the model and fitting the model
reg=LinearRegression()
reg.fit(X,Y)
#making predictions
predictions=reg.predict(X)
#plotting the predicted data
plt.figure(figsize=(16,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.plot(data['TV'],predictions, c='blue',linewidth=2)
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
r2= r2_score(Y,predictions)
print("R2 score is: ",r2)
print("Accuracy: {:.2f}".format(reg.score(X,Y)))
To work out if your model is underfitting (or overfitting) you need to look at the bias of the model (the distance between the output predicted by your model and the expected output). You can't (to the best of my knowledge) do it just by looking at your code, you need to evaluate your model as well (run it).
As it's a linear regression it's likely that you're underfitting.
I'd suggest splitting your data into a training set and a testing set. You can fit your model on the training set, and see how well it performs on unseen data using the testing set. A model is underfitting if it performs miserably on both the training data as well as the testing data. It's overfitting if it performs brilliantly on the training data but less well on the testing data.
Try something along the lines of:
from sklearn.model_selection import train_test_split
# This will split the data into a train set and a test set, leaving 20% (the test_size parameter) for testing
X, X_test, Y, Y_test = train_test_split(data['TV'].values.reshape(-1,1), data['sales'].values.reshape(-1,1), test_size=0.2)
# Then fit your model ...
# e.g. reg.fit(X,Y)
# Finally evaluate how well it does on the training and test data.
print("Test score " + str(reg.score(X_test, Y_test)))
print("Train score " + str(reg.score(X_test, Y_test)))
Instead of training and testing on same data.
Split your data set into 2,3 sets (train,validation,test)
You may only need to split it in 2 (train,test) use sklearn library function train_test_split
Train your model on training data. Then test on testing data and see if you get good result.
If model's training accuracy is very high but testing is very low then you may say it have overfit. Or if model don't even get high accuracy on train then it is underfitting.
Hope it will you. :)

Resources