Data Underfitting Or Not? - python-3.x

Is the line of regression underfitting and if yes what can I do for accurate results? I have not been able to identify such things like if the line of regression is overfitting or underfitting or accurate so suggestions regarding those will also be appreciated. The File "Advertising.csv":-https://github.com/marcopeix/ISL-linear-regression/tree/master/data
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
#reading and knowing the data
data=pd.read_csv('Advertising.csv')
#print(data.head())
#print(data.columns)
#print(data.shape)
#plotting the data
plt.figure(figsize=(10,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
#storing data into variable and shaping data
X=data['TV'].values.reshape(-1,1)
Y=data['sales'].values.reshape(-1,1)
#calling the model and fitting the model
reg=LinearRegression()
reg.fit(X,Y)
#making predictions
predictions=reg.predict(X)
#plotting the predicted data
plt.figure(figsize=(16,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.plot(data['TV'],predictions, c='blue',linewidth=2)
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
r2= r2_score(Y,predictions)
print("R2 score is: ",r2)
print("Accuracy: {:.2f}".format(reg.score(X,Y)))

To work out if your model is underfitting (or overfitting) you need to look at the bias of the model (the distance between the output predicted by your model and the expected output). You can't (to the best of my knowledge) do it just by looking at your code, you need to evaluate your model as well (run it).
As it's a linear regression it's likely that you're underfitting.
I'd suggest splitting your data into a training set and a testing set. You can fit your model on the training set, and see how well it performs on unseen data using the testing set. A model is underfitting if it performs miserably on both the training data as well as the testing data. It's overfitting if it performs brilliantly on the training data but less well on the testing data.
Try something along the lines of:
from sklearn.model_selection import train_test_split
# This will split the data into a train set and a test set, leaving 20% (the test_size parameter) for testing
X, X_test, Y, Y_test = train_test_split(data['TV'].values.reshape(-1,1), data['sales'].values.reshape(-1,1), test_size=0.2)
# Then fit your model ...
# e.g. reg.fit(X,Y)
# Finally evaluate how well it does on the training and test data.
print("Test score " + str(reg.score(X_test, Y_test)))
print("Train score " + str(reg.score(X_test, Y_test)))

Instead of training and testing on same data.
Split your data set into 2,3 sets (train,validation,test)
You may only need to split it in 2 (train,test) use sklearn library function train_test_split
Train your model on training data. Then test on testing data and see if you get good result.
If model's training accuracy is very high but testing is very low then you may say it have overfit. Or if model don't even get high accuracy on train then it is underfitting.
Hope it will you. :)

Related

Cross validation in cross_val_score

When fitting my data in python I'm usually doing:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
I splits my data into two chunks: one for training, other with testing.
After that I fit my data with:
model.fit(X_train,y_train)
y_pred = model.predict(X_test,y_test)
And I can get the accuracy with:
accuracy_score(y_test,y_pred)
I understand these steps.
But what is happening in sklearn.model_selection.cross_val_score? For example:
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10).
Is it doing everything that I did before, but 10 times?
Do I have to split the data to train,test sets? From my understanding it splits the data, fits it, predicts the test data and gets the accuracy score. 10 times. In one line.
But I don't see how large is the train and test sets. Can I set it manually? Also are they same size with each run?
The function "train_test_split" splits the train and test set randomly with a split ratio.
While the following "cross_val_score" function does 10-Fold cross-validation.
cross_val_score(estimator= model, X= X_train,y=y_train,cv=10)
In this case, the main difference is that the 10-Fold CV does not shuffle the data, and the folds are looped in the same sequence as the original data. You should think critically if the sequence of the data matters for cross-validation, this depends on your specific application.
Choosing which validation method to use: https://stats.stackexchange.com/questions/103459/how-do-i-know-which-method-of-cross-validation-is-best
You can read the docs about K-Fold here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold
Based on my understanding, if you set cv=10, it will divide your dataset into 10 folds. So if you have 1000 rows of data, that's mean 900 will be training dataset and the rest of the 100 will be your testing dataset. Hence, you are not required to set any test_size like what you did in train_test_split.

writing training and testing data sets into separate files

i am training an autoencoder neural network for my work purpose.However i am taking
the image numpy array dataset as input(total samples 16110) and want to split dataset into training and test set using the below autoencoder.fit command. Additionally while training the network it is writing like Train on 12856 samples, validate on 3254 samples.
However, i need to save both the training and testing data into separate files. How can i do it?
es=EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=5)
mc=ModelCheckpoint('best_model.h5',monitor='val_loss',mode='min',save_best_only=True)
history = autoencoder.fit(dataNoise,dataNoise, epochs=30, batch_size=256, shuffle=256,callbacks=[es,mc], validation_split = 0.2)
you can use the train_test_split function from sklearn. See code below
from sklearn.model_selection import train_test_split
train_split=.9 # set this as the % you want for training
train_noise, valid_noise=train_test_split(dataNoise, train_size=train_split, shuffle=True,
random_state=123)
now use train noise as x,y and valid noise for validation data in model.fit

how to correctly shape input of a multiclass classification using keras stacked LSTM model

I am working on a multiple classification problem and after dabbling with multiple neural network architectures, I settled for a stacked LSTM structure as it yields the best accuracy for my use-case. Unfortunately the network takes a long time (almost 48 hours) to reach a good accuracy (~1000 epochs) even when I use GPU acceleration. The resulting accuracy and loss functions are:
At this point, giving the good performance but the very slow training I suspect a bug in my code. I tested it using the golden tests mentioned here, which consist of running tests with 2 points only either in the testing set or the training set along with eliminating the dropouts. Unfortunately, the outputs of these runs result in testing accuracy better than the training accuracy, which should not be the case as far as I know. I suspect that I am shaping my data in the wrong way. Any hints, suggestions and advises are appreciated.
My code is the following:
# -*- coding: utf-8 -*-
import keras
import numpy as np
from time import time
from utils import dmanip, vis
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.utils import to_categorical
from keras.callbacks import TensorBoard
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
###############################################################################
####################### Extract the data from .csv file #######################
###############################################################################
# get data
data, column_names = dmanip.get_data(file_path='../data_one_outcome.csv')
# split data
X = data.iloc[:, :-1]
y = data.iloc[:, -1:].astype('category')
###############################################################################
########################## init global config vars ############################
###############################################################################
# check if GPU is used
print(device_lib.list_local_devices())
# init
n_epochs = 1500
n_comps = X.shape[1]
###############################################################################
################################## Keras RNN ##################################
###############################################################################
# encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y))
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.35,
random_state=True,
shuffle=True)
# expand dimensions
x_train = np.expand_dims(x_train, axis=2)
x_test = np.expand_dims(x_test, axis=2)
# define model
model = Sequential()
model.add(LSTM(units=n_comps, return_sequences=True,
input_shape=(x_train.shape[1], 1),
dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(4 ,activation='softmax'))
# print model architecture summary
print(model.summary())
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Create a TensorBoard instance with the path to the logs directory
tensorboard = TensorBoard(log_dir='./logs/rnn/{}'.format(time()))
# fit the model
history = model.fit(x_train, y_train, epochs=n_epochs, batch_size=100,
validation_data=(x_test, y_test), callbacks=[tensorboard])
# plot results
vis.plot_nn_stats(history=history, stat_type="accuracy", fname="RNN-accuracy")
vis.plot_nn_stats(history=history, stat_type="loss", fname="RNN-loss")
My data is a large 2d matrix (38607, 150), where 149 is the number of features and 38607 is the number of samples, with a target vector including 4 classes.
feat1 feat2 ... feat148 feat149 target
1 2.250 0.926 ... 16.0 0.0 class1
2 2.791 1.235 ... 1.0 0.0 class2
. . . . . .
. . . . . .
. . . . . .
38406 2.873 1.262 ... 281.0 0.0 class3
38407 3.222 1.470 ... 467.0 1.0 class4
Regarding the Slowness of Training: You can think of using tf.data instead of Data Frames and Numpy Arrays because, Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. The tf.data API helps to build flexible and efficient input pipelines.
For more information regarding tf.data, please refer this Tensorflow Documentation 1, Documentation 2.
This Tensorflow Tutorial guides you to convert your Data Frame to tf.data format.
One more important feature of use to you can be tf.profiler. Using Tensorflow Profiler, you can not only Visualize the Time and Memory Consumed in each phase of Data Science Project but it also provides us a Suggestion/Recommendation to reduce the Time/Memory Consumption and hence to Optimize our Project.
For more information on Tensorflow Profiler, refer this Documentation, this Tutorial and this Tensorflow DevSummit Youtube Video.
Regarding Testing Accuracy more than Training Accuracy: This is not a big problem and happens sometimes.
Probable Reason 1: Dropout ==> What is the reason for you to use Dropout and recurrent_dropout in your Model? Was the Model Overfitting? If the Model is not Overfitting, without Dropout and recurrent_dropout, then you can think of removing them because, If you set Dropout (0.2) and recurrent_dropout (0.2) it means 20% of features will be 0 and 20% of Time Steps will be 0, during training. However, during testing all the Features and Timesteps are used, so the model is more robust and have better testing accuracy.
Probable Reason 2: 35% of Testing Data is bit more than usual. You can make it either 20% or 25%.
Probable Reason 3: Your training data might have several arduous cases to learn and Your Testing data may contain easier cases to predict. To mitigate this, you can Split the Data Once again with different Random Seed.
For more information, please refer this Research Gate Link and this Stack Overflow Link.
Hope this helps. Happy Learning!

What do the data analytics data set train and test variables represent?

Within the below code there are a few variables I'm confused about:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm,metrics,datasets
train_data=np.zeros((280,10304))
train_target=np.zeros((280))
test_data=np.zeros((120,10304))
test_target=np.zeros((120))
Can someone please explain what test_data, train_data, test_target and train_target represent and their purpose?
That's a quite weird way of naming what's commonly named:
- X_train (here train_data): inputs of your model used to train
- Y_train (here train_target): labels of the lines used to train, i.e. what your model learns to predict
- X_test (here test_data): inputs of your model used to test
- Y_test (here test_target): what you want your model to predict while testing your model
To "test" a model signify mostly to compute some metrics (accuracy/recall/...) to determine how much you are satisfied of your model once that it's trained.
Note: lines of input must have same length, and you must have the same number of lines in input and in labels when training or testing.

Single prediction using a model pre-trained with scaled features

I trained a SVM scikit-learn model with scaled features and persist it to be used later. In another file I loaded the saved model and I want to submit a new set of features to perform a prediction. Do I have to scale this new set of features? How can I do this with only one set of features?
I am not scaling the new values and I am getting weird outcomes and I cannot do the predictions. Despite of this, the prediction with a large test set generated by StratifiedShuffleSplit is working fine and I am getting a 97% of accuracy.
The problem is with the single predictions using a persisted SVM model trained with scaled features. Some idea of what am I doing wrong?
Yes, you should absolutely perform the same scaling on the new data. However, this might be impossible if you haven't saved the scaler you trained before.
This is why instead of training and saving your SVM, you should train and save your scaler with your SVM together. In the machine learning jargon, this is called a Pipeline.
This is how you would use it on a toy example:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
pipe = Pipeline([('scaler',StandardScaler()), ('svc', SVC())])
This pipeline then supports the same operations as a regular scikit-learn model:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
When fitting the pipe, it first scales and then feeds the scaled features into the classifier.
Once it is trained, you can save the pipe object just like you saved the SVM before. When you will load it and apply it to new data, it will do the scaling as desired before the predictions.

Resources