writing training and testing data sets into separate files - python-3.x

i am training an autoencoder neural network for my work purpose.However i am taking
the image numpy array dataset as input(total samples 16110) and want to split dataset into training and test set using the below autoencoder.fit command. Additionally while training the network it is writing like Train on 12856 samples, validate on 3254 samples.
However, i need to save both the training and testing data into separate files. How can i do it?
es=EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=5)
mc=ModelCheckpoint('best_model.h5',monitor='val_loss',mode='min',save_best_only=True)
history = autoencoder.fit(dataNoise,dataNoise, epochs=30, batch_size=256, shuffle=256,callbacks=[es,mc], validation_split = 0.2)

you can use the train_test_split function from sklearn. See code below
from sklearn.model_selection import train_test_split
train_split=.9 # set this as the % you want for training
train_noise, valid_noise=train_test_split(dataNoise, train_size=train_split, shuffle=True,
random_state=123)
now use train noise as x,y and valid noise for validation data in model.fit

Related

GridSearchCV, Data Leaks & Production Process Clarity

I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:
# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold
# is respectively transformed by fit scalar, preventing data leaks between each test/train
pipe = Pipeline([('sc', StandardScaler()),
('knn', KNeighborsClassifier())
])
# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}
# create grid
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True)
search.fit(X_train, y_train)
Assuming my understanding and the above process is correct, my question is what's next?
My guess is that we:
fit X_train to our scaler
transform X_train and X_test with our scaler
train a new model using X_train and our newly discovered best parameters from the grid search process
test the new model with our very first holdout-test set.
Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.
Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?
Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.
Bonus Question:
I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.
As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.
You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True,
refit=True)
search.fit(X_train, y_train)
tuned_pipe = search.best_estimator_
tuned_pipe.score(X_test, y_test)

how to correctly shape input of a multiclass classification using keras stacked LSTM model

I am working on a multiple classification problem and after dabbling with multiple neural network architectures, I settled for a stacked LSTM structure as it yields the best accuracy for my use-case. Unfortunately the network takes a long time (almost 48 hours) to reach a good accuracy (~1000 epochs) even when I use GPU acceleration. The resulting accuracy and loss functions are:
At this point, giving the good performance but the very slow training I suspect a bug in my code. I tested it using the golden tests mentioned here, which consist of running tests with 2 points only either in the testing set or the training set along with eliminating the dropouts. Unfortunately, the outputs of these runs result in testing accuracy better than the training accuracy, which should not be the case as far as I know. I suspect that I am shaping my data in the wrong way. Any hints, suggestions and advises are appreciated.
My code is the following:
# -*- coding: utf-8 -*-
import keras
import numpy as np
from time import time
from utils import dmanip, vis
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.utils import to_categorical
from keras.callbacks import TensorBoard
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
###############################################################################
####################### Extract the data from .csv file #######################
###############################################################################
# get data
data, column_names = dmanip.get_data(file_path='../data_one_outcome.csv')
# split data
X = data.iloc[:, :-1]
y = data.iloc[:, -1:].astype('category')
###############################################################################
########################## init global config vars ############################
###############################################################################
# check if GPU is used
print(device_lib.list_local_devices())
# init
n_epochs = 1500
n_comps = X.shape[1]
###############################################################################
################################## Keras RNN ##################################
###############################################################################
# encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y))
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.35,
random_state=True,
shuffle=True)
# expand dimensions
x_train = np.expand_dims(x_train, axis=2)
x_test = np.expand_dims(x_test, axis=2)
# define model
model = Sequential()
model.add(LSTM(units=n_comps, return_sequences=True,
input_shape=(x_train.shape[1], 1),
dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(4 ,activation='softmax'))
# print model architecture summary
print(model.summary())
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Create a TensorBoard instance with the path to the logs directory
tensorboard = TensorBoard(log_dir='./logs/rnn/{}'.format(time()))
# fit the model
history = model.fit(x_train, y_train, epochs=n_epochs, batch_size=100,
validation_data=(x_test, y_test), callbacks=[tensorboard])
# plot results
vis.plot_nn_stats(history=history, stat_type="accuracy", fname="RNN-accuracy")
vis.plot_nn_stats(history=history, stat_type="loss", fname="RNN-loss")
My data is a large 2d matrix (38607, 150), where 149 is the number of features and 38607 is the number of samples, with a target vector including 4 classes.
feat1 feat2 ... feat148 feat149 target
1 2.250 0.926 ... 16.0 0.0 class1
2 2.791 1.235 ... 1.0 0.0 class2
. . . . . .
. . . . . .
. . . . . .
38406 2.873 1.262 ... 281.0 0.0 class3
38407 3.222 1.470 ... 467.0 1.0 class4
Regarding the Slowness of Training: You can think of using tf.data instead of Data Frames and Numpy Arrays because, Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. The tf.data API helps to build flexible and efficient input pipelines.
For more information regarding tf.data, please refer this Tensorflow Documentation 1, Documentation 2.
This Tensorflow Tutorial guides you to convert your Data Frame to tf.data format.
One more important feature of use to you can be tf.profiler. Using Tensorflow Profiler, you can not only Visualize the Time and Memory Consumed in each phase of Data Science Project but it also provides us a Suggestion/Recommendation to reduce the Time/Memory Consumption and hence to Optimize our Project.
For more information on Tensorflow Profiler, refer this Documentation, this Tutorial and this Tensorflow DevSummit Youtube Video.
Regarding Testing Accuracy more than Training Accuracy: This is not a big problem and happens sometimes.
Probable Reason 1: Dropout ==> What is the reason for you to use Dropout and recurrent_dropout in your Model? Was the Model Overfitting? If the Model is not Overfitting, without Dropout and recurrent_dropout, then you can think of removing them because, If you set Dropout (0.2) and recurrent_dropout (0.2) it means 20% of features will be 0 and 20% of Time Steps will be 0, during training. However, during testing all the Features and Timesteps are used, so the model is more robust and have better testing accuracy.
Probable Reason 2: 35% of Testing Data is bit more than usual. You can make it either 20% or 25%.
Probable Reason 3: Your training data might have several arduous cases to learn and Your Testing data may contain easier cases to predict. To mitigate this, you can Split the Data Once again with different Random Seed.
For more information, please refer this Research Gate Link and this Stack Overflow Link.
Hope this helps. Happy Learning!

Single prediction using a model pre-trained with scaled features

I trained a SVM scikit-learn model with scaled features and persist it to be used later. In another file I loaded the saved model and I want to submit a new set of features to perform a prediction. Do I have to scale this new set of features? How can I do this with only one set of features?
I am not scaling the new values and I am getting weird outcomes and I cannot do the predictions. Despite of this, the prediction with a large test set generated by StratifiedShuffleSplit is working fine and I am getting a 97% of accuracy.
The problem is with the single predictions using a persisted SVM model trained with scaled features. Some idea of what am I doing wrong?
Yes, you should absolutely perform the same scaling on the new data. However, this might be impossible if you haven't saved the scaler you trained before.
This is why instead of training and saving your SVM, you should train and save your scaler with your SVM together. In the machine learning jargon, this is called a Pipeline.
This is how you would use it on a toy example:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
pipe = Pipeline([('scaler',StandardScaler()), ('svc', SVC())])
This pipeline then supports the same operations as a regular scikit-learn model:
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
When fitting the pipe, it first scales and then feeds the scaled features into the classifier.
Once it is trained, you can save the pipe object just like you saved the SVM before. When you will load it and apply it to new data, it will do the scaling as desired before the predictions.

Data Underfitting Or Not?

Is the line of regression underfitting and if yes what can I do for accurate results? I have not been able to identify such things like if the line of regression is overfitting or underfitting or accurate so suggestions regarding those will also be appreciated. The File "Advertising.csv":-https://github.com/marcopeix/ISL-linear-regression/tree/master/data
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error
#reading and knowing the data
data=pd.read_csv('Advertising.csv')
#print(data.head())
#print(data.columns)
#print(data.shape)
#plotting the data
plt.figure(figsize=(10,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
#storing data into variable and shaping data
X=data['TV'].values.reshape(-1,1)
Y=data['sales'].values.reshape(-1,1)
#calling the model and fitting the model
reg=LinearRegression()
reg.fit(X,Y)
#making predictions
predictions=reg.predict(X)
#plotting the predicted data
plt.figure(figsize=(16,8))
plt.scatter(data['TV'],data['sales'], c='black')
plt.plot(data['TV'],predictions, c='blue',linewidth=2)
plt.xlabel('Money Spent on TV ads')
plt.ylabel('Sales')
plt.show()
r2= r2_score(Y,predictions)
print("R2 score is: ",r2)
print("Accuracy: {:.2f}".format(reg.score(X,Y)))
To work out if your model is underfitting (or overfitting) you need to look at the bias of the model (the distance between the output predicted by your model and the expected output). You can't (to the best of my knowledge) do it just by looking at your code, you need to evaluate your model as well (run it).
As it's a linear regression it's likely that you're underfitting.
I'd suggest splitting your data into a training set and a testing set. You can fit your model on the training set, and see how well it performs on unseen data using the testing set. A model is underfitting if it performs miserably on both the training data as well as the testing data. It's overfitting if it performs brilliantly on the training data but less well on the testing data.
Try something along the lines of:
from sklearn.model_selection import train_test_split
# This will split the data into a train set and a test set, leaving 20% (the test_size parameter) for testing
X, X_test, Y, Y_test = train_test_split(data['TV'].values.reshape(-1,1), data['sales'].values.reshape(-1,1), test_size=0.2)
# Then fit your model ...
# e.g. reg.fit(X,Y)
# Finally evaluate how well it does on the training and test data.
print("Test score " + str(reg.score(X_test, Y_test)))
print("Train score " + str(reg.score(X_test, Y_test)))
Instead of training and testing on same data.
Split your data set into 2,3 sets (train,validation,test)
You may only need to split it in 2 (train,test) use sklearn library function train_test_split
Train your model on training data. Then test on testing data and see if you get good result.
If model's training accuracy is very high but testing is very low then you may say it have overfit. Or if model don't even get high accuracy on train then it is underfitting.
Hope it will you. :)

How do I show both Training loss and validation loss on the same graph in tensorboard through keras?

I'm using Keras with the Tensorflow back end to train a CNN, and I'm using tensorboard to visualize the loss functions and accuracy. I would like to see the loss function of both the training data and validation data on the same graph, but I've only found ways to do so when using Tensorflow and not through keras.
Is there a way to do so?
Edit 1:
I tried writing loss/acc in the Regex but instead of putting both of the graphs together it shows them side by side like so:
http://imgur.com/a/oLIcL
Ive added what I use to log to tensor board:
tbCallBack=keras.callbacks.TensorBoard(log_dir='C:\\logs', histogram_freq=0, write_graph=False, write_images=True, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None)
model.fit_generator(train_generator,
steps_per_epoch=x_train.shape[0] // batch_size,
epochs=epochs,
validation_data=(x_test, y_test))
You can add a regex in the text box in the upper left corner of the Tensorboard window.
Add acc for accuracy of both train/validation data. Add lossfor the loss values. This works for me for Keras as well as Tensorflow.
Got this from this nice tutorial on TB: https://www.youtube.com/watch?v=eBbEDRsCmv4
As a code snippet I use this:
logdir = "_tf_logs/" + now.strftime("%Y%m%d-%H%M%S") + "/"
tb = TensorBoard(log_dir=logdir)
callbacks=[tb]
...
model.fit(X_train, Y_train, validation_data=val_data, epochs=10, verbose=2, callbacks=callbacks)
I found this from Github for this exact purpose but without using tensorboard. Hope this helps!
live loss plot for keras

Resources