Can validation in keras.fit run in multiprocess to speed things up? - multithreading

I am practicing the codes in Chapter 6 in "Deep Learning with Python" (by François Chollet). Things are running on colab, but the validation takes a very long time.
Here I only show the codes that are relevant to the question. The complete code can be found in the author's github
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop
lookback = 1440
model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss="mae")
train_gen = {some generator function}
val_gen = {some generator function}
val_steps = (300000 - 200001 - lookback)
history = model.fit(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)
The training took several seconds, but the validation took 20 mins per epoch, because there are 98559 steps of them.
I tried "use_multiprocessing=True" as in:
history = model.fit(..., use_multiprocessing=True)
or "workers=10" as in:
history = model.fit(..., workers=10)
But they didn't help, and took even longer.
So I wonder, whether there are ways to speed the validation up?

Related

k-NN GridSearchCV taking extremely long time to execute

I am attempting to use sklearn to train a KNN model on the MNIST classification task. When I try to tune my parameters using either sklearn's GridSearchCV or RandomisedSearchCV classes, my code is taking an extremely long time to execute.
As an experiment, I created a KNN model using KNeighborsClassifier() with the default parameters and passed these same parameters to GridSearchCV. Afaik, this should mean GridSearchCV only has single set of parameters and so should effectively not perform a "search". I then called the .fit() methods of both on the training data and timed their execution (see code below). The KNN model's .fit() method took about 11 seconds to run, whereas the GridSearchCV model took over 20 minutes.
I understand that GridSearchCV should take slightly longer as it is performing 5-fold cross validation, but the difference in execution time seems too large for it to be explained by that.
Am I doing something with my GridSearchCV call that it causing it to take such a long time to execute? And is there anything that I can do to accelerate it?
import sklearn
import time
# importing models
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
# Importing data
from sklearn.datasets import fetch_openml
mnist = fetch_openml(name='mnist_784')
print("data loaded")
# splitting the data into stratified train & test sets
X, y = mnist.data, mnist.target # mnist mj.data.shape is (n_samples, n_features)
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 0)
for train_index, test_index in sss.split(X,y):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
print("data split")
# Data has no missing values and is preprocessed, so no cleaing needed.
# using a KNN model, as recommended
knn = KNeighborsClassifier()
print("model created")
print("training model")
start = time.time()
knn.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
# Parameter tuning.
# starting by performing a broad-range search on n_neighbours to work out the
# rough scale the parameter should be on
print("beginning param tuning")
params = {'n_neighbors':[5],
'weights':['uniform'],
'leaf_size':[30]
}
paramSearch = GridSearchCV(
estimator = knn,
param_grid = params,
cv=5,
n_jobs = -1)
start = time.time()
paramSearch.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
With vanilla KNN, the costly procedure is predicting, not fitting: fitting just saves a copy of the data, and then predicting has to do the work of finding nearest neighbors. So since your search involves scoring on each test fold, that's going to take a lot more time than just fitting. A better comparison would have you predict on the training set in the no-search section.
However, sklearn does have different options for the algorithm parameter, which aim to trade away some of the prediction complexity for added training time, by building a search structure so that fewer comparisons are needed at prediction time. With the default algorithm='auto', you're probably building a ball tree, and so the effect of the first paragraph won't be so profound. I suspect this is still the issue though: now the training time will be non-neglibible, but the scoring portion in the search is what takes most of the time.

sklearn & pytorch: Train test split for neural net training in pipeline for a grid search

I am working on a pretty big dataset, which we decided to groupKfold (as we got measurements in the dataset which shouldn't get split, but folded i k folds).
We then are grid searching sklearn models with the groupkfolded dataset and either RandomizedGridSearch or BayesenGridSearch. To use neural nets in this pipeline we decied to fit pytorch in the sklearn interface. For that we used from sklearn.base import BaseEstimator, ClassifierMixin.
Then we are setting up a pipeline:
class Neural_Net_Interface(ClassifierMixin, BaseEstimator):
def __init__(self, X_test, y_test, Max_num_epochs, Early_Stopping, and so on...):
self.....
def fit(self, X_train, y_train):
...
def predict(self, X):
...
pipeline_nn = Pipeline([('std', StandardScaler()),
('splitter', train_test_split(X, y, test_size=0.2, random_state=69)),
('nn', Neural_Net_Interface(X_test=X,
y_test=y,
Max_num_epochs=3,
Early_Stopping=True,
... (20 more parameters))])
cv_object = GroupKFold(n_splits=np.max(group_vector) + 1)
model_grid_cv = BayesSearchCV(estimator=pipeline_nn,
search_spaces=search_space,
scoring=my_scorer,
optimizer_kwargs={'base_estimator': 'NN', 'n_initial_points': 20},
cv=cv_object,
n_jobs=N_JOBS,
verbose=100,
n_iter=N_ITER,
n_points=N_POINTS,
iid=False,
random_state=69)
model_grid_cv.fit(X, y, groups=groups)
And here comes the problem:
As you can see above the NeuralNetInterface (sklearn classifier) awaits an input for a test X & y. This is because after each training epoch we need to evaluate the NN accuracy. I can't train test split the dataset once in the beginning, as this would undermine the sense of a kFold. So what I am trying to do is to define the pipeline in a way that the output of the train test split is passed to the neural net interface. This is not working.
Besides my real question is:
-The groupKFold folds 4 groups 4 times, taking 3 parts for training and one part for the score estimation. --> How can I adjust the pipeline in a way that the 1 out of 4 parts of the kFold is passed to the NeuralNetInterface so that this part is used for the NN evaluation? Do I need to adjust the NeuralNetInterface not taking a test set??
-Or is that not possible and I need to train test split the data in the GridSearch always passing one part to the NeuralNetInterface? How do I get that working?
I hope I described my question well enough to understand.
Thanks for your help in advance!
Best regards

Polynomial Regression using keras

hi i am new to keras and i just wanted to know are ann's good for polynomial regression tasks or we shuold just
use sklearn for exmaple i write this script
import numpy as np
import keras
from keras.layers import Dense
from keras.models import Sequential
x=np.arange(1, 100)
y=x**2
model = Sequential()
model.add(Dense(units=200, activation = 'relu',input_dim=1))
model.add(Dense(units=200, activation= 'relu'))
model.add(Dense(units=1))
model.compile(loss='mean_squared_error',optimizer=keras.optimizers.SGD(learning_rate=0.001))
model.fit(x, y,epochs=2000)
but after testing it on some of numbers i didn't get good result like :
model.predict([300])
array([[3360.9023]], dtype=float32)
is there any problem in my code or i just shouldn't use ann's for polynomial regressions.
thank you.
I'm not 100 percent sure, but I think that the reason you are getting such bad predictions is because you did not scale your data. Artificial neural networks are extremely computationally intensive, and thus, scaling is a must. Scale your data as shown below:
import numpy as np
import keras
from keras.layers import Dense
from keras.models import Sequential
x=np.arange(1, 100)
y=x**2
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x = sc_x.fit_transform(x)
sc_y = StandardScaler()
y = sc_y.fit_transform(y)
model = Sequential()
model.add(Dense(units=5, activation = 'relu',input_dim=1))
model.add(Dense(units=5, activation= 'relu'))
model.add(Dense(units=1))
model.compile(loss='mean_squared_error',optimizer=keras.optimizers.SGD(learning_rate=0.001))
model.fit(x, y,epochs=75, batch_size=10)
prediction = sc_y.inverse_transform(model.predict(sc_x.transform([300])))
print(prediction)
Note that I changed the number of epochs from 2000 to 75. This is because 2000 epochs is way to high for a neural network, and it requires lots of time to train. Your X dataset contains only 100 values, so the maximum number of epochs I would suggest is 75.
Furthermore, I also changed the number of neurons in each hidden layer from 200 to 5. This is because 200 neurons is far to many for most datasets, let alone a small dataset of length 100.
These changes should ensure that your neural network produces more accurate predictions.
Hope that helped.

Keras batch normalization stops convergence

I'm new to keras and have been experimenting with various things such as BatchNormalization but it is not working at all. When the BatchNormalization line is commented out it will converge to around 0.04 loss or better, but with it as it is it will converge to 0.71 and get stuck around there, I'm not sure what's wrong.
from sklearn import preprocessing
from sklearn.datasets import load_boston
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers.normalization import BatchNormalization
import keras.optimizers
boston = load_boston()
x = boston.data
y = boston.target
normx = preprocessing.scale(x)
normy = preprocessing.scale(y)
# doesnt construct output layer
def layer_looper(inputs, number_of_loops, neurons):
inputs_copy = inputs
for i in range(number_of_loops):
inputs_copy = Dense(neurons, activation='relu')(inputs_copy)
inputs_copy = BatchNormalization()(inputs_copy)
return inputs_copy
inputs = Input(shape = (13,))
x = layer_looper(inputs, 40, 20)
predictions = Dense(1, activation='linear')(x)
model = Model(inputs=inputs, outputs=predictions)
opti = keras.optimizers.Adam(lr=0.0001)
model.compile(loss='mean_absolute_error', optimizer=opti, metrics=['acc'])
print(model.summary())
model.fit(normx, normy, epochs=5000, verbose=2, batch_size=128)
I have tried experimenting with batch sizes and the optimizer but it doesn't seem very effective. Am I doing something wrong?
I've increased learning rate to 0.01 and it seems like the network is able to learn something (I get Epoch 1000/5000- 0s - loss: 0.2330) .
I think it's worth to note the following from the abstract of original Batch Normalization paper:
Batch Normalization allows us to use much higher learning rates and
be less careful about initialization. It also acts as a regularizer (...)
That hinted to increased learning rate (that's something you might want to experiment with).
Be aware that since it works like regularization, BatchNorm should make your training loss worse - it's supposed to prevent overfitting and thus close the gap between the train and test/valid errors.

Neural Network In Scikit-Learn not producing meaningful results

I'm currently trying to use the scikit learn package for its neural network functionality. I have a complex problem to solve with it, but to start out I am just trying a couple of basic tests to familiarize myself with it. I have gotten it to do something, but it isn't producing meaningful results. My code:
import sklearn.neural_network.multilayer_perceptron as nnet
import numpy
def generateTargetDataset(expression="%s", generateRange=(-100,100), s=1000):
expression = expression.replace("x", "%s")
x = numpy.random.rand(s,)
y = numpy.zeros((s,), dtype="float")
numpy.multiply(x, abs(generateRange[1]-generateRange[0]), x)
numpy.subtract(x, min(generateRange), x)
for z in range(0, numpy.size(x)):
y[z] = eval(expression % (x[z]))
x = x.reshape(-1, 1)
outTuple = (x, y)
return(outTuple)
print("New Net + Training")
QuadRegressor = nnet.MLPRegressor(hidden_layer_sizes=(10), warm_start=True, verbose=True, learning_rate_init=0.00001, max_iter=10000, algorithm="sgd", tol=0.000001)
data = generateTargetDataset(expression="x**2", s=10000, generateRange=(-1,1))
QuadRegressor.fit(data[0], data[1])
print("Net Trained")
xt = numpy.random.rand(10000, 1)
yr = QuadRegressor.predict(xt)
yr = yr.reshape(-1, 1)
xt = xt.reshape(-1, 1)
numpy.multiply(xt, 100, xt)
numpy.multiply(yr, 10000, yr)
numpy.around(yr, 2, out=yr)
numpy.around(xt, 2, out=xt)
out = numpy.concatenate((xt, yr), axis=1)
numpy.set_printoptions(precision=4)
numpy.savetxt(fname="C:\\SCRATCHDIR\\numpydump.csv", X=out, delimiter=",")
I don't understand how to post the data it gives me, but it spits out between 7000 and 10000 for all inputs between 0 and 100. It seems to be correctly mapped very close to the top of the range, but for inputs close to 0, it just returns something near 7000.
EDIT: I forgot to add this. The network has the same behavior if I remove the dummy training to y=x, but I read somewhere that sometimes you can help a network along by training it to a different but closer function and then using that already weighted network as a starting ground. It didn't work but I just hadn't taken that bit out yet.
My recommendation is to reduce the number of neurons per layer, and increase the training dataset size. Right now, you have a lot of parameters to train in your network, and a small training set (~10K). However, the main point of my answer is that sklearn probably isn't a great choice for your end application.
So you have a complex problem you want to solve with neural networks?
I have a complex problem to solve with it, but to start out I am just trying a couple of basic tests to familiarize myself with it.
According to the official user guide, sklearn's implementation of neural networks isn't designed for large applications and is a lot less flexible than other options for deep learning.
One Python deep learning library I've had good experiences with is keras, a modular, easy-to-use library with GPU support.
Here's a sample I coded up that trains a single perceptron to do quadratic regression.
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
import numpy as np
import matplotlib.pyplot as plt
model = Sequential()
model.add(Dense(1, init = 'uniform', input_dim=1))
model.add(Activation('sigmoid'))
model.compile(optimizer = SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True), loss = 'mse')
data = np.random.random(1000)
labels = data**2
model.fit(data.reshape((len(data),1)), labels, nb_epoch = 1000, batch_size = 128, verbose = 1)
tdata = np.sort(np.random.random(100))
tlabels = tdata**2
preds = model.predict(tdata.reshape((len(tdata), 1)))
plt.plot(tdata, tlabels)
plt.scatter(tdata, preds)
plt.show()
This outputs a scatter plot of the test data points, along with a plot of the true curve.
As you can see, the results are reasonable. In general, neural networks are hard to train, and I had to do some parameter tuning before I got this example working.
It looks like you're using Windows. This question may be helpful for installing Keras on Windows.

Resources