Is it logical to do as below in Keras in order not to run out of memory?
for path in ['xaa', 'xab', 'xac', 'xad']:
x_train, y_train = prepare_data(path)
model.fit(x_train, y_train, batch_size=50, epochs=20, shuffle=True)
model.save('model')
It is, but prefer model.train_on_batch if each iteration is generating a single batch. This eliminates some overhead that comes with fit.
You can also try to create a generator and use model.fit_generator():
def dataGenerator(pathes, batch_size):
while True: #generators for keras must be infinite
for path in pathes:
x_train, y_train = prepare_data(path)
totalSamps = x_train.shape[0]
batches = totalSamps // batch_size
if totalSamps % batch_size > 0:
batches+=1
for batch in range(batches):
section = slice(batch*batch_size,(batch+1)*batch_size)
yield (x_train[section], y_train[section])
Create and use:
gen = dataGenerator(['xaa', 'xab', 'xac', 'xad'], 50)
model.fit_generator(gen,
steps_per_epoch = expectedTotalNumberOfYieldsForOneEpoch
epochs = epochs)
I would suggest having a look at this thread on Github.
You could indeed consider using model.fit(), but it would make the training more stable to do it in such a way:
for epoch in range(20):
for path in ['xaa', 'xab', 'xac', 'xad']:
x_train, y_train = prepare_data(path)
model.fit(x_train, y_train, batch_size=50, epochs=epoch+1, initial_epoch=epoch, shuffle=True)
This way you are iterating over all your data once per epoch, and not iterating 20 epochs over part of your data before switching.
As discussed in the thread, another solution would be to develop your own data generator and use it with model.fit_generator().
Related
I'm using a pre trained InceptionV3 on Keras to retrain the model to make a binary image classification (data labeled with 0's and 1's).
I'm reaching about 65% of accuracy on my k-fold validation with never seen data, but the problem is the model is overfitting to soon. I need to improve this average accuracy, and I guess there is something related to this overfitting problem.
Here are the loss values on epochs:
Here is the code. The dataset and label variables are Numpy Arrays.
dataset = joblib.load(path_to_dataset)
labels = joblib.load(path_to_labels)
le = LabelEncoder()
labels = le.fit_transform(labels)
labels = to_categorical(labels, 2)
X_train, X_test, y_train, y_test = sk.train_test_split(dataset, labels, test_size=0.2)
X_train, X_val, y_train, y_val = sk.train_test_split(X_train, y_train, test_size=0.25) # 0.25 x 0.8 = 0.2
X_train = np.array(X_train)
y_train = np.array(y_train)
X_val = np.array(X_val)
y_val = np.array(y_val)
X_test = np.array(X_test)
y_test = np.array(y_test)
aug = ImageDataGenerator(
rotation_range=20,
zoom_range=0.15,
horizontal_flip=True,
fill_mode="nearest")
pre_trained_model = InceptionV3(input_shape = (299, 299, 3),
include_top = False,
weights = 'imagenet')
for layer in pre_trained_model.layers:
layer.trainable = False
x = layers.Flatten()(pre_trained_model.output)
x = layers.Dense(1024, activation = 'relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(2, activation = 'softmax')(x) #already tried with sigmoid activation, same behavior
model = Model(pre_trained_model.input, x)
model.compile(optimizer = RMSprop(lr = 0.0001),
loss = 'binary_crossentropy',
metrics = ['accuracy']) #Already tried with Adam optimizer, same behavior
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=100)
mc = ModelCheckpoint('best_model_inception_rmsprop.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
history = model.fit(x=aug.flow(X_train, y_train, batch_size=32),
validation_data = (X_val, y_val),
epochs = 100,
callbacks=[es, mc])
The training dataset has 2181 images and validation has 727 images.
Something is wrong, but I can't tell what...
Any thoughts of what can be done to improve it?
One way to avoid overfitting is to use a lot of data. The main reason overfitting happens is because you have a small dataset and you try to learn from it. The algorithm will have greater control over this small dataset and it will make sure it satisfies all the datapoints exactly. But if you have a large number of datapoints, then the algorithm is forced to generalize and come up with a good model that suits most of the points.
Suggestions:
Use a lot of data.
Use less deep network if you have a small number of data samples.
If 2nd satisfies then don't use huge number of epochs - Using many epochs leads is kinda forcing your model to learn that and your model will learn it well but can not generalize.
From your loss graph , i see that the model is generalized at early epoch ( where there is intersection of both the train & val score) so plz try to use the model saved at that epoch ( and not the later epochs which seems to overfit)
Second option what you have is use lot of training samples..
If you have less no. of training samples then use data augmentations
Have you tried following?
Using a higher dropout value
Lower Learning Rate (lr=0.00001 or lr=0.000001 ...)
More data augmentation you can use.
It seems to me your data amount is low. You may use a lower ratio for test and validation (10%, 10%).
I try to compare three models SVM RandomForest and LogisticRegression.
I have an imbalance dataset. First i split it to with a 80% - 20% ratio to train and test set. I set the stratify=y.
Next, i used StratifiedKfold only on train set. What i try to do now is fit the models and choose the best one. Also i want to use grid search for each one of the models to find the best parameters.
My code until now is the next
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=True, stratify=y, random_state=42)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=21)
for train_index, test_index in skf.split(X_train, y_train):
X_train_folds, X_test_folds = X_train[train_index], X_train[test_index]
y_train_folds, y_test_folds = y_train[train_index], y_train[test_index]
X_train_2, X_test_2, y_train_2, y_test_2 = X[train_index], X[test_index], y[train_index], y[test_index]
How can i fit a model usin all the folds? How can i gridsearch? Should i have a doulbe loop? can you help?
You can use scikit-learn's GridSearchCV.
You will find an example here of how to evaluate the performance of the various models and assess the statistical significance of the results.
I am training my CNN using the following code:
history = model.fit_generator(
train_data_gen,
steps_per_epoch=int(np.ceil(train_data_gen.n / float(batch_size))),
epochs=num_epochs,
validation_data=val_data_gen,
validation_steps=int(np.ceil(val_data_gen.n / float(batch_size))),
verbose=1,
)
and these are the output:
The accuracy is okay but why is epoch 1/20 repeating itself multiple times?
Your steps per epoch is not managed well. You are trying to validate twice too..
Try without the float and int:
history = model.fit_generator(
train_data_gen,
steps_per_epoch=np.ceil(train_data_gen.n / batch_size),
epochs=num_epochs,
validation_data=val_data_gen,
validation_steps=np.ceil(val_data_gen.n / batch_size),
verbose=1,
)
print out the values of generator size, the steps_per_epoch should reflect number of training steps (usually determined by generator function)
And make sure train_data_gen.n = total_training_samples and same for val_data_gen.n = total_validation_samples
OR, just comment out steps_per_epoch and validation_steps and the model will take on the generator length as training steps.
Consult: https://keras.io/models/model/#fit_generator
I am working on a basic Classification task with Keras and I seem to have stumbled upon a problem where I need some assistance.
I have 200 samples for training and a 100 for validation, I intend to use a ImageDataGenerator to increase the number of training samples for my task. I want to make sure of the total number of training images that are passed to the fit_generator().
I know that the steps_per_epoch defines the total number of batches we get from a generator and ideally it should be number of samples divided by the batch size.
However, this is where things do not add up for me. Here is a snippet of my code:
num_samples = 200
batch_size = 10
gen = ImageDataGenerator(horizontal_flip = True,
vertical_flip = True,
width_shift_range = 0.1,
height_shift_range = 0.1,
zoom_range = 0.1,
rotation_range = 10
)
x,y = shuffle(img_data,img_label, random_state=2)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.333, random_state=2)
generator = gen.flow(x_train, y_train, save_to_dir='check_images/sample_run')
new_network.fit_generator(generator, steps_per_epoch=len(x_train)/batch_size, validation_data=(x_test, y_test), epochs=1, verbose=2)
I am saving the augmented images to see how the images turn out from the ImageDataGenerator and also to ascertain the number of images that are generated from it.
After running this code for a single epoch, I get 600 images in my directory, a number which I cannot arrive at, or maybe I am making a mistake.
Any assistance in making me understand the calculation in this code would be deeply appreciated. Has anyone come across similar problems ?
TIA
gen.flow() creates a NumpyArrayIterator internally and that in turn uses Iterator to calculate the steps_per_epoch. Ideally if steps_per_epoch is None, then the calculation is done as steps_per_epoch = (x.shape[0] + batch_size - 1) // batch_size which is approximately same as your calculation.
Not sure why you see more number of samples. Could you compute the x.shape[0] and double check if your code is as per what you explained?
I have he following code to run a 10-fold cross validation in SkLearn:
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
scores = model_selection.cross_val_score(MyEstimator(), x_data, y_data, cv=cv, scoring='mean_squared_error') * -1
For debugging purposes, while I am trying to make MyEstimator work, I would like to run only one fold of this cross-validation, instead of all 10. Is there an easy way to keep this code but just say to run the first fold and then exit?
I would still like that data is split into 10 parts, but that only one combination of that 10 parts is fitted and scored, instead of 10 combinations.
No, not with cross_val_score I suppose. You can set n_splits to minimum value of 2, but still that will be 50:50 split of train, test which you may not want.
If you want maintain a 90:10 ration and test other parts of code like MyEstimator(), then you can use a workaround.
You can use KFold.split() to get the first set of train and test indices and then break the loop after first iteration.
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(x_data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = x_data[train_index], x_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
break
Now use this X_train, y_train to train the estimator and X_test, y_test to score it.
Instead of :
scores = model_selection.cross_val_score(MyEstimator(),
x_data, y_data,
cv=cv,
scoring='mean_squared_error')
Your code becomes:
myEstimator_fitted = MyEstimator().fit(X_train, y_train)
y_pred = myEstimator_fitted.predict(X_test)
from sklearn.metrics import mean_squared_error
# I am appending to a scores list object, because that will be output of cross_val_score.
scores = []
scores.append(mean_squared_error(y_test, y_pred))
Rest assured, cross_val_score will be doing this only internally, just some enhancements for parallel processing.