training large dataset using keras-flow-from-dataframe generator - keras

What is the best way to train large data-set on google Co-laboratory using keras?
size of data: 3GB images stored on Google drive.
After searching, the problem is that data didn't fetch on memory. The suggested solution in all of the articles I read was to use keras generators (as per what I understand, its role is to fetch a batch and train it, then goes to then next batch.. so on, so no need to fetch the whole data on memory at once).
I tried keras-flow-from-dataframe generator. But it didn't solve the problem and I'm still suffering from (Runtime Died)
train_paths = pd.read_csv(path)
datagen = ImageDataGenerator(featurewise_center=True,
featurewise_center=True,
featurewise_std_normalization=True,
samplewise_std_normalization=True,
rotation_range=30,
validation_split=0.25)
train_generator=datagen.flow_from_dataframe(
dataframe=train_paths,directory= None,x_col='path',y_col='label',
subset="training",has_ext=True,
batch_size=32,target_size =(224,224),
color_mode= "rgb",seed=0,
shuffle=False,class_mode="binary",drop_duplicates=False)
def compile_and_train(model, num_epochs):
adam = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=10, amsgrad=False)
model.compile(optimizer= adam, loss='binary_crossentropy', metrics = ['acc'])
filepath = 'tmp/weights/' + model.name + '.{epoch:02d}-{loss:.2f}.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, mode='auto', period=1)
STEP_SIZE_TRAIN= (train_generator.n//train_generator.batch_size)+1
STEP_SIZE_VALID= (valid_generator.n//valid_generator.batch_size)+1
Model_history=model.fit_generator(generator=train_generator,steps_per_epoch=STEP_SIZE_TRAIN,validation_data=valid_generator,validation_steps=STEP_SIZE_VALID,epochs=num_epochs, verbose=1, callbacks=[checkpoint, tensor_board],class_weight=[1])
return Model_history
MobileNet_Model= MobileNet_model(input_shape)
MobileNet_model_his= compile_and_train(MobileNet_Model, num_epochs=1)
One suggested solution is to divide the data manually(or by for loop), save the weights after each MAJOR BATCH and continue training them for the next batch...
A question here, should I save the model(architecture) or the weights only ? And is there any better solutions instead of (for loop)? and why using keras generators don't solve this problem at all!!?

Related

keras training on big datasets seperately keras

I am working on a keras denoising neural network that denoise high Dimension x-ray images. The idea is to train on some datasets eg.1,2,3 and after having the weights, another datasets eg.4,5,6 will start with a new training with weights initialized from the previous training. Implementation-wise it works, however the weights resulted from the last rotation perform better only on the datasets that were used to train on in this rotation. Same goes for other rotation.
In other words, weights resutlted from training on dataset: 4,5,6 doesn't give the good results on an image of dataset 1 as intended as the weights that were trained on datasets: 1,2,3. which shouldn't be what I intend to do
The idea is that weights should be tweaked to work with all datasets effectively, as training on the whole dataset doesn't fit into memory.
I tried other solutions such as creating custom generator that takes images from disk and do the training as batches which is very slow as it depends on factors like I/O operations happening on disk or the time complexity of processing functions happening inside the custom keras generator!
Below is a code that shows what I am doing. I have 12 datasets, seperated into 4 checkpoints. data is loaded and training goes and saves final model to an array and next training takes the weights from the previous rotation and continues.
EPOCHES = 150
NUM_CHKPTS = 4
weights = []
for chk in range(1,NUM_CHKPTS+1):
log_dir = os.path.join(os.getcwd(), 'resnet_checkpts_' + str(EPOCHES) + "_tl2_chkpt" + str(chk))
if not os.path.isdir(log_dir):
os.makedirs(log_dir)
else:
print('Training log directory already exists # {}.'.format(log_dir))
tb_output = TensorBoard(log_dir=log_dir, histogram_freq=1)
print("Loading Data From CHKPT #" + str(chk))
h5f = h5py.File('C:\\autoencoder\\datasets\\mix\\chk' + str(chk) + '.h5','r')
org_patch = h5f['train_data'][:]
noisy_patch = h5f['train_noisy'][:]
h5f.close()
input_patch, test_patch, noisy_patch, test_noisy_patch = train_test_split(org_patch, noisy_patch, train_size=0.8, shuffle=True)
print("Reshaping")
train_data = np.array([np.reshape(input_patch[i], (52, 52, 1)) for i in range(input_patch.shape[0])], dtype = np.float32)
train_noisy_data = np.array([np.reshape(noisy_patch[i], (52, 52, 1)) for i in range(noisy_patch.shape[0])], dtype = np.float32)
test_data = np.array([np.reshape(test_patch[i], (52, 52, 1)) for i in range(test_patch.shape[0])], dtype = np.float32)
test_noisy_data = np.array([np.reshape(test_noisy_patch[i], (52, 52, 1)) for i in range(test_noisy_patch.shape[0])], dtype = np.float32)
print('Number of training samples are:', train_data.shape[0])
print('Number of test samples are:', test_data.shape[0])
# IN = np.ones((len(XTRAINFILES), 52, 52, 1 ))
if chk == 1:
print("Generating the Model For The First Time..")
autoencoder_model = model_autoencoder(train_noisy_data)
print("Done!")
else:
autoencoder_model=load_model(weights[chk-2])
checkpt_path = log_dir + r"\\cp-{epoch:04d}.ckpt"
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpt_path, verbose=0, save_weights_only=True, save_freq='epoch')
optimizer = tf.keras.optimizers.Adam(lr=0.0001)
autoencoder_model.compile(loss='mse',optimizer=optimizer)
autoencoder_model.fit(train_noisy_data, train_data,
batch_size=128,
epochs=EPOCHES, shuffle=True, verbose=1,
validation_data=(test_noisy_data, test_data),
callbacks=[tb_output, checkpoint_callback])
weight_dir = log_dir+'\\model_resnet_new_OL' + str(EPOCHES) + 'epochs.h5'
weights.append(weight_dir)
autoencoder_model.save(weight_dir) # Defined saved model name by number of epochs.
Tensorboard Graphs, Rotations are 1,2,3,4 from up down :
Your model will forget previous dataset as you train on new dataset.
I read in reinforcement learning, when game are used to train Deep Reinforcement Learning (DRL), then you have to create memory replay, which collect data from different rounds of game, because each round of game has different data, then randomly some of that data is chosen to train model. that way DRL model can learn to play different rounds of game without forgetting previous rounds.
You can try to create a single dataset by taking some random samples from each dataset.
When you train model on new dataset that make sure data from all previous rotation are in current rotation.
Also in transfer learning, when you train model on new dataset, you have to freeze previous layers so that model don`t forget previous training. you are not using transfer learning but still when you start training on 2nd dataset your 1st dataset will slowly be removed from memory of weights.
you can try freezing initial layers of decoder so that they are not updated when extracting feature, assuming all of the dataset contain similar images, that way your model will not forget previous training as in transfer learning. but still when you train on new dataset previous will be forgotten.

Speed problem of model.fit() in TF2 when loading data using DataGenerator

I run a simple classification problem with a small dataset on tf2 with two different ways on how to load the data.
In the first way, I loaded the data by reading images and loading them into (train_x, train_y) and (test_w,test_y).
The training was quite fast and fine.
Then, I wanted to try with using DataGenerator as such
training_datagen = ImageDataGenerator(
rescale = 1./255,
rotation_range=15,
fill_mode='nearest')
validation_datagen = ImageDataGenerator(rescale = 1./255)
train_generator = training_datagen.flow_from_directory(
TRAINING_DIR,
target_size=(224,224),
class_mode='categorical'
)
validation_generator = validation_datagen.flow_from_directory(
VALIDATION_DIR,
target_size=(224,224),
class_mode='categorical'
)
and then I run the training with the command
H = model.fit(
train_generator,
batch_size=2,
validation_data= validation_generator,
verbose = 1,
epochs=EPOCHS)
then, the training becomes extremely slow. One epoch lasts several minutes, while in the previous case, the whole training was less than 15 seconds.
I did not understand what is the problem. It seems this problem is shared among several developers but not clear why the training becomes so slow when using a data generator.
Thanks
The issue was also addressed here
https://github.com/keras-team/keras/issues/12683#issuecomment-614963118

ImageDataGenerator performs worse

I build a neural network with and without ImageDataGenerator. When I use it without than it works fine. When I use it with IDG than both accuracy and valid_accuracy-scores are really bad. So I think I am doing something wrong.
I wanted to use the IDG to see what augmentation could do for my neural network. But even when I get rid of all the augmentation it still performs bad.
Here is my code for the IDG:
image_size=224
train_datagen = ImageDataGenerator(rescale=1./255, validation_split = 0.2)
train_generator = train_datagen.flow_from_directory('images',
target_size = (image_size, image_size),
batch_size = 10
class_mode = 'categorical'
subset='training')
validation_generator = train_datagen.flow_from_directory('images',
target_size = (image_size, image_size),
batch_size = 10
class_mode = 'categorical'
subset='training')
When I fit it I use this code:
chat = model.fit_generator(train_generator, steps_per_epoch = train_generator.samples // 10,
validation_data = validation_generator,
validation_steps = validation_generator.samples // 10,
epochs = 10)
Am I doing something wrong? Does the IDG perform an operation on the images that I don't see but changes something that influences the images in some way?
When I plot my images, I don't see anything strange.
Hope someone can give me some tips!
When you say that the performance is worse with data augmentation, are you comparing both on the same dataset?
Generally there is a chance of mistake of comparing the accuracy of model trained with data augmentation on the augmented dataset with the model trained without data augmentation on the regular dataset.
It is important to keep in mind that augmented datasets can be harder to deal with for the model. Therefore, even if the accuracy isn't as high as before, it might be actually higher when evaluated on the regular dataset.

Emotion detection on text

I am a newbie in ML and was experimenting with emotion detection on the text.
So I have an ISEAR dataset which contains tweets with their emotion labeled.
So my current accuracy is 63% and I want to increase to at least 70% or even more maybe.
Heres the code :
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
64,
input_length=MAX_LENGTH)(inputs)
# x = Flatten()(embedding_layer)
x = LSTM(32, input_shape=(32, 32))(embedding_layer)
x = Dense(10, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.summary()
filepath="weights-simple.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.1,
shuffle=True, epochs=10, callbacks=[checkpointer])
That's a pretty general question, optimizing the performance of a neural network may require tuning many factors.
For instance:
The optimizer chosen: in NLP tasks rmsprop is also a popular
optimizer
Tweaking the learning rate
Regularization - e.g dropout, recurrent_dropout, batch norm. This may help the model to generalize better
More units in the LSTM
More dimensions in the embedding
You can try grid search, e.g. using different optimizers and evaluate on a validation set.
The data may also need some tweaking, such as:
Text normalization - better representation of the tweets - remove unnecessary tokens (#, #)
Shuffle the data before the fit - keras validation_split creates a validation set using the last data records
There is no simple answer to your question.

Keras Model Accuracy differs after loading the same saved model

I trained a Keras Sequential Model and Loaded the same later. Both the model are giving different accuracy.
I have came across a similar question but was not able solve the problem.
Sample Code :
Loading and Traing the model
model = gensim.models.FastText.load('abc.simple')
X,y = load_data()
Vectors = np.array(vectors(X))
X_train, X_test, y_train, y_test = train_test_split(Vectors, np.array(y),
test_size = 0.3, random_state = 0)
X_train = X_train.reshape(X_train.shape[0],100,max_tokens,1)
X_test = X_test.reshape(X_test.shape[0],100,max_tokens,1)
data for input to our model
print(X_train.shape)
model2 = train()
score = model2.evaluate(X_test, y_test, verbose=0)
print(score)
Training Accuracy is 90%.
Saved the Model
# Saving Model
model_json = model2.to_json()
with open("model_architecture.json", "w") as json_file:
json_file.write(model_json)
model2.save_weights("model_weights.h5")
print("Saved model to disk")
But after I restarted the kernel and just loaded the saved model and runned it on same set of data, accuracy got reduced.
#load json and create model
json_file = open('model_architecture.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
#load weights into new model
loaded_model.load_weights("model_weights.h5")
print("Loaded model from disk")
# evaluate loaded model on test data
loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
score = loaded_model.evaluate(X_test, y_test, verbose=0)
print(score)
Accuracy got reduced to 75% on the same set of data.
How to make it consistent ?
I have tried the following but of no help :
from keras.backend import manual_variable_initialization
manual_variable_initialization(True)
Even , I saved the whole model at once( weights and architecture) but was not able to solve this issue
Not sure, if your issue has been solved but for future comers.
I had exactly the same problem with saving and loading the weights. So on loading the model the accuracy and loss were changed greatly from 68% accuracy to 2 %. In my experiment, I am using Tensorflow as backend with Keras model layers Embedding, LSTM and Dense. My issue got solved by fixing the seed for keras which uses NumPy random generator and since I am using Tensorflow as backend, I also fixed the seed for it.
These are the lines I added at the top of my file where the model is also defined.
from numpy.random import seed
seed(42)# keras seed fixing
import tensorflow as tf
tf.random.set_seed(42)# tensorflow seed fixing
I hope this helps.
For more information have a look at this- https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
I had the same problem due to a silly mistake of mine - after loading the model I had in my data generator the shuffle option (useful for the training) turned to True instead of False. After changing it to False the model predicted as expected. It would be nice if keras could take care of this automatically. This is my critical code part:
pred_generator = pred_datagen.flow_from_directory(
directory='./ims_dir',
target_size=(100, 100),
color_mode="rgb",
batch_size=1,
class_mode="categorical",
shuffle=False,
)
model = load_model(logpath_ms)
pred=model.predict_generator(pred_generator, steps = N, verbose=1)
My code worked when I scaled my dataset before reevaluating the model. I did this treatment before saving the model and had forgotten to repeat this procedure when I opened the model and wanted to evaluate it again. After I did that, the accuracy value appeared as it should \o/
model_saved = keras.models.load_model('tuned_cnn_1D_HAR_example.h5')
trainX, trainy, testX, testy = load_dataset()
trainX, testX = scale_data(trainX, testX, True)
score = model_saved.evaluate(testX, testy, verbose=0)
print("%s: %.2f%%" % (model_saved.metrics_names[1], score[1]*100))
inside of my function scale_data I used StandardScaler()

Resources