Keras fit_generator(), is this the correct usage? - python-3.x

So far I have come up with this hacky code here, this code runs and outputs
Epoch 10/10
1/3000 [..............................] - ETA: 27s - loss: 0.3075 - acc: 0.7270
6/3000 [..............................] - ETA: 54s - loss: 0.3075 - acc: 0.7355
.....
2996/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337
2998/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337
3000/3000 [==============================] - 59s - loss: 0.3076 - acc: 0.7337
Traceback (most recent call last):
File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 140, in <module>
(loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData),
TypeError: 'History' object is not iterable
Process finished with exit code 1
I'm confused by "'History' object is not iterable", what does this mean?
This is the first time I've tried to do batch training and testing and I'm not sure i've implemented it correctly as most the examples I've seen online are for images. Here is the code
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import re
"""
amount of samples out to the 1 million to use, my 960m 2GB can only handel
about 30,000ish at the moment depending on the amount of neurons in the
deep layer and the amount fo layers.
"""
maxSamples = 3000
#Load the CSV and get the correct columns
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv")
dataX = pd.DataFrame()
dataY = pd.DataFrame()
dataY[['Sentiment']] = data[['Sentiment']]
dataX[['SentimentText']] = data[['SentimentText']]
dataY = dataY.iloc[0:maxSamples]
dataX = dataX.iloc[0:maxSamples]
testY = dataY.iloc[-1: -maxSamples]
testX = dataX.iloc[-1: -maxSamples]
"""
here I filter the data and clean it up bu remove # tags and hyper links and
also any characters that are not alpha numeric, I then add it to the vec list
"""
def removeTagsAndLinks(dataframe):
vec = []
for x in dataframe.iterrows():
#Removes Hyperlinks
zero = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?", "", x[1].values[0])
#Removes # tags
one = re.sub("#\\w+", '', zero)
#keeps only alpha-numeric chars
two = re.sub("\W+", ' ', one)
vec.append(two)
return vec
vec = removeTagsAndLinks(dataX)
xTest = removeTagsAndLinks(testX)
yTest = removeTagsAndLinks(testY)
"""
This loop looks for any Tweets with characters shorter than 2 and once found write the
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the
list of Tweets later
"""
indexOfBlankStrings = []
for index, string in enumerate(vec):
if len(string) < 2:
del vec[index]
indexOfBlankStrings.append(index)
for row in indexOfBlankStrings:
dataY.drop(row, axis=0, inplace=True)
"""
This makes a BOW model out of all the tweets then creates a
vector for each of the tweets containing all the words from
the BOW model, each vector is the same size becuase the
network expects it
"""
def vectorise(tokenizer, list):
tokenizer.fit_on_texts(list)
return tokenizer.texts_to_matrix(list)
#Make BOW model and vectorise it
t = Tokenizer(lower=False, num_words=1000)
dim = vectorise(t, vec)
xTest = vectorise(t, xTest)
"""
Here im experimenting with multiple layers of the total
amount of words in the syllabus divided by ^2 - This
has given me quite accurate results compared to random guess's
of amount of neron's and amounts of layers.
"""
l1 = int(len(dim[0]) / 4) #To big for my GPU
l2 = int(len(dim[0]) / 8) #To big for my GPU
l3 = int(len(dim[0]) / 16)
l4 = int(len(dim[0]) / 32)
l5 = int(len(dim[0]) / 64)
l6 = int(len(dim[0]) / 128)
#Make the model
model = Sequential()
model.add(Dense(l1, input_dim=dim.shape[1]))
model.add(Dropout(0.15))
model.add(Dense(l2))
model.add(Dense(l1))
model.add(Dense(l3))
model.add(Dropout(0.2))
model.add(Dense(l4))
model.add(Dense(1, activation='relu'))
#Compile the model
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc'])
"""
This here will use multiple batches to train the model.
startIndex:
This is the starting index of the array for which you want to
start training the network from.
dataRange:
The number of elements use to train the network in each batch so
since dataRange = 1000 this mean it goes from
startIndex...dataRange OR 0...1000
amountOfEpochs:
This is kinda self explanitory, the more Epochs the more it
is supposed to learn AKA updates the optimisation algo numbers
"""
amountOfEpochs = 10
dataRange = 1000
startIndex = 0
def generator(tokenizer, batchSize, totalSize=maxSamples, startIndex=0):
f = tokenizer.texts_to_sequences(vec[startIndex:totalSize])
l = np.asarray(dataY.iloc[startIndex:totalSize])
while True:
for i in range(1000, totalSize, batchSize):
batch_features = tokenizer.sequences_to_matrix(f[startIndex: batchSize])
batch_labels = l[startIndex: batchSize]
yield batch_features, batch_labels
##This runs the model for batch AKA load a little them process then load a little more
for amountOfData in range(1000, maxSamples, 1000):
#(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData]))
(loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData),
steps_per_epoch=maxSamples, epochs=amountOfEpochs,
validation_data=(np.array(xTest), np.array(yTest)))
startIndex += 1000
The part towards the bottom is where I've tried to implement the fit_generator() and make my own generator, I wanted to load say 75,000 maxSamples then train the network 1000 samples at a time until it reaches the maxSample var which is why I've setup range to do the (0, maxSample, 1000) which I use in the generator() was this the correct use?
I ask because my network is not using the validation data and it seems to fit to the data extremely quickly which suggests overfitting or just using a very small dataset. am I iterating over all the maxSamples int he correct way? or am I just looping over the first iterations several times?
Thanks

The problem lies in this line:
(loss, acc) = model.fit_generator(...)
as fit_generator returns a single object of keras.callbacks.history class. That's why you have this error as singe object is not iterable. In order to get loss lists you need to retrieve them from history field in this callback which is a dictionary of recorded losses:
history = model.fit_generator(...)
loss = history.history["loss"]

Related

tf.dataset got empty before it calls model.fit

I am building a keras model. The features are coming from pandas.DataFrame. I build the tf.Dataset through from_generator API. I followed this page to process the categorical string features.
output_sig= ...
features = [...]
def iter_to_gen(it):
def f():
for x in it:
# x is a list, with the last element being the label
key_to_feature = {key: x[i] for i, key in enumerate(features)}
yield key_to_feature, x[-1]
return f
train_ds = tf.data.Dataset.from_generator( iter_to_gen(map(tuple, train_data.values)), output_signature=output_sig, name='train').batch(batch_size)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a layer that turns strings into integer indices.
if dtype == 'string':
index = layers.StringLookup(max_tokens=max_tokens)
# Otherwise, create a layer that turns integer values into integer indices.
else:
index = layers.IntegerLookup(max_tokens=max_tokens)
# Prepare a `tf.data.Dataset` that only yields the feature.
feature_ds = dataset.map(lambda x, y : x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Encode the integer indices.
encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())
# Apply multi-hot encoding to the indices. The lambda function captures the
# layer, so you can use them, or include them in the Keras Functional model later.
return lambda feature: encoder(index(feature))
all_inputs = []
encoded_features = []
categorical_cols = ['feature_A']
for header in categorical_cols:
if header == 'feature_A':
categorical_col = tf.keras.Input(shape=(None,), name=header, dtype='string')
else:
categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
encoding_layer = get_category_encoding_layer(name=header,
dataset=train_ds,
dtype='string',
max_tokens=50) # tune the max tokens
encoded_categorical_col = encoding_layer(categorical_col)
all_inputs.append(categorical_col)
encoded_features.append(encoded_categorical_col)
all_features = tf.keras.layers.concatenate(encoded_features)
print(all_features.shape)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
# x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(num_class)(x)
model = tf.keras.Model(all_inputs, output)
model.compile(optimizer='SGD',
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
metrics=["accuracy"],
run_eagerly=True)
model.fit(train_ds, epochs=10, verbose=2) <------ ValueError: Unexpected result of #`train_function` (Empty logs). Please use `Model.compile(..., run_eagerly=True)`, or #`tf.config.run_functions_eagerly(True)` for more information of where went wrong, or file a #issue/bug to `tf.keras`.
And then if I reproduce the train_ds and skip directly to run model.fit, it would run only 2 epochs and end. I am wondering why is it.
Epoch 1/10
4984/4984 - 71s - loss: 2.5564 - accuracy: 0.4191 - 71s/epoch - 14ms/step
Epoch 2/10
4984/4984 - 0s - loss: 0.0000e+00 - accuracy: 0.0000e+00 - 12ms/epoch - 2us/step
<keras.callbacks.History at 0x....>
I found the first error was raised because model.fit got a empty dataset. I also verified the size of dataset by dataset.as_numpy_array() and it is empty. I am wondering why.
Thanks.

Expected performance of training tf.keras.Sequential model with model.fit, model.fit_generator and model.train_on_batch

I am using tensorflow to train a 1D CNN to detect specific events from sensor data. While the data with tens of millions samples easily fits to the ram in the form of an 1D float array, it obviously takes a huge amount of memory to store the data as a N x inputDim array that can be passed to model.fit for training. While I can use model.fit_generator or model.train_on_batch to generate the required mini batches on the fly, for some reason I am observing a huge performance gap between model.fit and model.fit_generator & model.train_on_batch even though everything is stored in memory and mini batch generation is fast as it basically only consists of reshaping the data. Therefore, I'm wondering whether I am doing something terribly wrong or if this kind of performance gap is to be expected. I am using the cpu version of Tensorflow 2.0 with 3.2 GHz Intel Core i7 processor (4 cores with multithreading support) and Python 3.6.3. on Mac Os X Mojave.
In short, I created a dummy python script to recreate the issue, and it reveals that with batch size of 64, if takes 407 seconds to run 10 epochs with model.fit, 1852 seconds with model.fit_generator, and 1985 seconds with model.train_on_batch. CPU loads are ~220%, ~130%, and ~120% respectively, and it seems especially odd that model.fit_generator & model.train_on_batch are practically on par, while model.fit_generator should be able to parallelise mini batch creation and model.train_on_batch definitely does not. That is, model.fit (with huge memory requirements) beats the other solution candidates with easily manageable memory requirements by a factor of four. Obviously, CPU loads increase and total training times decrease by increasing batch size, but model.fit is always fastest with a a margin of at least two up to batch size of 8096.
Is this kind of behaviour normal (when there is no GPU involved) or what could be done in order to increase the computation speed of the less memory intensive options? It seems that no such option is available to divide all data into manageable pieces, and then run model.fit in iterative manner.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tqdm import tqdm
import numpy as np
import tensorflow as tf
import time
import sys
import argparse
class DataGenerator(tf.keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, inputData, outputData, batchIndices, batchSize, shuffle):
'Initialization'
self.inputData = inputData
self.outputData = outputData
self.batchIndices = batchIndices
self.batchSize = batchSize
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int( np.floor( self.inputData.size / self.batchSize ) )
def __getitem__(self, index):
'Generate one batch of data'
# Generate data
X, y = self.__data_generation(self.indexes[index*self.batchSize:(index+1)*self.batchSize])
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(self.inputData.size)
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, INDX):
'Generates data containing batch_size samples'
# Generate data
X = np.expand_dims( self.inputData[ np.mod( self.batchIndices + np.reshape(INDX,(INDX.size,1)) , inputData.size ) ], axis=2)
y = self.outputData[INDX,:]
return X, y
FLAGS = None
parser = argparse.ArgumentParser()
parser.add_argument('--batchSize', type=int,
default=128,
help='Batch size')
parser.add_argument('--epochCount', type=int,
default=5,
help='Epoch count')
FLAGS, unparsed = parser.parse_known_args()
batchSize = FLAGS.batchSize
epochCount = FLAGS.epochCount
# Data generation
print(' ')
print('Generating data...')
np.random.seed(0) # For reproducible results
inputDim = int(104) # Input dimension
outputDim = int( 2) # Output dimension
N = int(1049344) # Total number of samples
M = int(5e4) # Number of anomalies
trainINDX = np.arange(N, dtype=np.uint32)
inputData = np.sin(trainINDX) + np.random.normal(loc=0.0, scale=0.20, size=N) # Source data stored in a single array
anomalyLocations = np.random.choice(N, M, replace=False)
inputData[anomalyLocations] += 0.5
outputData = np.zeros((N,outputDim)) # One-hot encoded target array without ones
for i in range(N):
if( np.any( np.logical_and( anomalyLocations >= i, anomalyLocations < np.mod(i+inputDim,N) ) ) ):
outputData[i,1] = 1 # set class #2 to one if there is at least a single anomaly within range [i,i+inputDim)
else:
outputData[i,0] = 1 # set class #1 to one if there are no anomalies within range [i,i+inputDim)
print('...completed')
print(' ')
# Create a model for anomaly detection
model = tf.keras.Sequential([
tf.keras.layers.Conv1D(filters=24, kernel_size=9, strides=1, padding='valid', dilation_rate=1, activation='relu', use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', input_shape=(inputDim,1)),
tf.keras.layers.MaxPooling1D(pool_size=4, strides=None, padding='valid'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(20, activation='relu', use_bias=True),
tf.keras.layers.Dense(outputDim, activation='softmax')
])
model.compile( tf.keras.optimizers.Adam(),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=[tf.keras.metrics.CategoricalAccuracy()])
print(' ')
relativeIndices = np.arange(inputDim) # Indices belonging to a single sample relative to current position
batchIndices = np.tile( relativeIndices, (batchSize,1) ) # Relative indices tiled into an array of size ( batchSize , inputDim )
stepsPerEpoch = int( np.floor( N / batchSize ) ) # Steps per epoch
# Create an intance of dataGenerator class
generator = DataGenerator(inputData, outputData, batchIndices, batchSize=batchSize, shuffle=True)
# Solve by gathering data into a large float32 array of size ( N , inputDim ) and feeding it to model.fit
startTime = time.time()
X = np.expand_dims( inputData[ np.mod( np.tile(relativeIndices,(N,1)) + np.reshape(trainINDX,(N,1)) , N ) ], axis=2)
y = outputData[trainINDX, :]
history = model.fit(x=X, y=y, sample_weight=None, batch_size=batchSize, verbose=1, callbacks=None, validation_split=None, shuffle=True, epochs=epochCount)
referenceTime = time.time() - startTime
print(' ')
print('Total solution time with model.fit: %6.3f seconds' % referenceTime)
# Solve with model.fit_generator
startTime = time.time()
history = model.fit_generator(generator=generator, steps_per_epoch=stepsPerEpoch, verbose=1, callbacks=None, epochs=epochCount, max_queue_size=1024, use_multiprocessing=True, workers=4)
generatorTime = time.time() - startTime
print(' ')
print('Total solution time with model.fit_generator: %6.3f seconds (%6.2f %% more)' % (generatorTime, 100.0 * generatorTime/referenceTime))
print(' ')
# Solve by gathering data into batches of size ( batchSize , inputDim ) and feeding it to model.train_on_batch
startTime = time.time()
for epoch in range(epochCount):
print(' ')
print('Training epoch # %2d ...' % (epoch+1))
print(' ')
np.random.shuffle(trainINDX)
epochStartTime = time.time()
for step in tqdm( range( stepsPerEpoch ) ):
INDX = trainINDX[ step*batchSize : (step+1)*batchSize ]
X = np.expand_dims( inputData[ np.mod( batchIndices + np.reshape(INDX,(batchSize,1)) , N ) ], axis=2)
y = outputData[INDX,:]
history = model.train_on_batch(x=X, y=y, sample_weight=None, class_weight=None, reset_metrics=False)
print(' ')
print('...completed with loss = %9.6e, accuracy = %6.2f %%, %6.2f ms/step' % (history[0], 100.0*history[1], (1000*(time.time() - epochStartTime)/np.floor(trainINDX.size / batchSize))))
print(' ')
batchTime = time.time() - startTime
print(' ')
print('Total solution time with model.train_on_batch: %6.3f seconds (%6.2f %% more)' % (batchTime, 100.0 * batchTime/referenceTime))
print(' ')
model.fit - suitable if you load the data as numpy-array and train without augmentation.
model.fit_generator - if your dataset is too big to fit in the memory or\and you want to apply augmentation on the fly.
model.train_on_batch - less common, usually used when training more than one model at a time (GAN for example)

Why my tensorboard is showing a discontinued output?

I am running a neural network with logging of training accuracy,Validation accuracy and validation loss. here is my code snippet.
def show_progress(epoch, feed_dict_train, feed_dict_validate, val_loss):
acc = session.run(accuracy, feed_dict=feed_dict_train)
val_acc = session.run(accuracy, feed_dict=feed_dict_validate)
msg = "Training Epoch {0} --- Training Accuracy: {1:>6.1%}, Validation Accuracy: {2:>6.1%}, Validation Loss: {3:.3f}"
print(msg.format(epoch + 1, acc, val_acc, val_loss))
return acc,val_acc
total_iterations = 0
#writer=tf.summary.FileWriter(options.tensorboard,session)
saver = tf.train.Saver()
def train(num_iteration):
global total_iterations
writer=tf.summary.FileWriter(options.tensorboard,session.graph)
#global writer
for i in range(total_iterations,
total_iterations + num_iteration):
x_batch, y_true_batch, _, cls_batch = data.train.next_batch(batch_size)
x_valid_batch, y_valid_batch, _, valid_cls_batch = data.valid.next_batch(batch_size)
feed_dict_tr = {x: x_batch,
y_true: y_true_batch}
feed_dict_val = {x: x_valid_batch,
y_true: y_valid_batch}
session.run(optimizer, feed_dict=feed_dict_tr)
if i % 10 == 0:
val_loss = session.run(cost, feed_dict=feed_dict_val)
epoch = int(i /10)
accu,valid_accu=show_progress(epoch, feed_dict_tr, feed_dict_val, val_loss)
#getting values for visualising inside the tensorboard
tf.summary.scalar("training_accuracy",accu)
tf.summary.scalar("Validation_accuracy",valid_accu)
tf.summary.scalar("Validation_loss",val_loss)
#tf.summary.scalar("epoch",epoch)
#merging all the values (serializing)
merged=tf.summary.merge_all()
summary=session.run(merged)
#adding them to the events directory
writer.add_summary(summary,epoch)
saver.save(session, options.save)
total_iterations += num_iteration
train(num_iteration=10)
Now I am getting a tensor board output, as for each epoch the accuracy,validation accuracy and validation loss as separate plots with single points.
For each epoch I am getting these three plots again with another point.
I want to get a continuous points for these three plots so as it forms a line graph.
Each of your call to tf.summary.scalar() will create a node in the computation graph. Specifically, in your code, the calls are inside the training loop and therefore metrics at different epochs get written to different plots.
tf.summary.scalar("training_accuracy", accu)
tf.summary.scalar("Validation_accuracy", valid_accu)
tf.summary.scalar("Validation_loss", val_loss)
What you can do is to define the summary ops before the loop with placeholders. Then, in the eval loop, you can feed these tensor with real values.
# Define a placeholder and wire it to the summary op.
accu_tensor = tf.placeholder(tf.float32)
tf.summary.scalar("training_accuracy", accu_tensor)
summary_op = tf.summary.merge_all()
# Create a session after defining ops.
sess = tf.Session()
writer = tf.summary.FileWriter(<some-directory>, sess.graph)
for i in range(total_iterations,
total_iterations + num_iteration):
# run training ops to get values for accu
# ...
# run the summary op with a feed_dict to feed the value.
summaries = sess.run(summary_op, feed_dict={accu_tensor: accu})
writer.add_summary(summaries, epoch)

Use a generator for Keras model.fit_generator

I originally tried to use generator syntax when writing a custom generator for training a Keras model. So I yielded from __next__. However, when I would try to train my mode with model.fit_generator I would get an error that my generator was not an iterator. The fix was to change yield to return which also necessitated rejiggering the logic of __next__ to track state. It's quite cumbersome compared to letting yield do the work for me.
Is there a way I can make this work with yield? I will need to write several more iterators that will have to have very clunky logic if I have to use a return statement.
I can't help debug your code since you didn't post it, but I abbreviated a custom data generator I wrote for a semantic segmentation project for you to use as a template:
def generate_data(directory, batch_size):
"""Replaces Keras' native ImageDataGenerator."""
i = 0
file_list = os.listdir(directory)
while True:
image_batch = []
for b in range(batch_size):
if i == len(file_list):
i = 0
random.shuffle(file_list)
sample = file_list[i]
i += 1
image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
image_batch.append((image.astype(float) - 128) / 128)
yield np.array(image_batch)
Usage:
model.fit_generator(
generate_data('~/my_data', batch_size),
steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)
I have recently played with the generators for Keras and I finally managed to prepare an example. It uses random data, so trying to teach NN on it makes no sense, but it's a good illustration of using a python generator for Keras.
Generate some data
import numpy as np
import pandas as pd
data = np.random.rand(200,2)
expected = np.random.randint(2, size=200).reshape(-1,1)
dataFrame = pd.DataFrame(data, columns = ['a','b'])
expectedFrame = pd.DataFrame(expected, columns = ['expected'])
dataFrameTrain, dataFrameTest = dataFrame[:100],dataFrame[-100:]
expectedFrameTrain, expectedFrameTest = expectedFrame[:100],expectedFrame[-100:]
Generator
def generator(X_data, y_data, batch_size):
samples_per_epoch = X_data.shape[0]
number_of_batches = samples_per_epoch/batch_size
counter=0
while 1:
X_batch = np.array(X_data[batch_size*counter:batch_size*(counter+1)]).astype('float32')
y_batch = np.array(y_data[batch_size*counter:batch_size*(counter+1)]).astype('float32')
counter += 1
yield X_batch,y_batch
#restart counter to yeild data in the next epoch as well
if counter >= number_of_batches:
counter = 0
Keras model
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten, Reshape
from keras.layers.convolutional import Convolution1D, Convolution2D, MaxPooling2D
from keras.utils import np_utils
model = Sequential()
model.add(Dense(12, activation='relu', input_dim=dataFrame.shape[1]))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adadelta', metrics=['accuracy'])
#Train the model using generator vs using the full batch
batch_size = 8
model.fit_generator(
generator(dataFrameTrain,expectedFrameTrain,batch_size),
epochs=3,
steps_per_epoch = dataFrame.shape[0]/batch_size,
validation_data = generator(dataFrameTest,expectedFrameTest,batch_size*2),
validation_steps = dataFrame.shape[0]/batch_size*2
)
#without generator
#model.fit(
# x = np.array(dataFrame),
# y = np.array(expected),
# batch_size = batch_size,
# epochs = 3
#)
Output
Epoch 1/3
25/25 [==============================] - 3s - loss: 0.7297 - acc: 0.4750 -
val_loss: 0.7183 - val_acc: 0.5000
Epoch 2/3
25/25 [==============================] - 0s - loss: 0.7213 - acc: 0.3750 -
val_loss: 0.7117 - val_acc: 0.5000
Epoch 3/3
25/25 [==============================] - 0s - loss: 0.7132 - acc: 0.3750 -
val_loss: 0.7065 - val_acc: 0.5000
This is the way I implemented it for reading files any size. And it works like a charm.
import pandas as pd
hdr=[]
for i in range(num_labels+num_features):
hdr.append("Col-"+str(i)) # data file do not have header so I need to
# provide one for pd.read_csv by chunks to work
def tgen(filename):
csvfile = open(filename)
reader = pd.read_csv(csvfile, chunksize=batch_size,names=hdr,header=None)
while True:
for chunk in reader:
W=chunk.values # labels and features
Y =W[:,:num_labels] # labels
X =W[:,num_labels:] # features
X= X / 255 # any required transformation
yield X, Y
csvfile = open(filename)
reader = pd.read_csv(csvfile, chunksize=batchz,names=hdr,header=None)
The back in the main I have
nval=number_of_validation_samples//batchz
ntrain=number_of_training_samples//batchz
ftgen=tgen("training.csv")
fvgen=tgen("validation.csv")
history = model.fit_generator(ftgen,
steps_per_epoch=ntrain,
validation_data=fvgen,
validation_steps=nval,
epochs=number_of_epochs,
callbacks=[checkpointer, stopper],
verbose=2)
I would like to upgrade Vaasha's code with TensorFlow 2.x to achieve training efficiencies as well as ease of data processing. This is particularly useful for image processing.
Process the data using Generator function as Vaasha had generated in the above example or using tf.data.dataset API. The latter approach is very useful when processing any datasets with metadata. For example, MNIST data can be loaded and processed with a few statements.
import tensorflow as tf # Ensure that TensorFlow 2.x is used
tf.compat.v1.enable_eager_execution()
import tensorflow_datasets as tfds # Needed if you are using any of the tf datasets such as MNIST, CIFAR10
mnist_train = tfds.load(name="mnist", split="train")
Use tfds.load the datasets. Once data is loaded and processed (for example, converting categorical variables, resizing, etc.).
Now upgrading keras model using TensorFlow 2.x
model = tf.keras.Sequential() # Tensorflow 2.0 upgrade
model.add(tf.keras.layers.Dense(12, activation='relu', input_dim=dataFrame.shape[1]))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adadelta',
metrics=['accuracy'])
#Train the model using generator vs using the full batch
batch_size = 8
model.fit_generator(generator(dataFrameTrain,expectedFrameTrain,batch_size),
epochs=3,
steps_per_epoch=dataFrame.shape[0]/batch_size,
validation_data=generator(dataFrameTest,expectedFrameTest,batch_size*2),
validation_steps=dataFrame.shape[0]/batch_size*2)
This will upgrade the model to run in TensorFlow 2.x

How to batch train a CNN with Keras fit_generator?

Apologies if this is the wrong place to raise my issue (please help me out with where best to raise it if that's the case). I'm a novice with Keras and Python so hope responses have that in mind.
I'm trying to train a CNN steering model that takes images as input. It's a fairly large dataset, so I created a data generator to work with fit_generator(). It's not clear to me how to make this method trains on batches, so I assumed that the generator has to return batches to fit_generator(). The generator looks like this:
def gen(file_name, batchsz = 64):
csvfile = open(file_name)
reader = csv.reader(csvfile)
batchCount = 0
while True:
for line in reader:
inputs = []
targets = []
temp_image = cv2.imread(line[1]) # line[1] is path to image
measurement = line[3] # steering angle
inputs.append(temp_image)
targets.append(measurement)
batchCount += 1
if batchCount >= batchsz:
batchCount = 0
X = np.array(inputs)
y = np.array(targets)
yield X, y
csvfile.seek(0)
It reads a csv file containing telemetry data (steering angle etc) and paths to image samples, and returns arrays of size: batchsz
The call to fit_generator() looks like this:
tgen = gen('h:/Datasets/dataset14-no.zero.speed.trn.csv', batchsz = 128) # Train data generator
vgen = gen('h:/Datasets/dataset14-no.zero.speed.val.csv', batchsz = 128) # Validation data generator
try:
model.fit_generator(
tgen,
samples_per_epoch=113526,
nb_epoch=6,
validation_data=vgen,
nb_val_samples=20001
)
The dataset contains 113526 sample points yet the model training update output reads like this (for example):
1020/113526 [..............................] - ETA: 27737s - loss: 0.0080
1021/113526 [..............................] - ETA: 27723s - loss: 0.0080
1022/113526 [..............................] - ETA: 27709s - loss: 0.0080
1023/113526 [..............................] - ETA: 27696s - loss: 0.0080
Which appears to be training sample by sample (stochastically?).
The resultant model is useless. I previously trained on a much smaller dataset using .fit() with the whole dataset loaded into memory, and that produced a model that at least works even if poorly. Clearly something is wrong with my fit_generator() approach. Will be very grateful for some help with this.
This:
for line in reader:
inputs = []
targets = []
... is resetting your batch for every line in the csv files. You're not training with your entire data, but with just a single sample in 128.
Suggestion:
for line in reader:
if batchCount == 0:
inputs = []
targets = []
....
....
As someone commented, the in fit generator, samples_per_epoch should be equal to total_samples / batchsz
Even though, I think your loss should be going down anyway. If it isn't, there might still be another problem in the code, perhaps in the way you load the data, or in the model's initialization or structure.
Try to plot your images and print the data in the generator:
for X,y in tgen: #careful, this is an infinite loop, make it stop
print(X.shape[0]) # is this really the number of batches you expect?
for image in X:
...some method to plot X so you can see it, or just print
print(y)
Check if the yielded values are ok with what you expect them to be.

Resources