Apologies if this is the wrong place to raise my issue (please help me out with where best to raise it if that's the case). I'm a novice with Keras and Python so hope responses have that in mind.
I'm trying to train a CNN steering model that takes images as input. It's a fairly large dataset, so I created a data generator to work with fit_generator(). It's not clear to me how to make this method trains on batches, so I assumed that the generator has to return batches to fit_generator(). The generator looks like this:
def gen(file_name, batchsz = 64):
csvfile = open(file_name)
reader = csv.reader(csvfile)
batchCount = 0
while True:
for line in reader:
inputs = []
targets = []
temp_image = cv2.imread(line[1]) # line[1] is path to image
measurement = line[3] # steering angle
inputs.append(temp_image)
targets.append(measurement)
batchCount += 1
if batchCount >= batchsz:
batchCount = 0
X = np.array(inputs)
y = np.array(targets)
yield X, y
csvfile.seek(0)
It reads a csv file containing telemetry data (steering angle etc) and paths to image samples, and returns arrays of size: batchsz
The call to fit_generator() looks like this:
tgen = gen('h:/Datasets/dataset14-no.zero.speed.trn.csv', batchsz = 128) # Train data generator
vgen = gen('h:/Datasets/dataset14-no.zero.speed.val.csv', batchsz = 128) # Validation data generator
try:
model.fit_generator(
tgen,
samples_per_epoch=113526,
nb_epoch=6,
validation_data=vgen,
nb_val_samples=20001
)
The dataset contains 113526 sample points yet the model training update output reads like this (for example):
1020/113526 [..............................] - ETA: 27737s - loss: 0.0080
1021/113526 [..............................] - ETA: 27723s - loss: 0.0080
1022/113526 [..............................] - ETA: 27709s - loss: 0.0080
1023/113526 [..............................] - ETA: 27696s - loss: 0.0080
Which appears to be training sample by sample (stochastically?).
The resultant model is useless. I previously trained on a much smaller dataset using .fit() with the whole dataset loaded into memory, and that produced a model that at least works even if poorly. Clearly something is wrong with my fit_generator() approach. Will be very grateful for some help with this.
This:
for line in reader:
inputs = []
targets = []
... is resetting your batch for every line in the csv files. You're not training with your entire data, but with just a single sample in 128.
Suggestion:
for line in reader:
if batchCount == 0:
inputs = []
targets = []
....
....
As someone commented, the in fit generator, samples_per_epoch should be equal to total_samples / batchsz
Even though, I think your loss should be going down anyway. If it isn't, there might still be another problem in the code, perhaps in the way you load the data, or in the model's initialization or structure.
Try to plot your images and print the data in the generator:
for X,y in tgen: #careful, this is an infinite loop, make it stop
print(X.shape[0]) # is this really the number of batches you expect?
for image in X:
...some method to plot X so you can see it, or just print
print(y)
Check if the yielded values are ok with what you expect them to be.
Related
I am building a keras model. The features are coming from pandas.DataFrame. I build the tf.Dataset through from_generator API. I followed this page to process the categorical string features.
output_sig= ...
features = [...]
def iter_to_gen(it):
def f():
for x in it:
# x is a list, with the last element being the label
key_to_feature = {key: x[i] for i, key in enumerate(features)}
yield key_to_feature, x[-1]
return f
train_ds = tf.data.Dataset.from_generator( iter_to_gen(map(tuple, train_data.values)), output_signature=output_sig, name='train').batch(batch_size)
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a layer that turns strings into integer indices.
if dtype == 'string':
index = layers.StringLookup(max_tokens=max_tokens)
# Otherwise, create a layer that turns integer values into integer indices.
else:
index = layers.IntegerLookup(max_tokens=max_tokens)
# Prepare a `tf.data.Dataset` that only yields the feature.
feature_ds = dataset.map(lambda x, y : x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Encode the integer indices.
encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())
# Apply multi-hot encoding to the indices. The lambda function captures the
# layer, so you can use them, or include them in the Keras Functional model later.
return lambda feature: encoder(index(feature))
all_inputs = []
encoded_features = []
categorical_cols = ['feature_A']
for header in categorical_cols:
if header == 'feature_A':
categorical_col = tf.keras.Input(shape=(None,), name=header, dtype='string')
else:
categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
encoding_layer = get_category_encoding_layer(name=header,
dataset=train_ds,
dtype='string',
max_tokens=50) # tune the max tokens
encoded_categorical_col = encoding_layer(categorical_col)
all_inputs.append(categorical_col)
encoded_features.append(encoded_categorical_col)
all_features = tf.keras.layers.concatenate(encoded_features)
print(all_features.shape)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
# x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(num_class)(x)
model = tf.keras.Model(all_inputs, output)
model.compile(optimizer='SGD',
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
metrics=["accuracy"],
run_eagerly=True)
model.fit(train_ds, epochs=10, verbose=2) <------ ValueError: Unexpected result of #`train_function` (Empty logs). Please use `Model.compile(..., run_eagerly=True)`, or #`tf.config.run_functions_eagerly(True)` for more information of where went wrong, or file a #issue/bug to `tf.keras`.
And then if I reproduce the train_ds and skip directly to run model.fit, it would run only 2 epochs and end. I am wondering why is it.
Epoch 1/10
4984/4984 - 71s - loss: 2.5564 - accuracy: 0.4191 - 71s/epoch - 14ms/step
Epoch 2/10
4984/4984 - 0s - loss: 0.0000e+00 - accuracy: 0.0000e+00 - 12ms/epoch - 2us/step
<keras.callbacks.History at 0x....>
I found the first error was raised because model.fit got a empty dataset. I also verified the size of dataset by dataset.as_numpy_array() and it is empty. I am wondering why.
Thanks.
I am trying to fit BERT text classifier. My training and test data looks as follows.
x_train = data["TEXT"].head(4500).tolist()
y_train= [label2id[label] for label in data["EMOTION"].head(4500).values.tolist()]
x_test = data["TEXT"].tail(500).tolist()
y_test = [label2id[label] for label in data["EMOTION"].tail(500).values.tolist()]
Then, I download the pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
x_test=x_test, y_test=y_test,
class_names=data['EMOTION'].unique().tolist(),
preprocess_mode='bert',
ngram_range=1,
maxlen=350,
max_features=35000)
For classification, we set the bert model as
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)
Finally, I try fit to the model using 1cycle policy rate
hist = learner.fit_onecycle(2e-5, 1)
I get the result with 750 samples rather than 4500 samples. I also tested this with various data. So there is always variations in data items. Can you give an idea what is behind it?
begin training using onecycle policy with max lr of 2e-05...
750/750 [==============================] - 875s 1s/step - loss: 0.3740 - accuracy: 0.8544
Thank you for your response in Advance.
My personal idea is that when you instantiate the learner with ktrain.get_learner you give it a batch size = 6 as input parameter.
So when you try to train the learner by simply doing learner.fit_onecycle (2e-5, 1), it takes exactly one batch for training, in fact 4500 training data / batch size (6) = 750 data to train on.
At this point either try to change the batch size, or do a for loop like this:
for epoch in range(X):
....
for batch in chunker(train, batch_size):
....
where chunker() could be something like:
def chunker(sequence, size):
"""useful for splitting a sequence into minibatches"""
for i in range(0, len(sequence), size):
chunk = sequence[i:i+size]
# sort sentences in batch by length in descending order
chunk.sort(key=lambda x: len(x), reverse=True)
yield chunk
In a nutshell the idea is that you have to do a loop in which you go to select each time a set of data (batch) that you want to use to train your model.
I have training data and validation data stacked up in two tensors. At first, I ran a NN using keras.model.fit() function. for my purposes, I wish to move to keras.model.fit_generator(). I build a generator and I have noticed the number of samples is not a multiplication of the batch size.
My implementation to overcome this:
indices = np.arange(len(dataset))# generate indices of len(dataset)
num_of_steps = int(np.ceil(len(dataset)/batch_size)) #number of steps per epoch
extra = num_of_steps *batch_size-len(dataset)#find the size of extra samples needed to complete the next multiplication of batch_size
additional = np.random.randint(len(dataset),size = extra )#complete with random samples
indices = np.append(indices ,additional )
After randomizing the indices at each epoch I simply iterate this in batches skips and pool the correct data and labels.
I am observing a degradation in the performance of the model. When training with fit() I get 0.99 training accuracy and 0.93 validation accuracy while with fit_generator() I am getting 0.95 and 0.9 respectively. note, this is consistent and not a single experiment. I thought it might be due to fit() handling the extra samples required differently. Is my implementation reasonable? how does fit() handles datasets of a size different from a batch_size multiplication?
Sharing the full generator code:
def generator(self,batch_size,train):
"""
Generates batches of samples
:return:
"""
while 1:
nb_of_steps=0
if(train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train))
additional = np.random.randint(len(self._x_train), size=self._num_of_steps_train*batch_size-len(self._x_train))
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
additional = np.random.randint(len(self._x_test), size=self._num_of_steps_test*batch_size-len(self._x_test))
indices = np.append(indices,additional)
np.random.shuffle(indices)
# print(indices.shape)
# print(nb_of_steps)
for i in range(nb_of_steps):
batch_indices=indices[i:i+batch_size]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
# print(feat.shape)
# print(label.shape)
yield feat, label
It looks like you can simplify the generator significantly!
The number of steps etc can be set outside the loop as they do not really change. Moreover, it looks like the batch_indices is not going through the entire dataset. Finally, if your data fits in memory you might not need a generator at all, but will leave this to your judgement.
def generator(self, batch_size, train):
nb_of_steps = 0
if (train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train)) #len of entire dataset
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
while 1:
np.random.shuffle(indices)
for i in range(nb_of_steps):
start_idx = i*batch_size
end_idx = min(i*batch_size+batch_size, len(indices))
batch_indices=indices[start_idx : end_idx]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
yield feat, label
For a more robust generator consider creating a class for your set using the keras.utils.Sequence class. It will add a few extra lines of code, but it is certainly working with keras.
I am trying to train a CNN using my own dataset. I've been using tfrecord files and the tf.data.TFRecordDataset API to handle my dataset. It works fine for my training dataset. But when I tried to batch my validation dataset, the error of 'OutOfRangeError: End of sequence' raised. After browsing through the Internet, I thought the problem was caused by the batch size of the validation set, which I set to 32 in the first place. But after I changed it to 2, the code ran for like 9 epochs and the error raised again.
I used an input function to handle the dataset, the code goes below:
def input_fn(is_training, filenames, batch_size, num_epochs=1, num_parallel_reads=1):
dataset = tf.data.TFRecordDataset(filenames,num_parallel_reads=num_parallel_reads)
if is_training:
dataset = dataset.shuffle(buffer_size=1500)
dataset = dataset.map(parse_record)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
and for the training set, "batch_size" is set to 128 and "num_epochs" set to None which means keep repeating for infinite time. For the validation set, "batch_size" is set to 32(later set to 2, still didn't work) and the "num_epochs" set to 1 since I only want to go through the validation set one time.
I can assure that the validation set contains enough data for the epochs. Because I've tried the codes below and it didn't raise any errors:
with tf.Session() as sess:
features, labels = input_fn(False, valid_list, 32, 1, 1)
for i in range(450):
sess.run([features, labels])
print(labels.shape)
In the code above, when I changed the number 450 to 500 or anything larger, it would raise the 'OutOfRangeError'. That can confirm that my validation dataset contains enough data for 450 iterations with a batch size of 32.
I've tried to use a smaller batch size(i.e., 2) for the validation set, but still having the same error.
I can get the code running with the "num_epochs" set to "None" in the input_fn for validation set, but that does not seem to be how the validation works. Any help, please?
This behaviour is normal. From the Tensorflow documentation:
If the iterator reaches the end of the dataset, executing the Iterator.get_next() operation will raise a tf.errors.OutOfRangeError. After this point the iterator will be in an unusable state, and you must initialize it again if you want to use it further.
The reason why the error is not raised when you set dataset.repeat(None) is because the dataset is never exhausted since it is repeated indefinitely.
To solve your issue, you should change your code to this:
n_steps = 450
...
with tf.Session() as sess:
# Training
features, labels = input_fn(True, training_list, 32, 1, 1)
for step in range(n_steps):
sess.run([features, labels])
...
...
# Validation
features, labels = input_fn(False, valid_list, 32, 1, 1)
try:
sess.run([features, labels])
...
except tf.errors.OutOfRangeError:
print("End of dataset") # ==> "End of dataset"
You can also make a few changes to your input_fn to run the evaluation at every epoch:
def input_fn(is_training, filenames, batch_size, num_epochs=1, num_parallel_reads=1):
dataset = tf.data.TFRecordDataset(filenames,num_parallel_reads=num_parallel_reads)
if is_training:
dataset = dataset.shuffle(buffer_size=1500)
dataset = dataset.map(parse_record)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_initializable_iterator()
return iterator
n_epochs = 10
freq_eval = 1
training_iterator = input_fn(True, training_list, 32, 1, 1)
training_features, training_labels = training_iterator.get_next()
val_iterator = input_fn(False, valid_list, 32, 1, 1)
val_features, val_labels = val_iterator.get_next()
with tf.Session() as sess:
# Training
sess.run(training_iterator.initializer)
for epoch in range(n_epochs):
try:
sess.run([training_features, training_labels])
except tf.errors.OutOfRangeError:
pass
# Validation
if (epoch+1) % freq_eval == 0:
sess.run(val_iterator.initializer)
try:
sess.run([val_features, val_labels])
except tf.errors.OutOfRangeError:
pass
I advise you to have a close look to this official guide if you want to have a better understanding of what is happening under the hood.
So far I have come up with this hacky code here, this code runs and outputs
Epoch 10/10
1/3000 [..............................] - ETA: 27s - loss: 0.3075 - acc: 0.7270
6/3000 [..............................] - ETA: 54s - loss: 0.3075 - acc: 0.7355
.....
2996/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337
2998/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337
3000/3000 [==============================] - 59s - loss: 0.3076 - acc: 0.7337
Traceback (most recent call last):
File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 140, in <module>
(loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData),
TypeError: 'History' object is not iterable
Process finished with exit code 1
I'm confused by "'History' object is not iterable", what does this mean?
This is the first time I've tried to do batch training and testing and I'm not sure i've implemented it correctly as most the examples I've seen online are for images. Here is the code
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import re
"""
amount of samples out to the 1 million to use, my 960m 2GB can only handel
about 30,000ish at the moment depending on the amount of neurons in the
deep layer and the amount fo layers.
"""
maxSamples = 3000
#Load the CSV and get the correct columns
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv")
dataX = pd.DataFrame()
dataY = pd.DataFrame()
dataY[['Sentiment']] = data[['Sentiment']]
dataX[['SentimentText']] = data[['SentimentText']]
dataY = dataY.iloc[0:maxSamples]
dataX = dataX.iloc[0:maxSamples]
testY = dataY.iloc[-1: -maxSamples]
testX = dataX.iloc[-1: -maxSamples]
"""
here I filter the data and clean it up bu remove # tags and hyper links and
also any characters that are not alpha numeric, I then add it to the vec list
"""
def removeTagsAndLinks(dataframe):
vec = []
for x in dataframe.iterrows():
#Removes Hyperlinks
zero = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?", "", x[1].values[0])
#Removes # tags
one = re.sub("#\\w+", '', zero)
#keeps only alpha-numeric chars
two = re.sub("\W+", ' ', one)
vec.append(two)
return vec
vec = removeTagsAndLinks(dataX)
xTest = removeTagsAndLinks(testX)
yTest = removeTagsAndLinks(testY)
"""
This loop looks for any Tweets with characters shorter than 2 and once found write the
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the
list of Tweets later
"""
indexOfBlankStrings = []
for index, string in enumerate(vec):
if len(string) < 2:
del vec[index]
indexOfBlankStrings.append(index)
for row in indexOfBlankStrings:
dataY.drop(row, axis=0, inplace=True)
"""
This makes a BOW model out of all the tweets then creates a
vector for each of the tweets containing all the words from
the BOW model, each vector is the same size becuase the
network expects it
"""
def vectorise(tokenizer, list):
tokenizer.fit_on_texts(list)
return tokenizer.texts_to_matrix(list)
#Make BOW model and vectorise it
t = Tokenizer(lower=False, num_words=1000)
dim = vectorise(t, vec)
xTest = vectorise(t, xTest)
"""
Here im experimenting with multiple layers of the total
amount of words in the syllabus divided by ^2 - This
has given me quite accurate results compared to random guess's
of amount of neron's and amounts of layers.
"""
l1 = int(len(dim[0]) / 4) #To big for my GPU
l2 = int(len(dim[0]) / 8) #To big for my GPU
l3 = int(len(dim[0]) / 16)
l4 = int(len(dim[0]) / 32)
l5 = int(len(dim[0]) / 64)
l6 = int(len(dim[0]) / 128)
#Make the model
model = Sequential()
model.add(Dense(l1, input_dim=dim.shape[1]))
model.add(Dropout(0.15))
model.add(Dense(l2))
model.add(Dense(l1))
model.add(Dense(l3))
model.add(Dropout(0.2))
model.add(Dense(l4))
model.add(Dense(1, activation='relu'))
#Compile the model
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc'])
"""
This here will use multiple batches to train the model.
startIndex:
This is the starting index of the array for which you want to
start training the network from.
dataRange:
The number of elements use to train the network in each batch so
since dataRange = 1000 this mean it goes from
startIndex...dataRange OR 0...1000
amountOfEpochs:
This is kinda self explanitory, the more Epochs the more it
is supposed to learn AKA updates the optimisation algo numbers
"""
amountOfEpochs = 10
dataRange = 1000
startIndex = 0
def generator(tokenizer, batchSize, totalSize=maxSamples, startIndex=0):
f = tokenizer.texts_to_sequences(vec[startIndex:totalSize])
l = np.asarray(dataY.iloc[startIndex:totalSize])
while True:
for i in range(1000, totalSize, batchSize):
batch_features = tokenizer.sequences_to_matrix(f[startIndex: batchSize])
batch_labels = l[startIndex: batchSize]
yield batch_features, batch_labels
##This runs the model for batch AKA load a little them process then load a little more
for amountOfData in range(1000, maxSamples, 1000):
#(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData]))
(loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData),
steps_per_epoch=maxSamples, epochs=amountOfEpochs,
validation_data=(np.array(xTest), np.array(yTest)))
startIndex += 1000
The part towards the bottom is where I've tried to implement the fit_generator() and make my own generator, I wanted to load say 75,000 maxSamples then train the network 1000 samples at a time until it reaches the maxSample var which is why I've setup range to do the (0, maxSample, 1000) which I use in the generator() was this the correct use?
I ask because my network is not using the validation data and it seems to fit to the data extremely quickly which suggests overfitting or just using a very small dataset. am I iterating over all the maxSamples int he correct way? or am I just looping over the first iterations several times?
Thanks
The problem lies in this line:
(loss, acc) = model.fit_generator(...)
as fit_generator returns a single object of keras.callbacks.history class. That's why you have this error as singe object is not iterable. In order to get loss lists you need to retrieve them from history field in this callback which is a dictionary of recorded losses:
history = model.fit_generator(...)
loss = history.history["loss"]