Avoid feed_dict mechanism in static graph in tensorflow - python-3.x

I am trying to implement a model for generating/reconstructing samples (Variational autoencoder). During test time, I would like to be able to make the model generate new samples by feeding it a latent variable, but that requires changing the inputs to a part of the computational graph.
I could use a feed_dict to "dynamically" do that, since I cannot directly change a static graph, but I want to avoid the overhead of exchanging data between the GPU and the system RAM.
As it stands I feed the data using Iterators.
def make_mnist_dataset(batch_size, shuffle=True, include_labels=True):
"""Loads the MNIST data set and returns the relevant
iterator along with its initialization operations.
"""
# load the data
train, test = tf.keras.datasets.mnist.load_data()
# binarize and reshape the data sets
temp_train = train[0]
temp_train = (temp_train > 0.5).astype(np.float32).reshape(temp_train.shape[0], 784)
train = (temp_train, train[1])
temp_test = test[0]
temp_test = (temp_test > 0.5).astype(np.float32).reshape(temp_test.shape[0], 784)
test = (temp_test, test[1])
# prepare Dataset objects
if include_labels:
train_set = tf.data.Dataset.from_tensor_slices(train).repeat().batch(batch_size)
test_set = tf.data.Dataset.from_tensor_slices(test).repeat(1).batch(batch_size)
else:
train_set = tf.data.Dataset.from_tensor_slices(train[0]).repeat().batch(batch_size)
test_set = tf.data.Dataset.from_tensor_slices(test[0]).repeat(1).batch(batch_size)
if shuffle:
train_set = train_set.shuffle(buffer_size=int(0.5*train[0].shape[0]),
seed=123)
# make the iterator
iter = tf.data.Iterator.from_structure(train_set.output_types,
train_set.output_shapes)
data = iter.get_next()
# create initialization ops
train_init = iter.make_initializer(train_set)
test_init = iter.make_initializer(test_set)
return train_init, test_init, data
And here's the code snippet where the data being iterated over is being fed to the graph:
train_init, test_init, next_batch = make_mnist_dataset(batch_size, include_labels=True)
ops = build_graph(next_batch[0], next_batch[1], learning_rate, is_training,
latent_dim, tau, batch_size, inf_layers, gen_layers)
Is there any way to "switch" from an Iterator object to a different input source during test time, without resorting to feed_dict?

Related

Keras fit vs. fit_generator extra smaples

I have training data and validation data stacked up in two tensors. At first, I ran a NN using keras.model.fit() function. for my purposes, I wish to move to keras.model.fit_generator(). I build a generator and I have noticed the number of samples is not a multiplication of the batch size.
My implementation to overcome this:
indices = np.arange(len(dataset))# generate indices of len(dataset)
num_of_steps = int(np.ceil(len(dataset)/batch_size)) #number of steps per epoch
extra = num_of_steps *batch_size-len(dataset)#find the size of extra samples needed to complete the next multiplication of batch_size
additional = np.random.randint(len(dataset),size = extra )#complete with random samples
indices = np.append(indices ,additional )
After randomizing the indices at each epoch I simply iterate this in batches skips and pool the correct data and labels.
I am observing a degradation in the performance of the model. When training with fit() I get 0.99 training accuracy and 0.93 validation accuracy while with fit_generator() I am getting 0.95 and 0.9 respectively. note, this is consistent and not a single experiment. I thought it might be due to fit() handling the extra samples required differently. Is my implementation reasonable? how does fit() handles datasets of a size different from a batch_size multiplication?
Sharing the full generator code:
def generator(self,batch_size,train):
"""
Generates batches of samples
:return:
"""
while 1:
nb_of_steps=0
if(train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train))
additional = np.random.randint(len(self._x_train), size=self._num_of_steps_train*batch_size-len(self._x_train))
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
additional = np.random.randint(len(self._x_test), size=self._num_of_steps_test*batch_size-len(self._x_test))
indices = np.append(indices,additional)
np.random.shuffle(indices)
# print(indices.shape)
# print(nb_of_steps)
for i in range(nb_of_steps):
batch_indices=indices[i:i+batch_size]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
# print(feat.shape)
# print(label.shape)
yield feat, label
It looks like you can simplify the generator significantly!
The number of steps etc can be set outside the loop as they do not really change. Moreover, it looks like the batch_indices is not going through the entire dataset. Finally, if your data fits in memory you might not need a generator at all, but will leave this to your judgement.
def generator(self, batch_size, train):
nb_of_steps = 0
if (train):
nb_of_steps = self._num_of_steps_train
indices = np.arange(len(self._x_train)) #len of entire dataset
else:
nb_of_steps = self._num_of_steps_test
indices = np.arange(len(self._x_test))
while 1:
np.random.shuffle(indices)
for i in range(nb_of_steps):
start_idx = i*batch_size
end_idx = min(i*batch_size+batch_size, len(indices))
batch_indices=indices[start_idx : end_idx]
if(train):
feat = self._x_train[batch_indices]
label = self._y_train[batch_indices]
else:
feat = self._x_test[batch_indices]
label = self._y_test[batch_indices]
feat = np.expand_dims(feat,axis=1)
yield feat, label
For a more robust generator consider creating a class for your set using the keras.utils.Sequence class. It will add a few extra lines of code, but it is certainly working with keras.

Tensorflow -- Iterating over training and validation sequencially

I have been going throught the Dataset API of tensorflow to feed different dataset with ease to an RNN model.
I got everything working following the not so many blogs together with the docs in the tensorflow website. My working example did the following:
--- Train on X epochs in a training dataset -> validate after all the training has concluded in a validation dataset.
However, I'm unable to develop the following example:
--- Train on X epochs in a training dataset -> validate in each epoch the training model with a validation dataset (a bit like what Keras does)
The problematic issue comes because of the following piece of code:
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE_VAL, drop_remainder=True).repeat()
itr = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_init_op = itr.make_initializer(train_dataset)
validation_init_op = itr.make_initializer(val_dataset)
When I create the iterator from_structure, I need to specify an output_shape. Obviously, the output shape of the train dataset and the validation dataset is not the same as they have a different batch_size. However, the validation_init_op is throwing the following error, which it seems counterintuitive because validation sets have always different batch_size:
TypeError: Expected output shapes compatible with (TensorShape([Dimension(256), Dimension(11), Dimension(74)]), TensorShape([Dimension(256), Dimension(3)])) but got dataset with output shapes (TensorShape([Dimension(28), Dimension(11), Dimension(74)]), TensorShape([Dimension(28), Dimension(3)])).
I want to do this second approach to evaluate my model and see the common train and validation plots developed at the same time, to see how can I improve it (stopping the learning early and etc). However, with the first simple approach I don't get all this.
So, the question is: ¿Am I doing something wrong? ¿Does my second approach has to be tackled differently? I can think of creating two iterators, but I don't know if that is the right approach. Also, this answer by #MatthewScarpino points out to a feedable iterator because switching between reinitializable ones makes them to start all over again; however, the above error is not related with that part of the code -- ¿Maybe the reinitializable iterator is not intended to set a different batch size for the validation set and to only iterate it once after training whatever the size it is and without setting it in the .batch() method?
Any help is very much appreciated.
Full code for reference:
N_TIMESTEPS_X = xt.shape[0] ## The stack number
BATCH_SIZE = 256
#N_OBSERVATIONS = xt.shape[1]
N_FEATURES = xt.shape[2]
N_OUTPUTS = yt.shape[1]
N_NEURONS_LSTM = 128 ## Number of units in the LSTMCell
N_EPOCHS = 350
LEARNING_RATE = 0.001
### Define the placeholders anda gather the data.
xt = xt.transpose([1,0,2])
xval = xval.transpose([1,0,2])
train_data = (xt, yt)
validation_data = (xval, yval)
N_BATCHES = train_data[0].shape[0] // BATCH_SIZE
print('The number of batches is: {}'.format(N_BATCHES))
BATCH_SIZE_VAL = validation_data[0].shape[0] // N_BATCHES
print('The validation batch size is: {}'.format(BATCH_SIZE_VAL))
## We define the placeholders as a trick so that we do not break into memory problems, associated with feeding the data directly.
'''As an alternative, you can define the Dataset in terms of tf.placeholder() tensors, and feed the NumPy arrays when you initialize an Iterator over the dataset.'''
batch_size = tf.placeholder(tf.int64)
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
y = tf.placeholder(tf.float32, shape=[None, N_OUTPUTS], name='YPlaceholder')
# Creating the two different dataset objects.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE_VAL, drop_remainder=True).repeat()
# Creating the Iterator type that permits to switch between datasets.
itr = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_init_op = itr.make_initializer(train_dataset)
validation_init_op = itr.make_initializer(val_dataset)
next_features, next_labels = itr.get_next()
After investigating the best way to do this, I came across with this final implementation that works well on my end. Surely not be the best. So as to maintain the state, I used a feedable iterator.
AIM: This code is intented to be used when you want to train and validate at the same time, preserving the state of each iterator (i.e. validate with the newest model parameters). Together with that, the code saves the model and other stuff, like some information about the hyperparameters and summaries to visualize the training and validation in Tensorboard.
Also, don't get confused: you don't need to have a different batch size for the training set and for the validation set. This is a misconception that I have. The batch sizes must be the same AND you have to deal with the different number of batches, just passing when no more batches are left. This is a requirement so that you can create the iterator, regarding having both datasets the same data type and shape.
Hope it helps others. Just ignore the code that does not relate to your objectives. Many thanks for #kvish for all the help and time.
Code:
def RNNmodelTF(xt, yt, xval, yval, xtest, ytest):
N_TIMESTEPS_X = xt.shape[0] ## The stack number
BATCH_SIZE = 256
#N_OBSERVATIONS = xt.shape[1]
N_FEATURES = xt.shape[2]
N_OUTPUTS = yt.shape[1]
N_NEURONS_LSTM = 128 ## Number of units in the LSTMCell
N_EPOCHS = 350
LEARNING_RATE = 0.001
### Define the placeholders anda gather the data.
xt = xt.transpose([1,0,2])
xval = xval.transpose([1,0,2])
train_data = (xt, yt)
validation_data = (xval, yval)
N_BATCHES = train_data[0].shape[0] // BATCH_SIZE
## We define the placeholders as a trick so that we do not break into memory problems, associated with feeding the data directly.
'''As an alternative, you can define the Dataset in terms of tf.placeholder() tensors, and feed the NumPy arrays when you initialize an Iterator over the dataset.'''
batch_size = tf.placeholder(tf.int64)
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
y = tf.placeholder(tf.float32, shape=[None, N_OUTPUTS], name='YPlaceholder')
# Creating the two different dataset objects.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
#################### Creating the Iterator type that permits to switch between datasets.
handle = tf.placeholder(tf.string, shape = [])
iterator = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
next_features, next_labels = iterator.get_next()
train_val_iterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_iterator = train_val_iterator.make_initializer(train_dataset)
val_iterator = train_val_iterator.make_initializer(val_dataset)
###########################
### Create the graph
cellType = tf.nn.rnn_cell.LSTMCell(num_units=N_NEURONS_LSTM, name='LSTMCell')
inputs = tf.unstack(next_features, axis=1)
'''inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size]'''
RNNOutputs, _ = tf.nn.static_rnn(cell=cellType, inputs=inputs, dtype=tf.float32)
out_weights = tf.get_variable("out_weights", shape=[N_NEURONS_LSTM, N_OUTPUTS], dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer())
out_bias = tf.get_variable("out_bias", shape=[N_OUTPUTS], dtype=tf.float32, initializer=tf.zeros_initializer())
predictionsLayer = tf.matmul(RNNOutputs[-1], out_weights) + out_bias
### Define the cost function, that will be optimized by the optimizer.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=predictionsLayer, labels=next_labels, name='Softmax_plus_Cross_Entropy'))
optimizer_type = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, name='AdamOptimizer')
optimizer = optimizer_type.minimize(cost)
### Model evaluation
correctPrediction = tf.equal(tf.argmax(predictionsLayer,1), tf.argmax(next_labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPrediction,tf.float32))
confusionMatrix1 = tf.confusion_matrix(tf.argmax(next_labels,1), tf.argmax(predictionsLayer,1), num_classes=3, name='ConfMatrix')
## Saving variables so that we can restore them afterwards.
saver = tf.train.Saver()
save_dir = '/media/SecondDiskHDD/8classModels/DLmodels/tfModels/{}_{}'.format(cellType.__class__.__name__, datetime.now().strftime("%Y%m%d%H%M%S"))
#save_dir = '/home/Desktop/tfModels/{}_{}'.format(cellType.__class__.__name__, datetime.now().strftime("%Y%m%d%H%M%S"))
os.mkdir(save_dir)
varDict = {'nTimeSteps': N_TIMESTEPS_X, 'BatchSize': BATCH_SIZE, 'nFeatures': N_FEATURES,
'nNeuronsLSTM': N_NEURONS_LSTM, 'nEpochs': N_EPOCHS,
'learningRate': LEARNING_RATE, 'optimizerType': optimizer_type.__class__.__name__}
varDicSavingTxt = save_dir + '/varDict.txt'
modelFilesDir = save_dir + '/modelFiles'
os.mkdir(modelFilesDir)
logDir = save_dir + '/TBoardLogs'
os.mkdir(logDir)
acc_summary = tf.summary.scalar('Accuracy', accuracy)
loss_summary = tf.summary.scalar('Cost_CrossEntropy', cost)
summary_merged = tf.summary.merge_all()
with open(varDicSavingTxt, 'w') as outfile:
outfile.write(repr(varDict))
with tf.Session() as sess:
tf.set_random_seed(2)
sess.run(tf.global_variables_initializer())
train_writer = tf.summary.FileWriter(logDir + '/train', sess.graph)
validation_writer = tf.summary.FileWriter(logDir + '/validation')
# initialise iterator with data
train_val_string = sess.run(train_val_iterator.string_handle())
cm1Total = None
cm2Total = None
print('¡Training starts!')
for epoch in range(N_EPOCHS):
batchAccList = []
batchAccListVal = []
tot_loss_train = 0
tot_loss_validation = 0
for batch in range(N_BATCHES):
sess.run(train_iterator, feed_dict = {x : train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
optimizer_output, loss_value, summary, accBatch, cm1 = sess.run([optimizer, cost, summary_merged, accuracy, confusionMatrix1], feed_dict = {handle: train_val_string})
npArrayPred = predictionsLayer.eval(feed_dict= {handle: train_val_string})
predLabEnc = np.apply_along_axis(thresholdSet, 1, npArrayPred, value=0.5)
npArrayLab = next_labels.eval(feed_dict= {handle: train_val_string})
labLabEnc = np.argmax(npArrayLab, 1)
cm2 = confusion_matrix(labLabEnc, predLabEnc)
tot_loss_train += loss_value
batchAccList.append(accBatch)
try:
sess.run(val_iterator, feed_dict = {x: validation_data[0], y: validation_data[1], batch_size: BATCH_SIZE})
valLoss, valAcc, summary_val = sess.run([cost, accuracy, summary_merged], feed_dict = {handle: train_val_string})
tot_loss_validation += valLoss
batchAccListVal.append(valAcc)
except tf.errors.OutOfRangeError:
pass
if cm1Total is None and cm2Total is None:
cm1Total = cm1
cm2Total = cm2
else:
cm1Total += cm1
cm2Total += cm2
if batch % 10 == 0:
train_writer.add_summary(summary, batch)
validation_writer.add_summary(summary_val, batch)
epochAcc = tf.reduce_mean(batchAccList)
sess.run(train_iterator, feed_dict = {x : train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
epochAcc_num = sess.run(epochAcc, feed_dict = {handle: train_val_string})
epochAccVal = tf.reduce_mean(batchAccListVal)
sess.run(val_iterator, feed_dict = {x: validation_data[0], y: validation_data[1], batch_size: BATCH_SIZE})
epochAcc_num_Val = sess.run(epochAccVal, feed_dict = {handle: train_val_string})
if epoch%10 == 0:
print("Epoch: {}, Loss: {:.4f}, Accuracy: {:.3f}".format(epoch, tot_loss_train / N_BATCHES, epochAcc_num))
print('Validation Loss: {:.4f}, Validation Accuracy: {:.3f}'.format(tot_loss_validation / N_BATCHES, epochAcc_num_Val))
cmLogFile1 = save_dir + '/cm1File.txt'
with open(cmLogFile1, 'w') as outfile:
outfile.write(repr(cm1Total))
cmLogFile2 = save_dir + '/cm2File.txt'
with open(cmLogFile2, 'w') as outfile:
outfile.write(repr(cm2Total))
saver.save(sess, modelFilesDir + '/model.ckpt')

how to create iterator.get_next() for validation set

I am working on project to classify medical images using the CNN model, for my project I use tensorflow, after doing some search, at last, I could use new tensorflow input pipeline to prepare the train, validation and test set, here is the code:
train_data = tf.data.Dataset.from_tensor_slices(train_images)
train_labels = tf.data.Dataset.from_tensor_slices(train_labels)
train_set = tf.data.Dataset.zip((train_data,train_labels)).shuffle(500).batch(30)
valid_data = tf.data.Dataset.from_tensor_slices(valid_images)
valid_labels = tf.data.Dataset.from_tensor_slices(valid_labels)
valid_set = tf.data.Dataset.zip((valid_data,valid_labels)).shuffle(200).batch(20)
test_data = tf.data.Dataset.from_tensor_slices(test_images)
test_labels = tf.data.Dataset.from_tensor_slices(test_labels)
test_set = tf.data.Dataset.zip((test_data, test_labels)).shuffle(200).batch(20)
# create general iterator
iterator = tf.data.Iterator.from_structure(train_set.output_types, train_set.output_shapes)
next_element = iterator.get_next()
train_init_op = iterator.make_initializer(train_set)
valid_init_op = iterator.make_initializer(valid_set)
test_init_op = iterator.make_initializer(test_set)
I can use next_element to iterate over the train set (next_element[0] for images and next_element[1] for labels), now I want to do is doing same thing for the validation set(creating an iterator for validation set), anyone can give me an idea how to do it?
You should be able to use the same next_element to get validation and test set.
For example, initialize the dataset by sess.run(valid_init_op) and then next_element generates data in the validation set.
with tf.Session as sess:
sess.run(train_init_op)
image_train, label_train = next_element
sess.run(valid_init_op)
image_val, label_val = next_element
sess.run(test_init_op)
image_test, label_test = next_element

batching huge data in tensorflow

I am trying to perform binary classification using the code/tutorial from
https://github.com/eisenjulian/nlp_estimator_tutorial/blob/master/nlp_estimators.py
print("Loading data...")
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)
print(len(y_train), "train sequences")
print(len(y_test), "test sequences")
print("Pad sequences (samples x time)")
x_train = sequence.pad_sequences(x_train_variable,
maxlen=sentence_size,
padding='post',
value=0)
x_test = sequence.pad_sequences(x_test_variable,
maxlen=sentence_size,
padding='post',
value=0)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
dataset = dataset.shuffle(buffer_size=len(x_train_variable))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def eval_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def cnn_model_fn(features, labels, mode, params):
input_layer = tf.contrib.layers.embed_sequence(
features['x'], vocab_size, embedding_size,
initializer=params['embedding_initializer'])
training = mode == tf.estimator.ModeKeys.TRAIN
dropout_emb = tf.layers.dropout(inputs=input_layer,
rate=0.2,
training=training)
conv = tf.layers.conv1d(
inputs=dropout_emb,
filters=32,
kernel_size=3,
padding="same",
activation=tf.nn.relu)
# Global Max Pooling
pool = tf.reduce_max(input_tensor=conv, axis=1)
hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu)
dropout_hidden = tf.layers.dropout(inputs=hidden,
rate=0.2,
training=training)
logits = tf.layers.dense(inputs=dropout_hidden, units=1)
# This will be None when predicting
if labels is not None:
labels = tf.reshape(labels, [-1, 1])
optimizer = tf.train.AdamOptimizer()
def _train_op_fn(loss):
return optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return head.create_estimator_spec(
features=features,
labels=labels,
mode=mode,
logits=logits,
train_op_fn=_train_op_fn)
cnn_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn,
model_dir=os.path.join(model_dir, 'cnn'),
params=params)
train_and_evaluate(cnn_classifier)
The example here loads data from IMDB movie reviews. I have my own dataset in the form of text which is approx 2GB huge. Now in this example the line
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size) tries to load whole dataset in memory. If I try to do the same I run out of memory. How can I restructure this logic to read data in batches from my disk?
You want to change the dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) line. There are lots of ways of creating a dataset - from_tensor_slices is the easiest, but won't work on its own if you can't load the entire dataset to memory.
The best way depends on how you have the data stored, or how you want to store it/manipulate it. The simplest in my opinion with very little down-side (unless running on multiple GPUs) is to have the original dataset just give indices to data, and write a normal numpy function for loading the ith example.
dataset = tf.data.Dataset.from_tensor_slices(tf.range(epoch_size))
def tf_map_fn(i):
def np_map_fn(i):
return load_ith_example(i)
inp1, inp2 = tf.py_func(np_map_fn, (i,), Tout=(tf.float32, tf.float32), stateful=False)
# other preprocessing/data augmentation goes here.
# unbatched sizes
inp1.set_shape(shape1)
inp2.set_shape(shape2)
return inp1, inp2
dataset = dataset.repeat().shuffle(epoch_size).map(tf_map_fn, 8)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(1) # start loading data as GPU trains on previous batch
inp1, inp2 = dataset.make_one_shot_iterator().get_next()
Here I assume your outputs are float32 tensors (Tout=...). set_shape calls aren't strictly necessary, but if you know the shape it'll do better error checks.
So long as your preprocessing doesn't take longer than your network to run, this should run just as fast as any other method on a single GPU machine.
The other obvious way is to convert your data to tfrecords, but that'll take up more space on disk and is more of a pain to manage if you ask me.

keras fit_generator reading chunks from hdfstore

I try to build a generator for a Keras model which will be trained on a large hdf store.
To speed up the training, I pre-calculated all features incl. one-hot encoding already in the hdfstore. So the call from that should be straight forward.
To feed chunks of my data into the network, I try to use fit_generator, but struggle to get it up and running.
The generator:
def myGenerator(myStore, generateFrom,generateTo):
# Create empty arrays to contain batch of features and labels#
while True:
X = pd.read_hdf(myStore,'X',start=generateFrom,stop=generateTo)
y = pd.read_hdf(myStore,'y',start=generateFrom,stop=generateTo)
yield X,y
Network and fitting:
def get_model(shape):
'''Create a keras model.'''
inputlayer = Input(shape=shape)
model = BatchNormalization()(inputlayer)
model = Dense(1024, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(512, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(256, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(128, activation='relu')(model)
model = Dropout(0.25)(model)
# 11 because background noise has been taken out
model = Dense(2, activation='tanh')(model)
model = Model(inputs=inputlayer, outputs=model)
return model
shape = (6603,10000)
model = get_model(shape)
model.compile(loss='mean_squared_error', optimizer=Adam(), metrics=['accuracy'])
#X = generator(myStore)
#Xt = generator(myStore)
labelbinarizer = LabelBinarizer()
y = labelbinarizer.fit_transform(y)
#yt = labelbinarizer.fit_transform(yt)
generateFrom = 0
for i in range(10):
generateTo=generateFrom+10000
model.fit_generator(
generator=myGenerator(myStore,generateFrom,generateTo),
epochs=1,
steps_per_epoch=X[0].shape[0] // 1000)
generateFrom=generateTo
I have tried both, to have the fit_generator within a loop and plug in the range (as shown above), but also to handle the range inside the generator. Both does not work. Currently running into
TypeError: 'generator' object is not subscriptable
Likely I have some misunderstanding how fit_generator() is supposed to be used in this context. Most examples out there are around generating tensors from pictures.
Any hint is appreciated.
Thanks
The function read_hdf returns a panda object, you need to convert it to numpy array.

Resources