I have a graph and a set of custom functions that define multilayer RNNs according to an input list which will specify the number of units in each layer. For instance:
def BuildLayers(....):
# takes inputs, list of layer sizes, mask information, etc
#
# invokes BuildLayer(...) several times
#
# returns RNN output and states of last layer
BuildLayer loops through a more detailed function which builds and returns individual layers:
def BuildLayer(....):
# Takes individual layer size, output of previous layer, etc
#
# handles bookkeeping of RNNCells, wrappers, reshaping, etc
# **Important! Defines scope for each layer**
#
# returns RNN output and states of last layer
And ultimately this would be called in a function that defines a graph and runs it in a session:
def Experiment(parameters):
tf.reset_default_graph()
graph = tf.Graph()
with graph.as_default():
#
# Placeholders
# BuildLayers(...)
# Loss function definitions
# optimizer definitions
with tf.Session(graph=graph) as session:
#
# Loop through epochs:
# etc
I.e., if the layer size parameter is [16, 32, 16], we end up with an RNN that has a cell of 16 units in layer1, scoped as layer1, 32 units in layer 2, scoped appropriately, and 16 units in layer 3, scoped, etc.
This seems to work fine, a casual inspection of the graph in tensorboard looks correct, nodes look correct, the thing trains, etc.
Problem: How can I add histogram summaries, e.g., of kernel weights and biases, to that function definition? I've done so naively, as such:
def buildLayer(numUnits, numLayer, input, lengths):
name = 'layer' "{0:0=2d}".format(numLayer)
with tf.variable_scope(name):
cellfw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
cellbw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
outputs, state = tf.nn.bidirectional_dynamic_rnn(cell_fw = cellfw, cell_bw = cellbw, inputs = input, dtype=tf.float32, sequence_length = lengths)
outputs = tf.concat([outputs[0], outputs[1]], axis=2)
FwKernel = tf.get_default_graph().get_tensor_by_name(name + '/bidirectional_rnn/fw/gru_cell/gates/kernel:0')
FwKernel_sum = tf.summary.histogram("FwKernel", FwKernel, 'rnn')
return outputs, state
And then, at the end of the graph definition, assumed these summaries would be caught up in the
merged = tf.summary.merge_all()
statement. It isn't. I'm confused by this behavior. I can see the histogram summary definitions on a visual inspection of the graph in tensorboard-- they're there. But they don't seem to be getting to the merge and so are never accessible in tensorboard as histograms per se.
How do I get summaries, which are defined in a function, to show up in tensorboard, preferably through a merge and without passing them around through function calls like excess baggage?
The least painful way I have found to avoid this is to pass a single list (i.e., "summaries") through each function, and within the BuildLayers function, to append or extend that list with all desired histogram summaries.
Then, in the main graph definition, rather than a merge_all
merged = tf.summary.merge_all()
instead use a merge and pass the list in as the argument
merged = tf.summary.merge(summaries)
This has the disadvantage of not actually being a merge_all, meaning that if you had defined other summaries (typically scalar summaries for loss functions, at least) you're going to have to manually append them to the summaries list or carry around two merge objects or something similar, which misses the self-advertised point of the merge_all.
I leave this here as an answer to my own question because it might help someone, but will pointedly not accept it because I am hoping to be shown a better way.
Most likely the problem is that summaries are created in the with graph.as_default(): context. The summary operations are then added to this graph's _collections["SUMMARIES"] list. But, when you call merge_all() you are no longer in that context (that set graph to be the default). So, merge_all() looks for summaries in the default graph that was created when you imported tensorflow, which is probably empty.
To fix the issue, simply call merge_all() within the same with graph.as_default(): context.
Here are some relevant code links:
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/summary/summary.py#L293
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/framework/ops.py#L5649
Related
Using OpenAI's gym environment, I've created my own environment in which the observation space of box type, and the shape is (21,21,1).
The intention is to use a keras Conv2D layer as the model's input. Ideally, the shape going into this model would be (None,21,21,1), with None representing the batch size. Kera's documentation is here: https://keras.io/api/layers/convolution_layers/convolution2d/
The issue I'm having is that an extra dimension is being required while checking the shaping. Because of this, the shape it expects is (None,1,21,21,1). This is prohibiting me from using MaxPooling layers in the model. After investigating the keras RL library, this is due to two functions that are adding this dimensionality.
The first function is found in memory.py, where a current observation is put into a list and returned as such. Here:
def get_recent_state(self, current_observation):
"""Return list of last observations
# Argument
current_observation (object): Last observation
# Returns
A list of the last observations
"""
# This code is slightly complicated by the fact that subsequent observations might be
# from different episodes. We ensure that an experience never spans multiple episodes.
# This is probably not that important in practice but it seems cleaner.
state = [current_observation]
idx = len(self.recent_observations) - 1
for offset in range(0, self.window_length - 1):
current_idx = idx - offset
current_terminal = self.recent_terminals[current_idx - 1] if current_idx - 1 >= 0 else False
if current_idx < 0 or (not self.ignore_episode_boundaries and current_terminal):
# The previously handled observation was terminal, don't add the current one.
# Otherwise we would leak into a different episode.
break
state.insert(0, self.recent_observations[current_idx])
while len(state) < self.window_length:
state.insert(0, zeroed_observation(state[0]))
return state
The second function is called just after and computes the Q values based on the recent observation. It creates a list of the state when passing onto "compute_batch_q_values".
def compute_q_values(self, state):
q_values = self.compute_batch_q_values([state]).flatten()
assert q_values.shape == (self.nb_actions,)
return q_values
I understand that one extra dimension should be added to represent the batch size, but is it twice? Can anyone explain why this is or how to use Conv2d layers with OpenAI gym?
Thanks.
I was wondering what is the proper way of logging metrics when using DDP. I noticed that if I want to print something inside validation_epoch_end it will be printed twice when using 2 GPUs. I was expecting validation_epoch_end to be called only on rank 0 and to receive the outputs from all GPUs, but I am not sure this is correct anymore. Therefore I have several questions:
validation_epoch_end(self, outputs) - When using DDP does every subprocess receive the data processed from the current GPU or data processed from all GPUs, i.e. does the input parameter outputs contains the outputs of the entire validation set, from all GPUs?
If outputs is GPU/process specific what is the proper way to calculate any metric on the entire validation set in validation_epoch_end when using DDP?
I understand that I can solve the printing by checking self.global_rank == 0 and printing/logging only in that case, however I am trying to get a deeper understanding of what I am printing/logging in this case.
Here is a code snippet from my use case. I would like to be able to report f1, precision and recall on the entire validation dataset and I am wondering what is the correct way of doing it when using DDP.
def _process_epoch_outputs(self,
outputs: List[Dict[str, Any]]
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Creates and returns tensors containing all labels and predictions
Goes over the outputs accumulated from every batch, detaches the
necessary tensors and stacks them together.
Args:
outputs (List[Dict])
"""
all_labels = []
all_predictions = []
for output in outputs:
for labels in output['labels'].detach():
all_labels.append(labels)
for predictions in output['predictions'].detach():
all_predictions.append(predictions)
all_labels = torch.stack(all_labels).long().cpu()
all_predictions = torch.stack(all_predictions).cpu()
return all_predictions, all_labels
def validation_epoch_end(self, outputs: List[Dict[str, Any]]) -> None:
"""Logs f1, precision and recall on the validation set."""
if self.global_rank == 0:
print(f'Validation Epoch: {self.current_epoch}')
predictions, labels = self._process_epoch_outputs(outputs)
for i, name in enumerate(self.label_columns):
f1, prec, recall, t = metrics.get_f1_prec_recall(predictions[:, i],
labels[:, i],
threshold=None)
self.logger.experiment.add_scalar(f'{name}_f1/Val',
f1,
self.current_epoch)
self.logger.experiment.add_scalar(f'{name}_Precision/Val',
prec,
self.current_epoch)
self.logger.experiment.add_scalar(f'{name}_Recall/Val',
recall,
self.current_epoch)
if self.global_rank == 0:
print((f'F1: {f1}, Precision: {prec}, '
f'Recall: {recall}, Threshold {t}'))
Questions
validation_epoch_end(self, outputs) - When using DDP does every
subprocess receive the data processed from the current GPU or data
processed from all GPUs, i.e. does the input parameter outputs
contains the outputs of the entire validation set, from all GPUs?
Data processed from the current GPU only, outputs are not synchronized, there is only backward synchronization (gradients are synchronized during training and distributed to replicas of models residing on each GPU).
Imagine that all of the outputs were passed from 1000 GPUs to this poor master, it could give it an OOM very easily
If outputs is GPU/process specific what is the proper way to calculate
any metric on the entire validation set in validation_epoch_end when
using DDP?
According to documentation (emphasis mine):
When validating using a accelerator that splits data from each batch
across GPUs, sometimes you might need to aggregate them on the master
GPU for processing (dp, or ddp2).
And here is accompanying code (validation_epoch_end would receive accumulated data across multiple GPUs from single step in this case, also see the comments):
# Done per-process (GPU)
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
pred = ...
return {'loss': loss, 'pred': pred}
# Gathered data from all processes (per single step)
# Allows for accumulation so the whole data at the end of epoch
# takes less memory
def validation_step_end(self, batch_parts):
gpu_0_prediction = batch_parts.pred[0]['pred']
gpu_1_prediction = batch_parts.pred[1]['pred']
# do something with both outputs
return (batch_parts[0]['loss'] + batch_parts[1]['loss']) / 2
def validation_epoch_end(self, validation_step_outputs):
for out in validation_step_outputs:
# do something with preds
Tips
Focus on per-device calculations and as small number of between-GPU transfers as possible
Inside validation_step (or training_step if that's what you want, this is general) calculate f1, precision, recall and whatever else on a per-batch basis
Returns those values (say, as a dict). Now you will return 3 numbers from each device instead of (batch, outputs) (which could be significantly larger)
Inside validation_step_end get those 3 values (actually (2, 3) if you have 2 GPUs) and sum/take mean of them and return 3 values
Now validation_epoch_end will get (steps, 3) values that you can use to accumulate
It would be even better if instead of operating on list of values during validation_epoch_end you could accumulate them in another 3 values (say you have a lot of validation steps, the list could grow too large), but this should be enough.
AFAIK PyTorch-Lightning doesn't do this (e.g. instead of adding to list, apply some accumulator directly), but I might be mistaken, so any correction would be great.
I have a large input vector (1000 features) to a Sequential model. The model is mainly a dense network.
I know that features 1-50 are coordinate-wise highly correlated to features 51-100 (1 with 51, 2 with 52 etc.) and I want to take advantage of that.
Is there a way to add a layer to my existing model to reflects that? (joining input 1 and 51 to a neuron, 2 and 52 etc.)
Or maybe the only option is to change the input structure to 50 tensors (of 1x2) and one large vector of 900 features? (I would like to avoid that since it means re-writing my feature preparation code)
I think the first dense layer would find out this relationship, of course if you define and train the model properly. However, if you would like to process the first 100 feature separately, one alternative is to use Keras functional API and define two Input layers, one for the first 100 features and another for the rest of 900 features:
input_100 = Input(shape=(100,))
input_900 = Input(shape=(900,))
Now you can process each one separately. For example, you can define two separate Dense layers connected to each one and then merge their outputs:
dense_100 = Dense(50, activation='relu')(input_100)
dense_900 = Dense(200, activation='relu')(input_900)
concat = concatenate([dense_100, dense_900])
# define the rest of your model ...
model = Model(inputs=[input_100, input_900], outputs=[the_outputs_of_model])
Of course, you need to feed the input layers separately. For that you can easily slice the training data:
model.fit([X_train[:,:100], X_train[:,100:]], y_train, ...)
Update: If you specifically want the features 1 and 51, 2 and 52, etc. to have a separate neuron (which, at least, I can't comment on the efficiency of it without experimenting on data), you can use LocallyConnected1D layer with kernel size and no. filters of 1 (i.e. it has the same behavior as applying a separate Dense layer on each two related features):
input_50_2 = Input(shape=(50,2))
local_out = LocallyConnected1D(1, 1, activation='relu')(input_50_2)
local_reshaped = Reshape((50,))(local_out) # need this for merging since local_out has shape of (None, 50, 1)
# or use the following:
# local_reshaped = Flatten()(local_out)
concat = concatenation([local_reshaped, dense_900])
# define the rest of your model...
X_train_50_2 = np.transpose(X_train[:,:100].reshape((2, 50)))
model.fit([X_train_50_2, X_train[:,100:]], y_train, ...)
I just read an interesting paper: A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks.
I'd like to try to implement this activation function in Keras. I've implemented custom activations before, e.g. a sinusoidal activation:
def sin(x):
return K.sin(x)
get_custom_objects().update({'sin': Activation(sin)})
However, the activation function in this paper has 3 unique properties:
It doubles the size of the input (the output is 2x the input)
It's parameterized
It's parameters should be regularized
I think once I have a skeleton for dealing with the above 3 issues, I can work out the math myself, but I'll take any help I can get!
Here, we will need one of these two:
A Lambda layer - If your parameters are not trainable (you don't want them to change with backpropagation)
A custom layer - If you need custom trainable parameters.
The Lambda layer:
If your parameters are not trainable, you can define your function for a lambda layer. The function takes one input tensor, and it can return anything you want:
import keras.backend as K
def customFunction(x):
#x can be either a single tensor or a list of tensors
#if a list, use the elements x[0], x[1], etc.
#Perform your calculations here using the keras backend
#If you could share which formula exactly you're trying to implement,
#it's possible to make this answer better and more to the point
#dummy example
alphaReal = K.variable([someValue])
alphaImag = K.variable([anotherValue]) #or even an array of values
realPart = alphaReal * K.someFunction(x) + ...
imagPart = alphaImag * K.someFunction(x) + ....
#You can return them as two outputs in a list (requires the fuctional API model
#Or you can find backend functions that join them together, such as K.stack
return [realPart,imagPart]
#I think the separate approach will give you a better control of what to do next.
For what you can do, explore the backend functions.
For the parameters, you can define them as keras constants or variables (K.constant or K.variable), either inside or outside the function above, or even transform them in model inputs. See details in this answer
In your model, you just add a lambda layer that uses that function.
In a Sequential model: model.add(Lambda(customFunction, output_shape=someShape))
In a functional API model: output = Lambda(customFunction, ...)(inputOrListOfInputs)
If you're going to pass more inputs to the function, you'll need the functional model API.
If you're using Tensorflow, the output_shape will be computed automatically, I believe only Theano requires it. (Not sure about CNTK).
The custom layer:
A custom layer is a new class you create. This approach will only be necessary if you're going to have trainable parameters in your function. (Such as: optimize alpha with backpropagation)
Keras teaches it here.
Basically, you have an __init__ method where you pass the constant parameters, a build method where you create the trainable parameters (weights), a call method that will do the calculations (exactly what would go in the lambda layer if you didn't have trainable parameters), and a compute_output_shape method so you can tell the model what the output shape is.
class CustomLayer(Layer):
def __init__(self, alphaReal, alphaImag):
self.alphaReal = alphaReal
self.alphaImage = alphaImag
def build(self,input_shape):
#weights may or may not depend on the input shape
#you may use it or not...
#suppose we want just two trainable values:
weigthShape = (2,)
#create the weights:
self.kernel = self.add_weight(name='kernel',
shape=weightShape,
initializer='uniform',
trainable=True)
super(CustomLayer, self).build(input_shape) # Be sure to call this somewhere!
def call(self,x):
#all the calculations go here:
#dummy example using the constant inputs
realPart = self.alphaReal * K.someFunction(x) + ...
imagPart = self.alphaImag * K.someFunction(x) + ....
#dummy example taking elements of the trainable weights
realPart = self.kernel[0] * realPart
imagPart = self.kernel[1] * imagPart
#all the comments for the lambda layer above are valid here
#example returning a list
return [realPart,imagPart]
def compute_output_shape(self,input_shape):
#if you decide to return a list of tensors in the call method,
#return a list of shapes here, twice the input shape:
return [input_shape,input_shape]
#if you stacked your results somehow in a single tensor, compute a single tuple, maybe with an additional dimension equal to 2:
return input_shape + (2,)
You need to implement a "Layer", not a common activation function.
I think the implementation of pReLU in Keras would be a good example for your task. See pReLU
A lambda function in the activation worked for me. Maybe not what you want but it's one step more complicated than the simple use of a built-in activation function.
encoder_outputs = Dense(units=latent_vector_len, activation=k.layers.Lambda(lambda z: k.backend.round(k.layers.activations.sigmoid(x=z))), kernel_initializer="lecun_normal")(x)
This code changes the output of a Dense from Reals to 0,1 (ie, binary).
Keras throws a warning but the code still proves to work.
Assume I have a model like this. M1 and M2 are two layers linking left and right sides of the model.
The example model: Red lines indicate backprop directions
During training, I hope M1 can learn a mapping from L2_left activation to L2_right activation. Similarly, M2 can learn a mapping from L3_right activation to L3_left activation.
The model also needs to learn the relationship between two inputs and the output.
Therefore, I should have three loss functions for M1, M2, and L3_left respectively.
I probably can use:
model.compile(optimizer='rmsprop',
loss={'M1': 'mean_squared_error',
'M2': 'mean_squared_error',
'L3_left': mean_squared_error'})
But during training, we need to provide y_true, for example:
model.fit([input_1,input_2], y_true)
In this case, the y_true is the hidden layer activations and not from a dataset.
Is it possible to build this model and train it using it's hidden layer activations?
If you have only one output, you must have only one loss function.
If you want three loss functions, you must have three outputs, and, of course, three Y vectors for training.
If you want loss functions in the middle of the model, you must take outputs from those layers.
Creating the graph of your model: (if the model is already defined, see the end of this answer)
#Here, all "SomeLayer(blabla)" could be replaced by a "SomeModel" if necessary
#Example of using a layer or a model:
#M1 = SomeLayer(blablabla)(L12)
#M1 = SomeModel(L12)
from keras.models import Model
from keras.layers import *
inLef = Input((shape1))
inRig = Input((shape2))
L1Lef = SomeLayer(blabla)(inLef)
L2Lef = SomeLayer(blabla)(L1Lef)
M1 = SomeLayer(blablaa)(L2Lef) #this is an output
L1Rig = SomeLayer(balbla)(inRig)
conc2Rig = Concatenate(axis=?)([L1Rig,M1]) #Or Add, or Multiply, however you're joining the models
L2Rig = SomeLayer(nlanlab)(conc2Rig)
L3Rig = SomeLayer(najaljd)(L2Rig)
M2 = SomeLayer(babkaa)(L3Rig) #this is an output
conc3Lef = Concatenate(axis=?)([L2Lef,M2])
L3Lef = SomeLayer(blabla)(conc3Lef) #this is an output
Creating your model with three outputs:
Now you've got your graph ready and you know what the outputs are, you create the model:
model = Model([inLef,inRig], [M1,M2,L3Lef])
model.compile(loss='mse', optimizer='rmsprop')
If you want different losses for each output, then you create a list:
#example of custom loss function, if necessary
def lossM1(yTrue,yPred):
return keras.backend.sum(keras.backend.abs(yTrue-yPred))
#compiling with three different loss functions
model.compile(loss = [lossM1, 'mse','binary_crossentropy'], optimizer =??)
But you've got to have three different yTraining too, for training with:
model.fit([input_1,input_2], [yTrainM1,yTrainM2,y_true], ....)
If your model is already defined and you don't create it's graph like I did:
Then, you have to find in yourModel.layers[i] which ones are M1 and M2, so you create a new model like this:
M1 = yourModel.layers[indexForM1].output
M2 = yourModel.layers[indexForM2].output
newModel = Model([inLef,inRig], [M1,M2,yourModel.output])
If you want that two outputs be equal:
In this case, just subtract the two outputs in a lambda layer, and make that lambda layer be an output of your model, with expected values = 0.
Using the exact same vars as before, we'll just create two addictional layers to subtract outputs:
diffM1L1Rig = Lambda(lambda x: x[0] - x[1])([L1Rig,M1])
diffM2L2Lef = Lambda(lambda x: x[0] - x[1])([L2Lef,M2])
Now your model should be:
newModel = Model([inLef,inRig],[diffM1L1Rig,diffM2L2lef,L3Lef])
And training will expect those two differences to be zero:
yM1 = np.zeros((shapeOfM1Output))
yM2 = np.zeros((shapeOfM2Output))
newModel.fit([input_1,input_2], [yM1,yM2,t_true], ...)
Trying to answer to the last part: how to make gradients only affect one side of the model.
...well.... at first that sounds unfeasible to me. But, if that is similar to "train only a part of the model", then it's totally ok by defining models that only go to a certain point and making part of the layers untrainable.
By doing that, nothing will affect those layers. If that's what you want, then you can do it:
#using the previous vars to define other models
modelM1 = Model([inLef,inRig],diffM1L1Rig)
This model above ends in diffM1L1Rig. Before compiling, you must set L2Right untrainable:
modelM1.layers[??].trainable = False
#to find which layer is the right one, you may define then using the "name" parameter, or see in the modelM1.summary() the shapes, types etc.
modelM1.compile(.....)
modelM1.fit([input_1, input_2], yM1)
This suggestion makes you train only a single part of the model. You can repeat the procedure for M2, locking the layers you need before compiling.
You can also define a full model taking all layers, and lock only the ones you want. But you won't be able (I think) to make half gradients pass by one side and half the gradients pass by the other side.
So I suggest you keep three models, the fullModel, the modelM1, and the modelM2, and you cycle them in training. One epoch each, maybe....
That should be tested....