Merge specific input coordinates in Keras - python-3.x

I have a large input vector (1000 features) to a Sequential model. The model is mainly a dense network.
I know that features 1-50 are coordinate-wise highly correlated to features 51-100 (1 with 51, 2 with 52 etc.) and I want to take advantage of that.
Is there a way to add a layer to my existing model to reflects that? (joining input 1 and 51 to a neuron, 2 and 52 etc.)
Or maybe the only option is to change the input structure to 50 tensors (of 1x2) and one large vector of 900 features? (I would like to avoid that since it means re-writing my feature preparation code)

I think the first dense layer would find out this relationship, of course if you define and train the model properly. However, if you would like to process the first 100 feature separately, one alternative is to use Keras functional API and define two Input layers, one for the first 100 features and another for the rest of 900 features:
input_100 = Input(shape=(100,))
input_900 = Input(shape=(900,))
Now you can process each one separately. For example, you can define two separate Dense layers connected to each one and then merge their outputs:
dense_100 = Dense(50, activation='relu')(input_100)
dense_900 = Dense(200, activation='relu')(input_900)
concat = concatenate([dense_100, dense_900])
# define the rest of your model ...
model = Model(inputs=[input_100, input_900], outputs=[the_outputs_of_model])
Of course, you need to feed the input layers separately. For that you can easily slice the training data:
model.fit([X_train[:,:100], X_train[:,100:]], y_train, ...)
Update: If you specifically want the features 1 and 51, 2 and 52, etc. to have a separate neuron (which, at least, I can't comment on the efficiency of it without experimenting on data), you can use LocallyConnected1D layer with kernel size and no. filters of 1 (i.e. it has the same behavior as applying a separate Dense layer on each two related features):
input_50_2 = Input(shape=(50,2))
local_out = LocallyConnected1D(1, 1, activation='relu')(input_50_2)
local_reshaped = Reshape((50,))(local_out) # need this for merging since local_out has shape of (None, 50, 1)
# or use the following:
# local_reshaped = Flatten()(local_out)
concat = concatenation([local_reshaped, dense_900])
# define the rest of your model...
X_train_50_2 = np.transpose(X_train[:,:100].reshape((2, 50)))
model.fit([X_train_50_2, X_train[:,100:]], y_train, ...)

Related

How to adding output dimension of `nn.Linear` while freezing the original dimension?

I am implementing an incremental learning task with pytorch. Let's say, in a simple scenario, the number of base classes is 5, and the number of incremental classes is 2. Namely, I want the model could incrementall learning 2 new classes each time.
Simply, suppose the model is composed of a feature extractor resnet18 and a classifier, a 1 layer mlp = nn.Linear(126,5). For classifying the novel classes, I must adding 2 extra output neuron responsible for the 2 incremental classes. That is to say, I want a new classifier mlp_inc = nn.Linear(126,7). But, importantly, I want to freeze the trained weights (126 * 5) for base classes while only update the parameters(126 * 2) for incremental classes.
A straight way is to cat the output of base classifier and incremental classifier:
self.mlp_base = nn.Linear(126,5)
self.mlp_inc = nn.Linear(126,2)
'''
def forward(x):
x_base = self.mlp_base(x)
x_inc = self.mlp_inc(x)
output = torch.cat((x_base,x_inc),1)
'''
But this way will add a new module self.mlp_inc to original model. Noting mlp_base1 and mlp_inc1 as the trained classifer for incremental task 1.
When adapting to newer incremental task (another novel 2 classes, taks 2), I can not directly merge mlp_base1 and mlp_inc1 as mlp_base and load the state_dict of mlp_base1 and mlp_inc1 as mlp_base. Which means I should adding mlp_inc2, mpl_inc3 ... for other tasks. This is not easily maintainable.
So an simple way like the code below is a better choice.
# for task 1
self.mlp = nn.Linear(126,5)
# for task 2
self.mlp = nn.Linear(126,5+2)
# self.mlp.load_state_dict(), load the partial parameters for 5 base classes
# self.mlp.requires_grad(False), freeze the partial parameters for 5 base classes
But this seems not achievable.

why before embedding, have to make the item be sequential starting at zero

I learn collaborative filtering from this bolg, Deep Learning With Keras: Recommender Systems.
The tutorial is good, and the code working well. Here is my code.
There is one thing confuse me, the author said,
The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero to use for modeling (you'll see why later).
user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()
But he didn't seem to metion the reason, I don't why need to do that.Can some one explain for me?
Embeddings are assumed to be sequential.
The first input of Embedding is the input dimension.
So, if the input exceeds the input dimension the value is ignored.
Embedding assumes that max value in the input is input dimension -1 (it starts from 0).
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding?hl=ja
As an example, the following code will generate embeddings only for input [4,3] and will skip the input [7, 8] since input dimension is 5.
I think it is more clear to explain it with tensorflow;
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
model = Sequential()
model.add(Embedding(5, 1, input_length=2))
input_array = np.array([[4,3], [7,8]])
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
You can increase the input dimension to 9 and then you will get embeddings for both inputs.
You could increase the input dimension to max number + 1 in the original data set, but this is not efficient.
It is actually similar to one-hot encoding where sequential data saves great amount of memory.

Tensorflow: Summaries defined in function not accessible in tensorboard

I have a graph and a set of custom functions that define multilayer RNNs according to an input list which will specify the number of units in each layer. For instance:
def BuildLayers(....):
# takes inputs, list of layer sizes, mask information, etc
#
# invokes BuildLayer(...) several times
#
# returns RNN output and states of last layer
BuildLayer loops through a more detailed function which builds and returns individual layers:
def BuildLayer(....):
# Takes individual layer size, output of previous layer, etc
#
# handles bookkeeping of RNNCells, wrappers, reshaping, etc
# **Important! Defines scope for each layer**
#
# returns RNN output and states of last layer
And ultimately this would be called in a function that defines a graph and runs it in a session:
def Experiment(parameters):
tf.reset_default_graph()
graph = tf.Graph()
with graph.as_default():
#
# Placeholders
# BuildLayers(...)
# Loss function definitions
# optimizer definitions
with tf.Session(graph=graph) as session:
#
# Loop through epochs:
# etc
I.e., if the layer size parameter is [16, 32, 16], we end up with an RNN that has a cell of 16 units in layer1, scoped as layer1, 32 units in layer 2, scoped appropriately, and 16 units in layer 3, scoped, etc.
This seems to work fine, a casual inspection of the graph in tensorboard looks correct, nodes look correct, the thing trains, etc.
Problem: How can I add histogram summaries, e.g., of kernel weights and biases, to that function definition? I've done so naively, as such:
def buildLayer(numUnits, numLayer, input, lengths):
name = 'layer' "{0:0=2d}".format(numLayer)
with tf.variable_scope(name):
cellfw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
cellbw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
outputs, state = tf.nn.bidirectional_dynamic_rnn(cell_fw = cellfw, cell_bw = cellbw, inputs = input, dtype=tf.float32, sequence_length = lengths)
outputs = tf.concat([outputs[0], outputs[1]], axis=2)
FwKernel = tf.get_default_graph().get_tensor_by_name(name + '/bidirectional_rnn/fw/gru_cell/gates/kernel:0')
FwKernel_sum = tf.summary.histogram("FwKernel", FwKernel, 'rnn')
return outputs, state
And then, at the end of the graph definition, assumed these summaries would be caught up in the
merged = tf.summary.merge_all()
statement. It isn't. I'm confused by this behavior. I can see the histogram summary definitions on a visual inspection of the graph in tensorboard-- they're there. But they don't seem to be getting to the merge and so are never accessible in tensorboard as histograms per se.
How do I get summaries, which are defined in a function, to show up in tensorboard, preferably through a merge and without passing them around through function calls like excess baggage?
The least painful way I have found to avoid this is to pass a single list (i.e., "summaries") through each function, and within the BuildLayers function, to append or extend that list with all desired histogram summaries.
Then, in the main graph definition, rather than a merge_all
merged = tf.summary.merge_all()
instead use a merge and pass the list in as the argument
merged = tf.summary.merge(summaries)
This has the disadvantage of not actually being a merge_all, meaning that if you had defined other summaries (typically scalar summaries for loss functions, at least) you're going to have to manually append them to the summaries list or carry around two merge objects or something similar, which misses the self-advertised point of the merge_all.
I leave this here as an answer to my own question because it might help someone, but will pointedly not accept it because I am hoping to be shown a better way.
Most likely the problem is that summaries are created in the with graph.as_default(): context. The summary operations are then added to this graph's _collections["SUMMARIES"] list. But, when you call merge_all() you are no longer in that context (that set graph to be the default). So, merge_all() looks for summaries in the default graph that was created when you imported tensorflow, which is probably empty.
To fix the issue, simply call merge_all() within the same with graph.as_default(): context.
Here are some relevant code links:
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/summary/summary.py#L293
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/framework/ops.py#L5649

Confused about how to implement time-distributed LSTM + LSTM

After much reading and diagramming, I think I've come up with a model that I can use to as the foundation for more testing on which parameters and features I need to tweak. However, I am confused about how to implement the following test case (all numbers are orders of magnitude smaller than final model, but I want to start small):
Input data: 5000x1 time series vector, split into 5 epochs of 1000x1
For each time step, 3 epochs worth of data will be put through 3 time-distributed copies of a bidirectional LSTM layer, and each of those will output a vector of 10x1 (10 features extracted), which will then be taken as the input for a second bidirectional LSTM layer.
For each time step, the first and last labels are ignored, but the center one is what is desired.
Here's what I've come up with, which does compile. However, looking at the model.summary, I think I'm missing the fact that I want the first LSTM to be run on a 3 of the input sequences for each output time step. What am I doing wrong?
model = Sequential()
model.add(TimeDistributed(Bidirectional(LSTM(11, return_sequences=True, recurrent_dropout=0.1, unit_forget_bias=True), input_shape=(3, 3, epoch_len), merge_mode='sum'), input_shape=(n_epochs, 3, epoch_len)))
model.add(TimeDistributed(Dense(7)))
model.add(TimeDistributed(Flatten()))
model.add(Bidirectional(LSTM(12, return_sequences=True, recurrent_dropout=0.1, unit_forget_bias=True), merge_mode='sum'))
model.add(TimeDistributed(Dense(n_classes, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Since your question is a bit confused, I'll take the following assumptions.
You have one time series of 5000 time steps, each step with one feature. Shape (1, 5000, 1)
The main part of the answer to your question: You want to run a "sliding window" case, being the size of the window equal to 3000, and the stride of the window being 1000.
You want the window size to be divided in 3 internal time series, each of these 3 series with 1000 steps, each step with only one feature. Each of these series enters the same LSTM as independent series (which is equivalent to having 3 copies of the LSTM) - Shape (slidingWindowSteps, 3, 1000, 1)
Important: From these 3 series, you want 3 outputs without length and with 10 features. Shape (1,3,10). (Your image says 1x10, but your text says 10x1, I'm assuming the image is correct).
You want these 3 outputs to be merged in a single sequence of 3 steps, shape (1,3,10)
You want the LSTM that processes this 3 step sequence to also return a 3 step sequence
Preparing for the sliding window case:
In a sliding window case, it's unavoidable to duplicate data. You need to first work in your input.
Taking the initial time series (1,5000,1), we need to split it prolerly in a batch containing samples with 3 groups of 1000. Here I do this for X only, you will have to do a similar thing to Y
numberOfOriginalSequences = 1
totalSteps = 5000
features = 1
#example of original input with 5000 steps
originalSeries = np.array(
range(numberOfOriginalSequences*totalSteps*features)
).reshape((numberOfOriginalSequences,
totalSteps,
features))
windowSize = 3000
windowStride = 1000
totalWindowSteps = ((totalSteps - windowSize)//windowStride) + 1
#at first, let's keep these dimensions for better understanding
processedSequences = np.empty((numberOfOriginalSequences,
totalWindowSteps,
windowSize,
features))
for seq in range(numberOfOriginalSequences):
for winStep in range(totalWindowSteps):
start = winStep * windowStride
end = start + windowSize
processedSequences[seq,winStep,:,:] = originalSeries[seq,start:end,:]
#now we reshape the array to transform each window step in independent sequences:
totalSamples = numberOfOriginalSequences*totalWindowSteps
groupsInWindow = windowSize // windowStride
processedSequences = processedSequences.reshape((totalSamples,
groupsInWindow,
windowStride,
features))
print(originalSeries)
print(processedSequences)
Creating the model:
A few comments about your first added layer:
The model only takes into account one input_shape. And this shape is (groupsInWindow,windowStride,features). It should be in the most external wrapper: the TimeDistributed.
You don't want to keep 1000 time steps, you want only 10 resulting features: return_sequences = False. (You can use many LSTMs in this first stage, if you want more layers. In this case the first ones can keep the steps, only the last one needs to use return_sequences=False)
You want 10 features, so units=10
I'll use the functional API just to see the input shape in the summary, which hels understanding things.
from keras.models import Model
intermediateFeatures = 10
inputTensor = Input((groupsInWindow,windowStride,features))
out = TimeDistributed(
Bidirectional(
LSTM(intermediateFeatures,
return_sequences=False,
recurrent_dropout=0.1,
unit_forget_bias=True),
merge_mode='sum'))(inputTensor)
At this point, you have eliminated the 1000 time steps. Since we used return_sequences=False, there will be no need to flatten or things like that. The data is already shaped in the form (samples, groupsInWindow,intermediateFeatures). The Dense layer is also not necessary. But it wouldn't be "wrong" if you wanted to do it the way you did, as long as the final shape is the same.
arbitraryLSTMUnits = 12
n_classes = 17
out = Bidirectional(
LSTM(arbitraryLSTMUnits,
return_sequences=True,
recurrent_dropout=0.1,
unit_forget_bias=True),
merge_mode='sum')(out)
out = TimeDistributed(Dense(n_classes, activation='softmax'))(out)
And if you're going to discard the borders, you can add this layer:
out = Lambda(lambda x: x[:,1,:])(out) #model.add(Lambda(lambda x: x[:,1,:]))
Completing the model:
model = Model(inputTensor,out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Here is how dimensions are flowing through this model.
The first dimension I put here (totalSamples) is shown as None in the model.summary().
Input: (totalSamples,groupsInWindow,windowStride,features)
The Time Distributed LSTM works like this:
TimeDistributed allows a 4th dimension, which is groupsInWindow.
This dimension will be kept.
The LSTM with return_sequences=False will eliminate the windowStride and change the features (windowStride, the second last dimension, is at the time steps position for this LSTM):
result: (totalSamples, groupsInWindow, intermadiateFeatures)
The other LSTM, without time distributed, will not have the 4th dimension. This way, groupsInWindow (the second last) will be the "time steps". But return_sequences=True will not eliminate the time steps as the first LSTM did. Result: (totalSamples, groupsInWindow, arbitraryLSTMUnits)
The final Dense layer, because it's receiving a 3D input, will interpret the second dimension as if it were a TimeDistributed and leave it unchanged, applying itself only to the features dimension. Result: (totalSamples, groupsInWindow, n_classes)

How to use hidden layer activations to construct loss function and provide y_true during fitting in Keras?

Assume I have a model like this. M1 and M2 are two layers linking left and right sides of the model.
The example model: Red lines indicate backprop directions
During training, I hope M1 can learn a mapping from L2_left activation to L2_right activation. Similarly, M2 can learn a mapping from L3_right activation to L3_left activation.
The model also needs to learn the relationship between two inputs and the output.
Therefore, I should have three loss functions for M1, M2, and L3_left respectively.
I probably can use:
model.compile(optimizer='rmsprop',
loss={'M1': 'mean_squared_error',
'M2': 'mean_squared_error',
'L3_left': mean_squared_error'})
But during training, we need to provide y_true, for example:
model.fit([input_1,input_2], y_true)
In this case, the y_true is the hidden layer activations and not from a dataset.
Is it possible to build this model and train it using it's hidden layer activations?
If you have only one output, you must have only one loss function.
If you want three loss functions, you must have three outputs, and, of course, three Y vectors for training.
If you want loss functions in the middle of the model, you must take outputs from those layers.
Creating the graph of your model: (if the model is already defined, see the end of this answer)
#Here, all "SomeLayer(blabla)" could be replaced by a "SomeModel" if necessary
#Example of using a layer or a model:
#M1 = SomeLayer(blablabla)(L12)
#M1 = SomeModel(L12)
from keras.models import Model
from keras.layers import *
inLef = Input((shape1))
inRig = Input((shape2))
L1Lef = SomeLayer(blabla)(inLef)
L2Lef = SomeLayer(blabla)(L1Lef)
M1 = SomeLayer(blablaa)(L2Lef) #this is an output
L1Rig = SomeLayer(balbla)(inRig)
conc2Rig = Concatenate(axis=?)([L1Rig,M1]) #Or Add, or Multiply, however you're joining the models
L2Rig = SomeLayer(nlanlab)(conc2Rig)
L3Rig = SomeLayer(najaljd)(L2Rig)
M2 = SomeLayer(babkaa)(L3Rig) #this is an output
conc3Lef = Concatenate(axis=?)([L2Lef,M2])
L3Lef = SomeLayer(blabla)(conc3Lef) #this is an output
Creating your model with three outputs:
Now you've got your graph ready and you know what the outputs are, you create the model:
model = Model([inLef,inRig], [M1,M2,L3Lef])
model.compile(loss='mse', optimizer='rmsprop')
If you want different losses for each output, then you create a list:
#example of custom loss function, if necessary
def lossM1(yTrue,yPred):
return keras.backend.sum(keras.backend.abs(yTrue-yPred))
#compiling with three different loss functions
model.compile(loss = [lossM1, 'mse','binary_crossentropy'], optimizer =??)
But you've got to have three different yTraining too, for training with:
model.fit([input_1,input_2], [yTrainM1,yTrainM2,y_true], ....)
If your model is already defined and you don't create it's graph like I did:
Then, you have to find in yourModel.layers[i] which ones are M1 and M2, so you create a new model like this:
M1 = yourModel.layers[indexForM1].output
M2 = yourModel.layers[indexForM2].output
newModel = Model([inLef,inRig], [M1,M2,yourModel.output])
If you want that two outputs be equal:
In this case, just subtract the two outputs in a lambda layer, and make that lambda layer be an output of your model, with expected values = 0.
Using the exact same vars as before, we'll just create two addictional layers to subtract outputs:
diffM1L1Rig = Lambda(lambda x: x[0] - x[1])([L1Rig,M1])
diffM2L2Lef = Lambda(lambda x: x[0] - x[1])([L2Lef,M2])
Now your model should be:
newModel = Model([inLef,inRig],[diffM1L1Rig,diffM2L2lef,L3Lef])
And training will expect those two differences to be zero:
yM1 = np.zeros((shapeOfM1Output))
yM2 = np.zeros((shapeOfM2Output))
newModel.fit([input_1,input_2], [yM1,yM2,t_true], ...)
Trying to answer to the last part: how to make gradients only affect one side of the model.
...well.... at first that sounds unfeasible to me. But, if that is similar to "train only a part of the model", then it's totally ok by defining models that only go to a certain point and making part of the layers untrainable.
By doing that, nothing will affect those layers. If that's what you want, then you can do it:
#using the previous vars to define other models
modelM1 = Model([inLef,inRig],diffM1L1Rig)
This model above ends in diffM1L1Rig. Before compiling, you must set L2Right untrainable:
modelM1.layers[??].trainable = False
#to find which layer is the right one, you may define then using the "name" parameter, or see in the modelM1.summary() the shapes, types etc.
modelM1.compile(.....)
modelM1.fit([input_1, input_2], yM1)
This suggestion makes you train only a single part of the model. You can repeat the procedure for M2, locking the layers you need before compiling.
You can also define a full model taking all layers, and lock only the ones you want. But you won't be able (I think) to make half gradients pass by one side and half the gradients pass by the other side.
So I suggest you keep three models, the fullModel, the modelM1, and the modelM2, and you cycle them in training. One epoch each, maybe....
That should be tested....

Resources