Train two model iteratively with PyTorch - pytorch

I hope to train two cascaded networks, e.g. X->Z->Y, Z=net1(X), Y=net2(Z).
I hope to optimize the parameters of these two networks iteratively, i.e., for a fixed parameter of net1, firstly train parameters of net2 using MSE(predY,Y) loss util convergence; then, use the converged MSE loss to train a iteration of net1, etc.
So, I define two optimizers for each networks respectively. My training code is below:
net1 = SimpleLinearF()
opt1 = torch.optim.Adam(net1.parameters(), lr=0.01)
loss_func = nn.MSELoss()
for itera1 in range(num_iters1 + 1):
predZ = net1(X)
net2 = SimpleLinearF()
opt2 = torch.optim.Adam(net2.parameters(), lr=0.01)
for itera2 in range(num_iters2 + 1):
predY = net2(predZ)
loss = loss_func(predY,Y)
if itera2 % (num_iters2 // 2) == 0:
print('iteration: {:d}, loss: {:.7f}'.format(int(itera2), float(loss)))
loss.backward(retain_graph=True)
opt2.step()
opt2.zero_grad()
loss.backward()
opt1.step()
opt1.zero_grad()
However, I encounter the following mistake:
RuntimeError: one of the variables needed for gradient computation has been modified by an
inplace operation: [torch.FloatTensor [1, 1]], which is output 0 of AsStridedBackward0, is at
version 502; expected version 501 instead. Hint: enable anomaly detection to find the
operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Does anyone know why this error occurs? How should I solve this problem. Many Thanks.

I found the answer to my question after some searching on PyTorch computation graph.
Just remove the retain_graph=True and add a .detach() in net2(predZ) will solve this error.
This detach operation can cut net1 away from the computation graph of net2/optimizor2.

Related

How to get intermediate output grad in Pytorch model

we can get loss of last layer by loss = loss_fn(y_pred, y_true), and results in a loss: Tensor
then we call loss.backward() to do back propagation.
after optimizer.step() we could see updated model.parameters()
taking below example
y = Model1(x) # with optimizer1
z = Model2(y) # with optimizer2
loss = loss_fn(z, z_true)
loss.backward()
optimizer2.optimize() # update Model2 parameters
# in order to update Model1 parameters I think we should do
y.backward(grad_tensor=the_output_gradient_from_Model2)
optimizer1.optimize()
How to get the intermediate back propagation result? e.g. the gradient of output grad, which will be taken by y_pred.backward(grad_tensor=grad).
Update: The solution is setting required_grad=True and take Tensor x.grad. Thanks for the answers.
PS: The scenario is I am doing a federated learning, the model is split into 2 parts. The first part takes input and forward to second part. And it need the second part to calculate the loss and back propagate the loss to first part, so that the first part takes the loss and do its own back propagation.
I will assume you're referring to intermediate gradients when you say "loss of a specific layer".
You can access the gradient of the layer with respect to the output loss by accessing the grad attribute on the parameters of your model which require gradient computation.
Here is a simplistic setup:
>>> f = nn.Sequential(
nn.Linear(10,5),
nn.Linear(5,2),
nn.Linear(2, 2, bias=False),
nn.Sigmoid())
>>> x = torch.rand(3, 10).requires_grad_(True)
>>> f(x).mean().backward()
Navigate through all the parameters per layer:
>>> for n, c in f.named_children():
... for p in c.parameters():
... print(f'<{n}>:{p.grad}')
<0>:tensor([[-0.0054, -0.0034, -0.0028, -0.0058, -0.0073, -0.0066, -0.0037, -0.0044,
-0.0035, -0.0051],
[ 0.0037, 0.0023, 0.0019, 0.0040, 0.0050, 0.0045, 0.0025, 0.0030,
0.0024, 0.0035],
[-0.0016, -0.0010, -0.0008, -0.0017, -0.0022, -0.0020, -0.0011, -0.0013,
-0.0010, -0.0015],
[ 0.0095, 0.0060, 0.0049, 0.0102, 0.0129, 0.0116, 0.0066, 0.0077,
0.0063, 0.0091],
[ 0.0005, 0.0003, 0.0002, 0.0005, 0.0006, 0.0006, 0.0003, 0.0004,
0.0003, 0.0004]])
<0>:tensor([-0.0090, 0.0062, -0.0027, 0.0160, 0.0008])
<1>:tensor([[-0.0035, 0.0035, -0.0026, -0.0106, -0.0002],
[-0.0020, 0.0020, -0.0015, -0.0061, -0.0001]])
<1>:tensor([-0.0289, -0.0166])
<2>:tensor([[0.0355, 0.0420],
[0.0354, 0.0418]])
To supplement gradient related answer(s), it should to say that you can't get the loss of the layer, loss is model level concept, generally, you can't say, which layer is responsible for error. See, if model deep enough one can freeze any model layer, and it can still train to high accuracy.

torch.nn.CrossEntropyLoss over Multiple Batches

I am currently working with torch.nn.CrossEntropyLoss. As far as I know, it is common to compute the loss batch-wise. However, is there a possibility to compute the loss over multiple batches?
More concretely, assume we are given the data
import torch
features = torch.randn(no_of_batches, batch_size, feature_dim)
targets = torch.randint(low=0, high=10, size=(no_of_batches, batch_size))
loss_function = torch.nn.CrossEntropyLoss()
Is there a way to compute in one line
loss = loss_function(features, targets) # raises RuntimeError: Expected target size [no_of_batches, feature_dim], got [no_of_batches, batch_size]
?
Thank you in advance!
You can compute multiple cross-entropy losses but you'll need to do your own reduction. Since cross-entropy loss assumes the feature dim is always the second dimension of the features tensor you will also need to permute it first.
loss_function = torch.nn.CrossEntropyLoss(reduction='none')
loss = loss_function(features.permute(0,2,1), targets).mean(dim=1)
which will result in a loss tensor with no_of_batches entries.

Trying to accumulate gradients in Pytorch, but getting RuntimeError when calling loss.backward

I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
https://stackoverflow.com/a/62076913/1227353
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# CHANGE: we're gonna accumulate losses manually
batch_loss = 0
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
batch_loss += loss # CHANGE: accumulate!
# CHANGE: do backprop outside for loop
batch_loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. This happens when the next epoch starts though... (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train') but I had forgotten to wrap the call to batch_loss.backward() with an if phase == 'train'... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE as 1. Furthermore, I'm creating a new loss object each time, so shouldn't the calls to backward() operate on different graphs entirely?
It seems you have two issues here, you said you couldn't have batch_size=8 because of memory limitations but later state that your labels are not of the same size. The latter seems much more important than the former. Anyway, I will try to answer your questions best I can.
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
You want to call .backward() on every loop cycle otherwise the batch will have no effect on the training. You can then call step() and zero_grad() only when batch_idx % 2 is True (i.e. for every other batch).
Here's an example which accumulates the gradient, not the loss:
model = nn.Linear(10, 3)
optim = torch.optim.SGD(model.parameters(), lr=0.1)
ds = TensorDataset(torch.rand(100, 10), torch.rand(100, 3))
dl = DataLoader(ds, batch_size=4)
for i, (x, y) in enumerate(dl):
y_hat = model(x)
loss = F.l1_loss(y_hat, y) / 2
loss.backward()
if i % 2:
optim.step()
optim.zero_grad()
Note this approach is different to accumulating the loss, and back-propagating only all batches (or part of the batches) have gone through the network. In the example above we backpropagate every 4 datapoints and updating the model every 8 datapoints.
Oh and as a side question, does the location of where I put loss.backward() affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2?
Usually torch's built-in losses have reduction='mean' set as default. This means the loss gets averaged over all batch elements that contributed to calculating the loss. So this will depend on your loss implementation.
However if you are using gradient accumalation, then yes you will need to average your loss by the number of accumulation steps (here loss = F.l1_loss(y_hat, y) / 2). Since your gradients will be accumulated twice.
To read more about this, I recommend taking a look at this other SO post.

Multi-label classification with class weights in Keras

I have a 1000 classes in the network and they have multi-label outputs. For each training example, the number of positive output is same(i.e 10) but they can be assigned to any of the 1000 classes. So 10 classes have output 1 and rest 990 have output 0.
For the multi-label classification, I am using 'binary-cross entropy' as cost function and 'sigmoid' as the activation function. When I tried this rule of 0.5 as the cut-off for 1 or 0. All of them were 0. I understand this is a class imbalance problem. From this link, I understand that, I might have to create extra output labels.Unfortunately, I haven't been able to figure out how to incorporate that into a simple neural network in keras.
nclasses = 1000
# if we wanted to maximize an imbalance problem!
#class_weight = {k: len(Y_train)/(nclasses*(Y_train==k).sum()) for k in range(nclasses)}
inp = Input(shape=[X_train.shape[1]])
x = Dense(5000, activation='relu')(inp)
x = Dense(4000, activation='relu')(x)
x = Dense(3000, activation='relu')(x)
x = Dense(2000, activation='relu')(x)
x = Dense(nclasses, activation='sigmoid')(x)
model = Model(inputs=[inp], outputs=[x])
adam=keras.optimizers.adam(lr=0.00001)
model.compile('adam', 'binary_crossentropy')
history = model.fit(
X_train, Y_train, batch_size=32, epochs=50,verbose=0,shuffle=False)
Could anyone help me with the code here and I would also highly appreciate if you could suggest a good 'accuracy' metric for this problem?
Thanks a lot :) :)
I have a similar problem and unfortunately have no answer for most of the questions. Especially the class imbalance problem.
In terms of metric there are several possibilities: In my case I use the top 1/2/3/4/5 results and check if one of them is right. Because in your case you always have the same amount of labels=1 you could take your top 10 results and see how many percent of them are right and average this result over your batch size. I didn't find a possibility to include this algorithm as a keras metric. Instead, I wrote a callback, which calculates the metric on epoch end on my validation data set.
Also, if you predict the top n results on a test dataset, see how many times each class is predicted. The Counter Class is really convenient for this purpose.
Edit: If found a method to include class weights without splitting the output.
You need a numpy 2d array containing weights with shape [number classes to predict, 2 (background and signal)].
Such an array could be calculated with this function:
def calculating_class_weights(y_true):
from sklearn.utils.class_weight import compute_class_weight
number_dim = np.shape(y_true)[1]
weights = np.empty([number_dim, 2])
for i in range(number_dim):
weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
return weights
The solution is now to build your own binary crossentropy loss function in which you multiply your weights yourself:
def get_weighted_loss(weights):
def weighted_loss(y_true, y_pred):
return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
return weighted_loss
weights[:,0] is an array with all the background weights and weights[:,1] contains all the signal weights.
All that is left is to include this loss into the compile function:
model.compile(optimizer=Adam(), loss=get_weighted_loss(class_weights))

Tensorflow- How to display accuracy rate for a linear regression model

I have a linear regression model that seems to work. I first load the data into X and the target column into Y, after that I implement the following...
X_train, X_test, Y_train, Y_test = train_test_split(
X_data,
Y_data,
test_size=0.2
)
rng = np.random
n_rows = X_train.shape[0]
X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
pred = tf.add(tf.multiply(X, W), b)
cost = tf.reduce_sum(tf.pow(pred-Y, 2)/(2*n_rows))
optimizer = tf.train.GradientDescentOptimizer(FLAGS.learning_rate).minimize(cost)
init = tf.global_variables_initializer()
init_local = tf.local_variables_initializer()
with tf.Session() as sess:
sess.run([init, init_local])
for epoch in range(FLAGS.training_epochs):
avg_cost = 0
for (x, y) in zip(X_train, Y_train):
sess.run(optimizer, feed_dict={X:x, Y:y})
# display logs per epoch step
if (epoch + 1) % FLAGS.display_step == 0:
c = sess.run(
cost,
feed_dict={X:X_train, Y:Y_train}
)
print("Epoch:", '%04d' % (epoch + 1), "cost=", "{:.9f}".format(c))
print("Optimization Finished!")
accuracy, accuracy_op = tf.metrics.accuracy(labels=tf.argmax(Y_test, 0), predictions=tf.argmax(pred, 0))
print(sess.run(accuracy))
I cannot figure out how to print out the model's accuracy. For example, in sklearn, it is simple, if you have a model you just print model.score(X_test, Y_test). But I do not know how to do this in tensorflow or if it is even possible.
I think I'd be able to calculate the Mean Squared Error. Does this help in any way?
EDIT
I tried implementing tf.metrics.accuracy as suggested in the comments but I'm having an issue implementing it. The documentation says it takes 2 arguments, labels and predictions, so I tried the following...
accuracy, accuracy_op = tf.metrics.accuracy(labels=tf.argmax(Y_test, 0), predictions=tf.argmax(pred, 0))
print(sess.run(accuracy))
But this gives me an error...
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value accuracy/count
[[Node: accuracy/count/read = IdentityT=DT_FLOAT, _class=["loc:#accuracy/count"], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
How exactly does one implement this?
Turns out, since this is a multi-class Linear Regression problem, and not a classification problem, that tf.metrics.accuracy is not the right approach.
Instead of displaying the accuracy of my model in terms of percentage, I instead focused on reducing the Mean Square Error (MSE) instead.
From looking at other examples, tf.metrics.accuracy is never used for Linear Regression, and only classification. Normally tf.metric.mean_squared_error is the right approach.
I implemented two ways of calculating the total MSE of my predictions to my testing data...
pred = tf.add(tf.matmul(X, W), b)
...
...
Y_pred = sess.run(pred, feed_dict={X:X_test})
mse = tf.reduce_mean(tf.square(Y_pred - Y_test))
OR
mse = tf.metrics.mean_squared_error(labels=Y_test, predictions=Y_pred)
They both do the same but obviously the second approach is more concise.
There's a good explanation of how to measure the accuracy of a Linear Regression model here.
I didn't think this was clear at all from the Tensorflow documentation, but you have to declare the accuracy operation, and then initialize all global and local variables, before you run the accuracy calculation:
accuracy, accuracy_op = tf.metrics.accuracy(labels=tf.argmax(Y_test, 0), predictions=tf.argmax(pred, 0))
# ...
init_global = tf.global_variables_initializer
init_local = tf.local_variables_initializer
sess.run([init_global, init_local])
# ...
# run accuracy calculation
I read something on Stack Overflow about the accuracy calculation using local variables, which is why the local variable initializer is necessary.
After reading the complete code you posted, I noticed a couple other things:
In your calculation of pred, you use
pred = tf.add(tf.multiply(X, W), b). tf.multiply performs element-wise multiplication, and will not give you the fully connected layers you need for a neural network (which I am assuming is what you are ultimately working toward, since you're using TensorFlow). To implement fully connected layers, where each layer i (including input and output layers) has ni nodes, you need separate weight and bias matrices for each pair of successive layers. The dimensions of the i-th weight matrix (the weights between the i-th layer and the i+1-th layer) should be (ni, ni + 1), and the i-th bias matrix should have dimensions (ni + 1, 1). Then, going back to the multiplication operation - replace tf.multiply with tf.matmul, and you're good to go. I assume that what you have is probably fine for a single-class linear regression problem, but this is definitely the way you want to go if you plan to solve a multiclass regression problem or implement a deeper network.
Your weight and bias tensors have a shape of (1, 1). You give the variables the initial value of np.random.randn(), which according to the documentation, generates a single floating point number when no arguments are given. The dimensions of your weight and bias tensors need to be supplied as arguments to np.random.randn(). Better yet, you can actually initialize these to random values in Tensorflow: W = tf.Variable(tf.random_normal([dim0, dim1], seed = seed) (I always initialize random variables with a seed value for reproducibility)
Just a note in case you don't know this already, but non-linear activation functions are required for neural networks to be effective. If all your activations are linear, then no matter how many layers you have, it will reduce to a simple linear regression in the end. Many people use relu activation for hidden layers. For the output layer, use softmax activation for multiclass classification problems where the output classes are exclusive (i.e., where only one class can be correct for any given input), and sigmoid activation for multiclass classification problems where the output classes are not exlclusive.

Resources