So, I'm training a DCGAN model in pytorch on celeba dataset (people). And here is the architecture of the generator:
(main): Sequential(
(0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
(13): Tanh()
So after training, I want to check what generator outputs if I feed an occluded image like this:
(size: 64X64)
But as u might have guessed that the image has 3 channels and my generator accepts a latent vector of 100 channels at the starting, so what is the correct way to feed this image to the generator and check the output. (I'm expecting that the generator tries to generate only the occluded part of the image). If you want a reference code then try this demo file of pytorch. I have modified this file according to my own needs, so for referring, this will do the trick.
You just can't do that. As you said, your network expects 100 dimensional input which is normally sampled from standard normal distribution:
So the generator's job is to take this random vector and generate 3x64x64 image that is indistinguishable from real images. Input is a random 100 dimensional vector sampled from standard normal distribution. I don't see any way to input your image into the current network without modifying the architecture and retraining the new model. If you want to try a new model, you can change input to occluded images, apply some conv. / linear layers to reduce the dimensions to 100 then keep the rest of the network same. This way network will try to learn to generate images not from latent vector but from the feature vector extracted from occluded images. It may or may not work.
EDIT I've decided to give it a go and see if network can learn with this type of conditioned input vectors instead of latent vectors. I've used the tutorial example you've linked and added a couple of changes. First a new network for receiving input and reducing it to 100 dimensions:
class ImageTransformer(nn.Module):
def __init__(self):
super(ImageTransformer, self).__init__()
self.main = nn.Sequential(
nn.Conv2d(3, 1, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True)
self.linear = nn.Linear(32*32, 100)
def forward(self, input):
out = self.main(input).view(input.shape[0], -1)
return self.linear(out).view(-1, 100, 1, 1)
Just a simple convolution layer + relu + linear layer to map to 100 dimensions at the output. Note that you can try a much better network here as a better feature extractor, I just wanted to make a simple test.
fixed_input = next(iter(dataloader))[0][0:64, :, : ,:]
fixed_input[:, :, 20:44, 20:44] = torch.tensor(np.zeros((24,24), dtype = np.float32))
fixed_input =
This is how I modify the tensor to add a black patch over the input. Just sampled a batch to create a fixed input to track the process as it was done in the tutorial with a random vector.
# Create the generator
netG = Generator().to(device)
netD = Discriminator().to(device)
netT = ImageTransformer().to(device)
# Apply the weights_init function to randomly initialize all weights
# to mean=0, stdev=0.2.
# Print the model
Most of the steps are same, just created an instance of the new transformer network. Then finally, training loop is slightly modified where generator is not fed random vectors but it is given outputs of the new transformer network.
img_list = []
G_losses = []
D_losses = []
iters = 0
for epoch in range(num_epochs):
for i, data in enumerate(dataloader, 0):
# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
## Train with all-real batch
transformed = data[0].detach().clone()
transformed[:, :, 20:44, 20:44] = torch.tensor(np.zeros((24,24), dtype = np.float32))
transformed =
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
label = torch.full((b_size,), real_label, dtype=torch.float, device=device)
output = netD(real_cpu).view(-1)
errD_real = criterion(output, label)
D_x = output.mean().item()
## Train with all-fake batch
fake = netT(transformed)
fake = netG(fake)
output = netD(fake.detach()).view(-1)
errD_fake = criterion(output, label)
D_G_z1 = output.mean().item()
errD = errD_real + errD_fake
# (2) Update G network: maximize log(D(G(z)))
output = netD(fake).view(-1)
errG = criterion(output, label)
D_G_z2 = output.mean().item()
# Output training stats
if i % 50 == 0:
print('[%d/%d][%d/%d]\tLoss_D: %.4f\tLoss_G: %.4f\tD(x): %.4f\tD(G(z)): %.4f / %.4f'
% (epoch, num_epochs, i, len(dataloader),
errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))
# Save Losses for plotting later
# Check how the generator is doing by saving G's output on fixed_noise
if (iters % 500 == 0) or ((epoch == num_epochs-1) and (i == len(dataloader)-1)):
with torch.no_grad():
fake = netT(fixed_input)
fake = netG(fake).detach().cpu()
img_list.append(vutils.make_grid(fake, padding=2, normalize=True))
iters += 1
Training was somewhat okay in terms of loss reductions etc. Finally this is what I got after 5 epochs training:
So what does this result tell us? Since the generator's inputs were not randomly taken from a normal distribution, generator wasn't able to learn the distribution of faces to create varying range of output faces. And since the input is a conditioned feature vector, output images' range is limited. So in summary, random inputs are required for the generator even though it learned to remove patches :)
Bare Problem Statement:
I have trained a Model A, that consists of a feature Extractor FE and a classification head ACH.
I want to train a model B, that uses A's feature extractor FE and retrains it's own classification head BCH.
So far it's easy. Now I don't want to save the entire model B since the FE part of it is already saved in the model A. I only want to dump the BCH, and during inference
Load model A - do it's prediction
Load B's classification head BCH.
Swap the classification head ACH with BCH
Run prediction using this swapped state.
Reading pyTorches documentation it only talks about saving entire models. How can I achieve this?
End of problem statement
More details on the motivation of the problem:
I have a dataset of images that I want to classify, these images have can have several classes given to them. For example the same image can have the class of "Land Vehicle" (supercategory) and a class of "Car" (category) or a "Truck". Another image might have the class "Aerial Vehicle" and it can be a "Helicopter" or a "Plane".
Since the images and therefore most of the features should be the same, I wish to train one classifier for the supercategories, then freeze it's feature-extractor, and sort of transfer learn the same model for the categories using the pretrained feature extractor.
Since the weights of the feature extracting backbone is the same, I only want to save the weights of the classification head of the categories model, and thus save some precious computational resources.
In general, it's something usual to only want an access to the backbone of a model in order to reuse it for others purposes. You have several ways to perform this. But mostly, having in mind that saving a model checkpoint and loading it later means saving weights and biases and being able to load them correctly to the corresponding layers, you first need to know, from your model, what part do you want to save.
When you get the state of a model, you will obtain a dictionary. The keys will be the layers names and the values will be the weights and the biases. Let's see an example with an efficientnet classifier on how to only save the backbone of a model. Basically, an efficientnet, as in your example, is a backbone and a fully connected layer as a head, if you only want the backbone, you want every single layers, except the head that you'll fine tune later.
import torch
import torch.nn as nn
from efficientnet_pytorch import EfficientNet
model = EfficientNet.from_name("efficientnet-b0")
It will print the model layers and some features, basic stuff.
(_conv_stem): Conv2dStaticSamePadding(
3, 32, kernel_size=(3, 3), stride=(2, 2), bias=False
(static_padding): ZeroPad2d(padding=(0, 1, 0, 1), value=0.0)
(_bn0): BatchNorm2d(32, eps=0.001, momentum=0.010000000000000009, affine=True, track_running_stats=True)
(_blocks): ModuleList(
(0): MBConvBlock(
(_depthwise_conv): Conv2dStaticSamePadding(
32, 32, kernel_size=(3, 3), stride=[1, 1], groups=32, bias=False
(static_padding): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0)
(_bn1): BatchNorm2d(32, eps=0.001, momentum=0.010000000000000009, affine=True, track_running_stats=True)
(_se_reduce): Conv2dStaticSamePadding(
32, 8, kernel_size=(1, 1), stride=(1, 1)
(static_padding): Identity()
(_se_expand): Conv2dStaticSamePadding(
8, 32, kernel_size=(1, 1), stride=(1, 1)
(static_padding): Identity()
Now what is interesting is the final layers of this model :
(_bn1): BatchNorm2d(1280, eps=0.001, momentum=0.010000000000000009, affine=True, track_running_stats=True)
(_avg_pooling): AdaptiveAvgPool2d(output_size=1)
(_dropout): Dropout(p=0.2, inplace=False)
(_fc): Linear(in_features=1280, out_features=1000, bias=True)
(_swish): MemoryEfficientSwish()
Let's say we want to reuse this model backbone, except _fcsince we would like to use the weights on another model having the same backbone but a different head, not pre-trained. In this example I'll take the same backbone and add 3 heads :
class ThreeHeadEfficientNet(torch.nn.Module):
def __init__(self,nbClasses1,nbClasses2,nbClasses3,model="efficientnet-b0",dropout_p=0.2):
super(ThreeHeadEfficientNet, self).__init__()
self.NBC1 = nbClasses1
self.NBC2 = nbClasses2
self.NBC3 = nbClasses3
self.dropout_p = dropout_p
self._dropout_layer = torch.nn.Dropout(p=self.dropout_p)
self._head1 = torch.nn.Linear(1280,self.NBC1)
self._head2 = torch.nn.Linear(1280,self.NBC2)
self._head3 = torch.nn.Linear(1280,self.NBC3)
self.model = EfficientNet.from_name(model,include_top=False) #you can notice here, I'm not loading the head, only the backbone
def forward(self,x):
features = self.model(x)
res = features.flatten(start_dim=1)
res = self._dropout_layer(res)
res1 = self._head1(res)
res2 = self._head2(res)
res3 = self._head3(res)
return res1,res2,res3
You'll notice now, if you print this ThreeHeadsModel layers, the layers name have slightly changed from _conv_stem.weight to model._conv_stem.weight since the backbone is now stored in a attribute variable model. We'll thus have to process that otherwise the keys will mismatch, create a new state dictionary that matches the expected keys of this new model and containing the pretrained weights and biases :
pretrained_dict = model.state_dict() #pretrained model keys
model_dict = new_model.state_dict() #new model keys
processed_dict = {}
for k in model_dict.keys():
decomposed_key = k.split(".")
if("model" in decomposed_key):
pretrained_key = ".".join(decomposed_key[1:])
processed_dict[k] = pretrained_dict[pretrained_key] #Here we are creating the new state dict to make our new model able to load the pretrained parameters without the head.
new_model.load_state_dict(processed_dict, strict=False) #strict here is important since the heads layers are missing from the state, we don't want this line to raise an error but load the present keys anyway.
And finally, in new_model you should have your new model with a pretrained backbone and heads to fine tune.
Now you should be able to fix your issues :)
For more pytorch information, please also check the forum.
Hi I'm trying to make this model using pytorch.
Each input is consisted of 20 images of size 28 X 28, which is C1 ~ Cp in the image.
Each image goes to CNN of same structure, but their outputs are concatenated eventually.
I'm currently struggling with feeding multiple inputs to each of its respective CNN model.
Each model in the first box with three convolutional layers will look like this as a code, but I'm not quite sure how I can put 20 different input to separate models of same structure to eventually concatenate.
self.features = nn.Sequential(
nn.Conv2d(1,10, kernel_size = 3, padding = 1),
nn.Conv2d(10, 14, kernel_size=3, padding=1),
nn.Conv2d(14, 18, kernel_size=3, padding=1),
nn.Linear(28*28*18, 256)
I've tried out giving a list of inputs as an input to forward function, but it ended up with an error and won't go through.
I'll be more than happy to explain further if anything is unclear.
Simply define forward as taking a list of tensors as input, then process each input with the corresponding CNN (in the example snippet, CNNs share the same structure but don't share parameters, which is what I assume you need. You'll need to fill in the dots ... according to your specifications.
class MyModel(torch.nn.Module):
def __init__(self, ...):
self.cnns = torch.nn.ModuleList([torch.nn.Sequential(...) for _ in range(20)])
def forward(xs: list[Tensor]):
return[cnn(x) for x, cnn in zip(xs, self.cnns)], dim=...)
Assuming each path have it's own weights, may be this could be done with grouped convolution, although pre fusion Linear can cause some trouble.
P = 20
self.features = nn.Sequential(
nn.Conv2d(1*P,10*P, kernel_size = 3, padding = 1, groups = P ),
nn.Conv2d(10*P, 14*P, kernel_size=3, padding=1, groups = P),
nn.Conv2d(14*P, 18*P, kernel_size=3, padding=1, groups = P),
nn.Conv2d(18*P, 256*P, kernel_size=28, groups = P), # not shure about this one
nn.Linear(256*P, 1024 )
def __init__(self):
self.conv = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=5, stride=2),
nn.Conv2d(64, 64, kernel_size=3, stride=2),
nn.Conv2d(64, 64, kernel_size=3, stride=2),
nn.Conv2d(64, 64, kernel_size=3, stride=2),
nn.Conv2d(64, 64, kernel_size=3, stride=2),
conv_out_size = self._get_conv_out((32, 110, 110))
self.fc = nn.Sequential(
nn.Linear(conv_out_size, 1),
I have this model where everything to my eyes is fine. However, It says that I have to remove bias from the convolution if the convolution is followed by a normalization layer, because it already contains a parameter for the bias. Can you explain why and how I can do that?
Batch normalization = gamma * normalize(x) + bias
So, using bias in convolution layer and then again in batch normalization will cancel out the bias in the process of mean subtraction.
You can just put bias = False in your convolution layer to ignore this conflict as the default value for bias is True in pytorch
The answer is already accepted but still, I would like to add a point here. One of the advantages of Batch Normalization is that it can be folded in a convolution layer. This means that we can replace the Convolution followed by the Batch Normalization operation with just one convolution with different weights. It is a good practice folding batch normalization and you can refer to the link here Folding Batch Norm.
I have also written some python script for your understanding. Kindly check this.
def fold_batch_norm(conv_layer, bn_layer):
"""Fold the batch normalization parameters into the weights for
the previous layer."""
conv_weights = conv_layer.get_weights()[0]
# Keras stores the learnable weights for a BatchNormalization layer
# as four separate arrays:
# 0 = gamma (if scale == True)
# 1 = beta (if center == True)
# 2 = moving mean
# 3 = moving variance
bn_weights = bn_layer.get_weights()
gamma = bn_weights[0]
beta = bn_weights[1]
mean = bn_weights[2]
variance = bn_weights[3]
epsilon = 1e-7
new_weights = conv_weights * gamma / np.sqrt(variance + epsilon)
param = conv_layer.get_config()
#Note that it will handle for all cases
if param['use_bias'] == True:
bias = conv_layer.get_weights()[1]
new_bias = beta + (bias - mean) * gamma / np.sqrt(variance + epsilon)
new_bias = beta - mean * gamma / np.sqrt(variance + epsilon)
return new_weights, new_bias
You can consider this idea in your future projects as well. Cheers :)
If the pre-trained network doesn't have bias in conv2d layer [use_bias = false], folding batchnorm would require it to use bias.
Is there an easy way to change the use_bias config in pre-trained keras network ?
layer.set_weights(fold_batch_norm(..)) won't work since original weights didn't have bias.
I have two PyTorch models that are equivalent (I think), the only difference between them is the padding:
import torch
import torch.nn as nn
i = torch.arange(9, dtype=torch.float).reshape(1,1,3,3)
# First model:
model1 = nn.Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), padding_mode='reflection')
# tensor([[[[-0.6095, -0.0321, 2.2022],
# [ 0.1018, 1.7650, 5.5392],
# [ 1.7988, 3.9165, 5.6506]]]], grad_fn=<MkldnnConvolutionBackward>)
# Second model:
model2 = nn.Sequential(nn.ReflectionPad2d((1, 1, 1, 1)),
nn.Conv2d(1, 1, kernel_size=3))
# tensor([[[[1.4751, 1.5513, 2.6566],
# [4.0281, 4.1043, 5.2096],
# [2.6149, 2.6911, 3.7964]]]], grad_fn=<MkldnnConvolutionBackward>)
I was wondering why and when you use both approaches, the output of both is different but as I see it they should be the same, because the padding is of type reflection.
Would appreciate some help in understanding it.
After what #Ash said, I wanted to check wheter or not the weights had influence so I pinned all of them to the same value and still there is a difference between the 2 methods:
import torch
import torch.nn as nn
i = torch.arange(9, dtype=torch.float).reshape(1,1,3,3)
# First model:
model1 = nn.Conv2d(1, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), padding_mode='reflection') = torch.full(, 0.4)
# tensor([[[[ 3.4411, 6.2411, 5.0412],
# [ 8.6411, 14.6411, 11.0412],
# [ 8.2411, 13.4411, 9.8412]]]], grad_fn=<MkldnnConvolutionBackward>)
# Parameter containing:
# tensor([[[[0.4000, 0.4000, 0.4000],
# [0.4000, 0.4000, 0.4000],
# [0.4000, 0.4000, 0.4000]]]], requires_grad=True)
# Second model:
model2 = [nn.ReflectionPad2d((1, 1, 1, 1)),
nn.Conv2d(1, 1, kernel_size=3)]
model2[1] = torch.full(model2[1], 0.4)
model2 = nn.Sequential(*model2)
# tensor([[[[ 9.8926, 11.0926, 12.2926],
# [13.4926, 14.6926, 15.8926],
# [17.0926, 18.2926, 19.4926]]]], grad_fn=<MkldnnConvolutionBackward>)
# Parameter containing:
# tensor([[[[0.4000, 0.4000, 0.4000],
# [0.4000, 0.4000, 0.4000],
# [0.4000, 0.4000, 0.4000]]]], requires_grad=True)
the output of both is different but as I see it they should be the same
I don't think that the different outputs that you get are only related to how the reflective padding is implemented. In the code snippet that you provide, the values of the weights and biases of the convolutions from model1 and model2 differ, since they are initialized randomly and you don't seem to fix their values in the code.
Following your new edit, it seems that for versions prior to 1.5, looking at the implementation of the forward pass in <your_torch_install>/nn/modules/conv.pyshows that "reflection" is not supported. It wont complain about arbitrary strings instead of "reflection" either, but will default to zero-padding.
I am trying to make a simple GANs to generate digits from the MNIST dataset. However when I get to training(which is custom) I get this annoying warning that I suspect is the cause of not training like I'm used to.
Keep in mind this is all in tensorflow 2.0 using it's default eager execution.
GET THE DATA(not that important)
(train_images,train_labels),(test_images,test_labels) = tf.keras.datasets.mnist.load_data()
train_images = train_images.reshape(train_images.shape[0], 28, 28, 1).astype('float32')
train_images = (train_images - 127.5) / 127.5 # Normalize the images to [-1, 1]
train_dataset =,train_labels)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
GENERATOR MODEL(This is where the Batch Normalization is at)
def make_generator_model():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(7*7*256, use_bias=False, input_shape=(100,)))
model.add(tf.keras.layers.Reshape((7, 7, 256)))
assert model.output_shape == (None, 7, 7, 256) # Note: None is the batch size
model.add(tf.keras.layers.Conv2DTranspose(128, (5, 5), strides=(1, 1), padding='same', use_bias=False))
assert model.output_shape == (None, 7, 7, 128)
model.add(tf.keras.layers.Conv2DTranspose(64, (5, 5), strides=(2, 2), padding='same', use_bias=False))
assert model.output_shape == (None, 14, 14, 64)
model.add(tf.keras.layers.Conv2DTranspose(1, (5, 5), strides=(2, 2), padding='same', use_bias=False, activation='tanh'))
assert model.output_shape == (None, 28, 28, 1)
return model
DISCRIMINATOR MODEL (likely not that important)
def make_discriminator_model():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(64, (5, 5), strides=(2, 2), padding='same'))
model.add(tf.keras.layers.Conv2D(128, (5, 5), strides=(2, 2), padding='same'))
return model
INSTANTIATE THE MODELS(likely not that important)
generator = make_generator_model()
discriminator = make_discriminator_model()
DEFINE THE LOSSES(maybe the generator loss is important since that is where the gradient comes from)
def generator_loss(generated_output):
return tf.nn.sigmoid_cross_entropy_with_logits(labels = tf.ones_like(generated_output), logits = generated_output)
def discriminator_loss(real_output, generated_output):
# [1,1,...,1] with real output since it is true and we want our generated examples to look like it
real_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(real_output), logits=real_output)
# [0,0,...,0] with generated images since they are fake
generated_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.zeros_like(generated_output), logits=generated_output)
total_loss = real_loss + generated_loss
return total_loss
MAKE THE OPTIMIZERS(likely not important)
generator_optimizer = tf.optimizers.Adam(1e-4)
discriminator_optimizer = tf.optimizers.Adam(1e-4)
RANDOM NOISE FOR THE GENERATOR(likely not important)
noise_dim = 100
num_examples_to_generate = 16
# We'll re-use this random vector used to seed the generator so
# it will be easier to see the improvement over time.
random_vector_for_generation = tf.random.normal([num_examples_to_generate,
A SINGLE TRAIN STEP(This is where I get the error
def train_step(images):
# generating noise from a normal distribution
noise = tf.random.normal([BATCH_SIZE, noise_dim])
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
generated_images = generator(noise, training=True)
real_output = discriminator(images[0], training=True)
generated_output = discriminator(generated_images, training=True)
gen_loss = generator_loss(generated_output)
disc_loss = discriminator_loss(real_output, generated_output)
This line >>>>>
gradients_of_generator = gen_tape.gradient(gen_loss, generator.variables)
<<<<< This line
gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.variables)
generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.variables))
discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.variables))
THE FULL TRAIN(not important except that it calls train_step)
def train(dataset, epochs):
for epoch in range(epochs):
start = time.time()
for images in dataset:
epoch + 1,
# saving (checkpoint) the model every 15 epochs
if (epoch + 1) % 15 == 0: = checkpoint_prefix)
print ('Time taken for epoch {} is {} sec'.format(epoch + 1,
# generating after the final epoch
train(train_dataset, EPOCHS)
The error I get is as follows,
W0330 19:42:57.366302 4738405824] Gradients does
not exist for variables ['batch_normalization_v2_54/moving_mean:0',
'batch_normalization_v2_56/moving_variance:0'] when minimizing the
And I get an image from the generator which looks like this:
which is kinda what I would expect without the normalization. Everything would clump to one corner because there are extreme values.
The problem is here:
gradients_of_generator = gen_tape.gradient(gen_loss, generator.variables)
You should only be getting gradients for the trainable variables. So you should change it to
gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
The same goes for the three lines following. The variables field includes stuff like the running averages batch norm uses during inference. Because they are not used during training, there are no sensible gradients defined and trying to compute them will lead to a crash.