class Net(nn.Module):
def __init__(self):
super().__init__()
#(input channel, output channel, kenel size)
#channel is a dimension of a tensor which is a container that can house data in N dimensions (matrices)
self.conv1 = nn.Conv2d(3, 6, 5)
#shrink the image stack by pooling(kernel size, stride(shift)) and take max value per window
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
#TODO: add conv3
self.conv3 = nn.Conv2d(16, 32, 5)
#drop layer deletes 20% of the feautures to help prevent overfitting
self.drop = nn.Dropout2d(p=0.2)
#linear predicts the output as a linear function of inputs
#(output channels, height, width, batch size
#TODO:
self.fc1 = nn.Linear(16 * 16 * 5, 120)
#TODO:
self.fc1_5 = nn.Linear()
#layer(size of input, size of output)
#Linear layer=Fully connected layer
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
#F.ReLUs change negative values to 0. Apply to all stack of images.
#they are activation functions. We apply it after each liner layer.
#only used in hidden layers.
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
#Select some feautures to drop after 3rd conv to prevent overfitting
x = self.drop(F.relu(self.conv3(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch into 1-D
x = F.relu(self.fc1(x))
#TODO: add fc1_5
x = F.relu(self.fc1_5(x))
x = F.relu(self.fc2(x))
#Feed to Fully connected layer to predict class
x = self.fc3(x) # no relu b/c it's a last layer.
return x
I am using images from CIFAR10 which are of size 3x32x32.
When I ran the code before, it stopped because self.fc1 linear layer size did not work with self.conv3 I've added.
I'm also not sure what to write for self.fc1_5.
Can someone explain me how this is actually working and the solution as well?
Thank you!
I have added an extra convolutional layer and you can see it is
self.conv3 = nn.Conv2d(16, 32, 5).
Lines under the TODO are where I'm stuck at.
I updated the line to:
self.fc1 = nn.Linear(16 * 16 * 5, 120)
before, it was:
self.fc1 = nn.Linear(16 * 5 * 5, 120).
When you create a CNN for classification with a fixed input size, it's easy to figure out the size of your image by the time it has progressed through your CNN layers. Since we start with images of size [32,32] (channels are unimportant for now):
def __init__(self):
super().__init__()
#(input channel, output channel, kenel size)
#channel is a dimension of a tensor which is a container that can house data in N dimensions (matrices)
self.conv1 = nn.Conv2d(3, 6, 5) # size 28x28 - lose 2 px from each side with a kernel of size 5
#shrink the image stack by pooling(kernel size, stride(shift)) and take max value per window
self.pool = nn.MaxPool2d(2, 2) # size 14x14 - max pooling with K=2 halves the image size
self.conv2 = nn.Conv2d(6, 16, 5) # size 10x10 -> 5x5 after pooling
#TODO: add conv3
self.conv3 = nn.Conv2d(16, 32, 5) # size 1x1
#drop layer deletes 20% of the feautures to help prevent overfitting
self.drop = nn.Dropout2d(p=0.2)
#linear predicts the output as a linear function of inputs
#(output channels, height, width, batch size
self.fc1 = nn.Linear(1 * 1 * 32, 120)
self.fc1_5 = nn.Linear(120,120) # matches the output size of fc1 and input size of fc2
The CNN size losses can be negated by using padding of (K-1)//2, where K=kernel_size.
Related
Hi I am trying to understand how the following PyTorch AutoEncoder code works. The code below uses the MNIST dataset which is 28X28. My question is how the nn.Linear(128,3) parameters where chosen?
I have a dataset which is 512X512 and I would like to modify the code for this AutoEncoder to support.
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3))
self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28))
def forward(self, x):
# in lightning, forward defines the prediction/inference actions
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
# training_step defined the train loop. It is independent of forward
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
I am assuming input image data are in this shape: x.shape == [bs, 1, h, w], where bs is batch size. Then, x is first viewed as [bs, h*w], i.e. [bs, 28*28]. This means all pixels in an image are flattened into a 1D vector.
Then in the encoder:
nn.Linear(28*28, 128) takes flattened input of size [bs, 28*28] and outputs intermediate result of size [bs, 128]
nn.Linear(128, 3): [bs, 128] -> [bs, 3]
Then in the decoder:
nn.Linear(3, 128): [bs, 3] -> [bs, 128]
nn.Linear(128, 28*28): [bs, 128] -> [bs, 28*28]
The final output is then matched against the input.
If you want to use the exact architecture for your 512x512 images, simply change every occurrence of 28*28 in the code to 512*512. However, this is a quite infeasible choice, for these reasons:
For MNIST images, nn.Linear(28*28, 128) contains 28x28x128+128=100480 parameters, while for your images nn.Linear(512*512, 128) contains 512x512x128+128=33554560 parameters. The size is too large, and it may lead to overfitting
The intermediate data [bs, 3] uses only 3 floats to encode a 512x512 image. I don't think you can recover anything with such compression
I'd suggest looking up convolutional architectures for you purpose
Is RNN for image classification available only for gray image?
The following program works for gray image classification.
If RGB images are used, I have this error:
Expected input batch_size (18) to match target batch_size (6)
at this line loss = criterion(outputs, labels).
My data loading for train, valid and test are as follows.
input_size = 300
inputH = 300
inputW = 300
#Data transform (normalization & data augmentation)
stats = ((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
train_resize_tfms = tt.Compose([tt.Resize((inputH, inputW), interpolation=2),
tt.ToTensor(),
tt.Normalize(*stats)])
train_tfms = tt.Compose([tt.Resize((inputH, inputW), interpolation=2),
tt.RandomHorizontalFlip(),
tt.ToTensor(),
tt.Normalize(*stats)])
valid_tfms = tt.Compose([tt.Resize((inputH, inputW), interpolation=2),
tt.ToTensor(),
tt.Normalize(*stats)])
test_tfms = tt.Compose([tt.Resize((inputH, inputW), interpolation=2),
tt.ToTensor(),
tt.Normalize(*stats)])
#Create dataset
train_ds = ImageFolder('./data/train', train_tfms)
valid_ds = ImageFolder('./data/valid', valid_tfms)
test_ds = ImageFolder('./data/test', test_tfms)
from torch.utils.data.dataloader import DataLoader
batch_size = 6
#Training data loader
train_dl = DataLoader(train_ds, batch_size, shuffle = True, num_workers = 8, pin_memory=True)
#Validation data loader
valid_dl = DataLoader(valid_ds, batch_size, shuffle = True, num_workers = 8, pin_memory=True)
#Test data loader
test_dl = DataLoader(test_ds, 1, shuffle = False, num_workers = 1, pin_memory=True)
My model is as follows.
num_steps = 300
hidden_size = 256 #size of hidden layers
num_classes = 5
num_epochs = 20
learning_rate = 0.001
# Fully connected neural network with one hidden layer
num_layers = 2 # 2 RNN layers are stacked
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, num_classes):
super(RNN, self).__init__()
self.num_layers = num_layers
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True, dropout=0.2)#batch must have first dimension
#our inpyt needs to have shape
#x -> (batch_size, seq, input_size)
self.fc = nn.Linear(hidden_size, num_classes)#this fc is after RNN. So needs the last hidden size of RNN
def forward(self, x):
#according to ducumentation of RNN in pytorch
#rnn needs input, h_0 for inputs at RNN (h_0 is initial hidden state)
#the following one is initial hidden layer
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)#first one is number of layers and second one is batch size
#output has two outputs. The first tensor contains the output features of the hidden last layer for all time steps
#the second one is hidden state f
out, _ = self.rnn(x, h0)
#output has batch_size, num_steps, hidden size
#we need to decode hidden state only the last time step
#out (N, 30, 128)
#Since we need only the last time step
#Out (N, 128)
out = out[:, -1, :] #-1 for last time step, take all for N and 128
out = self.fc(out)
return out
stacked_rnn_model = RNN(input_size, hidden_size, num_layers, num_classes).to(device)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()#cross entropy has softmax at output
#optimizer = torch.optim.Adam(stacked_rnn_model.parameters(), lr=learning_rate) #optimizer used gradient optimization using Adam
optimizer = torch.optim.SGD(stacked_rnn_model.parameters(), lr=learning_rate)
# Train the model
n_total_steps = len(train_dl)
for epoch in range(num_epochs):
t_losses=[]
for i, (images, labels) in enumerate(train_dl):
# origin shape: [6, 3, 300, 300]
# resized: [6, 300, 300]
images = images.reshape(-1, num_steps, input_size).to(device)
print('images shape')
print(images.shape)
labels = labels.to(device)
# Forward pass
outputs = stacked_rnn_model(images)
print('outputs shape')
print(outputs.shape)
loss = criterion(outputs, labels)
t_losses.append(loss)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
Printing images and outputs shapes are
images shape
torch.Size([18, 300, 300])
outputs shape
torch.Size([18, 5])
Where is the mistake?
Tl;dr: You are flattening the first two axes, namely batch and channels.
I am not sure you are taking the right approach but I will write about that layer.
In any case, let's look at the issue you are facing. You have a data loader that produces (6, 3, 300, 300), i.e. batches of 6 three-channel 300x300 images. By the look of it you are looking to reshape each batch element (3, 300, 300) into (step_size=300, -1).
However instead of that you are affecting the first axis - which you shouldn't - with images.reshape(-1, num_steps, input_size). This will have the desired effect when working with a single-channel images since dim=1 wouldn't be the "channel axis". In your case your have 3 channels, therefore, the resulting shape is: (6*3*300*300//300//300, 300, 300) which is (18, 300, 300) since num_steps=300 and input_size=300. As a result you are left with 18 batch elements instead of 6.
Instead what you want is to reshape with (batch_size, num_steps, -1). Leaving the last axis (a.k.a. seq_length) of variable size. This will result in a shape (6, 300, 900).
Here is a corrected and reduced snippet:
batch_size = 6
channels = 3
inputH, inputW = 300, 300
train_ds = TensorDataset(torch.rand(100, 3, inputH, inputW), torch.rand(100, 5))
train_dl = DataLoader(train_ds, batch_size)
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, num_classes):
super(RNN, self).__init__()
# (batch_size, seq, input_size)
self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
# (batch_size, hidden_size)
self.fc = nn.Linear(hidden_size, num_classes)
# (batch_size, num_classes)
def forward(self, x):
out, _ = self.rnn(x)
out = out[:, -1, :]
out = self.fc(out)
return out
num_steps = 300
input_size = inputH*inputW*channels//num_steps
hidden_size = 256
num_classes = 5
num_layers = 2
rnn = RNN(input_size, hidden_size, num_layers, num_classes)
for x, y in train_dl:
print(x.shape, y.shape)
images = images.reshape(batch_size, num_steps, -1)
print(images.shape)
outputs = rnn(images)
print(outputs.shape)
break
As I said in the beginning I am a bit wary about this approach because you are essentially feeding your RNN a RGB 300x300 image in the form of a sequence of 300 flattened vectors... I can't say if that makes sense and terms of training and if the model will be able to learn from that. I could be wrong!
Here is example pytorch code from the website:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 1 input image channel, 6 output channels, 3x3 square convolution
# kernel
self.conv1 = nn.Conv2d(1, 6, 3)
self.conv2 = nn.Conv2d(6, 16, 3)
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Max pooling over a (2, 2) window
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
# If the size is a square you can only specify a single number
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, self.num_flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
In the forward function, we simply apply a series of transformations to x, but never explicitly define which objects are part of that transformation. Yet when computing the gradient and updating the weights, Pytorch 'magically' knows which weights to update and how the gradient should be calculated.
How does this process work? Is there code analysis going on, or something else that I am missing?
Yes, there is implicit analysis on forward pass. Examine the result tensor, there is thingie like grad_fn= <CatBackward>, that's a link, allowing you to unroll the whole computation graph. And it is built during real forward computation process, no matter how you defined your network module, object oriented with 'nn' or 'functional' way.
You can exploit this graph for net analysis, as torchviz do here: https://github.com/szagoruyko/pytorchviz/blob/master/torchviz/dot.py
I got an error:
shape '[-1, 270000]' is invalid for the input of size 1440000
while running my code for a CNN structure input tensor size is 64.
Class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.conv1 = nn.Conv2d(3, 48, 2)
self.conv2 = nn.Conv2d(48, 108, 2)
self.conv3 = nn.Conv2d(108, 192, 2)
self.conv4 = nn.Conv2d(192, 300, 2)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(300* 30* 30, 864)
self.fc2 = nn.Linear(864, 288)
self.fc3 = nn.Linear(288, 2)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = F.relu(self.conv4(x))
#x = self.pool(F.relu(self.conv4(x)))
x = self.pool(x)
x = x.view(-1, 300 * 30* 30)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return F.log_softmax(x)
Any idea why I am getting above error?
Because after your max pooling layer, the shape of feature map is (300, width, height), and 300*width*height != 300*30*30. If you want to reshape the tensor, you must keep the same number of elements.
The view operation which should flatten x is throwing this error, since the size of 300*30*30 is not matching your activation size. Most likely your custom dataset has a different spatial size, such that the view is failing.
Based on the shape given in the error message, it looks like your activation should have the shape [batch_size=3, channels=300, height=40, width=40], which results in 1440000 values. Try to change the input size in your linear layer to 300*40*40 like this:
self.fc1 = nn.Linear(300*40*40, 864)
and the flattening to:
x = x.view(x.size(0), 300*40*40)
Please, notify me if this doesn't work.
I'm very new to pytorch and I want to figure out how to input a matrix rather than image into CNN.
I have try it in the following way, but some errors occur.
I define my dataset as following:
class FrameDataSet(tud.Dataset):
def __init__(self, data):
targets = data['class'].values.tolist()
features = data.drop('class', axis=1).astype(np.int64).values
self.datalist = features.reshape((-1, feature_num, frame_size))
self.labellist = targets
def __getitem__(self, index):
return torch.Tensor(self.datalist[index].astype(float)), self.labellist[index]
def __len__(self):
return self.datalist.shape[0]
And my CNN is:
self.conv = nn.Sequential(
nn.Conv2d(1, 12, 3),
nn.ReLU(True),
nn.MaxPool2d(3, 3))
self.fc1 = nn.Linear(80, 100)
self.fc2 = nn.Linear(100, 30)
self.fc3 = nn.Linear(30, 5)
But when the data was inputted to CNN, the error brings:
File "/home/sparks/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py", line 48, in conv2d
raise ValueError("Expected 4D tensor as input, got {}D tensor instead.".format(input.dim()))
Expected 4D tensor as input, got 3D tensor instead.
Your input probably missing one dimension. It should be:
(batch_size, channels, width, height)
If you only have one element in the batch, the tensor have to be in your case
e.g (1, 1, 28, 28)
because your first conv2d-layer expected a 1-channel input.