How to do a weighted pooling in Mxnet? - python-3.x

I want to do a 2d convolutional operation that uses same 1x2x4 weight on every channel.
(Note: the input height & width are bigger than our kernel, so I can't just use a dot product.)
How can I do this is mxnet?
I tried to use the same instance of a signle 2d conv layer by concatenating it on every channel, but it is incredibly slow.
def Concat(*args, axis=1, **kwargs):
net = nn.HybridConcatenate(axis=axis,**kwargs)
net.add(*args)
return net
def Seq(*args):
net = nn.HybridSequential()
net.add(*args)
return net
class Trim_D1(nn.HybridBlock):
def __init__(self, from_, to, **kwargs):
super(Trim_D1, self).__init__(**kwargs)
self.from_ = from_
self.to = to
def forward(self, x):
return x[:,self.from_:self.to]
PooPool = nn.Conv2D(kernel_size=(2,4), strides=(2, 4), channels=1, activation=None, use_bias=False, weight_initializer=mx.init.Constant(1/8))
conc = ()
for i in range(40):
conc += Seq(
Trim_D1(i,i+1),
PooPool
),
WeightedPool= Concat(*conc)
Ideally I would also want my kernel weights to sum up to 1 in order to resemble the weighted average pooling.
Edit: I think I know how to do this. I'm going to edit Conv2D and _Conv source codes so that instead of creating weights of CxHxW dimension it creates a weight of 1xHxW dimension and uses a broadcasting during the convolutional operation. In order for weights to sum up to 1, additionally a softmax operation has to be applied.

Ok, apparently the weights are of in_channels x out_channels x H x W dimensions and broadcasting is not allowed during the convolutional operation. We could fix out_channels to 1 by using the num_groups same as the output channels, as for input channels, we can simply broadcast the same weight n number of times.
In _Conv.__init__ during initialization I discarded the first two dimensions so our kernel is only H x W now:
self.weight = Parameter('weight', shape=wshapes[1][2:],
init=weight_initializer,
allow_deferred_init=True)
In _Conv.hybrid_forward I am flattening our weight to 1D in order to perform softmax and then restore to the original 2D shape. Then I expand first two dimensions and repeat the first dimension as mentioned above:
orig_shape = weight.shape
act = getattr(F, self._op_name)(x, mx.nd.softmax(weight.reshape(-1)).reshape(orig_shape)[None,None,:].repeat(self._kwargs['num_group'],axis=0), name='fwd', **self._kwargs)

Related

Understanding the architecture of an LSTM for sequence classification

I have this model in pytorch that I have been using for sequence classification.
class RoBERT_Model(nn.Module):
def __init__(self, hidden_size = 100):
self.hidden_size = hidden_size
super(RoBERT_Model, self).__init__()
self.lstm = nn.LSTM(768, hidden_size, num_layers=1, bidirectional=False)
self.out = nn.Linear(hidden_size, 2)
def forward(self, grouped_pooled_outs):
# chunks_emb = pooled_out.split_with_sizes(lengt) # splits the input tensor into a list of tensors where the length of each sublist is determined by length
seq_lengths = torch.LongTensor([x for x in map(len, grouped_pooled_outs)]) # gets the length of each sublist in chunks_emb and returns it as an array
batch_emb_pad = nn.utils.rnn.pad_sequence(grouped_pooled_outs, padding_value=-91, batch_first=True) # pads each sublist in chunks_emb to the largest sublist with value -91
batch_emb = batch_emb_pad.transpose(0, 1) # (B,L,D) -> (L,B,D)
lstm_input = nn.utils.rnn.pack_padded_sequence(batch_emb, seq_lengths, batch_first=False, enforce_sorted=False) # seq_lengths.cpu().numpy()
packed_output, (h_t, h_c) = self.lstm(lstm_input, ) # (h_t, h_c))
# output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, padding_value=-91)
h_t = h_t.view(-1, self.hidden_size) # (-1, 100)
return self.out(h_t) # logits
The issue that I am having is that I am not entirely convinced of what data is being passed to the final classification layer. I believe what is being done is that only the final LSTM cell in the last layer is being used for classification. That is there are hidden_size features that are passed to the feedforward layer.
I have depicted what I believe is going on in this figure here:
Is this understanding correct? Am I missing anything?
Thanks.
Your code is a basic LSTM for classification, working with a single rnn layer.
In your picture you have multiple LSTM layers, while, in reality, there is only one, H_n^0 in the picture.
Your input to LSTM is of shape (B, L, D) as correctly pointed out in the comment.
packed_output and h_c is not used at all, hence you can change this line to: _, (h_t, _) = self.lstm(lstm_input) in order no to clutter the picture further
h_t is output of last step for each batch element, in general (B, D * L, hidden_size). As this neural network is not bidirectional D=1, as you have a single layer L=1 as well, hence the output is of shape (B, 1, hidden_size).
This output is reshaped into nn.Linear compatible (this line: h_t = h_t.view(-1, self.hidden_size)) and will give you output of shape (B, hidden_size)
This input is fed to a single nn.Linear layer.
In general, the output of the last time step from RNN is used for each element in the batch, in your picture H_n^0 and simply fed to the classifier.
By the way, having self.out = nn.Linear(hidden_size, 2) in classification is probably counter-productive; most likely your are performing binary classification and self.out = nn.Linear(hidden_size, 1) with torch.nn.BCEWithLogitsLoss might be used. Single logit contains information whether the label should be 0 or 1; everything smaller than 0 is more likely to be 0 according to nn, everything above 0 is considered as a 1 label.

Questions about programming a cnn with PyTorch

I'm pretty new at programming cnn so I'm a little bit lost. I'm trying to do this part of the code, where they ask me to implement a fully-connected network to classify the digits. It should contain 1 hidden layer with 20 units. I should use ReLU activation function on the hidden layer.
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.fc1 = ...
self.fc2 = nn.Sequential(
nn.Linear(500,10),
nn.Softmax(dim = 1)
)
def forward(self, x):
x = x.view(x.size(0),-1)
x = self.fc1(x)
x = self.fc2(x)
return x
The dots are the part to fill, I think about this line:
self.fc1 = nn.Linear(20, 500)
But I don't know if it's correct. Could someone help me please? And I don't understand at all what the function Softmax do... so if someone knows it please.
Thank you so much!!
Pd. This is the code to load the data:
batch_size = 64
trainset = datasets.MNIST('./data', train=True, download=True, transform=transforms.ToTensor())
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=1)
testset = datasets.MNIST('./data', train=False, download=True, transform=transforms.ToTensor())
test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=1)
From the code given for the model, it can be seen that the hidden layer has 500 units. So I am assuming you meant 20 units for input. With this assumption, the code must be:
self.fc1 = nn.Sequential(
nn.Linear(20, 500),
nn.ReLU()
)
Coming to the next part of your question, given that you are working with MNIST dataset and you have the softmax function, I am assuming you are trying to predict the number present in the images.
Your neural network performs various multiplication and addition operations in each layer and finally, you end up with 10 numbers in the output layer. Now, you have to make sense of these 10 numbers to decide which of the 10 digits is given in the image.
One way to do this would be to select the unit which has the maximum value. For example if the 10th unit has the maximum value among all units, then we conclude that the digit is '9'. If the 2nd unit has the maximum value, then we conclude that the digit is '1'.
This is fine but a better way would be to convert the values of each of the units to probability that the corresponding digit is contained in the image and then we choose the digit having highest probability. This has certain mathematical advantages which helps us in defining a better loss function.
Softmax is what helps us to convert the values to probabilities. On applying softmax, all the values lie in the range (0, 1) and they sum up to 1.
If you are interested in deeplearning and the math behind it, I would suggest you to checkout Andrew NG's course on deeplearning.
You did not mention the shape of your data so I'll be assuming the expected shape returned by datasets.MNIST.
Data shape: torch.Size([64, 1, 28, 28])
class Network(nn.Module):
def __init__(self):
super(Network, self).__init__()
self.fc1 = nn.Sequential(
nn.Linear(1*28*28, 20),
nn.ReLU())
self.fc2 = nn.Sequential(
nn.Linear(500,10),
nn.Softmax(dim = 1))
def forward(self, x):
x = x.view(x.size(0), -1)
x = self.fc1(x)
x = self.fc2(x)
return x
The first argument of nn.Linear is the size of input feature while the second is the number of units.
For self.fc1, the size of the input feature is the multiplication of your data shape except the batch size, which is 1 * 28 * 28. And as per your post the second argument should be 20 (20 units).
The shape of the output from self.fc1 (which is also the input to self.fc2) will then be (batch size, 20).
For self.fc2, the size of the input feature will be 20 while the number of units (which is also the number of digits) will be 10.

RuntimeError: Given groups=3, weight of size 12 64 3 768, expected input[32, 12, 30, 768] to have 192 channels, but got 12 channels instead

I started working with Pytorch recently so my understanding of it isn't quite strong. I previously had a 1 layer CNN but wanted to extend it to 2 layers, but the input and output channels have been throwing errors I can seem to decipher. Why does it expect 192 channels? Can someone give me a pointer to help me understand this better? I have seen several related problems on here, but I don't understand those solutions either.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from transformers import BertConfig, BertModel, BertTokenizer
import math
from transformers import AdamW, get_linear_schedule_with_warmup
def pad_sents(sents, pad_token): # Pad list of sentences according to the longest sentence in the batch.
sents_padded = []
max_len = max(len(s) for s in sents)
for s in sents:
padded = [pad_token] * max_len
padded[:len(s)] = s
sents_padded.append(padded)
return sents_padded
def sents_to_tensor(tokenizer, sents, device):
tokens_list = [tokenizer.tokenize(str(sent)) for sent in sents]
sents_lengths = [len(tokens) for tokens in tokens_list]
tokens_list_padded = pad_sents(tokens_list, '[PAD]')
sents_lengths = torch.tensor(sents_lengths, device=device)
masks = []
for tokens in tokens_list_padded:
mask = [0 if token == '[PAD]' else 1 for token in tokens]
masks.append(mask)
masks_tensor = torch.tensor(masks, dtype=torch.long, device=device)
tokens_id_list = [tokenizer.convert_tokens_to_ids(tokens) for tokens in tokens_list_padded]
sents_tensor = torch.tensor(tokens_id_list, dtype=torch.long, device=device)
return sents_tensor, masks_tensor, sents_lengths
class ConvModel(nn.Module):
def __init__(self, device, dropout_rate, n_class, out_channel=16):
super(ConvModel, self).__init__()
self.bert_config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
self.dropout_rate = dropout_rate
self.n_class = n_class
self.out_channel = out_channel
self.bert = BertModel.from_pretrained('bert-base-uncased', config=self.bert_config)
self.out_channels = self.bert.config.num_hidden_layers * self.out_channel
self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', config=self.bert_config)
self.conv = nn.Conv2d(in_channels=self.bert.config.num_hidden_layers,
out_channels=self.out_channels,
kernel_size=(3, self.bert.config.hidden_size),
groups=self.bert.config.num_hidden_layers)
self.conv1 = nn.Conv2d(in_channels=self.out_channels,
out_channels=48,
kernel_size=(3, self.bert.config.hidden_size),
groups=self.bert.config.num_hidden_layers)
self.hidden_to_softmax = nn.Linear(self.out_channels, self.n_class, bias=True)
self.dropout = nn.Dropout(p=self.dropout_rate)
self.device = device
def forward(self, sents):
sents_tensor, masks_tensor, sents_lengths = sents_to_tensor(self.tokenizer, sents, self.device)
encoded_layers = self.bert(input_ids=sents_tensor, attention_mask=masks_tensor)
hidden_encoded_layer = encoded_layers[2]
hidden_encoded_layer = hidden_encoded_layer[0]
hidden_encoded_layer = torch.unsqueeze(hidden_encoded_layer, dim=1)
hidden_encoded_layer = hidden_encoded_layer.repeat(1, 12, 1, 1)
conv_out = self.conv(hidden_encoded_layer) # (batch_size, channel_out, some_length, 1)
conv_out = self.conv1(conv_out)
conv_out = torch.squeeze(conv_out, dim=3) # (batch_size, channel_out, some_length)
conv_out, _ = torch.max(conv_out, dim=2) # (batch_size, channel_out)
pre_softmax = self.hidden_to_softmax(conv_out)
return pre_softmax
def batch_iter(data, batch_size, shuffle=False, bert=None):
batch_num = math.ceil(data.shape[0] / batch_size)
index_array = list(range(data.shape[0]))
if shuffle:
data = data.sample(frac=1)
for i in range(batch_num):
indices = index_array[i * batch_size: (i + 1) * batch_size]
examples = data.iloc[indices]
sents = list(examples.train_BERT_tweet)
targets = list(examples.train_label.values)
yield sents, targets # list[list[str]] if not bert else list[str], list[int]
def train():
label_name = ['Yes', 'Maybe', 'No']
device = torch.device("cpu")
df_train = pd.read_csv('trainn.csv') # , index_col=0)
train_label = dict(df_train.train_label.value_counts())
label_max = float(max(train_label.values()))
train_label_weight = torch.tensor([label_max / train_label[i] for i in range(len(train_label))], device=device)
model = ConvModel(device=device, dropout_rate=0.2, n_class=len(label_name))
optimizer = AdamW(model.parameters(), lr=1e-3, correct_bias=False)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=1000) # changed the last 2 arguments to old ones
model = model.to(device)
model.train()
cn_loss = torch.nn.CrossEntropyLoss(weight=train_label_weight, reduction='mean')
train_batch_size = 16
for epoch in range(1):
for sents, targets in batch_iter(df_train, batch_size=train_batch_size, shuffle=True): # for each epoch
optimizer.zero_grad()
pre_softmax = model(sents)
loss = cn_loss(pre_softmax, torch.tensor(targets, dtype=torch.long, device=device))
loss.backward()
optimizer.step()
scheduler.step()
TrainingModel = train()
Here's a snippet of data https://github.com/Kosisochi/DataSnippet
It seems that the original version of the code you had in this question behaved differently. The final version of the code you have here gives me a different error from what you posted, more specifically - this:
RuntimeError: Calculated padded input size per channel: (20 x 1). Kernel size: (3 x 768). Kernel size can't be greater than actual input size
I apologize if I misunderstood the situation, but it seems to me that your understanding of what exactly nn.Conv2d layer does is not 100% clear and that is the main source of your struggle. I interpret the part "detailed explanation on 2 layer CNN in Pytorch" you requested as an ask to explain in detail on how that layer works and I hope that after this is done there will be no problem applying it 1 time, 2 times or more.
You can find all the documentation about the layer here, but let me give you a recap which hopefully will help to understand more the errors you're getting.
First of all nn.Conv2d inputs are 4-d tensors of the shape (BatchSize, ChannelsIn, Height, Width) and outputs are 4-d tensors of the shape (BatchSize, ChannelsOut, HeightOut, WidthOut). The simplest way to think about nn.Conv2d is of something applied to 2d images with pixel grid of size Height x Width and having ChannelsIn different colors or features per pixel. Even if your inputs have nothing to do with actual images the behavior of the layer is still the same. Simplest situation is when the nn.Conv2d is not using padding (as in your code). In that case the kernel_size=(kernel_height, kernel_width) argument specifies the rectangle which you can imagine sweeping through Height x Width rectangle of your inputs and producing one pixel for each valid position. Without padding the coordinate of the rectangle's point can be any pair of indicies (x, y) with x between 0 and Height - kernel_height and y between 0 and Width - kernel_width. Thus the output will look like a 2d image of size (Height - kernel_height + 1) x (Width - kernel_width + 1) and will have as many output channels as specified to nn.Conv2d constructor, so the output tensor will be of shape (BatchSize, ChannelsOut, Height - kernel_height + 1, Width - kernel_width + 1).
The parameter groups is not affecting how shapes are changed by the layer - it is only controlling which input channels are used as inputs for the output channels (groups=1 means that every input channel is used as input for every output channel, otherwise input and output channels are divided into corresponding number of groups and only input channels from group i are used as inputs for the output channels from group i).
Now in your current version of the code you have BatchSize = 16 and the output of pre-trained model is (BatchSize, DynamicSize, 768) with DynamicSize depending on the input, e.g. 22. You then introduce additional dimension as axis 1 with unsqueeze and repeat the values along that dimension transforming the tensor of shape (16, 22, 768) into (16, 12, 22, 768). Effectively you are using the output of the pre-trained model as 12-channel (with each channel having same values as others) 2-d images here of size (22, 768), where 22 is not fixed (depends on the batch). Then you apply a nn.Conv2d with kernel size (3, 768) - which means that there is no "wiggle room" for width and output 2-d images will be of size (20, 1) and since your layer has 192 channels final size of the output of first convolution layer has shape (16, 192, 20, 1). Then you try to apply second layer of convolution on top of that with kernel size (3, 768) again, but since your 2-d "image" is now just (20 x 1) there is no valid position to fit (3, 768) kernel rectangle inside a rectangle (20 x 1) which leads to the error message Kernel size can't be greater than actual input size.
Hope this explanation helps. Now to the choices you have to avoid the issue:
(a) is to add padding in such a way that the size of the output is not changing comparing to input (I won't go into details here,
because I don't think this is what you need)
(b) Use smaller kernel on both first and/or second convolutions (e.g. if you don't change first convolution the only valid width for
the second kernel would be 1).
(c) Looking at what you're trying to do my guess is that you actually don't want to use 2d convolution, you want 1d convolution (on the sequence) with every position described by 768 values. When you're using one convolution layer with 768 width kernel (and same 768 width input) you're effectively doing exactly same thing as 1d convolution with 768 input channels, but then if you try to apply second one you have a problem. You can specify kernel width as 1 for the next layer(s) and that will work for you, but a more correct way would be to transpose pre-trained model's output tensor by switching the last dimensions - getting shape (16, 768, DynamicSize) from (16, DynamicSize, 768) and then apply nn.Conv1d layer with 768 input channels and arbitrary ChannelsOut as output channels and 1d kernel_size=3 (meaning you look at 3 consecutive elements of the sequence for convolution). If you do that than without padding input shape of (16, 768, DynamicSize) will become (16, ChannelsOut, DynamicSize-2), and after you apply second Conv1d with e.g. the same settings as first one you'll get a tensor of shape (16, ChannelsOut, DynamicSize-4), etc. (each time the 1d length will shrink by kernel_size-1). You can always change number of channels/kernel_size for each subsequent convolution layer too.

Local fully connected layer - Pytorch

Assume we have a feature representation with kN neurons before the classification layer. Now, the classification layer produces an output layer of size N with only local connections.
That is, the kth neuron at the output is computed using input neurons at locations from kN to kN+N. Hence, every N locations in the input layer (with stride N) give single neuron value at the output.
This is done using conv1dlocal in Keras, however, the PyTorch does not seem to have this.
Weight matrix in standard linear layer: kNxN = kN^2 variables
Weight matrix with local linear layer: (kx1)#N times = NK variables
This is currently triaged on the PyTorch issue tracker, in the mean time you can get a similar behavious using fold and unfold. See this answer:
https://github.com/pytorch/pytorch/issues/499#issuecomment-503962218
class LocalLinear(nn.Module):
def __init__(self,in_features,local_features,kernel_size,padding=0,stride=1,bias=True):
super(LocalLinear, self).__init__()
self.kernel_size = kernel_size
self.stride = stride
self.padding = padding
fold_num = (in_features+2*padding-self.kernel_size)//self.stride+1
self.weight = nn.Parameter(torch.randn(fold_num,kernel_size,local_features))
self.bias = nn.Parameter(torch.randn(fold_num,local_features)) if bias else None
def forward(self, x:torch.Tensor):
x = F.pad(x,[self.padding]*2,value=0)
x = x.unfold(-1,size=self.kernel_size,step=self.stride)
x = torch.matmul(x.unsqueeze(2),self.weight).squeeze(2)+self.bias
return x

How to correctly implement backpropagation for machine learning the MNIST dataset?

So, I'm using Michael Nielson's machine learning book as a reference for my code (it is basically identical): http://neuralnetworksanddeeplearning.com/chap1.html
The code in question:
def backpropagate(self, image, image_value) :
# declare two new numpy arrays for the updated weights & biases
new_biases = [np.zeros(bias.shape) for bias in self.biases]
new_weights = [np.zeros(weight_matrix.shape) for weight_matrix in self.weights]
# -------- feed forward --------
# store all the activations in a list
activations = [image]
# declare empty list that will contain all the z vectors
zs = []
for bias, weight in zip(self.biases, self.weights) :
print(bias.shape)
print(weight.shape)
print(image.shape)
z = np.dot(weight, image) + bias
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# -------- backward pass --------
# transpose() returns the numpy array with the rows as columns and columns as rows
delta = self.cost_derivative(activations[-1], image_value) * sigmoid_prime(zs[-1])
new_biases[-1] = delta
new_weights[-1] = np.dot(delta, activations[-2].transpose())
# l = 1 means the last layer of neurons, l = 2 is the second-last, etc.
# this takes advantage of Python's ability to use negative indices in lists
for l in range(2, self.num_layers) :
z = zs[-1]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
new_biases[-l] = delta
new_weights[-l] = np.dot(delta, activations[-l-1].transpose())
return (new_biases, new_weights)
My algorithm can only get to the first round backpropagation before this error occurs:
File "D:/Programming/Python/DPUDS/DPUDS_Projects/Fall_2017/MNIST/network.py", line 97, in stochastic_gradient_descent
self.update_mini_batch(mini_batch, learning_rate)
File "D:/Programming/Python/DPUDS/DPUDS_Projects/Fall_2017/MNIST/network.py", line 117, in update_mini_batch
delta_biases, delta_weights = self.backpropagate(image, image_value)
File "D:/Programming/Python/DPUDS/DPUDS_Projects/Fall_2017/MNIST/network.py", line 160, in backpropagate
z = np.dot(weight, activation) + bias
ValueError: shapes (30,50000) and (784,1) not aligned: 50000 (dim 1) != 784 (dim 0)
I get why it's an error. The number of columns in weights doesn't match the number of rows in the pixel image, so I can't do matrix multiplication. Here's where I'm confused -- there are 30 neurons used in the backpropagation, each with 50,000 images being evaluated. My understanding is that each of the 50,000 should have 784 weights attached, one for each pixel. But when I modify the code accordingly:
count = 0
for bias, weight in zip(self.biases, self.weights) :
print(bias.shape)
print(weight[count].shape)
print(image.shape)
z = np.dot(weight[count], image) + bias
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
count += 1
I still get a similar error:
ValueError: shapes (50000,) and (784,1) not aligned: 50000 (dim 0) != 784 (dim 0)
I'm just really confuzzled by all the linear algebra involved and I think I'm just missing something about the structure of the weight matrix. Any help at all would be greatly appreciated.
It looks like the issue is in your changes to the original code.
I’be downloaded example from the link you provided and it works without any errors:
Here is full source code I used:
import cPickle
import gzip
import numpy as np
import random
def load_data():
"""Return the MNIST data as a tuple containing the training data,
the validation data, and the test data.
The ``training_data`` is returned as a tuple with two entries.
The first entry contains the actual training images. This is a
numpy ndarray with 50,000 entries. Each entry is, in turn, a
numpy ndarray with 784 values, representing the 28 * 28 = 784
pixels in a single MNIST image.
The second entry in the ``training_data`` tuple is a numpy ndarray
containing 50,000 entries. Those entries are just the digit
values (0...9) for the corresponding images contained in the first
entry of the tuple.
The ``validation_data`` and ``test_data`` are similar, except
each contains only 10,000 images.
This is a nice data format, but for use in neural networks it's
helpful to modify the format of the ``training_data`` a little.
That's done in the wrapper function ``load_data_wrapper()``, see
below.
"""
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
return (training_data, validation_data, test_data)
def load_data_wrapper():
"""Return a tuple containing ``(training_data, validation_data,
test_data)``. Based on ``load_data``, but the format is more
convenient for use in our implementation of neural networks.
In particular, ``training_data`` is a list containing 50,000
2-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarray
containing the input image. ``y`` is a 10-dimensional
numpy.ndarray representing the unit vector corresponding to the
correct digit for ``x``.
``validation_data`` and ``test_data`` are lists containing 10,000
2-tuples ``(x, y)``. In each case, ``x`` is a 784-dimensional
numpy.ndarry containing the input image, and ``y`` is the
corresponding classification, i.e., the digit values (integers)
corresponding to ``x``.
Obviously, this means we're using slightly different formats for
the training data and the validation / test data. These formats
turn out to be the most convenient for use in our neural network
code."""
tr_d, va_d, te_d = load_data()
training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
training_results = [vectorized_result(y) for y in tr_d[1]]
training_data = zip(training_inputs, training_results)
validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
validation_data = zip(validation_inputs, va_d[1])
test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
test_data = zip(test_inputs, te_d[1])
return (training_data, validation_data, test_data)
def vectorized_result(j):
"""Return a 10-dimensional unit vector with a 1.0 in the jth
position and zeroes elsewhere. This is used to convert a digit
(0...9) into a corresponding desired output from the neural
network."""
e = np.zeros((10, 1))
e[j] = 1.0
return e
class Network(object):
def __init__(self, sizes):
"""The list ``sizes`` contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers."""
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
def feedforward(self, a):
"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a
def SGD(self, training_data, epochs, mini_batch_size, eta,
test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+mini_batch_size]
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in xrange(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
return (nabla_b, nabla_w)
def evaluate(self, test_data):
"""Return the number of test inputs for which the neural
network outputs the correct result. Note that the neural
network's output is assumed to be the index of whichever
neuron in the final layer has the highest activation."""
test_results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)
#### Miscellaneous functions
def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))
def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))
training_data, validation_data, test_data = load_data_wrapper()
net = Network([784, 30, 10])
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
Additional info:
However, I would recommend using one of existing frameworks, for example - Keras to don't reinvent the wheel
Also, it was checked with python 3.6:
Kudos on digging into Nielsen's code. It's a great resource to develop thorough understanding of NN principles. Too many people leap ahead to Keras without knowing what goes on under the hood.
Each training example doesn't get its own weights. Each of the 784 features does. If each example got its own weights then each weight set would overfit to its corresponding training example. Also, if you later used your trained network to run inference on a single test example, what would it do with 50,000 sets of weights when presented with just one handwritten digit? Instead, each of the 30 neurons in your hidden layer learns a set of 784 weights, one for each pixel, that offers high predictive accuracy when generalized to any handwritten digit.
Import network.py and instantiate a Network class like this without modifying any code:
net = network.Network([784, 30, 10])
..which gives you a network with 784 input neurons, 30 hidden neurons and 10 output neurons. Your weight matrices will have dimensions [30, 784] and [10, 30], respectively. When you feed the network an input array of dimensions [784, 1] the matrix multiplication that gave you an error is valid because dim 1 of the weight matrix equals dim 0 of the input array (both 784).
Your problem is not implementation of backprop but rather setting up a network architecture appropriate for the shape of your input data. If memory serves Nielsen leaves backprop as a black box in chapter 1 and doesn't dive into it until chapter 2. Keep at it, and good luck!

Resources