Unable to get the correct SVM gradient using vectorization - svm

I was attempting the CS231n Assignment 1 for vectorizing the computation of the SVM gradients. dW is the gradient matrix. The following is my attempt:
def svm_loss_vectorized(W, X, y, reg):
num_train = X.shape[0]
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
scores = X.dot(W)
margins = np.maximum(0, scores - scores[y] + 1)
margins[y] = 0
# indices gets the position in the margins matrix where entry is > 0
indices = np.argwhere(margins > 0)
i = indices[:, 0]
j = indices[:, 1]
dW[:, j] += np.transpose(X[i])
dW[:, y[i]] -= np.transpose(X[i])
loss = np.sum(margins)
# average it
loss /= num_train
dW /= num_train
# regularization
loss += reg * np.sum(W * W)
dW += reg * W
return loss, dW
My loss calculation is correct, however, the gradient calculated is off by a huge margin. Could someone please help me understand what I'm missing here?

Related

Which among best stochastic optimizers gives better visualization?

I am trying to make a visual comparison of predictions among the best neural network optimization algorithms [1] implemented from scratch.
The loss for SGD with momentum is: 0.2235
The loss for RMSprop is: 0.2075
The loss for Adam is: 0.6931
Are the results for Adam correct or not?
Here is what I have got as graphs:
Code for SGD with momentum:
np.random.seed(42)
w = np.array([0, 0, 0, 0, 0, 1])
eta = 0.05 # learning rate
alpha = 0.9 # momentum
nu = np.zeros_like(w)
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)
plt.figure(figsize=(12, 5))
for i in range(n_iter):
ind = np.random.choice(X_expanded.shape[0], batch_size)
loss[i] = compute_loss(X_expanded, y, w)
if i % 10 == 0:
visualize(X_expanded[ind, :], y[ind], w, loss)
grad = compute_grad(X_expanded, y, w)
nu = alpha * nu + eta * grad
w = w - nu
visualize(X, y, w, loss)
plt.clf()
Code for RMSprop:
np.random.seed(42)
w = np.array([0, 0, 0, 0, 0, 1.])
eta = 0.1 # learning rate
alpha = 0.9 # moving average of gradient norm squared
g2 = np.zeros_like(w)
eps = 1e-8
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)
plt.figure(figsize=(12,5))
for i in range(n_iter):
ind = np.random.choice(X_expanded.shape[0], batch_size)
loss[i] = compute_loss(X_expanded, y, w)
if i % 10 == 0:
visualize(X_expanded[ind, :], y[ind], w, loss)
grad = compute_grad(X_expanded, y, w)
grad2 = grad ** 2
g2 = alpha * g2 + (1-alpha) * grad2
w = w - eta * grad / np.sqrt(g2 + eps)
visualize(X, y, w, loss)
plt.clf()
Code for Adam:
np.random.seed(42)
w = np.array([0, 0, 0, 0, 0, 1.])
eta = 0.01 # learning rate
beta1 = 0.9 # moving average of gradient norm
beta2 = 0.999 # moving average of gradient norm squared
m = np.zeros_like(w) # Initial 1st moment estimates
nu = np.zeros_like(w) # Initial 2nd moment estimates
eps = 1e-8 # A small constant for numerical stability
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)
plt.figure(figsize=(12,5))
for i in range(n_iter):
ind = np.random.choice(X_expanded.shape[0], batch_size)
loss[i] = compute_loss(X_expanded, y, w)
if i % 10 == 0:
visualize(X_expanded[ind, :], y[ind], w, loss)
grad = compute_grad(X_expanded, y, w)
grad2 = grad ** 2
m = ((beta1 * m) + ((1 - beta1) * grad)) / (1 - beta1)
nu = ((beta2 * nu) + ((1 - beta2) * grad2)) / (1 - beta2)
w = (w - eta * m) / (np.sqrt(nu) + eps)
visualize(X, y, w, loss)
plt.clf()
I am expecting to get a lower cost for Adam. I mean less than that provided by RMSprop (0.2075).
[1] https://stackoverflow.com/a/37723962/10543310

Derivative of ReLU [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
I'm learning PyTorch. Here is the first example in official tutorial. I got two questions, as shown in the block below,
a) I understand that derivative of a ReLU function is 0 when x < 0 and 1 when x > 0. Is that right? But the code seems to keep the x > 0 part unchanged and set x < 0 part to 0. Why is that?
b) Why transpose, i.e. x.T.mm(grad_h)? A transpose does't seem needed to me. I'm just confused. Thanks,
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
1- It is true that derivative of a ReLU function is 0 when x < 0 and 1 when x > 0. But notice that gradient is flowing from output of the function to all the way back to h. When you get all the way back to calculate grad_h, it is calculated as:
grad_h = derivative of ReLu(x) * incoming gradient
As you said exactly, derivative of ReLu function is 1 so grad_h is just equal to incoming gradient.
2- Size of the x matrix is 64x1000 and grad_h matrix is 64x100. It is obvious that you can not directly multiply x with grad_h and you need to take transpose of x to get appropriate dimensions.

Visualization of the filters of VGG16

I am learning CNN, right now, working on deconvolution of the layers. I have begun the process of learning upsampling and observe how convolution layers see the world by generating feature maps from the filters from the source Visualization of the filters of VGG16, with the Source code. I have changed the input and the code is as follows:
import imageio
import numpy as np
import time
from keras.applications import vgg16
from keras import backend as K
import cv2
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# dimensions of the generated pictures for each filter.
img_width = 128
img_height = 128
# the name of the layer we want to visualize
# (see model definition at keras/applications/vgg16.py)
layer_name = 'block5_conv1'
# util function to convert a tensor into a valid image
def deprocess_image(x):
# normalize tensor: center on 0., ensure std is 0.1
x -= x.mean()
x /= (x.std() + K.epsilon())
x *= 0.1
# clip to [0, 1]
x += 0.5
x = np.clip(x, 0, 1)
# convert to RGB array
x *= 255
if K.image_data_format() == 'channels_first':
x = x.transpose((1, 2, 0))
x = np.clip(x, 0, 255).astype('uint8')
return x
# build the VGG16 network with ImageNet weights
model = vgg16.VGG16(weights='imagenet', include_top=False)
print('Model loaded.')
model.summary()
# this is the placeholder for the input images
input_img = model.input
# get the symbolic outputs of each "key" layer (we gave them unique names).
layer_dict = dict([(layer.name, layer) for layer in model.layers[1:]])
def normalize(x):
# utility function to normalize a tensor by its L2 norm
return x / (K.sqrt(K.mean(K.square(x))) + K.epsilon())
kept_filters = []
for filter_index in range(200):
# we only scan through the first 200 filters,
# but there are actually 512 of them
print('Processing filter %d' % filter_index)
start_time = time.time()
# we build a loss function that maximizes the activation
# of the nth filter of the layer considered
layer_output = layer_dict[layer_name].output
if K.image_data_format() == 'channels_first':
loss = K.mean(layer_output[:, filter_index, :, :])
else:
loss = K.mean(layer_output[:, :, :, filter_index])
# we compute the gradient of the input picture wrt this loss
grads = K.gradients(loss, input_img)[0]
# normalization trick: we normalize the gradient
grads = normalize(grads)
# this function returns the loss and grads given the input picture
iterate = K.function([input_img], [loss, grads])
# step size for gradient ascent
step = 1.
inpImgg = '/home/sanaalamgeer/Downloads/cat.jpeg'
inpImg = mpimg.imread(inpImgg)
inpImg = cv2.resize(inpImg, (img_width, img_height))
# we start from a gray image with some random noise
if K.image_data_format() == 'channels_first':
input_img_data = inpImg.reshape((1, 3, img_width, img_height))
else:
input_img_data = inpImg.reshape((1, img_width, img_height, 3))
input_img_data = (input_img_data - 0.5) * 20 + 128
# we run gradient ascent for 20 steps
for i in range(20):
loss_value, grads_value = iterate([input_img_data])
input_img_data += grads_value * step
print('Current loss value:', loss_value)
if loss_value <= 0.:
# some filters get stuck to 0, we can skip them
break
# decode the resulting input image
if loss_value > 0:
img = deprocess_image(input_img_data[0])
kept_filters.append((img, loss_value))
end_time = time.time()
print('Filter %d processed in %ds' % (filter_index, end_time - start_time))
# we will stich the best 64 filters on a 8 x 8 grid.
n = 8
# the filters that have the highest loss are assumed to be better-looking.
# we will only keep the top 64 filters.
kept_filters.sort(key=lambda x: x[1], reverse=True)
kept_filters = kept_filters[:n * n]
# build a black picture with enough space for
# our 8 x 8 filters of size 128 x 128, with a 5px margin in between
margin = 5
width = n * img_width + (n - 1) * margin
height = n * img_height + (n - 1) * margin
stitched_filters = np.zeros((width, height, 3))
# fill the picture with our saved filters
for i in range(n):
for j in range(n):
img, loss = kept_filters[i * n + j]
stitched_filters[(img_width + margin) * i: (img_width + margin) * i + img_width,
(img_height + margin) * j: (img_height + margin) * j + img_height, :] = img
# save the result to disk
imageio.imwrite('stitched_filters_%dx%d.png' % (n, n), stitched_filters)
The input image I am using is
It is supposed to generate an output with 64 feature maps embedded into one image as shown in Visualization of the filters of VGG16, but it is generating the same input image at each filter,
.
I am confused what's wrong or where I should make changes.
Please help.
What a complex code....
I'd do this:
from keras.applications.vgg16 import preprocess_input
layer_name = 'block5_conv1'
#create a section of the model to output the layer we want
model = vgg16.VGG16(weights='imagenet', include_top=False)
model = Model(model.input, model.get_layer(layer_name).output)
#open and preprocess the cat image
catImage = openTheCatImage(catFile)
catImage = np.expand_dims(catImage,axis=0)
catImage = preprocess_input(catImage)
#get the layer outputs
features = model.predict(catImage)
#plot
for channel in range(features.shape[-1]): #or .shape[1], or up to a limit you like
featureMap = features[:,:,:,channel] #or features[:,channel]
featureMap = deprocess_image(feature_map)[0]
saveOrPlot(featureMap)

Custom loss function in Keras (asymmetric MAE)

I am trying to implement a custom loss function for Keras LSTM, which would represent asymmetric MAE (penalizing right shift and rewarding left shift of a prediction in relation to actuals). What is the syntax considering the input parameters are tensors, not numpy arrays?
def amae(a, p):
product = a
product[1:] = a[1:] - a[:-1]
product[0] = 0
product = -product*10
delta = p - a
delta = abs(delta) + (delta * product)
return sum(delta)/len(delta)
Could try:
def amae(a, p):
delta = p - a
# Get row of booleans where predicted is higher than actual e.g. 1 - 0
penalty = tf.greater(delta, 0)
# Cast float to mathematically do some math
penalty = tf.cast(penalize_right_shift, tf.float16)
# Penalize by a factor of 10x
penalty = penalty * 9 + 1 # Add +1 even to 0 since will use this as multiplier
# squared difference
sq_delta = tf.square(delta)
return tf.reduce_mean(penalty * sq_delta, axis=-1)

Correct way to compute AUC in tensorflow

I'm calculating the area under the curve (AUC) in TensorFlow.
Here is part of my code:
with tf.name_scope("output"):
W = tf.Variable(tf.random_normal([num_filters_total, num_classes], stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
self.softmax_scores = tf.nn.softmax(self.scores)
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# CalculateMean cross-entropy loss
with tf.name_scope("loss"):
self.losses = tf.nn.softmax_cross_entropy_with_logits(labels=self.input_y,logits=self.scores)
self.loss = tf.reduce_mean(self.losses) + l2_reg_lambda * l2_loss
# Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
# AUC
with tf.name_scope("auc"):
self.auc = tf.metrics.auc(labels = tf.argmax(self.input_y, 1), predictions = self.predictions)`
`
In the above piece of code, input_y is a tensor with shape (batch_size,2) and predictions has the shape (batch_size,).
Therefore the real values for labels and predictions variables in tf.metrics.auc are [0,1,1,1,0,0,...].
I wonder if it's a correct way to compute AUC?
I've tried with the following command:
self.auc = tf.metrics.auc(labels = tf.argmax(self.input_y, 1), predictions = tf.reduce_max(self.softmax_scores,axis=1))
But this only gives me zero numbers.
Another thing I notice is that while the accuracy is quite stable at the end of the training process, the auc computed by the first method keeps increasing. Is that correct?
Thanks.

Resources