Derivative of ReLU [closed] - pytorch

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
I'm learning PyTorch. Here is the first example in official tutorial. I got two questions, as shown in the block below,
a) I understand that derivative of a ReLU function is 0 when x < 0 and 1 when x > 0. Is that right? But the code seems to keep the x > 0 part unchanged and set x < 0 part to 0. Why is that?
b) Why transpose, i.e. x.T.mm(grad_h)? A transpose does't seem needed to me. I'm just confused. Thanks,
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

1- It is true that derivative of a ReLU function is 0 when x < 0 and 1 when x > 0. But notice that gradient is flowing from output of the function to all the way back to h. When you get all the way back to calculate grad_h, it is calculated as:
grad_h = derivative of ReLu(x) * incoming gradient
As you said exactly, derivative of ReLu function is 1 so grad_h is just equal to incoming gradient.
2- Size of the x matrix is 64x1000 and grad_h matrix is 64x100. It is obvious that you can not directly multiply x with grad_h and you need to take transpose of x to get appropriate dimensions.

Related

Efficiently training a network simultaneously on labels and partial derivaties

I'm trying to train a network in pytorch along the lines of this idea.
The author creates a simple MLP (4 hidden layers) and then explicitly works out what the partial derivatives of the output is wrt the inputs. He then trains the network on the training labels as well as the gradients of the output wrt the input data (which is also part of the training data).
To replicate the idea in pytorch, my training loop looks like this:
import torch
import torch.nn.functional as F
class vanilla_net(torch.nn.Module):
def __init__(self,
input_dim, # dimension of inputs, e.g. 10
hidden_units, # units in hidden layers, assumed constant, e.g. 20
hidden_layers): # number of hidden layers, e.g. 4):
super(vanilla_net, self).__init__()
self.input = torch.nn.Linear(input_dim, hidden_units)
self.hidden = torch.nn.ModuleList()
for hl in range(hidden_layers):
layer = torch.nn.Linear(hidden_units, hidden_units)
self.hidden.append(layer)
self.output = torch.nn.Linear(hidden_units, 1)
def forward(self, x):
x = self.input(x)
x = F.softplus(x)
for h in self.hidden:
x = h(x)
x = F.softplus(x)
x = self.output(x)
return x
....
def lossfn(x, y, dx, dy):
# some loss function involving both sets of training data (y and dy)
# the network outputs x and what's needed is an efficient way of calculating dx - the partial
# derivatives of x wrt the batch inputs.
pass
def train(net, x_train, y_train, dydx_train, batch_size=256)
m, n = x_train.shape
first = 0
last = min(batch_size, m)
while first < m:
xi = x_train[first:last]
yi = y_train[first:last]
zi = dydx_train[first:last]
xi.requires_grad_()
# Perform forward pass
outputs = net(xi)
minimizer.zero_grad()
outputs.backward(torch.ones_like(outputs), create_graph=True)
xi_grad = xi.grad
# Compute loss
loss = lossfn(outputs, yi, xi_grad, zi)
minimizer.zero_grad()
# Perform backward pass
loss.backward()
# Perform optimization
minimizer.step()
first = last
last = min(first + batch_size, m)
net = vanilla_net(4, 10, 4)
minimizer = torch.optim.Adam(net.parameters(), lr=1e-4)
...
This seems to work but is there a more elegant/efficient way to achieve the same thing? Also - not sure I know where the best place to put the minimizer.zero_grad()
Thanks

Gradients of loss with respect to random parameters are the same in pytorch

In the simple code below, I perform a simple linear operation on an input tensor of ones and compute its binary cross-entropy loss considering a vector of zeros as the expected output.
When computing the gradient of the loss with respect to w, the rows are the same and equal to the gradient with respect to b. This is counter-intuitive since w and b have random values. What is the reason?
n_input, n_output = 5, 3
x = torch.ones(n_input)
y = torch.zeros(n_output) # expected output
w = torch.randn(n_input, n_output, requires_grad=True)
b = torch.randn(n_output, requires_grad=True)
z = torch.matmul(x,w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
print(w.grad)
print(b.grad)
Output:
tensor([[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959],
[0.2179, 0.4337, 0.1959]])
tensor([0.2179, 0.4337, 0.1959])
It's because Your input is symmetric.
Imagine the issue from the point of view of a perceptron (You have 3 of them in Your setup):
each input is 1.0 so the weights of a specific neuron don't matter (it is not important from which input You will take as there is 1.0 everywhere).
If You diversify the input, everything works just fine:
n_input, n_output = 5, 3
x = torch.randn(n_input)
y = torch.ones(n_output)/2. # expected output
w = torch.randn(n_input, n_output, requires_grad=True)
b = torch.randn(n_output, requires_grad=True)
z = torch.matmul(x, w) + b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
loss.backward()
print(w.grad)
print(b.grad)
tensor([[-0.1939, 0.1657, -0.2501],
[ 0.0561, -0.0480, 0.0724],
[-0.3162, 0.2703, -0.4079],
[ 0.0947, -0.0809, 0.1221],
[-0.0140, 0.0120, -0.0181]])
tensor([-0.1263, 0.1080, -0.1630])
You have a single data point with an input feature size of 5. If you look at your operation performed you have z = x#w + b, then you have a binary cross-entropy from logits against a null label. The binary cross-entropy is defined by:
bce = -[y_true*log(σ(y_pred)) + (1 - y_true)*log(1 - σ(y_pred))]
The gradient of z is written as the partial derivative dL/dz, it consists of three elements (same size as z) let's say [dz1, dz2, dz3].
To compute the gradients of the weight parameter w and the bias parameter b we have the following:
dL/dw = x.T # dL/dz
dL/db = dL/dz (with a shape change)
Therefore b.grad is simply
[dz1, dz2, dz3]
And, since we have x made up of ones, x.T # dL/dz ends up being a matrix with rows equal to dL/dz as well, i.e. with five rows:
[[dz1, dz2, dz3],
[dz1, dz2, dz3],
[dz1, dz2, dz3],
[dz1, dz2, dz3],
[dz1, dz2, dz3]]

Why is my DQN training with tensorflow becoming slower each iteration?

I’m trying to implement a DQN with experience replay in tensorflow. It seems to be working, i.e. my loss is decreasing. However, as the the training loop is running I’ve noticed that each training iteration becomes slower and slower. It is as if my tensorflow graph is becoming bigger and bigger and slowing down the training. I cannot see myself what the problem with my code is. Any tensorflow guru out there that can point it out? I have made a scaled-down version of my code here which operates on random data but produces the same issue.
import numpy as np
import tensorflow as tf
# Function which initializes tensorflow weights for feed-forward NN.
def InitWeights(LayerSizes):
# Make tensorflow input/output placeholders
X = tf.placeholder(shape = (None,LayerSizes[0]), dtype = tf.float32, name ='InputData')
y = tf.placeholder(shape = (None,LayerSizes[-1]), dtype = tf.float32, name ='OutputData')
# Initialize dictionaries for weights and biases.
W = {}
b = {}
for ii in range(len(LayerSizes)-1):
layername = f'layer%s' % ii
with tf.variable_scope(layername):
ny = LayerSizes[ii]
nx = LayerSizes[ii+1]
# Weights (initialized with xavier initializatiion).
W['Weights_'+layername] = tf.get_variable(
name = 'Weights_'+layername,
shape = (ny, nx),
initializer = tf.contrib.layers.xavier_initializer(),
dtype = tf.float32
)
# Bias (initialized with xavier initializatiion).
b['Bias_'+layername] = tf.get_variable(
name = 'Bias_'+layername,
shape = (nx),
initializer = tf.contrib.layers.xavier_initializer(),
dtype = tf.float32
)
return W, b, X, y
# Function which defines feed-forward neural network operation.
def FeedForward(X, W, b):
a = X
# Loop all layers of the network.
for ii in range(len(W)):
# Use name of each layer as index.
layername = f'layer%s' % ii
# Weighted sum: z = input*W + b
z = tf.add(tf.matmul(a, W['Weights_'+layername], name = 'WeightedSum_z_'+layername), b['Bias_'+layername])
# Passed through actication fcn: a = h(z)
if ii == len(W)-1:
a = z
else:
a = tf.nn.relu(z, name = 'activation_a_'+layername)
return a
# Function used for experience replay
def ExperienceReplay(s, a, r, s_prime, gamma, TermState, X, y, yhat, yhatNN2, train_op, loss, sess):
# Inputs:
# s - state(s)
# a - actions(s)
# r - rewards(s)
# s_prime - new state(s)
# gamma - discount factor
# TermState - scalar of which action is termenating
# X - tensorflow placeholder for network inputs
# y - tensorflow placeholder for network outputs
# yhat - tensorflow operation for feed foward with NN 1
# yhatNN2 - tensorflow operation for feed foward with NN 2
# train_op - tensorflow training operation
# loss - tensorflow fcn for calculating loss
# sess - tensorflow session
# Forward pass throught NN2 using s_prime to find max(Q(s',a',theta')).
Q = sess.run(yhatNN2, feed_dict={X : s_prime})
# Actions that NN1 thinks is best # sprime state
a_argmax = np.argmax(sess.run(yhat, feed_dict={X : s_prime}), axis=1)
# Values from NN2's opinion about the actions NN1 picked.
Qm = np.zeros(len(r))
for obs in range(len(r)):
Qm[obs] = Q[obs,a_argmax[obs]]
# First make all targets equal to NN1's approximation of Q (so the error is 0 in all unobserved cases)
Targets = sess.run(yhat, feed_dict={X : s})
# If the action was experienced, change the target to either real reward or discounted future reward.
for obs in range(len(r)):
# If the action was episode-terminating, use only reward as target.
if int(a[obs]) == TermState:
Targets[obs,int(a[obs])] = r[obs]
# Otherwise use discounted future reward.
else:
Targets[obs,int(a[obs])] = r[obs] + gamma*Qm[obs]
# Gradient decent one step on NN1 weights.
sess.run(train_op, feed_dict={X : s, y : Targets})
# Calculate the losses.
loss_val = sess.run(loss, feed_dict={X : s, y : Targets})
meanloss = np.mean(loss_val)
return loss_val, meanloss
if __name__ == "__main__":
#### Hyperparameter settings
N = 64 # Minibatch size during training
gamma = 0.99 # Discount rate
C = 100 # How many iterations between NN sync NN2 = NN1
lr = 1e-7 # Learning rate of NN during training
nstates = 256 # Number of possible states
nactions = 256 # Number of possible actions
TermState = 255 # Which state ends episode
"""
Initialize tensorflow session and create one NN with two set of weights
"""
# Initialize & configure action-value function Q with random weights theta.
LayerSizes = [nstates, 1024, 1024, nactions]
W, b, X, y = InitWeights(LayerSizes)
# Define loss function to optimize. Here: quadratic loss fcn. (Outputdata-a)^2
yhat = FeedForward(X, W, b)
loss = tf.reduce_sum(tf.square(y - yhat),reduction_indices=[0])
# Define optimizer to use when minimizing loss function.
all_variables = tf.trainable_variables()
optimizer = tf.train.AdamOptimizer(learning_rate = lr)
train_op = optimizer.minimize(loss, var_list = all_variables)
# Initialize target action-value function Qhat with random weights theta_= theta.
with tf.device('/gpu:0'):
W2 = {}
b2 = {}
# Make hard copy of tensorflow Weights and biases
for key in W:
W2[key] = tf.Variable(W[key].initialized_value())
for key in b:
b2[key] = tf.Variable(b[key].initialized_value())
yhatNN2 = FeedForward(X, W2, b2)
# Start tf session and initialize variables.
sess = tf.Session()
sess.run(tf.global_variables_initializer())
## Generate random data representing state transitions <s,a,r,s'>.
# Random states
Ds = np.random.rand(100000,nstates)>0.5
Ds = Ds.astype(np.float32)
# Random actions
Da = np.random.randint(0,nstates,(100000,1)).astype(np.float32)
# Random rewards
Dr = np.random.rand(100000,1).astype(np.float32)
# Random new states
Ds_prime = np.random.rand(100000,nstates)>0.5
Ds_prime = Ds.astype(np.float32)
"""
Pretrain network and report time each C iterations
"""
import time
t0 = time.time()
for i in range(100000):
# Randomly pick minibatch to use
MemsToUse = np.random.choice(len(Dr), N)
s = Ds[MemsToUse,:]
a = Da[MemsToUse,0]
r = Dr[MemsToUse,0]
sprime = Ds_prime[MemsToUse,:]
# Experience replay.
loss_val, meanloss = ExperienceReplay(s, a, r, sprime, gamma,TermState, X, y, yhat, yhatNN2, train_op, loss, sess)
# every C iteration copy NN2 = NN1
if (i % C) == 0:
t1 = time.time()
print('iter: %i meanloss: %0.5f iteration took %0.2f s' %(i,meanloss,t1-t0))
t0 = time.time()
with tf.device('/gpu:0'):
for key in W:
W2[key] = tf.Variable(W[key].initialized_value())
for key in b:
b2[key] = tf.Variable(b[key].initialized_value())
Update: After having timed individual segments of the code it appears as if it’s the first line of my Experience replay function:
Q = sess.run(yhatNN2, feed_dict={X : s_prime})
That is causing most if not all of the slowdown. I don’t understand the logic behind why that is, there are several feedforward passes occurring in the program but only this one seem to cause a problem..

Passing multiple values through the same feedforward network in tensorflow

I am trying to pass 3 values in the same network at one time, since I need the values of all 3 vectors for calculating the triplet loss. But it gives an error when I pass the second value.
The code snippet is:
# runs the siamese network
def forward_prop(x):
w1 = tf.get_variable("w1", [n1, 2048], initializer=tf.contrib.layers.xavier_initializer()) * 0.01
b1 = tf.get_variable("b1", [n1, 1], initializer=tf.zeros_initializer())*0.01
z1 = tf.add(tf.matmul(w1, x), b1) # n1*2048 x 2048*batch_size = n1*batch_size
a1 = tf.nn.relu(z1) # n1*batch_size
w2 = tf.get_variable("w2", [n2, n1], initializer=tf.contrib.layers.xavier_initializer()) * 0.01
b2 = tf.get_variable("b2", [n2, 1], initializer=tf.zeros_initializer()) * 0.01
z2 = tf.add(tf.matmul(w2, a1), b2) # n2*n1 x n1*batch_size = n2*batch_size
a2 = tf.nn.relu(z2) # n2*batch_size
w3 = tf.get_variable("w3", [n3, n2], initializer=tf.contrib.layers.xavier_initializer()) * 0.01
b3 = tf.get_variable("b3", [n3, 1], initializer=tf.zeros_initializer()) * 0.01
z3 = tf.add(tf.matmul(w3, a2), b3) # n3*n2 x n2*batch_size = n3*batch_size
a3 = tf.nn.relu(z3) # n3*batch_size
w4 = tf.get_variable("w4", [n4, n3], initializer=tf.contrib.layers.xavier_initializer()) * 0.01
b4 = tf.get_variable("b4", [n4, 1], initializer=tf.zeros_initializer()) * 0.01
z4 = tf.add(tf.matmul(w4, a3), b4) # n4*n3 x n3*batch_size = n4*batch_size
a4 = tf.nn.relu(z4) # n4*batch_size = 128*batch_size (128 feature vectors for all training examples)
return a4
def back_prop():
anchor_embeddings = forward_prop(x1)
positive_embeddings = forward_prop(x2)
negative_embeddings = forward_prop(x3)
# finding sum of squares of distances
distance_positive = tf.reduce_sum(tf.square(anchor_embeddings - positive_embeddings), 0)
distance_negative = tf.reduce_sum(tf.square(anchor_embeddings - negative_embeddings), 0)
# applying the triplet loss equation
triplet_loss = tf.maximum(0., distance_positive - distance_negative + margin)
triplet_loss = tf.reduce_mean(triplet_loss)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(triplet_loss)
with tf.Session as sess:
sess.run(tf.global_variables_initializer())
feed_dict = {
x1: anchors,
x2: positives,
x3: negatives
}
print("Starting the Siamese network...")
for epoch in range(total_epochs_net_1):
for _ in range(len(anchors)):
_, triplet_loss = sess.run([optimizer, triplet_loss], feed_dict=feed_dict)
print("Epoch", epoch, "completed out of", total_epochs_net_1)
saver = tf.train.Saver()
saver.save(sess, 'face_recognition_model')
I am getting error in the following line:
positive_embeddings = forward_prop(x2)
The tf.get_variable in the forward_prop() function throws the error.
The error says:
ValueError: Variable w1 already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?
I think it's because the variable w1 gets defined in the first call of forward_prop() function in the following line:
anchor_embeddings = forward_prop(x1)
How to resolve this? I cannot pass the three values separately since i will need all the three values for computing the triplet loss. Any help will be appreciated. Thanks!
You're misconfiguring your network here:
def back_prop():
anchor_embeddings = forward_prop(x1)
positive_embeddings = forward_prop(x2)
negative_embeddings = forward_prop(x3)
You should only define 1 network, you're erroneously defining 3 sets of variables for each of the 3 inputs, effectively 3 neural networks are being defined here.
For triplet loss what you want to do is feed the 3 inputs in as a batch to a single network (all 3 get processed by the same network), not as individual variables. For this discussion, I'll assume your inputs are images and you're training on a single set of 3 inputs on each training step.
If your images are 256x256x1 (grayscale) in size, then a single triplet batch would be of shape [3 x 256 x 256 x 1]. Now your output will be of shape [3 x size_of_your_output_layer]. Your loss function should now be written with the understanding that the first axis there represents your 3 values: anchor, positive, negative. Compute the loss appropriately.
You can, of course, pass in multiple anchors, positives, and negatives, you'll just have to deal with this in more complex detail at the loss function, perfectly doable though. My triplet loss functions have gotten pretty complex though, so I suggest keeping it simple to start.

What does compute_gradients return in tensorflow

mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
gradients, variables = zip(*optimizer.compute_gradients(mean_sqr))
opt = optimizer.apply_gradients(list(zip(gradients, variables)))
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for j in range(TRAINING_EPOCHS):
sess.run(opt, feed_dict={x: batch_xs, y_: batch_xs})
I don't clearly understand what compute_gradients returns? Does it return sum(dy/dx) for a given x values assigned by batch_xs, and update gradient in apply_gradients function such as :
theta <- theta - LEARNING_RATE*1/m*gradients?
Or does it already return average of gradients that is summed for each x values in a given batch such as sum(dy/dx)*1/m, m is defined as batch_size?
compute_gradients(a,b) returns d[ sum a ]/db. So in your case this returns d mean_sq / d theta, where theta is set of all variables. There is no "dx" in this equation, you are not computing gradients wrt. inputs. So what happens with batch dimension? You remove it yourself in the definition of mean_sq:
mean_sqr = tf.reduce_mean(tf.pow(y_ - y, 2))
thus (I am assuming y is 1D for simplicity)
d[ mean_sqr ] / d theta = d[ 1/M SUM_i=1^M (pred(x_i), y_i)^2 ] / d theta
= 1/M SUM_i=1^M d[ (pred(x_i), y_i)^2 ] / d theta
so you are in control of whether it sums over batch, takes the mean or does something different, if you would define mean_sqr to use reduce_sum instead of a reduce_mean, gradients would be the sum over the batch and so on.
On the other hand apply_gradients simply "applies the gradients", the exact rule for application is optimiser dependent, for GradientDescentOptimizer it would be
theta <- theta - learning_rate * gradients(theta)
For Adam that you are using the equation is more complex of course.
Note however that tf.gradients is more like "backprop" than true gradient in mathematical sense - meaning that it depends on the graph dependencies and does not recognise dependences which are in "opposite" direction.

Resources