Can anybody explain to me what is the meaning behind these two lines of code from here: https://github.com/Newmu/Theano-Tutorials/blob/master/4_modern_net.py
acc = theano.shared(p.get_value() * 0.)
acc_new = rho * acc + (1 - rho) * g ** 2
Is it a mistake? Why do we instantiate acc to zero and then multiply it by rho in next line? It looks like it will not achieve anything this way and remain zero. Will there be any difference if we replace "rho * acc" by just "acc"?
The full function is given below:
def RMSprop(cost, params, lr=0.001, rho=0.9, epsilon=1e-6):
grads = T.grad(cost=cost, wrt=params)
updates = []
for p, g in zip(params, grads):
acc = theano.shared(p.get_value() * 0.)
acc_new = rho * acc + (1 - rho) * g ** 2
gradient_scaling = T.sqrt(acc_new + epsilon)
g = g / gradient_scaling
updates.append((acc, acc_new))
updates.append((p, p - lr * g))
return updates
This is just a way to tell Theano "create a shared variable and initialize its value to be zero in the same shape as p."
This RMSprop method is a symbolic method. It does not actually compute the RmsProp parameter updates, it only tells Theano how parameter updates should be computed when the eventual Theano function is executed.
If you look further down the tutorial code you linked to you'll see the symbolic execution graph for the parameter updates are constructed by RMSprop via a call on line 67. These updates are then compiled into a Theano function called train in Python on line 69 and the train function is executed many times on line 74 within the for loops of lines 72 and 73. The Python function RMSprop will be called only once, irrespective of how many times the train function is called within the for loops on lines 72 and 73.
Within RMSprop, we are telling Theano that, for each parameter p, we need a new Theano variable whose initial value has the same shape as p and is 0. throughout. We then go on to tell Theano how it should update both this new variable (unnamed as far as Theano is concerned but named acc in Python) and how to update the parameter p itself. These commands do not alter either p or acc, they just tell Theano how p and acc should be updated later, once the function has been compiled (line 69) each time it is executed (line 74).
The function executions on line 74 will not call the RMSprop Python function, they execute a compiled version of RMSprop. There will be no initialization inside the compiled version because that already happened in the Python version of RMSprop. Each train execution of the line acc_new = rho * acc + (1 - rho) * g ** 2 will use the current value of acc not its initial value.
Related
In PyTorch, I want to do the following calculation:
l1 = f(x.detach(), y)
l1.backward(retain_graph=True)
l2 = -1*f(x, y.detach())
l2.backward()
where f is some function, and x and y are tensors that require gradient. Notice that x and y may both be the results of previous calculations which utilize shared parameters (for example, maybe x=g(z) and y=g(w) where g is an nn.Module).
The issue is that l1 and l2 are both numerically identical, up to the minus sign, and it seems wasteful to repeat the calculation f(x,y) twice. It would be nicer to be able to calculate it once, and apply backward twice on the result. Is there any way of doing this?
One possibility is to manually call autograd.grad and update the w.grad field of each nn.Parameter w. But I'm wondering if there is a more direct and clean way to do this, using the backward function.
I took this answer from here.
We can calculate f(x,y) once, without detaching neither x or y, if we ensure that we we multiply by -1 the gradient flowing through x. This can be done using register_hook:
x.register_hook(lambda t: -t)
l = f(x,y)
l.backward()
Here is code demonstrating that this works:
import torch
lin = torch.nn.Linear(1, 1, bias=False)
lin.weight.data[:] = 1.0
a = torch.tensor([1.0])
b = torch.tensor([2.0])
loss_func = lambda x, y: (x - y).abs()
# option 1: this is the inefficient option, presented in the original question
lin.zero_grad()
x = lin(a)
y = lin(b)
loss1 = loss_func(x.detach(), y)
loss1.backward(retain_graph=True)
loss2 = -1 * loss_func(x, y.detach()) # second invocation of `loss_func` - not efficient!
loss2.backward()
print(lin.weight.grad)
# option 2: this is the efficient method, suggested in this answer.
lin.zero_grad()
x = lin(a)
y = lin(b)
x.register_hook(lambda t: -t)
loss = loss_func(x, y) # only one invocation of `loss_func` - more efficient!
loss.backward()
print(lin.weight.grad) # the output of this is identical to the previous print, which confirms the method
# option 3 - this should not be equivalent to the previous options, used just for comparison
lin.zero_grad()
x = lin(a)
y = lin(b)
loss = loss_func(x, y)
loss.backward()
print(lin.weight.grad)
I'm attempting to solve the differential equation:
m(t) = M(x)x'' + C(x, x') + B x'
where x and x' are vectors with 2 entries representing the angles and angular velocity in a dynamical system. M(x) is a 2x2 matrix that is a function of the components of theta, C is a 2x1 vector that is a function of theta and theta' and B is a 2x2 matrix of constants. m(t) is a 2*1001 array containing the torques applied to each of the two joints at the 1001 time steps and I would like to calculate the evolution of the angles as a function of those 1001 time steps.
I've transformed it to standard form such that :
x'' = M(x)^-1 (m(t) - C(x, x') - B x')
Then substituting y_1 = x and y_2 = x' gives the first order linear system of equations:
y_2 = y_1'
y_2' = M(y_1)^-1 (m(t) - C(y_1, y_2) - B y_2)
(I've used theta and phi in my code for x and y)
def joint_angles(theta_array, t, torques, B):
phi_1 = np.array([theta_array[0], theta_array[1]])
phi_2 = np.array([theta_array[2], theta_array[3]])
def M_func(phi):
M = np.array([[a_1+2.*a_2*np.cos(phi[1]), a_3+a_2*np.cos(phi[1])],[a_3+a_2*np.cos(phi[1]), a_3]])
return np.linalg.inv(M)
def C_func(phi, phi_dot):
return a_2 * np.sin(phi[1]) * np.array([-phi_dot[1] * (2. * phi_dot[0] + phi_dot[1]), phi_dot[0]**2])
dphi_2dt = M_func(phi_1) # (torques[:, t] - C_func(phi_1, phi_2) - B # phi_2)
return dphi_2dt, phi_2
t = np.linspace(0,1,1001)
initial = theta_init[0], theta_init[1], dtheta_init[0], dtheta_init[1]
x = odeint(joint_angles, initial, t, args = (torque_array, B))
I get the error that I cannot index into torques using the t array, which makes perfect sense, however I am not sure how to have it use the current value of the torques at each time step.
I also tried putting odeint command in a for loop and only evaluating it at one time step at a time, using the solution of the function as the initial conditions for the next loop, however the function simply returned the initial conditions, meaning every loop was identical. This leads me to suspect I've made a mistake in my implementation of the standard form but I can't work out what it is. It would be preferable however to not have to call the odeint solver in a for loop every time, and rather do it all as one.
If helpful, my initial conditions and constant values are:
theta_init = np.array([10*np.pi/180, 143.54*np.pi/180])
dtheta_init = np.array([0, 0])
L_1 = 0.3
L_2 = 0.33
I_1 = 0.025
I_2 = 0.045
M_1 = 1.4
M_2 = 1.0
D_2 = 0.16
a_1 = I_1+I_2+M_2*(L_1**2)
a_2 = M_2*L_1*D_2
a_3 = I_2
Thanks for helping!
The solver uses an internal stepping that is problem adapted. The given time list is a list of points where the internal solution gets interpolated for output samples. The internal and external time lists are in no way related, the internal list only depends on the given tolerances.
There is no actual natural relation between array indices and sample times.
The translation of a given time into an index and construction of a sample value from the surrounding table entries is called interpolation (by a piecewise polynomial function).
Torque as a physical phenomenon is at least continuous, a piecewise linear interpolation is the easiest way to transform the given function value table into an actual continuous function. Of course one also needs the time array.
So use numpy.interp1d or the more advanced routines of scipy.interpolate to define the torque function that can be evaluated at arbitrary times as demanded by the solver and its integration method.
I am trying to understand the following piece of theano code.
self.sgd_step = theano.function(
[x, y, learning_rate, theano.Param(decay, default=0.9)],
[],
updates=[(E, E - learning_rate * dE / T.sqrt(mE + 1e-6)),
(U, U - learning_rate * dU / T.sqrt(mU + 1e-6)),
(W, W - learning_rate * dW / T.sqrt(mW + 1e-6)),
(V, V - learning_rate * dV / T.sqrt(mV + 1e-6)),
(b, b - learning_rate * db / T.sqrt(mb + 1e-6)),
(c, c - learning_rate * dc / T.sqrt(mc + 1e-6)),
(self.mE, mE),
(self.mU, mU),
(self.mW, mW),
(self.mV, mV),
(self.mb, mb),
(self.mc, mc)
])
Can someone please tell me, what the author of the above code is trying to do there? There is a value, [x, y, learning_rate, theano.Param(decay, default=0.9)] trying to be updated, and the value is gonna be updated by []? And what is the function of updates here?
I would be so grateful if I can have an idea what is going on in the above code?
The documentation of the updates is as follows (taken from here).
updates must be supplied with a list of pairs of the form (shared-variable, new expression). It can also be a dictionary whose keys are shared-variables and values are the new expressions. Either way, it means “whenever this function runs, it will replace the .value of each shared variable with the result of the corresponding expression”. Above, our accumulator replaces the state‘s value with the sum of the state and the increment amount.
So when you call the above theano function with the required inputs, it will update values of shared variables, namely E, U, W, V, b, c, ..., self.mc. The new value to be updated is given by the second quantity in the tuple. Basically, E = E - learning_rate * dE / T.sqrt(mE + 1e-6) and so on.
I am following RNN tutorial of Tensorflow.
I am having trouble understanding the function ptb_producer in reader.py in following script :
with tf.control_dependencies([assertion]):
epoch_size = tf.identity(epoch_size, name="epoch_size")
i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
x = tf.strided_slice(data, [0, i * num_steps],[batch_size, (i + 1) * num_steps])
x.set_shape([batch_size, num_steps])
y = tf.strided_slice(data, [0, i * num_steps + 1],[batch_size, (i + 1) * num_steps + 1])
y.set_shape([batch_size, num_steps])
return x, y
Can anyone explain what tf.train.range_input_producer is doing ?
I have been trying to understand the same tutorial for weeks now. In my opinion, what makes it so difficult is the fact that all the functions one calls from TensorFlow are not executed immediately, but rather add their corresponding operation nodes to the graph.
According to the official documentation, a Range Input Producer 'generates integers from 0 to limit - 1 in a queue'. So, the way I see it, the code in question i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue() creates a node, which acts as a counter, producing the next number in the sequence 0:(epoch_size) once executed.
This is used to get the next batch from the input data. The raw data is split into batch_size rows, so that in every run batch_size batches are given to the training function. In every batch (row), a sliding window of size num_steps moves forward. The counter i allows the window to move forward by num_steps in every call.
Both x and y are of shape [batch_size, num_steps], since they contain batch_size batches of num_steps steps each. Variable x is the input and y is the expected output for the given input (it is produced by moving the window one item to the left, so that iff x = data[i:(i + num_steps] then y = data[(i + 1):(i + num_steps + 1)].
It has been a nightmare for me, but I hope this post helps people in the future.
Problem Synopsis:
When attempting to use the scipy.optimize.fmin_bfgs minimization (optimization) function, the function throws a
derphi0 = np.dot(gfk, pk)
ValueError: matrices are not aligned
error. According to my error checking this occurs at the very end of the first iteration through fmin_bfgs--just before any values are returned or any calls to callback.
Configuration:
Windows Vista
Python 3.2.2
SciPy 0.10
IDE = Eclipse with PyDev
Detailed Description:
I am using the scipy.optimize.fmin_bfgs to minimize the cost of a simple logistic regression implementation (converting from Octave to Python/SciPy). Basically, the cost function is named cost_arr function and the gradient descent is in gradient_descent_arr function.
I have manually tested and fully verified that *cost_arr* and *gradient_descent_arr* work properly and return all values properly. I also tested to verify that the proper parameters are passed to the *fmin_bfgs* function. Nevertheless, when run, I get the ValueError: matrices are not aligned. According to the source review, the exact error occurs in the
def line_search_wolfe1
function in # Minpack's Wolfe line and scalar searches as supplied by the scipy packages.
Notably, if I use scipy.optimize.fmin instead, the fmin function runs to completion.
Exact Error:
File
"D:\Users\Shannon\Programming\Eclipse\workspace\SBML\sbml\LogisticRegression.py",
line 395, in fminunc_opt
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, callback=self.callback_fmin_bfgs, retall=True)
File
"C:\Python32x32\lib\site-packages\scipy\optimize\optimize.py", line
533, in fmin_bfgs old_fval,old_old_fval)
File "C:\Python32x32\lib\site-packages\scipy\optimize\linesearch.py", line
76, in line_search_wolfe1
derphi0 = np.dot(gfk, pk)
ValueError: matrices are not aligned
I call the optimization function with:
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, callback=self.callback_fmin_bfgs, retall=True)
I have spent a few days trying to fix this and cannot seem to determine what is causing the matrices are not aligned error.
ADDENDUM: 2012-01-08
I worked with this a lot more and seem to have narrowed the issues (but am baffled on how to fix them). First, fmin (using just fmin) works using these functions--cost, gradient. Second, the cost and the gradient functions both accurately return expected values when tested in a single iteration in a manual implementation (NOT using fmin_bfgs). Third, I added error code to optimize.linsearch and the error seems to be thrown at def line_search_wolfe1 in line: derphi0 = np.dot(gfk, pk).
Here, according to my tests, scipy.optimize.optimize pk = [[ 12.00921659]
[ 11.26284221]]pk type = and scipy.optimize.optimizegfk = [[-12.00921659] [-11.26284221]]gfk type =
Note: according to my tests, the error is thrown on the very first iteration through fmin_bfgs (i.e., fmin_bfgs never even completes a single iteration or update).
I appreciate ANY guidance or insights.
My Code Below (logging, documentation removed):
Assume theta = 2x1 ndarray (Actual: theta Info Size=(2, 1) Type = )
Assume X = 100x2 ndarray (Actual: X Info Size=(2, 100) Type = )
Assume y = 100x1 ndarray (Actual: y Info Size=(100, 1) Type = )
def cost_arr(self, theta, X, y):
theta = scipy.resize(theta,(2,1))
m = scipy.shape(X)
m = 1 / m[1] # Use m[1] because this is the length of X
logging.info(__name__ + "cost_arr reports m = " + str(m))
z = scipy.dot(theta.T, X) # Must transpose the vector theta
hypthetax = self.sigmoid(z)
yones = scipy.ones(scipy.shape(y))
hypthetaxones = scipy.ones(scipy.shape(hypthetax))
costright = scipy.dot((yones - y).T, ((scipy.log(hypthetaxones - hypthetax)).T))
costleft = scipy.dot((-1 * y).T, ((scipy.log(hypthetax)).T))
def gradient_descent_arr(self, theta, X, y):
theta = scipy.resize(theta,(2,1))
m = scipy.shape(X)
m = 1 / m[1] # Use m[1] because this is the length of X
x = scipy.dot(theta.T, X) # Must transpose the vector theta
sig = self.sigmoid(x)
sig = sig.T - y
grad = scipy.dot(X,sig)
grad = m * grad
return grad
def fminunc_opt_bfgs(self, initialtheta, X, y, maxnumit):
myargs= (X,y)
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, retall=True, full_output=True)
return optcost
In case anyone else encounters this problem ....
1) ERROR 1: As noted in the comments, I incorrectly returned the value from my gradient as a multidimensional array (m,n) or (m,1). fmin_bfgs seems to require a 1d array output from the gradient (that is, you must return a (m,) array and NOT a (m,1) array. Use scipy.shape(myarray) to check the dimensions if you are unsure of the return value.
The fix involved adding:
grad = numpy.ndarray.flatten(grad)
just before returning the gradient from your gradient function. This "flattens" the array from (m,1) to (m,). fmin_bfgs can take this as input.
2) ERROR 2: Remember, the fmin_bfgs seems to work with NONlinear functions. In my case, the sample that I was initially working with was a LINEAR function. This appears to explain some of the anomalous results even after the flatten fix mentioned above. For LINEAR functions, fmin, rather than fmin_bfgs, may work better.
QED
As of current scipy version you need not pass fprime argument. It will compute the gradient for you without any issues. You can also use 'minimize' fn and pass method as 'bfgs' instead without providing gradient as argument.