I'm training a model in pytorch. Every 10 epochs, I'm evaluating the train and test error on the entire train and test dataset. For some reason the evaluation function is causing out-of-memory on my GPU. This is strange because I have the same batchsize for training and evaluation. I believe it's due to the net.forward() method being called repeated and having all the hidden values stored in memory but I'm not sure how to get around this?
def evaluate(self, data):
correct = 0
total = 0
loader = self.train_loader if data == "train" else self.test_loader
for step, (story, question, answer) in enumerate(loader):
story = Variable(story)
question = Variable(question)
answer = Variable(answer)
_, answer = torch.max(answer, 1)
if self.config.cuda:
story = story.cuda()
question = question.cuda()
answer = answer.cuda()
pred_prob = self.mem_n2n(story, question)[0]
_, output_max_index = torch.max(pred_prob, 1)
toadd = (answer == output_max_index).float().sum().data[0]
correct = correct + toadd
total = total + captions.size(0)
acc = correct / total
return acc
I think it fails during Validation because you don't use optimizer.zero_grad(). The zero_grad executes detach, making the tensor a leaf. It is commonly used every epoch in the training part.
The use of volatile flag in Variable from PyTorch 0.4.0 has been removed.
Ref - migration_guide_to_0.4.0
Starting from 0.4.0, to avoid the gradient being computed during validation, use torch.no_grad()
Code example from the migration guide.
# evaluate
with torch.no_grad(): # operations inside don't track history
for input, target in test_loader:
...
For 0.3.X, using volatile should work.
I would suggest to use volatile flag set to True for all variables used during the evaluation,
story = Variable(story, volatile=True)
question = Variable(question, volatile=True)
answer = Variable(answer, volatile=True)
Thus, the gradients and operation history is not stored and you will save a lot of memory.
Also, you could delete references to those variables at the end of the batch processing:
del story, question, answer, pred_prob
Don't forget to set the model to the evaluation mode (and back to the train mode after you finished the evaluation). For instance, like this
model.eval()
Related
I'm writing a custom optimizer I want JIT-able with Jax which features 1) breaking on maximum steps reached 2) breaking on a tolerance reached, and 3) saving the history of the steps taken. I'm relatively new to some of this stuff in Jax, but reading the docs I have this solution:
import jax, jax.numpy as jnp
#jax.jit
def optimizer(x, tol = 1, max_steps = 5):
def cond(arg):
step, x, history = arg
return (step < max_steps) & (x > tol)
def body(arg):
step, x, history = arg
x = x / 2 # simulate taking an optimizer step
history = history.at[step].set(x) # simulate saving current step
return (step + 1, x, history)
return jax.lax.while_loop(
cond,
body,
(0, x, jnp.full(max_steps, jnp.nan))
)
optimizer(10.) # works
My question is whether this can be improved in some way? In particular, is there a way to avoid pre-allocating the history? This isn't ideal since the real thing is alot more complicated than a single array and there's obviously the potential for wasted memory if tolerance is reached well before the maximum steps.
is there a way to avoid pre-allocating the history?
No, as I understand JAX
in JAX, 'type' includes shape, that is, the in and out data shape of the body function MUST be the same, otherwise, say dynamic grow history use jnp.vstack((history, x)), JAX will consider it as side effect.
There is a way, if you think the tolerance will be often reached before the maximum number of steps.
JAX implements sparse matrices (and pytrees of them), in jax.experimental.sparse. They will have the same shape as the maximum history size, and therefore satisfy the "fixed size" requirement for XLA, but of course, will only store nonzero elements in memory.
I have two losses: one the usual L1 loss and second one involving torch.rfft()
def dft_amp(img):
fft_im = torch.rfft( img, signal_ndim=2, onesided=False )
fft_amp = fft_im[:,:,:,:,0]**2 + fft_im[:,:,:,:,1]**2
return torch.sqrt(fft_amp)
l1_loss = torch.nn.L1Loss()
loss = l1_loss(pred,gt) + l1_loss(dft_amp(pred),dft(gt_amp))
loss.backward()
This runs for the 1st iteration with both losses not bein being nan but the loss from 2nd iteration onwards becomes nan.
If however only the simple L1 loss is kept and l1_loss(dft_amp(pred),dft(gt_amp)) is omitted, the training proceeds normally.
Does torch.rfft() supports backpropagation? I am using pytorch 1.4.0
Any suggestions would be appreciated
There seems to be some issue with torch.sqrt() and not with the torch.rfft(). torch.sqrt() cannot handle very small values. Thus in the above code replace
return torch.sqrt(fft_amp)
with
return torch.sqrt(fft_amp + 1e-10)
and the NaN values will vanish away.
Though this solution works, it would be nice if someone digs in why square roots are problematic for digital computations.
I want to use python3 to build a zeroinflatedpoisson model. I found in library statsmodel the function statsmodels.discrete.count_model.ZeroInflatePoisson.
I just wonder how to use it. It seems I should do:
ZIFP(Y_train,X_train).fit().
But when I wanted to do prediction using X_test.
It told me the length of X_test doesn't fit X_train.
Or is there another package to fit this model?
Here is the code I used:
X1 = [random.randint(0,1) for i in range(200)]
X2 = [random.randint(1,2) for i in range(200)]
y = np.random.poisson(lam = 2,size = 100).tolist()
for i in range(100):y.append(0)
df['x1'] = x1
df['x2'] = x2
df['y'] = y
df_x = df.iloc[:,:-1]
x_train,x_test,y_train,y_test = train_test_split(df_x,df['y'],test_size = 0.3)
clf = ZeroInflatedPoisson(endog = y_train,exog = x_train).fit()
clf.predict(x_test)
ValueError:operands could not be broadcat together with shapes (140,)(60,)
also tried:
clf.predict(x_test,exog = np.ones(len(x_test)))
ValueError: shapes(60,) and (1,) not aligned: 60 (dim 0) != 1 (dim 0)
This looks like a bug to me.
As far as I can see:
If there are no explanatory variables, exog_infl, specified for the inflation model, then a array of ones is used to model a constant inflation probability.
However, if exog_infl in predict is None, then it uses the model.exog_infl which is an array of ones with the length equal to the training sample.
As work around specifying a 1-D array of ones of correct length in predict should work.
Try:
clf.predict(test_x, exog_infl=np.ones(len(test_x))
I guess the same problem will occur if exposure was used in the model, but is not explicitly specified in predict.
I ran into the same problem, landing me on this thread. As noted by Josef, it seems like you need to provide exog_infl with a 1-D array of ones of correct length to work.
However, the code Josef provided misses the 1-D array-part, so the full line required to generate the required array is actually
clf.predict(test_x, exog_infl=np.ones((len(test_x),1))
For diagnostic purposes, I am grabbing the gradients of the network periodically. One way to do this is to return the gradients as output of the theano function. However, copying the gradients from the GPU to CPU memory every time may be costly so I would prefer to do it only periodically. At the moment, I am achieving this by creating two function objects, one which returns the gradient and one which doesn't.
However, I do not know whether this is optimal and am looking for a more elegant way to achieve the same thing.
Your first function obviously executes a training step and updates all your parameters.
The second function must return the gradients of your parameters.
The fastest way to do what you are asking is to add the updates for the training step to the second function and when logging the gradients, don't call the first function, but only the second.
gradients = [ ... ]
train_f = theano.function([x, y], [], updates=updates)
train_grad_f = theano.function([x, y], gradients, updates=updates)
num_iters = 1000
grad_array = []
for i in range(num_iters):
# every 10 training steps keep log of gradients
if i % 10 == 0:
grad_array.append(train_grad_f(...))
else:
train_f(...)
Update
if you wish to have a single function to do this, you can do the following
from theano.ifelse import ifelse
no_grad = T.iscalar('no_grad')
example_gradient = T.grad(example_cost, example_variable)
# if no_grad is > 0 then return the gradient, otherwise return zeros array
out_grad = ifelse(T.gt(no_grad,0), example_gradient, T.zeros_like(example_variable))
train_f = theano.function([x, y, no_grad], [out_grad], updates=updates)
So when you want to retrieve the gradients you call
train_f(x_data, y_data, 1)
otherwise
train_f(x_data, y_data, 0)
I want to make use of Theano's logistic regression classifier, but I would like to make an apples-to-apples comparison with previous studies I've done to see how deep learning stacks up. I recognize this is probably a fairly simple task if I was more proficient in Theano, but this is what I have so far. From the tutorials on the website, I have the following code:
def errors(self, y):
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
I'm pretty sure this is where I need to add the functionality, but I'm not certain how to go about it. What I need is either access to y_pred and y for each and every run (to update my confusion matrix in python) or to have the C++ code handle the confusion matrix and return it at some point along the way. I don't think I can do the former, and I'm unsure how to do the latter. I've done some messing around with an update function along the lines of:
def confuMat(self, y):
x=T.vector('x')
classes = T.scalar('n_classes')
onehot = T.eq(x.dimshuffle(0,'x'),T.arange(classes).dimshuffle('x',0))
oneHot = theano.function([x,classes],onehot)
yMat = T.matrix('y')
yPredMat = T.matrix('y_pred')
confMat = T.dot(yMat.T,yPredMat)
confusionMatrix = theano.function(inputs=[yMat,yPredMat],outputs=confMat)
def confusion_matrix(x,y,n_class):
return confusionMatrix(oneHot(x,n_class),oneHot(y,n_class))
t = np.asarray(confusion_matrix(y,self.y_pred,self.n_out))
print (t)
But I'm not completely clear on how to get this to interface with the function in question and give me a numpy array I can work with.
I'm quite new to Theano, so hopefully this is an easy fix for one of you. I'd like to use this classifer as my output layer in a number of configurations, so I could use the confusion matrix with other architectures.
I suggest using a brute force sort of a way. You need an output for a prediction first. Create a function for it.
prediction = theano.function(
inputs = [index],
outputs = MLPlayers.predicts,
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size]})
In your test loop, gather the predictions...
labels = labels + test_set_y.eval().tolist()
for mini_batch in xrange(n_test_batches):
wrong = wrong + int(test_model(mini_batch))
predictions = predictions + prediction(mini_batch).tolist()
Now create confusion matrix this way:
correct = 0
confusion = numpy.zeros((outs,outs), dtype = int)
for index in xrange(len(predictions)):
if labels[index] is predictions[index]:
correct = correct + 1
confusion[int(predictions[index]),int(labels[index])] = confusion[int(predictions[index]),int(labels[index])] + 1
You can find this kind of an implementation in this repository.