CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` - nlp

I got the following error when I ran my pytorch deep learning model in colab
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
1370 ret = torch.addmm(bias, input, weight.t())
1371 else:
-> 1372 output = input.matmul(weight.t())
1373 if bias is not None:
1374 output += bias
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
I even reduced batch size from 128 to 64 i.e., reduced to half, but still, I got this error. Earlier, I ran the same code with a batch size of 128 but didn't get any error like this.

No, batch size does not matter in this case
The most likely reason is that there is an inconsistency between number of labels and number of output units.
Try printing the size of the final output in the forward pass and check the size of the output
print(model.fc1(x).size())
Here fc1 would be replaced by the name of your model's last linear layer before returning
Make sure that label.size() is equal to prediction.size() before calculating the loss
And even after fixing that problem, you'll have to restart the GPU runtime (I needed to do this in my case when using a colab GPU)
This answer might also be helpful

This error can actually be due to different reasons. It is recommended to debug CUDA errors by running the code on the CPU, if possible. If that’s not possible, try to execute the script via:
CUDA_LAUNCH_BLOCKING=1 python [YOUR_PROGRAM]
This will help you get the right line of code which raised the error in the stack trace so that you can resolve it.

Reducing batch size works for me and the training proceeds as planned.

First, try running the same on your CPU to check if everything is fine with your tensors' shapes.
In my case everything was fine. And since this error means "Resource allocation failed inside the cuBLAS library", I tried decreasing the batch size and it solved the issue. You said you increased to 64 and it didn't help. Can you try 32, 8, 1?

I encountered this problem when the number of label is not equaled with the number of network's output channel, i.e the number of classes predicted.

I had the same problem while I don't know the reason to be exactly I know the cause,
my last line of the NN.module was
self.fc3 = nn.Linear(84, num_classes)
I changed my real num_classes to be 2 times as much
but it did not change the value of the variable num_classes, this probably made a mistake when I was outputting the results somewhere.
after I fixed the value of num_classes it just worked out
i recommend going over the numbers in your model again

I was facing CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle) on colab
Updating the pytorch to 1.8.1 fixed the issue.

I ran into this issue because I was passing parameters in the wrong order to the BCELoss function. This became apparent only after switching to CPU.

Good chance that there is a layer mismatch. Double check to make sure all the dimensions are consistent at each layer.

The accurate error message can be obtained by switching to CPU. In my case I had 8 class placeholders at the input of torch.nn.CrossEntropyLoss, but there are 9 different labels (0~8).

My model is to classify two classes with only one neuron in the last layer. I had this problem when the last layer is nn.Linear(512,1) in pytorch environment. But my label is just [0] or [1]. I solved this problem by adding the layer: nn.sigmoid()

For a large-scale dataset, just delete the temple variables
for batch_idx, (x, target) in enumerate(train_dataloader):
...
del x,target,loss,outputs

Reducing the batch size worked for me.

Reducing the batch size didn't work for me. I have defined num_classes in main.py and also in model structure. I forgot to change the num_classes in model structure therefore i got an error. After changing, training process has been started

This is probably a mismatch of dimensions or indexes. You can have a more clear feedback about the error by running your model on cpu. You can reduce the datasets size if needed, in my case, as it was a simple prediction, I just switched for cpu and found out it was a token outside my model's vocabulary range.

Reducing the maximum sequence length for a model that has a limit (e.g. BERT) solves the error for me.
Also, I faced the same issue when I resized the embedding layer of a model: model.resize_token_embeddings(NEW_SIZE), trained, and saved it.
At prediction time, when I loaded the model, I needed to resize the embedding layer again!

I got that same issue in google colab with GPU runtime and so i changed from GPU to TPU.
Then it got resolved.
I recommend u trying the same.

Related

Tensorflow object detection API - validation loss behaviour

I am trying to use the TensorFlow object detection API to recognize a specific object (guitars) in pictures and videos.
As for the data, I downloaded the images from the OpenImage dataset, and derived the .tfrecord files. I am testing with different numbers, but for now let's say I have 200 images in the training set and 100 in the evaluation one.
I'm traininig the model using the "ssd_mobilenet_v1_coco" as a starting point, and the "model_main.py" script, so that I can have training and validation results.
When I visualize the training progress in TensorBoard, I get the following results for train:
and validation loss:
respectively.
I am generally new to computer vision and trying to learn, so I was trying to figure out the meaning of these plots.
The training loss goes as expected, decreasing over time.
In my (probably simplistic) view, I was expecting the validation loss to start at high values, decrease as training goes on, and then start increasing again if the training goes on for too long and the model starts overfitting.
But in my case, I don't see this behavior for the validation curve, which seems to be trending upwards basically all the time (excluding fluctuations).
Have I been training the model for too little time to see the behavior I'm expecting? Are my expectations wrong in the first place? Am I misinterpreting the curves?
Ok, I fixed it by decreasing the initial_learning_rate from 0.004 to 0.0001.
It was the obvious solution, considering the wild oscillations of the validation loss, but at first I thought it wouldn't work since there seems to be already a learning rate scheduler in the config file.
However, immediately below (in the config file) there's a num_steps option, and it's stated that
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
Honestly, I don't remember if I commented out the num_steps option...if I didn't, it seems my learning rate was kept to the initial value of 0.004, which turned out to be too high.
If I did comment it out (so that the learning scheduler was active), I guess that, instead of the decrease, it still started from too high of a value.
Anyway, it's working much better now, I hope this can be useful if anyone is experiencing the same problem.

Memory error while running LSTM Autoencoders

I am setting up an LSTM Autoencoder with multivariant time sequences. Each of my sequence has different time steps(approximately 30 million steps in one sequence) and 6 features. I know to give one sequence as input to the LSTM Autoencoder, I have to reshape my sequence as (1,30million,6).I reshaped all of my 9 sequences in a similar manner. I want the autoencoder to reconstruct my sequences. However my program is crashing due to large number of time steps in each sequence. How can I solve this memory error. Even if I am giving data in batch size,my program is running out of memory. I am new to machine learning and sequence learning, so please help me with the same.My network is attached below:
`
def repeat_vector(args):
[layer_to_repeat, sequence_layer] = args
return RepeatVector(K.shape(sequence_layer)[1])(layer_to_repeat)
encoder_input = Input(shape=(None, self._input_features))
encoder_output = LSTM(self._latent_space)(encoder_input)
decoder_input = Lambda(repeat_vector, output_shape=(None, self._latent_space))([encoder_output, encoder_input])
decoder_output = LSTM(self._input_cells, return_sequences=True)(decoder_input)
self._autoencoder = Model(encoder_input, decoder_output)
`
I have already tried to take input via hdf files.
I am not sure what system configuration are you using. OOMs can be solved from both software and hardware ends. If you are using a system with say 4GB RAM and some i5 processor(assuming it's intel), it might not work. If you are working on a GPU(which is not very likely.) it should not be a hardware issue.
If your system has a graphic card, well then you can optimize the code a bit.
Try a batch size of 1.
If you have a pre-processing queue etc. try to tweak the queue size.
I would suggest you to try this for a smaller series once before going in for the complete thing, and check if it works.
If you take the time step to be large, it will lose precision and if it's too small, well then it's heavy to compute. Check for each one
of them, if the time step can be increased, without compromising much
on precision.
You can use PCA for knowing the important features and reduce dimensionality. You can also use random forest as a preprocessing step
to know the feature importance and reduce the features with less
inportance.

memory saving gradients or memory check pointing in keras

I recently found a github repo: https://github.com/openai/gradient-checkpointing
The main purpose is to reduce gpu memory consumption. And the usage seems pretty straight forward:
from tensorflow.python.keras._impl.keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory
How can I do the same thing but with keras installed separately, not as a part of tensorflow? Since this didn't work:
from keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory
Thank you in advance
I know I am a bit late, but I recently ran into the same problem, and I was able to solve it.
The problem (I think) is that memory_saving_gradients.gradients_memory uses a heuristic approach which does not work well for many scenarios. Fortunately, there is an alternative function: memory_saving_gradients.gradients_collection, which works perfectly fine, but it requires you to specify at which points in the network the gradient must be checkpointed.
As an example on how this can be accomplished, suppose that we want to checkpoint all the Keras layers whose name contains the word 'add' (for instance, to make a resnet memory effcient). Then, you could include something like this after building your model, but before training it:
layer_names= [layer.name for layer in self.model.layers]
[tf.add_to_collection("checkpoints", self.model.get_layer(l).get_output_at(0))
for l in [i for i in layer_names if 'add' in i]]
K.__dict__["gradients"] = memory_saving_gradients.gradients_collection
I hope it helps!

MemoryError using MLPClassifier from sklearn.neural_network

I'm running python 3.5 on a windows 10 64-bit operating system.
When I try to implement MLPClassifier the code runs for a while and then gives me a MemoryError.
I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data. How can I circumvent this error?
Code
gamma=[1,10,100,1000,10000,100000]#create array for range of gamma values
score_train=[]
score_test=[]
for j in gamma:
mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[j,j], activation='tanh').fit(data_train, classes_train)
score_train.append(mlp.score(data_train,classes_train))
score_test.append(mlp.score(data_test,classes_test))
print (score_train)
print (score_test)
Error
Memory Erroy Traceback
the code runs for a while and then gives me a MemoryError. I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data.
Yes, it's the size of the hidden-layers! And the remaining part of that sentence does not make much sense (continue reading)!
Please make sure to read read the tutorial and API-docs
Now some more specific remarks:
The sizes of the hidden-layer does not have anything to do with the collection of your data!
input- and output-layers will be build based on the sizes of your X,y!
hidden_layer_sizes=[j,j] is actually creating 2 hidden-layers!
In the MLP, all layers are fully connected!
a call with hidden_layer_sizes=[100000, 100000] as you try to do will use ~76 gigabytes of memory (assuming 64-bit doubles) just for these weights connecting these 2 layers alone!
and this is just one connection-layer: input-h0 and h1-output are still missing
lbfgs is a completely different solver than all the others. Don't use it without some understanding of the implications! It's not default!
It's a full-batch method and therefore uses a lot more memory when sample-size is big!
Additionally, there are more internal reasons to use more memory compared to the other (first-order-) methods
Not that precise, but the docs already gave some hints: Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

Neural Network Always Produces Same/Similar Outputs for Any Input [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I have a problem where I am trying to create a neural network for Tic-Tac-Toe. However, for some reason, training the neural network causes it to produce nearly the same output for any given input.
I did take a look at Artificial neural networks benchmark, but my network implementation is built for neurons with the same activation function for each neuron, i.e. no constant neurons.
To make sure the problem wasn't just due to my choice of training set (1218 board states and moves generated by a genetic algorithm), I tried to train the network to reproduce XOR. The logistic activation function was used. Instead of using the derivative, I multiplied the error by output*(1-output) as some sources suggested that this was equivalent to using the derivative. I can put the Haskell source on HPaste, but it's a little embarrassing to look at. The network has 3 layers: the first layer has 2 inputs and 4 outputs, the second has 4 inputs and 1 output, and the third has 1 output. Increasing to 4 neurons in the second layer didn't help, and neither did increasing to 8 outputs in the first layer.
I then calculated errors, network output, bias updates, and the weight updates by hand based on http://hebb.mit.edu/courses/9.641/2002/lectures/lecture04.pdf to make sure there wasn't an error in those parts of the code (there wasn't, but I will probably do it again just to make sure). Because I am using batch training, I did not multiply by x in equation (4) there. I am adding the weight change, though http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-2.html suggests to subtract it instead.
The problem persisted, even in this simplified network. For example, these are the results after 500 epochs of batch training and of incremental training.
Input |Target|Output (Batch) |Output(Incremental)
[1.0,1.0]|[0.0] |[0.5003781562785173]|[0.5009731800870864]
[1.0,0.0]|[1.0] |[0.5003740346965251]|[0.5006347214672715]
[0.0,1.0]|[1.0] |[0.5003734471544522]|[0.500589332376345]
[0.0,0.0]|[0.0] |[0.5003674110937019]|[0.500095157458231]
Subtracting instead of adding produces the same problem, except everything is 0.99 something instead of 0.50 something. 5000 epochs produces the same result, except the batch-trained network returns exactly 0.5 for each case. (Heck, even 10,000 epochs didn't work for batch training.)
Is there anything in general that could produce this behavior?
Also, I looked at the intermediate errors for incremental training, and the although the inputs of the hidden/input layers varied, the error for the output neuron was always +/-0.12. For batch training, the errors were increasing, but extremely slowly and the errors were all extremely small (x10^-7). Different initial random weights and biases made no difference, either.
Note that this is a school project, so hints/guides would be more helpful. Although reinventing the wheel and making my own network (in a language I don't know well!) was a horrible idea, I felt it would be more appropriate for a school project (so I know what's going on...in theory, at least. There doesn't seem to be a computer science teacher at my school).
EDIT: Two layers, an input layer of 2 inputs to 8 outputs, and an output layer of 8 inputs to 1 output, produces much the same results: 0.5+/-0.2 (or so) for each training case. I'm also playing around with pyBrain, seeing if any network structure there will work.
Edit 2: I am using a learning rate of 0.1. Sorry for forgetting about that.
Edit 3: Pybrain's "trainUntilConvergence" doesn't get me a fully trained network, either, but 20000 epochs does, with 16 neurons in the hidden layer. 10000 epochs and 4 neurons, not so much, but close. So, in Haskell, with the input layer having 2 inputs & 2 outputs, hidden layer with 2 inputs and 8 outputs, and output layer with 8 inputs and 1 output...I get the same problem with 10000 epochs. And with 20000 epochs.
Edit 4: I ran the network by hand again based on the MIT PDF above, and the values match, so the code should be correct unless I am misunderstanding those equations.
Some of my source code is at http://hpaste.org/42453/neural_network__not_working; I'm working on cleaning my code somewhat and putting it in a Github (rather than a private Bitbucket) repository.
All of the relevant source code is now at https://github.com/l33tnerd/hsann.
I've had similar problems, but was able to solve by changing these:
Scale down the problem to manageable size. I first tried too many inputs, with too many hidden layer units. Once I scaled down the problem, I could see if the solution to the smaller problem was working. This also works because when it's scaled down, the times to compute the weights drop down significantly, so I can try many different things without waiting.
Make sure you have enough hidden units. This was a major problem for me. I had about 900 inputs connecting to ~10 units in the hidden layer. This was way too small to quickly converge. But also became very slow if I added additional units. Scaling down the number of inputs helped a lot.
Change the activation function and its parameters. I was using tanh at first. I tried other functions: sigmoid, normalized sigmoid, Gaussian, etc.. I also found that changing the function parameters to make the functions steeper or shallower affected how quickly the network converged.
Change learning algorithm parameters. Try different learning rates (0.01 to 0.9). Also try different momentum parameters, if your algo supports it (0.1 to 0.9).
Hope this helps those who find this thread on Google!
So I realise this is extremely late for the original post, but I came across this because I was having a similar problem and none of the reasons posted here cover what was wrong in my case.
I was working on a simple regression problem, but every time I trained the network it would converge to a point where it was giving me the same output (or sometimes a few different outputs) for each input. I played with the learning rate, the number of hidden layers/nodes, the optimization algorithm etc but it made no difference. Even when I looked at a ridiculously simple example, trying to predict the output (1d) of two different inputs (1d):
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
class net(nn.Module):
def __init__(self, obs_size, hidden_size):
super(net, self).__init__()
self.fc = nn.Linear(obs_size, hidden_size)
self.out = nn.Linear(hidden_size, 1)
def forward(self, obs):
h = F.relu(self.fc(obs))
return self.out(h)
inputs = np.array([[0.5],[0.9]])
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
network = net(1,5)
optimizer = torch.optim.Adam(network.parameters(), lr=0.001)
for i in range(10000):
out = network(torch.tensor(inputs, dtype=torch.float32))
loss = F.mse_loss(out, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Loss: %f outputs: %f, %f"%(loss.data.numpy(), out.data.numpy()[0], out.data.numpy()[1]))
but STILL it was always outputting the average value of the outputs for both inputs. It turns out the reason is that the dimensions of my outputs and targets were not the same: the targets were Size[2], and the outputs were Size[2,1], and for some reason PyTorch was broadcasting the outputs to be Size[2,2] in the MSE loss, which completely messes everything up. Once I changed:
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
to
targets = torch.tensor([[3.0], [2.0]], dtype=torch.float32)
It worked as it should. This was obviously done with PyTorch, but I suspect maybe other libraries broadcast variables in the same way.
For me it was happening exactly like in your case, the output of neural network was always the same no matter the training & number of layers etc.
Turns out my back-propagation algorithm had a problem. At one place I was multiplying by -1 where it wasn't required.
There could be another problem like this. The question is how to debug it?
Steps to debug:
Step1 : Write the algorithm such that it can take variable number of input layers and variable number of input & output nodes.
Step2 : Reduce the hidden layers to 0. Reduce input to 2 nodes, output to 1 node.
Step3 : Now train for binary-OR-Operation.
Step4 : If it converges correctly, go to Step 8.
Step5 : If it doesn't converge, train it only for 1 training sample
Step6 : Print all the forward and prognostication variables (weights, node-outputs, deltas etc)
Step7 : Take pen&paper and calculate all the variables manually.
Step8 : Cross verify the values with algorithm.
Step9 : If you don't find any problem with 0 hidden layers. Increase hidden layer size to 1. Repeat step 5,6,7,8
It sounds like a lot of work, but it works very well IMHO.
I know, that for the original post this is far, too late but maybe I can help someone with this, as I faced the same problem.
For me the problem was, that my input data had missing values in important columns, where the training/test data were not missing. I replaced these values with zero values and voilà, suddenly the results were plausible. So maybe check your data, maybe it si misrepresented
It's hard to tell without seeing a code sample but it is possible occure for a net because its number of hidden neron.with incresing in number of neron and number of hiden layer it is not possible to train a net with small set of training data.until it is possible to make a net with smaller layer and nerons it is amiss to use a larger net.therefore perhaps your problem solved with attention to this matters.
I haven't tested it with the XOR problem in the question, but for my original dataset based on Tic-Tac-Toe, I believe that I have gotten the network to train somewhat (I only ran 1000 epochs, which wasn't enough): the quickpropagation network can win/tie over half of its games; backpropagation can get about 41%. The problems came down to implementation errors (small ones) and not understanding the difference between the error derivative (which is per-weight) and the error for each neuron, which I did not pick up on in my research. #darkcanuck's answer about training the bias similarly to a weight would probably have helped, though I didn't implement it. I also rewrote my code in Python so that I could more easily hack with it. Therefore, although I haven't gotten the network to match the minimax algorithm's efficiency, I believe that I have managed to solve the problem.
I faced a similar issue earlier when my data was not properly normalized. Once I normalized the data everything ran correctly.
Recently, I faced this issue again and after debugging, I found that there can be another reason for neural networks giving the same output. If you have a neural network that has a weight decay term such as that in the RSNNS package, make sure that your decay term is not so large that all weights go to essentially 0.
I was using the caret package for in R. Initially, I was using a decay hyperparameter = 0.01. When I looked at the diagnostics, I saw that the RMSE was being calculated for each fold (of cross validation), but the Rsquared was always NA. In this case all predictions were coming out to the same value.
Once I reduced the decay to a much lower value (1E-5 and lower), I got the expected results.
I hope this helps.
I was running into the same problem with my model when number of layers is large. I was using a learning rate of 0.0001. When I lower the learning rate to 0.0000001 the problem seems solved. I think algorithms stuck on local minumums when learning rate is too low
It's hard to tell without seeing a code sample, but a bias bug can have that effect (e.g. forgetting to add the bias to the input), so I would take a closer look at that part of the code.
Based on your comments, I'd agree with #finnw that you have a bias problem. You should treat the bias as a constant "1" (or -1 if you prefer) input to each neuron. Each neuron will also have its own weight for the bias, so a neuron's output should be the sum of the weighted inputs, plus the bias times its weight, passed through the activation function. Bias weights are updated during training just like the other weights.
Fausett's "Fundamentals of Neural Networks" (p.300) has an XOR example using binary inputs and a network with 2 inputs, 1 hidden layer of 4 neurons and one output neuron. Weights are randomly initialized between +0.5 and -0.5. With a learning rate of 0.02 the example network converges after about 3000 epochs. You should be able to get a result in the same ballpark if you get the bias problems (and any other bugs) ironed out.
Also note that you cannot solve the XOR problem without a hidden layer in your network.
I encountered a similar issue, I found out that it was a problem with how my weights were being generated.
I was using:
w = numpy.random.rand(layers[i], layers[i+1])
This generated a random weight between 0 and 1.
The problem was solved when I used randn() instead:
w = numpy.random.randn(layers[i], layers[i+1])
This generates negative weights, which helped my outputs become more varied.
I ran into this exact issue. I was predicting 6 rows of data with 1200+ columns using nnet.
Each column would return a different prediction but all of the rows in that column would be the same value.
I got around this by increasing the size parameter significantly. I increased it from 1-5 to 11+.
I have also heard that decreasing your decay rate can help.
I've had similar problems with machine learning algorithms and when I looked at the code I found random generators that were not really random. If you do not use a new random seed (such Unix time for example, see http://en.wikipedia.org/wiki/Unix_time) then it is possible to get the exact same results over and over again.

Resources