CUDA out of memory and BERT - pytorch

Is there a way that I can manage how much memory to reserve for Pytorch? I've trying to finetune a BERT model from Huggingface and I'm already using a A100 Nvidia GPU with 80GiB of GPU from Paperspace and still managed to go "CUDA out of memory". I'm not training from scratch or anything and my dataset is composed by less than 250,000 datum. I don't know if I'm missing something or if there is a way to reduce the allocated memory for Pytorch.
(Seriously, I just wanna cry now.)
Error specs
I've tried reducing my batch size at training dataset and eval dataset, also turning fp16=Tue. I even tried with the gradient accumulation but nothing seems to work. I've been cleaning with gc.collect() and torch.cuda.empty_cache() and restarting my kernel machine. I even tried with a smaller model.

Related

How do I prevent a lack of VRAM halfway through training a Huggingface Transformers (Pegasus) model?

I'm taking a pre-trained pegasus model through Huggingface transformers, (specifically, google/pegasus-cnn_dailymail, and I'm using Huggingface transformers through Pytorch) and I want to finetune it on my own data. This is however quite a large dataset and I've run into the problem of running out of VRAM halfway through training, which because of the size of the dataset can be a few days after training even started, which makes a trial-and-error approach very inefficient.
I'm wondering how I can make sure ahead of time that it doesn't run out of memory. I would think that the memory usage of the model is in some way proportional to the size of the input, so I've passed truncation=True, padding=True, max_length=1024 to my tokenizer, which if my understanding is correct should make all the outputs of the tokenizer of the same size per line. Considering that the batch size is also a constant, I would think that the amount of VRAM in use should be stable. So I should just be able to cut up the dataset into managable parts, just looking at the ram/vram use of the first run, and infer that it will run smoothly from start to finish.
However, the opposite seems to be true. I've been observing the amount of VRAM used at any time and it can vary wildly, from ~12GB at one time to suddenly requiring more than 24GB and crashing (because I don't have more than 24GB).
So, how do I make sure that the amount of vram in use will stay within reasonable bounds for the full duration of the training process, and avoid it crashing due to a lack of vram when I'm already days into the training process?
padding=True actually doesn't pad to max_length, but to the longest sample in the list you pass to the tokenizer. To pad to max_length you need to set padding='max_length'.

What is difference between the result of using GPU or not?

I have a CNN with 2 hidden layers. When i use keras on cpu with 8GB RAM, sometimes i have "Memory Error" or sometimes precision class was 0 but some classes at the same time were 1.00. If i use keras on GPU,will it solve my problem?
You probably don't have enough memory to fit all the images in the CPU during training. Using a GPU will only help if it has more memory. If this is happening because you have too many images or they're resolution is too high, you can try using keras' ImageDataGenerator and any of the flow methods to feed your data in batches.

How to know when to use fit_generator() in keras when training data gets too big for fit()?

When using keras for machine learning, model.fit() is used when training data is small. When training data is too big, model.fit_generator() is recommended instead of model.fit(). How does one know when data size has become too large?
The moment you run into memory errors when trying to take the training data into memory, you'll have to switch to fit_generator(). There is extra overhead associated with generating data on the fly (and reading from disk to do so), so training a model on a dataset that lives in memory will always be faster.

Difference between Keras 2.0.8 and 2.1.5?

I am training a GAN and I see that my performance is very different on my CPU and GPU. I noticed that on the GPU installation, it is 2.0.8 and for CPU it is 2.1.5. On a separate machine with keras+tf GPU I get the same performance as the CPU one from before, the keras version is 2.1.6.
Is this expected? In the keras release notes I did not find anything that would change the way my training works.
The performance with the new one is better in many senses. Much faster convergence (10x less epochs required) but the images are less smooth sometimes.

Training Methodology of CNN in theano with large scale data

I am training a CNN with 1M images with theano. Now I am puzzled on how to prepare the training data.
My questions are:
When the images resize to 64*64*3, the size of whole data is about 100G. Should I save the data into a single npy file or some smaller files? which one is efficient?
How to decide the number of parameters of the CNN? How about 1M/10 = 100K?
Should I limit the memory cost of a training block and the CNN parameters less than GPU memory?
My computer is with 16G memory and GPU Titian.
Thank you very much.
If you're using a NN framework like pylearn2, lasagne, Keras, etc, check the docs to see if there are guidelines for iterating batches off disk from an hdf5 store or similar.
If there's nothing and you don't want to roll your own, the fuel package provides lots of helpful data iteration schemes that can be adapted to models in theano (and probably most of the frameworks; there's a good tutorial in the fuel repository).
As for the parameters, you'll have to cross validate to figure out the best parameters for your data.
And yes, the model size + minibatch size + dropout mask for the batch has to be under the available vram.

Resources