Does using FP16 help accelerate generation? (HuggingFace BART) - pytorch

I follow the guide below to use FP16 in PyTorch.
https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
Basically, I'm using BART in HuggingFace for generation
During the training phase, I'm able to get 2x speedup and less GPU memory consumption
But.
I found out there is no speedup when I call model.generate under torch.cuda.amp.autocast().
with torch.cuda.amp.autocast():
model.generate(...)
When I save the model by:
model.save_pretrained("model_folder")
the size does not decrease to half. But I have to call model.half() before saving in order to make the model half size.
Thus, my questions:
Is the issue in 1. expected or there should be something I did wrong?
Is the operation I did in 2. proper?

Related

Does calling forward() on a model in pytorch require extra gpu memory after already having loaded the model and data in gpu memory?

I can load the model and a data sample in gpu memory, but when I call forward on the model with the sample, it gives a CUDA out of memory error.
I'm sure the model and data have been loaded, as my code is structured as follows (pseudocode):
model = Model()
sample = load_sample()
sleep(5) # to check memory usage with nvidia-smi
print('before forward')
model(sample)
print('after forward')
"before forward" gets printed, but "after forward" does not.
I assumed all the necessary memory for a forward pass gets allocated during construction of the model, but I don't know how else this error can happen. I also cannot find it on Google.
Python: 3.6.9
PyTorch: 1.2.0
It is not possible to determine the amount of space required to store the activations before runtime and hence GPU memory increases. Pytorch maintains a dynamic computation graph and hence the order of computations is not at all known before runtime. When you declare/initialize the model, only __init__ is called and model parameters are initialized. To figure out the graph one would need to look at the forward call and maybe also loss function (if it is not within forward call).
Let's say we can look at the forward call before running the model but still the batch size is unknown and hence memory can't be pre-allocated for activations.
Even if the batch size is known, there could be other unknowns like sequence size (for RNN), or episode size in RL that make it hard to pre-allocate memory for activations. Even if we account for all this at the declaration, pytorch naturally allows for for-loops which makes it almost impossible to pre-allocate space for activations and hence GPU memory can increase during runtime depending on the use case.
As Umang Gupta pointed out in the comments, GPU memory will increase during a forward() call on a Pytorch model, as (possibly amongst others) the batch size is not known before runtime. Therefore the required memory cannot be reserved beforehand, and the GPU memory can increase after having loaded the model and data already.

VGG16 model freezes computer

I am currently trying to use the vgg16 model from keras library but whenever I create an object of the VGG16 model by doing
from keras.applications.vgg16 import VGG16
model = VGG16()
I get the following message 3 times.
tensorflow/core/framework/allocator.cc.124 allocation of 449576960 exceeds 10% of system memory
following this, my computer freezes. I am using a 64-bit, 4gb RAM with linux mint 18 and I have no access to GPU.
Is this problem has to do something with my RAM?
As a temporary solution I am running my python scripts from command line because my computer freezes less there compared to any IDE. Also, this does not happen when I use any alternate model like InceptionV3.
I have tried the solution provided here
but it didn't work
Any help is appreciated.
You are most likely running out of memory (RAM).
Try running top (or htop) in parallel and see your memory utilization.
In general, VGG models are rather big and require a decent amount of RAM. That said, the actual requirement depends on batch size. Smaller batch means smaller activation layer.
For example, a 6 image batch would consume about a gig of ram (reference). As a test you could lower your batch size to 1 and see it that fits in your memory.

AWS, Cuda, Tensorflow

When I'm running my Python code on the most powerfull AWS GPU instances (with 1 or 8 x Tesla v100 16mb aka. P3.x2large or P3.16xlarge) they are both only 2-3 times faster than my DELL XPS Geforce 1050-Ti laptop?
I'm using Windows, Keras, Cuda 9, Tensorflow 1.12 and the newest Nvidia drivers.
When I check the GPU load via GZU the GPU max. run at 43% load for a very short period - each time. The controller runs at max. 100%...
The dataset I use is matrices in JSON format and the files are located on a Nitro drive at 10TB with MAX 64.000 IOPS. No matter if the folder contains 10TB, 1TB or 100mb...the training is still very very slow per iteration?
All advises are more than welcome!
UPDATE 1:
From the Tensorflow docs:
"To start an input pipeline, you must define a source. For example, to construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data are on disk in the recommended TFRecord format, you can construct a tf.data.TFRecordDataset."
Before I had matrices stored in JSON format (Made by Node). My TF runs in Python.
I will now only save the coordinates in Node and save it in JSON format.
The question is now: In Python what is the best solution to load data? Can TF use the coordinates only or do I have to make the coordinates back to matrices again or what?
The performance of any machine learning model depends on many things. Including but not limited to: How much pre-processing you do, how much data you copy from CPU to GPU, Op bottlenecks, and many more. Check out the tensorflow performance guide as a first step. There are also a few videos from the tensorflow dev summit 2018 that talk about performance. How to properly use tf.data, and how to debug performance are two that I recommend.
The only thing I can say for sure is that JSON is a bad format for this purpose. You should switch to tfrecord format, which uses protobuf (better than JSON).
Unfortunately performance and optimisation of any system takes a lot of effort and time, and can be a rabbit hole that just keeps going down.
First off, you should be having a really good reason to go for an increased computational overhead with Windows-based AMI.
If your CPU is at ~100%, while GPU is <100%, then your CPU is likely the bottleneck. If you are on cloud, consider moving to instances with larger CPU-count (CPU is cheap, GPU is scarce). If you can't increase CPU count, moving some parts of your graph to GPU is an option. However, tf.data-based input pipeline is run entirely on CPU (but highly scalable due to C++ implementation). Prefetching to GPUs might also help here, but the cost of spawning another background thread to populate the buffer for downstream might damp this effect. Another option is to do some or all pre-processing steps offline (i.e. prior to training).
A word of caution on using Keras as the input pipeline. Keras relies on Python´s multithreading (and optionally multiprocessing) libraries, which may both lack performance (when doing heavy I/O or augmentations on-the-fly) and scalability (when running on multiple CPUs) compared to GIL-free implementations. Consider performing preprocessing offline, pre-loading input data, or using alternative input pipelines (as the aforementioned TF native tf.data, or 3rd party ones, like Tensorpack).

Outer product based Conv filters consume disproportionately high memory

The minimum reproducible example is in this github gist. The issue is of a disproportionately large memory usage.
I modified the cifar10 example which ships with tensorflow to use outer-product of 3 vectors as the weights of the convolutional layers. This change can be seen in this part of the code.
For simplicity, i have removed all parameter training operations and even loss computations. The current model only computes logits (forward pass) again and again.
The unmodified code (can be executed by setting the use_outerp flag to False) uses approximately .4 GB RAM
whereas the modified code (with outer product of vectors used as the convolutional weight tensor) uses a disproportionately high 5.6 GB RAM.
Any idea why this is the case?
My intuition as to why this might happen is that maybe the outer product operations are being executed every single time that the conv filter is needed instead of being executed exactly once in every forward pass. Is this really the case? Is there a way to fix this?
Steps to reproduce:
To run the default version of the code (low memory footprint):
python train.py --use_outerp='False'
To run the modified version of the code (high memory footprint):
python train.py --use_outerp='True'
Operating System:
Ubuntu 14.04
Installed version of CUDA and cuDNN:
None, I'm using the CPU version of TF.
The output from python -c "import tensorflow; print(tensorflow.__version__)":
0.10.0rc0
Using tcmalloc as suggested here didn't help.

Pre-processing on multiple CPUs in parallel and feeding output to tensorflow which trains on Multi-GPUs

I am trying to use Tensorflow for my work on a classification problem.
Before feeding the input in the form of images, I want to do some pre-processing on the images. I would like to carry out this pre-processing on multiple CPU cores in parallel and feed them to the TensorFlow graph which I want to run in a Multi-GPU setting (I have 2 TitanX GPUs).
The reasoning I want this setup is, so that while the GPU is performing the training, the CPUs keep on doing their job of pre-processing and hence after each iteration, the GPU does not remain idle. I have been looking into the TensorFlow API for this, but could not locate something which specifically addresses such a scenario.
So, multiple CPU cores should keep on pre-processing a list of files and fill in a queue from which TensorFlow extracts its batch of data. Whenever this queue is full, CPU cores should wait and again start processing when the queue (or part of it) is vacated due to feeding of examples to TensorFlow graph.
I have two questions specifically :
How may I achieve this setup ?
Is it a good idea to have this setup ?
A clear example would be a great help.

Resources