Tensorflow. How to distribute ops between GPUs - multithreading

I am running a very large Tensorflow model on google cloud ml-engine.
When using the scale tier basic_gpu (with batch_size=1) I get errors like:
Resource exhausted: OOM when allocating tensor with shape[1,155,240,240,16]
because the model is too large to fit in one GPU.
Using the tier comple_model_m_gpu which provides 4 GPUs I can spread the operations between the 4 GPUs.
However, I remember reading that communication between GPUs is slow and can create a bottleneck in training. Is this true?
If so, is there a recommended way of spreading operations between the GPUs that prevents this problem?

I recommend the following guide:
Optimizing for GPU
From the guide:
The best approach to handling variable updates depends on the model,
hardware, and even how the hardware has been configured.
A few suggestions based on the guide:
Try using P100s which have 16 GB of RAM (compared to 12 on the K80s). They are also significantly faster, although they also cost more
Place the variables on CPU: tf.train.replica_device_setter(worker_device=worker, ps_device='/cpu:0', ps_tasks=1)

Using Tesla P100 GPUs instead of Tesla K80 GPUs fixes this issue because P100s have something called Page Migration Engine.
Page Migration Engine frees developers to focus more on tuning for
computing performance and less on managing data movement. Applications
can now scale beyond the GPU's physical memory size to virtually
limitless amount of memory.

Related

Multiprocessing with large ml models

I have got a large transformer model from huggingface. The model is about 2gb in storage. When I try to run multiprocessing processes or pool the program just freezes. Even if I just try to have 2 workers/processes.
From what I understand it freezes because it's trying to pickle the transformer model and copy the environment for both workers.
I've tried to load the model in after the multiprocessing starts but it also results in the same challenge.
My question is do I need to increase my ram if so what's the general rule of thumb for how much ram I need per worker and how would I calculate it.
How can I get this right, I've tried making the model use a shared memory block but I've not managed to get it to work. has anyone done something like this?
You probably have to account 2 GB (or more) for each worker, since they likely have different copies of your model.
Using shared memory is the only option if you can't increase your memory amount.
I believe that an easy rule of the thumb to understand how much RAM you need is something n_workers * per_worker_mem * 1.1. You measure per_worker_mem with free or ps command, accounting for a 10% overhead that you may have for synchronization and data exchange between threads.
Your overhead may vary according to the amount of data shared and exchanged between the workers.
On a physical system you may also want to account for an additional 1/2 GB for the OS and (in general) a fair amount of free RAM to be used as cache to speedup your file system (e.g. if your model needs 6 GB of RAM, I won't go below 16 or 32 to keep a snappy system).

In PyTorch, can I load a tensor from file directly to the GPU, without using CPU memory?

I'm working on feature generation before I train a model in PyTorch. I wish to save my features as PyTorch tensors on disk for later use in training.
One of my features ("Feature A") is calculated on a CPU while another feature ("Feature B") must be calculated from that CPU on a GPU (some linear algebra stuff). I have an unusual limitation: on my university cluster, jobs which don't use GPUs have CPU memory limits of 1TB each while jobs which do use GPUs have CPU memory limits of 4GB with GPU memory limits of 48GB. Feature A and Feature B are each approximately 10GB.
Naturally, I want to first calculate Feature A using CPUs only then save Feature A to disk. In another job (this one with GPU access and thus the 4GB CPU memory limitation), I want to load Feature A directly to GPU, compute Feature B, then save Feature B to disk.
With Feature A computed and saved to disk, I've tried:
feaB = torch.load(feaAfile, map_location=torch.device('cuda'))
And yet I max-out my CPU memory. I've confirmed cuda is available.
In the PyTorch documentation I see that in loading tensors they "are first deserialized on the CPU..."
I wonder if there is any way to avoid a CPU memory implication when I want to load only onto the GPU? If the tensor must first be copied to the CPU, could I use some sort of 4GB buffer? Thanks so much in advance.
EDIT: per discussion in the comments, I no longer need to do this. But the question itself, of loading a tensor to the GPU without using CPU memory, remains unanswered so I'm leaving this question up.

Running two different independent PyTorch programs on a single GPU

I have a single NVIDIA GPU which has a memory of 16GB. I have to run two different (and independent; meaning, two different problems: one is a vision type task, another is NLP task) Python programs. The codes are written using PyTorch and both the codes can use GPU.
I have tested that program 1 takes roughly 5GB of GPU memory, and the rest is free. If I run the two programs, will it hamper the model performance or will it cause any process conflicts?
Linked question; but it does not necessarily mean PyTorch codes
I do not know the details of how this works, but I can tell from experience that both programs will run well (as long as they do not need more than 16GB of RAM when combined), and execution times should stay roughly the same.
However, computer vision usually requires a lot of IO (mostly reading images), if the other task needs to read files too, this part may become slower than when running both programs individually.
It should work fine.
In one of my projects, I faced the problem of lack of GPU memory while working with multiple models. After loading them, my models used to take up most of the GPU memory. And during model inference, very less memory used to remain for the data. As we know, if your models are loaded on GPU then you also need to load your data on your GPU. So when you do batch inference (for eg giving 16 images at a time to the model) the complete batch is loaded on the GPU. This again takes more GPU memory. Your program crashes if it does not get enough GPU memory.
If you think GPU memory is not the issue in your case then everything should work fine. You also do not need to worry about conflicts because both processes will allocate their own GPU memory and will work independently. There would be no performance issues.

AWS, Cuda, Tensorflow

When I'm running my Python code on the most powerfull AWS GPU instances (with 1 or 8 x Tesla v100 16mb aka. P3.x2large or P3.16xlarge) they are both only 2-3 times faster than my DELL XPS Geforce 1050-Ti laptop?
I'm using Windows, Keras, Cuda 9, Tensorflow 1.12 and the newest Nvidia drivers.
When I check the GPU load via GZU the GPU max. run at 43% load for a very short period - each time. The controller runs at max. 100%...
The dataset I use is matrices in JSON format and the files are located on a Nitro drive at 10TB with MAX 64.000 IOPS. No matter if the folder contains 10TB, 1TB or 100mb...the training is still very very slow per iteration?
All advises are more than welcome!
UPDATE 1:
From the Tensorflow docs:
"To start an input pipeline, you must define a source. For example, to construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data are on disk in the recommended TFRecord format, you can construct a tf.data.TFRecordDataset."
Before I had matrices stored in JSON format (Made by Node). My TF runs in Python.
I will now only save the coordinates in Node and save it in JSON format.
The question is now: In Python what is the best solution to load data? Can TF use the coordinates only or do I have to make the coordinates back to matrices again or what?
The performance of any machine learning model depends on many things. Including but not limited to: How much pre-processing you do, how much data you copy from CPU to GPU, Op bottlenecks, and many more. Check out the tensorflow performance guide as a first step. There are also a few videos from the tensorflow dev summit 2018 that talk about performance. How to properly use tf.data, and how to debug performance are two that I recommend.
The only thing I can say for sure is that JSON is a bad format for this purpose. You should switch to tfrecord format, which uses protobuf (better than JSON).
Unfortunately performance and optimisation of any system takes a lot of effort and time, and can be a rabbit hole that just keeps going down.
First off, you should be having a really good reason to go for an increased computational overhead with Windows-based AMI.
If your CPU is at ~100%, while GPU is <100%, then your CPU is likely the bottleneck. If you are on cloud, consider moving to instances with larger CPU-count (CPU is cheap, GPU is scarce). If you can't increase CPU count, moving some parts of your graph to GPU is an option. However, tf.data-based input pipeline is run entirely on CPU (but highly scalable due to C++ implementation). Prefetching to GPUs might also help here, but the cost of spawning another background thread to populate the buffer for downstream might damp this effect. Another option is to do some or all pre-processing steps offline (i.e. prior to training).
A word of caution on using Keras as the input pipeline. Keras relies on Python´s multithreading (and optionally multiprocessing) libraries, which may both lack performance (when doing heavy I/O or augmentations on-the-fly) and scalability (when running on multiple CPUs) compared to GIL-free implementations. Consider performing preprocessing offline, pre-loading input data, or using alternative input pipelines (as the aforementioned TF native tf.data, or 3rd party ones, like Tensorpack).

Memory Estimation for Convolution Neural Network in Tensorflow

Hello Everyone,
I am working on a Image classification problem using tensorflow and Convolution Neural Network.
My model is having following layers.
Input image of size 2456x2058
3 convolution Layer {Con1-shape(10,10,1,32); Con2-shape(5,5,32,64); Con3-shape(5,5,64,64)}
3 max pool 2x2 layer
1 fully connected layer.
I have tried using the NVIDIA-SMI tool but it shows me the GPU memory consumption as the model runs.
I would like to know if there is any method or a way to find the estimate of memory before running the model on GPU. So that I can design models with the consideration of available memory.
I have tried using this method for estimation but my calculated memory and observed memory utilisation are no where near to each other.
Thank you all for your time.
As far as I understand, when you open a session with tensorflow-gpu, it allocates all the memory in the GPUS that are available. So, when you look at the nvidia-smi output, you will always see the same amount of used memory, even if it actually uses only a part of it. There are options when opening a session to force tensorflow to allocate only a part of the available memory (see How to prevent tensorflow from allocating the totality of a GPU memory? for instance)
You can control the memory allocation of GPU in TensorFlow. Once you calculated your memory requirements for your Deep learning model you can use tf.GPUOptions.
For example if you want to allocate 4 GB(approximately) of GPU memory out of 8 GB.
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Once done pass it in tf.Session using config parameter
The per_process_gpu_memory_fraction is used to bound the available amount of GPU memory.
Here's the link to documentation :-
https://www.tensorflow.org/tutorials/using_gpu
NVIDIA-SMI ... shows me the GPU memory consumption as the model run
TF preallocates all available memory when you use it, so NVIDIA-SMI would show nearly 100% memory usage ...
but my calculated memory and observed memory utilisation are no where near to each other.
.. so this is unsurprising.

Resources