Putting Huggingface model on GPU with torch.distributed

Putting Huggingface model on GPU with torch.distributed - pytorch

I'm using Huggingface and I'm putting my model on GPU using the following code:
from transformers import GPTJForCausalLM
import torch
model = GPTJForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B",
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_cache=False,
gradient_checkpointing=True
)
model.to("cuda")
I would like to train on 2 GPUs using the following command fails:
python -m torch.distributed.launch --nproc_per_node 2 --nnodes=1 train.py
This causes an error because I think it tries to put both instances on the same GPU. When I remove the model.to("cuda") line it works fine (but it is then not running on GPU I guess).
How can I put the model on GPU when using multiple GPUs?

Related

Non-deterministic behavior for training a neural network on GPU implemented in PyTorch and with a fixed random seed

I observed a strange behavior of the final Accuracy when I run exactly the same experiment (the same code for training neural net for image classification) with the same random seed on different GPUs (machines). I use only one GPU. Precisely, When I run the experiment on one machine_1 the Accuracy is 86,37. When I run the experiment on machine_2 the Accuracy is 88,0.
There is no variability when I run the experiment multiple times on the same machine. PyTorch and CUDA versions are the same. Could you help me to figure out the reason and fix it?
Machine_1:
NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2
Machine_2:
NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2
To fix random seed I use the following code:
random.seed(args.seed)
os.environ['PYTHONHASHSEED'] = str(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

This is what I use:
import torch
import os
import numpy as np
import random
def set_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
set_seed(13)
Make sure you have a single function that set's the seeds from once. If you are using Jupyter notebooks cell execution timing may cause this. Also the order of functions inside may be important. I never had problems with this code. You may call set_seed() often in code.

OOM with a "simple" ResNet50 using Tensorflow2.0 on an Nvidia RTX2080 Ti

I'm surprised to face an Out-of-Memory error using tf.keras.applications.ResNet50 implementation on an Nvidia RTX2080Ti (with 11Gb of memory !).
Question:
Is there something wrong with the workflow I use?
Notes:
I'm using tensorflow-gpu==2.0.0b1 with CUDA v10.1
I work on a segmentation task, thus the large output_shape
I build the batches myself, thus the use of train_on_batch()
Even when setting memory_growth to True, the memory get filled-up from 700Mb to 10850Mb in a fraction of a second.
Code:
import tensorflow as tf
import tensorflow.keras as ke
import numpy as np
ke.backend.clear_session()
inputs = ke.layers.Input(shape=(512,1024,3), dtype="float32")
outputs = ke.applications.ResNet50(include_top=False, weights="imagenet")(inputs)
outputs = ke.layers.Lambda(lambda x: tf.compat.v1.image.resize_bilinear(x, size=(512,1024)))(outputs)
outputs = ke.layers.Conv2D(2, 1, activation="softmax")(outputs)
model = ke.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=ke.optimizers.RMSprop(lr=0.001), loss=ke.losses.CategoricalCrossentropy())
images = np.zeros((1,512,1024,3), dtype=np.float32)
targets = np.zeros((1,512,1024,2), dtype=np.float32)
model.train_on_batch(images, targets)

Resnet being the complex complex model, the dimensions of the input might be the reason for OOM error. Try reducing the dimensions and the corresponding batch size(as much as the memory can hold) and try.
As mentioned in comments it worked with batch size 1 and with dimensions 700*512.

mxnet cpu memory leak when running inference on model

I'm running into a memory leak when performing inference on an mxnet model (i.e. converting an image buffer to tensor and running one forward pass through the model).
A minimal reproducable example is below:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
The result is a linear increase of RSS memory (from 700MB up to 10GB+).
The problem persists with other pretrained models and with a custom model that I am trying to use. And using garbage collectors does not show any increase in objects.
This gist has the full code snippet including an example imgbuf.
Environment info:
python 2.7.15
gcc 4.2.1
mxnet-mkl 1.3.1
gluoncv 0.3.0

MXNet is running a asynchronous engine to maximize parallelism and parallel executions of operators, that means that every call to enqueue operation / copy data returns eagerly and the operation is enqueued on the MXNet backend. Effectively by running the loop as you have written it, you are enqueueing operations faster than you are processing them.
You can add an explicit synchronization point, for example .asnumpy() or .mx.nd.waitall() or .wait_to_read(), that way MXNet will wait for the enqueued operations to be completed before continuing the python execution.
This will solve your issue:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
mx.nd.waitall()
Read more about MXNet asynchronous execution here: http://d2l.ai/chapter_computational-performance/async-computation.html

Can we run tensorflow lite on linux ? Or it is for android and ios only

Hi is there any possibility to run tensorflow lite on linux platform? If yes, then how we can write code in java/C++/python to load and run models on linux platform? I am familiar with bazel and successfully made Android and ios application using tensorflow lite.

I think the other answers are quite wrong.
Look, I'll tell you my experience... I've been working with Django for many years, and I've been using normal tensorflow, but there was a problem with having 4 or 5 or more models in the same project.
I don't know if you know Gunicorn + Nginx. This generates workers, so if you have 4 machine learning models, for every worker it multiplies, if you have 3 workers you will have 12 models preloaded in RAM. This is not efficient at all, because if the RAM overflows your project will fall or in fact the service responses are slower.
So this is where Tensorflow lite comes in. Switching from a tensorflow model to tensorflow lite improves and makes things much more efficient. Times are reduced absurdly.
Also, Django and Gunicorn can be configured so that the model is pre-loaded and compiled at the same time. So every time the API is used up, it only generates the prediction, which helps you make each API call a fraction of a second long.
Currently I have a project in production with 14 models and 9 workers, you can understand the magnitude of that in terms of RAM.
And besides doing thousands of extra calculations, outside of machine learning, the API call does not take more than 2 seconds.
Now, if I used normal tensorflow, it would take at least 4 or 5 seconds.
In summary, if you can use tensorflow lite, I use it daily in Windows, MacOS, and Linux, it is not necessary to use Docker at all. Just a python file and that's it. If you have any doubt you can ask me without any problem.
Here a example project
Django + Tensorflow Lite

It's possible to run (but it will works slower, than original tf)
Example
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path=graph_file)
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Get quantization info to know input type
quantization = None
using_type = input_details[0]['dtype']
if dtype is np.uint8:
quantization = input_details[0]['quantization']
# Get input shape
input_shape = input_details[0]['shape']
# Input tensor
input_data = np.zeros(dtype=using_type, shape=input_shape)
# Set input tensor, run and get output tensor
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

I agree with Nouvellie. It is possible and worth the time implementing. I developed a model on my Ubuntu 18.04 32 processor server and exported the model to tflite. The model ran in 178 secs on my ubuntu server. On my raspberry pi4 with 4GB memory, the tflite implementation ran in 85 secs, less than half the time of my server. When I installed tflite on my server the run time went down to 22 secs, an 8 fold increase in performance and now almost 4 times faster than the rpi4.
To install for python, I did not have to build the package but was able to use one of the prebuilt interpreters here:
https://www.tensorflow.org/lite/guide/python
I have Ubuntu 18.04 with python 3.7.7. So I ran pip install with the Linux python 3.7 package:
pip3 install
https://dl.google.com/coral/python/tflite_runtime-2.1.0.post1-cp37-cp37m-linux_x86_64.whl
Then import the package with:
from tflite_runtime.interpreter import Interpreter
Previous posts show how to use tflite.

From Tensorflow lite
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices.
Tensorflow lite is a fork of tensorflow for embedded devices. For PC just use the original tensorflow.
From github tensorflow:
TensorFlow is an open source software library
TensorFlow provides stable Python API and C APIs as well as without API backwards compatibility guarantee like C++, Go, Java, JavaScript and Swift.
We support CPU and GPU packages on Linux, Mac, and Windows.
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> tf.add(1, 2)
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
'Hello, TensorFlow!'

Yes, you can compile Tensorflow Lite to run on Linux platforms even with a Docker container. See the demo: https://sconedocs.github.io/tensorflowlite/

The meaning of "n_jobs == 1" in GridSearchCV with using multiple GPU

I have been training NN model by using Keras framework with 4 NVIDIA GPU. (Data Row Count: ~160,000, Column Count: 5). Now I want to optimize its parameter by using GridSearchCV.
However, I encountered several different errors whenever I tried to change n_jobs to other values than one. Error, such as
CUDA OUT OF MEMORY
Can not get device properties error code : 3
Then I read this web page,
"# if you're not using a GPU, you can set n_jobs to something other than 1"
http://queirozf.com/entries/scikit-learn-pipeline-examples
So it is not possible to use multiple GPU with GridSearchCV?
[Environment]
Ubuntu 16.04
Python 3.6.0
Keras / Scikit-Learn
Thanks!

According to the FAQ in scikit learn - GPU is NOT supported. Link
You can use n_jobs to use your CPU cores. If you want to run at maximum speed you might want to use almost all your cores:
import multiprocessing
n_jobs = multiprocessing.cpu_count()-1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string