Memory leak (CPU's RAM) when using onnxruntime on GPU - memory-leaks

I'm using the Insightface library from Pypi (https://pypi.org/project/insightface/), the source code is here: https://github.com/deepinsight/insightface/blob/master/python-package/insightface/model_zoo/scrfd.py.
When I run it on my GPU there is a severe memory leak of the CPU's RAM, over 40 GB until I stopped it (not the GPU memory).
here is my script:
import insightface
import cv2
import time
model = insightface.app.FaceAnalysis()
# It happens only when using GPU !!!
ctx_id = 0
image_path = "my-face-image.jpg"
image = cv2.imread(image_path)
model.prepare(ctx_id = ctx_id, det_thresh=0.3, det_size=[416, 416])
detector = model.models["detection"]
for i in range(100000):
start_t = time.time()
bboxes, landmarks = detector.detect(image)
end_t = time.time()
print('Detection time: {}'.format(end_t - start_t))
print('DONE')
My setup is (inside docker):
Docker Base Image - nvidia/cuda:11.0.3-cudnn8-devel-ubuntu18.04
Nvidia Driver - 465.27
python - 3.6.9
insightface==0.3.8
mxnet==1.8.0.post0
mxnet-cu110==2.0.0a0
numpy==1.18.5
onnx==1.9.0
onnxruntime-gpu==1.8.1

I managed to solve it with the following setup:
Ubuntu-20.04
Python-3.8
Nvidia-470
Cuda-11.3
Cudnn-8
mxnet==1.8.0.post0
onnx==1.9.0
onnxruntime-gpu==1.8.1
insightface==0.4

Related

AutoGluon TextPredictor.fit gives BrokenPipeError: [Errno 32] Broken pipe - possible solution

I'm posting this as another potential solution to the AutoGluon TextPredictor.fit() "BrokenPipeError: [Errno 32] Broken pipe" and get feedback. I searched broadly, looked at related responses on SO, Github, PyTorch discussion, etc. and don't see option 2 below for AutoGluon which is why I'd like to propose it.
For the AutoGluon Text Prediction quick start tutorial I used an AWS EC2 P3.2xlarge EC2 instance with 8 vCPUs, 61GB ram, 1 NVIDIA v100 running Windows Server 2016.
Running the tutorial, TextPredictor.fit() gave a "BrokenPipeError: [Errno 32] Broken pipe" error.
Option 1 - One way to fix it is to wrap your code in a block as mentioned here:
def run():
# TextPredictor.fit() code goes here
if __name__ == '__main__':
run()
Option 2 - Earlier in that same SO post there is another resolution to set PyTorch 'num_workers' to 0, however in my case PyTorch is running via AutoGluon. I looked through the Autogluon docs here and noticed I could set both env.num_workers and env.num_workers_evaluation. In TextPredictor.fit() I set both to zero and it worked great on my EC2 instance.
predictor.fit(train_data,
time_limit = TRAIN_TIME_LIMIT,
hyperparameters = {"env.num_workers": 0, "env.num_workers_evaluation": 0})
Option 2 only takes ~3 minutes longer to run vs option 1 (17 mins vs 14 mins).
My full code is below. As mentioned earlier most of it was borrowed from the AutoGluon site.
# -*- coding: utf-8 -*-
# -----
# Libs
from datetime import datetime
import torch
from autogluon.text import TextPredictor
import autogluon as ag
from autogluon.core.utils.loaders import load_pd
import pandas as pd
# -----
# Constants.
# This is in seconds. Ex: 5 mins * 60 secs per minute.
TRAIN_TIME_LIMIT = 1*60
# Data source
TRAIN_DATA = '.\\data\\train.parquet'
TEST_DATA = '.\\data\\train.parquet'
TRAIN_SAMPLE_SIZE = 67300
PRED_LABEL_COL = 'label'
# -----
# Main
print("\nStart time: ", datetime.now())
# Sanity test pytorch:
print("Is Cuda available? ", torch.cuda.is_available()) # Should be True
print("Is Cuda device count > 0? ", torch.cuda.device_count()) # Should be > 0
# My debug stuff
from autogluon.text.version import __version__
print("autogluon.text.version: ", __version__)
train_data = load_pd.load(TRAIN_DATA)
test_data = load_pd.load(TEST_DATA)
train_data = train_data.sample(n = TRAIN_SAMPLE_SIZE, random_state = 42)
predictor = TextPredictor(label = PRED_LABEL_COL, eval_metric = 'acc', path = '.\\ag_sst')
# Set num workers to Zero.
predictor.fit(train_data,
time_limit = TRAIN_TIME_LIMIT,
hyperparameters = {"env.num_workers": 0, "env.num_workers_evaluation": 0})
test_score = predictor.evaluate(test_data)
print("\n\nTest score:", test_score)
test_score = predictor.evaluate(test_data, metrics=['acc'])
print(test_score)
print("\nEnd time: ", datetime.now())
Question - I'm not familiar writing multi-processor code. Is there anything wrong or inefficient setting the fit() num_workers hyperparameters above within the context of the full code?

using gpu with simple transformer mt5 training

mt5 fine-tuning does not use gpu(volatile gpu utill 0%)
Hi, im trying to fine tuning for ko-en translation with mt5-base model.
I think the Cuda setting was done correctly(cuda available is True)
But during training, the training set doesn't use GPU except getting dataset first(very short time).
I want to use GPU resource efficiently and get advice about translation model fine-tuning
here is my code and training env.
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args
import torch
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train_df = pd.read_csv("data/enko_train.tsv", sep="\t").astype(str)
eval_df = pd.read_csv("data/enko_eval.tsv", sep="\t").astype(str)
train_df["prefix"] = ""
eval_df["prefix"] = ""
model_args = T5Args()
model_args.max_seq_length = 96
model_args.train_batch_size = 64
model_args.eval_batch_size = 32
model_args.num_train_epochs = 10
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 1000
model_args.use_multiprocessing = False
model_args.fp16 = True
model_args.save_steps = 1000
model_args.save_eval_checkpoints = True
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.preprocess_inputs = False
model_args.num_return_sequences = 1
model_args.wandb_project = "MT5 Korean-English Translation"
print("Is cuda available?", torch.cuda.is_available())
model = T5Model("mt5", "google/mt5-base", cuda_device=0 , args=model_args)
# Train the model
model.train_model(train_df, eval_data=eval_df)
# Optional: Evaluate the model. We'll test it properly anyway.
results = model.eval_model(eval_df, verbose=True)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
gpu 0 = Quadro RTX 6000
it jus out of memory cases.
The parameter and dataset weren't loaded on my gpu memory.
so i changed my model mt5-base to mt5-small, delete save point, reduce dataset

How long does load_dataset take time in huggingface?

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")
For the downloading part the message is:
Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...
I am running it on Google Colab Pro (with High Ram setting and on TPU). However, it's about 2 hours and the execution line is still on load_datset
what is doing? is it normal for load_dataset to take so much time? Should I interrupt it an run it again?

Why does creating a single tensor on the GPU take 2.5 seconds in PyTorch?

I'm just going through the beginner tutorial on PyTorch and noticed that one of the many different ways to put a tensor (basically the same as a numpy array) on the GPU takes a suspiciously long amount compared to the other methods:
import time
import torch
if torch.cuda.is_available():
print('time =', time.time())
x = torch.randn(4, 4)
device = torch.device("cuda")
print('time =', time.time())
y = torch.ones_like(x, device=device) # directly create a tensor on GPU => 2.5 secs??
print('time =', time.time())
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change dtype together!
a = torch.ones(5)
print(a.cuda())
print('time =', time.time())
else:
print('I recommend you get CUDA to work, my good friend!')
Output (just times):
time = 1551809363.28284
time = 1551809363.282943
time = 1551809365.7204516 # (!)
time = 1551809365.7236063
Version details:
1 CUDA device: GeForce GTX 1050, driver version 415.27
CUDA = 9.0.176
PyTorch = 1.0.0
cuDNN = 7401
Python = 3.5.2
GCC = 5.4.0
OS = Linux Mint 18.3
Linux kernel = 4.15.0-45-generic
As you can see this one operation ("y = ...") takes much longer (2.5 seconds) than the rest combined (.003 seconds). I'm confused about this as I expect all these methods to basically do the same. I've tried making sure the types in this line are 32 bit or have different shapes but that didn't change anything.
When I re-order the commands, whatever command is on top takes 2.5 seconds. So this leads me to believe there is a delayed one-time setup of the device happening here, and future on-GPU allocations will be faster.

Gaussian Process eating up my memory

I have skicit-learn 0.13.1 installed on Ubuntu 12.04. Running the following code is eating up my memory, i.e. I can watch with top how memory is growing in each iteration and I get a segmentation fault after approx. 160 iterations (limiting available memory with 'ulimit -Sv 4000000' to approx. 4GB).
from sklearn import gaussian_process
import numpy as np
x = np.random.normal(size=(600, 60))
y = np.random.normal(size=600)
for s in range(100000):
print 'step %s' % s
test = gaussian_process.GaussianProcess(
theta0= 1e-2,
thetaL= 1e-4,
thetaU= 1e-1,
nugget= 0.01,
storage_mode='light').fit(x, y)
So am I missing here something?
This looks like a serious memory leak. Please report it on https://github.com/scikit-learn/scikit-learn/issues .

Resources