Why does creating a single tensor on the GPU take 2.5 seconds in PyTorch? - python-3.x

I'm just going through the beginner tutorial on PyTorch and noticed that one of the many different ways to put a tensor (basically the same as a numpy array) on the GPU takes a suspiciously long amount compared to the other methods:
import time
import torch
if torch.cuda.is_available():
print('time =', time.time())
x = torch.randn(4, 4)
device = torch.device("cuda")
print('time =', time.time())
y = torch.ones_like(x, device=device) # directly create a tensor on GPU => 2.5 secs??
print('time =', time.time())
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change dtype together!
a = torch.ones(5)
print(a.cuda())
print('time =', time.time())
else:
print('I recommend you get CUDA to work, my good friend!')
Output (just times):
time = 1551809363.28284
time = 1551809363.282943
time = 1551809365.7204516 # (!)
time = 1551809365.7236063
Version details:
1 CUDA device: GeForce GTX 1050, driver version 415.27
CUDA = 9.0.176
PyTorch = 1.0.0
cuDNN = 7401
Python = 3.5.2
GCC = 5.4.0
OS = Linux Mint 18.3
Linux kernel = 4.15.0-45-generic
As you can see this one operation ("y = ...") takes much longer (2.5 seconds) than the rest combined (.003 seconds). I'm confused about this as I expect all these methods to basically do the same. I've tried making sure the types in this line are 32 bit or have different shapes but that didn't change anything.

When I re-order the commands, whatever command is on top takes 2.5 seconds. So this leads me to believe there is a delayed one-time setup of the device happening here, and future on-GPU allocations will be faster.

Related

AutoGluon TextPredictor.fit gives BrokenPipeError: [Errno 32] Broken pipe - possible solution

I'm posting this as another potential solution to the AutoGluon TextPredictor.fit() "BrokenPipeError: [Errno 32] Broken pipe" and get feedback. I searched broadly, looked at related responses on SO, Github, PyTorch discussion, etc. and don't see option 2 below for AutoGluon which is why I'd like to propose it.
For the AutoGluon Text Prediction quick start tutorial I used an AWS EC2 P3.2xlarge EC2 instance with 8 vCPUs, 61GB ram, 1 NVIDIA v100 running Windows Server 2016.
Running the tutorial, TextPredictor.fit() gave a "BrokenPipeError: [Errno 32] Broken pipe" error.
Option 1 - One way to fix it is to wrap your code in a block as mentioned here:
def run():
# TextPredictor.fit() code goes here
if __name__ == '__main__':
run()
Option 2 - Earlier in that same SO post there is another resolution to set PyTorch 'num_workers' to 0, however in my case PyTorch is running via AutoGluon. I looked through the Autogluon docs here and noticed I could set both env.num_workers and env.num_workers_evaluation. In TextPredictor.fit() I set both to zero and it worked great on my EC2 instance.
predictor.fit(train_data,
time_limit = TRAIN_TIME_LIMIT,
hyperparameters = {"env.num_workers": 0, "env.num_workers_evaluation": 0})
Option 2 only takes ~3 minutes longer to run vs option 1 (17 mins vs 14 mins).
My full code is below. As mentioned earlier most of it was borrowed from the AutoGluon site.
# -*- coding: utf-8 -*-
# -----
# Libs
from datetime import datetime
import torch
from autogluon.text import TextPredictor
import autogluon as ag
from autogluon.core.utils.loaders import load_pd
import pandas as pd
# -----
# Constants.
# This is in seconds. Ex: 5 mins * 60 secs per minute.
TRAIN_TIME_LIMIT = 1*60
# Data source
TRAIN_DATA = '.\\data\\train.parquet'
TEST_DATA = '.\\data\\train.parquet'
TRAIN_SAMPLE_SIZE = 67300
PRED_LABEL_COL = 'label'
# -----
# Main
print("\nStart time: ", datetime.now())
# Sanity test pytorch:
print("Is Cuda available? ", torch.cuda.is_available()) # Should be True
print("Is Cuda device count > 0? ", torch.cuda.device_count()) # Should be > 0
# My debug stuff
from autogluon.text.version import __version__
print("autogluon.text.version: ", __version__)
train_data = load_pd.load(TRAIN_DATA)
test_data = load_pd.load(TEST_DATA)
train_data = train_data.sample(n = TRAIN_SAMPLE_SIZE, random_state = 42)
predictor = TextPredictor(label = PRED_LABEL_COL, eval_metric = 'acc', path = '.\\ag_sst')
# Set num workers to Zero.
predictor.fit(train_data,
time_limit = TRAIN_TIME_LIMIT,
hyperparameters = {"env.num_workers": 0, "env.num_workers_evaluation": 0})
test_score = predictor.evaluate(test_data)
print("\n\nTest score:", test_score)
test_score = predictor.evaluate(test_data, metrics=['acc'])
print(test_score)
print("\nEnd time: ", datetime.now())
Question - I'm not familiar writing multi-processor code. Is there anything wrong or inefficient setting the fit() num_workers hyperparameters above within the context of the full code?

Cuda.synchronize()/ .cuda() is extremely slow

I am using Torch 1.7.1 and Cuda 10.1 in Titan XP.
but when i use .cuda() command,it always takes more than 10 mins.
According to the same problem answered before,i try to use torch.cuda.synchronize() before the .cuda() command,but synchronize still needs more than 10mins.
Is there anyway to accelerate this?
Here’s my code and result:
import torch
from datetime import datetime
torch.cuda.set_device(2)
t1 = datetime.now()
torch.cuda.synchronize()
print(datetime.now() - t1)
for i in range(10):
x = torch.randn(10, 10, 10, 10) # similar timings regardless of the tensor size
t1 = datetime.now()
x.cuda()
print(i, datetime.now() - t1)

Darknet Yolov4 Python memory leak

I ran into a memory leak running detect_image(...) which is provided by darknet.py. I was detecting objects in an endless while loop. I'm using Ubuntu 20.04, Python 3.8.10, OpenCV 4.5.2 and Cuda 10.2.
darknet.py already has a function to take care of this, namely free_image(image). For some reason, this isn't called in the function detect_image(...) . I have added this underneath free_detections(detections, num) and the memory leak is taken care of. Here's the exact code:
def detect_image(network, class_names, image_path, thresh=.5, hier_thresh=.5, nms=.45):
"""
Returns a list with highest confidence class and their bbox
"""
pnum = pointer(c_int(0))
image = load_image(image_path,0,0)
predict_image(network, image)
detections = get_network_boxes(network, image.w, image.h,
thresh, hier_thresh, None, 0, pnum, 0)
num = pnum[0]
if nms:
do_nms_sort(detections, num, len(class_names), nms)
predictions = remove_negatives(detections, class_names, num)
predictions = decode_detection(predictions)
free_detections(detections, num)
free_image(image) # this was missing...
return sorted(predictions, key=lambda x: x[1])```

TensorFlow vs PyTorch: Memory usage

I have PyTorch 1.9.0 and TensorFlow 2.6.0 in the same environment, and both recognizing the all GPUs.
I was comparing the performance of both, so I did this small simple test, multiplying large matrices (A and B, both 2000x2000) several times (10000x):
import numpy as np
import os
import time
def mul_torch(A,B):
# PyTorch matrix multiplication
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import torch
A, B = torch.Tensor(A.copy()), torch.Tensor(B.copy())
A = A.cuda()
B = B.cuda()
start = time.time()
for i in range(10000):
C = torch.matmul(A, B)
torch.cuda.empty_cache()
print('PyTorch:', time.time() - start, 's')
return C
def mul_tf(A,B):
# TensorFlow Matrix Multiplication
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
with tf.device('GPU:0'):
A = tf.constant(A.copy())
B = tf.constant(B.copy())
start = time.time()
for i in range(10000):
C = tf.math.multiply(A, B)
print('TensorFlow:', time.time() - start, 's')
return C
if __name__ == '__main__':
A = np.load('A.npy')
B = np.load('B.npy')
n = 2000
A = np.random.rand(n, n)
B = np.random.rand(n, n)
PT = mul_torch(A, B)
time.sleep(5)
TF = mul_tf(A, B)
As a result:
PyTorch: 19.86856198310852 s
TensorFlow: 2.8338065147399902 s
I was not expecting these results, I thought the results should be similar.
Investigating the GPU performance, I noticed that both are using GPU full capacity, but PyTorch uses a small fraction of the memory Tensorflow uses. It explains the processing time difference, but I cannot explain the difference on memory usage. Is it something intrinsic to the methods, or is it my computer configuration? Regardless the matrix size (at least for matrices larger than 1000x1000), these plateau are the same.
Thanks you for your help.
It is because you are doing matrix multiplication in pytorch but element-wise multiplication in tensorflow. To do matrix multiplication in TF, use tf.matmul or simply:
for i in range(10000):
C = A # B
That does the same for both TF and torch. You also have to call torch.cuda.synchronize() inside the time measurement and move torch.cuda.empty_cache() outside of the measurement for the sake of fairness.
The expected results will be tensorflow's eager execution slower than pytorch.
Regarding the memory usage, TF by default claims all GPU memory and using nvidia-smi in linux or similarly task manager in windows, does not reflect the actual memory usage of the operations.

Can I disable CUDA temporarily in PyTorch? [duplicate]

I want to do some timing comparisons between CPU & GPU as well as some profiling and would like to know if there's a way to tell pytorch to not use the GPU and instead use the CPU only? I realize I could install another CPU-only pytorch, but hoping there's an easier way.
Before running your code, run this shell command to tell torch that there are no GPUs:
export CUDA_VISIBLE_DEVICES=""
This will tell it to use only one GPU (the one with id 0) and so on:
export CUDA_VISIBLE_DEVICES="0"
I just wanted to add that it is also possible to do so within the PyTorch Code:
Here is a small example taken from the PyTorch Migration Guide for 0.4.0:
# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
...
# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)
I think the example is pretty self-explaining. But if there are any questions just ask! One big advantage is when using this syntax like in the example above is, that you can create code which runs on CPU if no GPU is available but also on GPU without changing a single line.
Instead of using the if-statement with torch.cuda.is_available() you can also just set the device to CPU like this:
device = torch.device("cpu")
Further you can create tensors on the desired device using the device flag:
mytensor = torch.rand(5, 5, device=device)
This will create a tensor directly on the device you specified previously.
I want to point out, that you can switch between CPU and GPU using this syntax, but also between different GPUs.
I hope this is helpful!
Simplest way using Python is:
os.environ["CUDA_VISIBLE_DEVICES"]=""
There are multiple ways to force CPU use:
Set default tensor type:
torch.set_default_tensor_type(torch.FloatTensor)
Set device and consistently reference when creating tensors:
(with this you can easily switch between GPU and CPU)
device = 'cpu'
# ...
x = torch.rand(2, 10, device=device)
Hide GPU from view:
import os
os.environ["CUDA_VISIBLE_DEVICES"]=""
General
As previous answers showed you can make your pytorch run on the cpu using:
device = torch.device("cpu")
Comparing Trained Models
I would like to add how you can load a previously trained model on the cpu (examples taken from the pytorch docs).
Note: make sure that all the data inputted into the model also is on the cpu.
Recommended loading
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=torch.device("cpu")))
Loading entire model
model = torch.load(PATH, map_location=torch.device("cpu"))
This is a real world example: original function with gpu, versus new function with cpu.
Source: https://github.com/zllrunning/face-parsing.PyTorch/blob/master/test.py
In my case I have edited these 4 lines of code:
#totally new line of code
device=torch.device("cpu")
#net.cuda()
net.to(device)
#net.load_state_dict(torch.load(cp))
net.load_state_dict(torch.load(cp, map_location=torch.device('cpu')))
#img = img.cuda()
img = img.to(device)
#new_function_with_cpu
def evaluate(image_path='./imgs/116.jpg', cp='cp/79999_iter.pth'):
device=torch.device("cpu")
n_classes = 19
net = BiSeNet(n_classes=n_classes)
#net.cuda()
net.to(device)
#net.load_state_dict(torch.load(cp))
net.load_state_dict(torch.load(cp, map_location=torch.device('cpu')))
net.eval()
to_tensor = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),])
with torch.no_grad():
img = Image.open(image_path)
image = img.resize((512, 512), Image.BILINEAR)
img = to_tensor(image)
img = torch.unsqueeze(img, 0)
#img = img.cuda()
img = img.to(device)
out = net(img)[0]
parsing = out.squeeze(0).cpu().numpy().argmax(0)
return parsing
#original_function_with_gpu
def evaluate(image_path='./imgs/116.jpg', cp='cp/79999_iter.pth'):
n_classes = 19
net = BiSeNet(n_classes=n_classes)
net.cuda()
net.load_state_dict(torch.load(cp))
net.eval()
to_tensor = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),])
with torch.no_grad():
img = Image.open(image_path)
image = img.resize((512, 512), Image.BILINEAR)
img = to_tensor(image)
img = torch.unsqueeze(img, 0)
img = img.cuda()
out = net(img)[0]
parsing = out.squeeze(0).cpu().numpy().argmax(0)
return parsing

Resources