Cuda.synchronize()/ .cuda() is extremely slow

Cuda.synchronize()/ .cuda() is extremely slow - pytorch

I am using Torch 1.7.1 and Cuda 10.1 in Titan XP.
but when i use .cuda() command,it always takes more than 10 mins.
According to the same problem answered before,i try to use torch.cuda.synchronize() before the .cuda() command，but synchronize still needs more than 10mins.
Is there anyway to accelerate this?
Here’s my code and result:
import torch
from datetime import datetime
torch.cuda.set_device(2)
t1 = datetime.now()
torch.cuda.synchronize()
print(datetime.now() - t1)
for i in range(10):
x = torch.randn(10, 10, 10, 10) # similar timings regardless of the tensor size
t1 = datetime.now()
x.cuda()
print(i, datetime.now() - t1)

Related

TensorFlow vs PyTorch: Memory usage

I have PyTorch 1.9.0 and TensorFlow 2.6.0 in the same environment, and both recognizing the all GPUs.
I was comparing the performance of both, so I did this small simple test, multiplying large matrices (A and B, both 2000x2000) several times (10000x):
import numpy as np
import os
import time
def mul_torch(A,B):
# PyTorch matrix multiplication
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import torch
A, B = torch.Tensor(A.copy()), torch.Tensor(B.copy())
A = A.cuda()
B = B.cuda()
start = time.time()
for i in range(10000):
C = torch.matmul(A, B)
torch.cuda.empty_cache()
print('PyTorch:', time.time() - start, 's')
return C
def mul_tf(A,B):
# TensorFlow Matrix Multiplication
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
with tf.device('GPU:0'):
A = tf.constant(A.copy())
B = tf.constant(B.copy())
start = time.time()
for i in range(10000):
C = tf.math.multiply(A, B)
print('TensorFlow:', time.time() - start, 's')
return C
if __name__ == '__main__':
A = np.load('A.npy')
B = np.load('B.npy')
n = 2000
A = np.random.rand(n, n)
B = np.random.rand(n, n)
PT = mul_torch(A, B)
time.sleep(5)
TF = mul_tf(A, B)
As a result:
PyTorch: 19.86856198310852 s
TensorFlow: 2.8338065147399902 s
I was not expecting these results, I thought the results should be similar.
Investigating the GPU performance, I noticed that both are using GPU full capacity, but PyTorch uses a small fraction of the memory Tensorflow uses. It explains the processing time difference, but I cannot explain the difference on memory usage. Is it something intrinsic to the methods, or is it my computer configuration? Regardless the matrix size (at least for matrices larger than 1000x1000), these plateau are the same.
Thanks you for your help.

It is because you are doing matrix multiplication in pytorch but element-wise multiplication in tensorflow. To do matrix multiplication in TF, use tf.matmul or simply:
for i in range(10000):
C = A # B
That does the same for both TF and torch. You also have to call torch.cuda.synchronize() inside the time measurement and move torch.cuda.empty_cache() outside of the measurement for the sake of fairness.
The expected results will be tensorflow's eager execution slower than pytorch.
Regarding the memory usage, TF by default claims all GPU memory and using nvidia-smi in linux or similarly task manager in windows, does not reflect the actual memory usage of the operations.

How can I create a datetime64[D] in numba

I need to pass dates into numba function.
Passing them in as .astype('datetime64[D]') works well. But I need to create an epoch date inside function too.
import numpy as np
import pandas as pd
import numba
from numba import jit
from datetime import datetime, timedelta
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
#jit(nopython=True)
def myfunc(dts):
epoch = np.datetime64('1970-01-01').astype('datetime64[D]')
if epoch == dts[0]:
n = 1
return epoch
dts = [dt for dt in
datetime_range(datetime(2016, 9, 1, 7), datetime(2016, 9, 2,7),
timedelta(minutes=15))]
pandas_df = pd.DataFrame(index = dts)
res = myfunc(pandas_df.index.values.astype('datetime64[D]'))
print(res)
I get error:
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'astype' of type datetime64[]
File "test5.py", line 17:
def myfunc(dts):
epoch = np.datetime64('1970-01-01').astype('datetime64[D]')
^
During: typing of get attribute at C:/Users/PUser/PycharmProjects/pythonProjectTEST/test5.py (17)
File "test5.py", line 17:
def myfunc(dts):
epoch = np.datetime64('1970-01-01').astype('datetime64[D]')
^
How can I make this work

Your problem is likely related to this documented issue with numba.
A first workaround would be to define epoch outside of your jit function:
def myfunc(dts):
#jit(nopython=True)
def wrapper(dts, epoch):
if epoch == dts[0]:
n = 1
return epoch
epoch = np.datetime64('1970-01-01').astype('datetime64[D]')
return wrapper(dts, epoch)
An alternative, hacky solution that also comes to mind would be to render your dates as strings before feeding them to myfunc :
res = myfunc(np.datetime_as_string(pandas_df.index.values, unit='D'))
and define epoch ='1970-01-01' inside of myfunc.
You can finally add a post-processing step after that to convert your strings back to datetime64 or whatever they need be.

Matplotlib runs out of memory

Here is the code that I'm using to plot many plots and save them, but it is eating up all of the available RAM and causes the notebook to crash. I tried adding fig.clf(), del fig, gc.collect, and yet nothing seems to work.
I'm able to save only 38 figures around, then session gets crashed on Google Colab, since RAM gets full.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
print(np.__version__) # 1.19.5
print(mpl.__version__) # 3.2.2, also tried with latest 3.4.1
x = np.arange(0, 280, 0.1)
y = np.sin(x)
for k in range(100):
fig, ax = plt.subplots(6, 2, sharex = True)
fig.set_size_inches(37.33, 21)
for i in range(2):
for j in range(6):
ax[j][i].plot(x, y)
fig.savefig(f'figure{k}.png', dpi = 300)
plt.close(fig)

This is related to the inline backend. The memory leak can be avoided by explicitly switching to the agg backend.
cross ref: matplotlib/issues/20067

maybe if you try to save each figure after it is generated, I mean try putting fig. savefig in the for loop.
Edit: after looking for the issue on google, I found that you might need to buy Colab pro.

creating distance matrix iteratively efficiently while preserving iteration process in Python 3 / Numpy

i have a matrix A and want to calculate the distance matrix D from it, iteratively. The reason behind wanting to calculate it step by step is to later include some if-statements in the iteration process.
My code right now looks like this:
import numpy as np
from scipy.spatial import distance
def create_data_matrix(n,m):
mean = np.zeros(m)
cov = np.eye(m, dtype=float)
data_matrix = np.random.multivariate_normal(mean,cov,n)
return(data_matrix)
def create_full_distance(A):
distance_matrix = np.triu(distance.squareform(distance.pdist(A,"euclidean")),0)
return(distance_matrix)
matrix_a = create_data_matrix(1000,2)
distance_from_numpy = create_full_distance(matrix_a)
matrix_b = np.empty((1000,1000))
for idx, line in enumerate(matrix_a):
for j, line2 in enumerate(matrix_a):
matrix_b[idx][j] = distance.euclidean(matrix_a[idx],matrix_a[j])
Now the matrices "distance_from_numpy" and "matrix_b" are the same, though matrix_b takes far longer to calculate allthough the matrix_a is only a (100x2) matrix, and i know that "distance.pdist()" method is very fast but i am not sure if i can implement it in an iteration process.
My question is, why is the double for loop so slow and how can i increase the speed while still preserving the iteration process (since i want to include if statements there) ?
edit: for context: i want to preserve the iteration, because i'd like stop the iteration if one of the distances is smaller than a specific number.

Python is a high-level language and therefore loops are inherently slow. It just has to deal with a lot of overhead. This gets progressively worse, as the number of nested loops increases. On the other hand, Numpy uses fast Fortran code.
To speed up the Python implementation, you can for example implement the loop part with Cython, which will translate your code to C, and then compile it for faster execution. Other options are Numba, or writing the loops in Fortran.

As Ehsan mentioned in a comment i used numba to increase computational speed.
from numba import jit
import numpy as np
from scipy.spatial import distance
def create_data_matrix(n,m):
mean = np.zeros(m)
cov = np.eye(m, dtype=float)
data_matrix = np.random.multivariate_normal(mean,cov,n)
return(data_matrix)
def create_full_distance(A):
distance_matrix = np.triu(distance.squareform(distance.pdist(A,"euclidean")),0)
return(distance_matrix)
#jit(nopython=True) # Set "nopython" mode for best performance, equivalent to #njit
def slow_loop(matrix_a):
matrix_b = np.empty((1000,1000))
for i in range(len(matrix_a)):
for j in range(len(matrix_a)):
#matrix_b[i][j] = distance.euclidean(matrix_a[i],matrix_a[j])
matrix_b[i][j] = np.linalg.norm(matrix_a[i]-matrix_a[j])
print("matrix_b: ",matrix_b)
return()
def slow_loop_without_numba(matrix_a):
matrix_b = np.empty((1000,1000))
for i in range(len(matrix_a)):
for j in range(len(matrix_a)):
matrix_b[i][j] = np.linalg.norm(matrix_a[i]-matrix_a[j])
return()
matrix_a = create_data_matrix(1000,2)
start = time.time()
ergebnis = create_full_distance(matrix_a)
#print("ergebnis: ",ergebnis)
end = time.time()
print("with scipy.distance.pdist = %s" % (end - start))
start2 = time.time()
slow_loop(matrix_a)
end2 = time.time()
print("with #jit onto np.linalg.norm = %s" % (end2 - start2))
start3 = time.time()
slow_loop_without_numba(matrix_a)
end3 = time.time()
print("slow_loop without numba = %s" % (end3 - start3))
i executed the code and it yielded these results:
with scipy.distance.pdist = 0.021986722946166992
with #jit onto np.linalg.norm = 0.8565070629119873
slow_loop without numba = 6.818004846572876
so numba increased the computational speed by alot allthough scipy is still much faster. This will be more interesting the bigger the distance matrices get. I couldn´t use numba on a function with scipy methods.

Why does creating a single tensor on the GPU take 2.5 seconds in PyTorch?

I'm just going through the beginner tutorial on PyTorch and noticed that one of the many different ways to put a tensor (basically the same as a numpy array) on the GPU takes a suspiciously long amount compared to the other methods:
import time
import torch
if torch.cuda.is_available():
print('time =', time.time())
x = torch.randn(4, 4)
device = torch.device("cuda")
print('time =', time.time())
y = torch.ones_like(x, device=device) # directly create a tensor on GPU => 2.5 secs??
print('time =', time.time())
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change dtype together!
a = torch.ones(5)
print(a.cuda())
print('time =', time.time())
else:
print('I recommend you get CUDA to work, my good friend!')
Output (just times):
time = 1551809363.28284
time = 1551809363.282943
time = 1551809365.7204516 # (!)
time = 1551809365.7236063
Version details:
1 CUDA device: GeForce GTX 1050, driver version 415.27
CUDA = 9.0.176
PyTorch = 1.0.0
cuDNN = 7401
Python = 3.5.2
GCC = 5.4.0
OS = Linux Mint 18.3
Linux kernel = 4.15.0-45-generic
As you can see this one operation ("y = ...") takes much longer (2.5 seconds) than the rest combined (.003 seconds). I'm confused about this as I expect all these methods to basically do the same. I've tried making sure the types in this line are 32 bit or have different shapes but that didn't change anything.

When I re-order the commands, whatever command is on top takes 2.5 seconds. So this leads me to believe there is a delayed one-time setup of the device happening here, and future on-GPU allocations will be faster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cuda.synchronize()/ .cuda() is extremely slow - pytorch

Related

TensorFlow vs PyTorch: Memory usage

How can I create a datetime64[D] in numba

Matplotlib runs out of memory

creating distance matrix iteratively efficiently while preserving iteration process in Python 3 / Numpy

Why does creating a single tensor on the GPU take 2.5 seconds in PyTorch?

Categories

Resources