I have skicit-learn 0.13.1 installed on Ubuntu 12.04. Running the following code is eating up my memory, i.e. I can watch with top how memory is growing in each iteration and I get a segmentation fault after approx. 160 iterations (limiting available memory with 'ulimit -Sv 4000000' to approx. 4GB).
from sklearn import gaussian_process
import numpy as np
x = np.random.normal(size=(600, 60))
y = np.random.normal(size=600)
for s in range(100000):
print 'step %s' % s
test = gaussian_process.GaussianProcess(
theta0= 1e-2,
thetaL= 1e-4,
thetaU= 1e-1,
nugget= 0.01,
storage_mode='light').fit(x, y)
So am I missing here something?
This looks like a serious memory leak. Please report it on https://github.com/scikit-learn/scikit-learn/issues .
Related
I have PyTorch 1.9.0 and TensorFlow 2.6.0 in the same environment, and both recognizing the all GPUs.
I was comparing the performance of both, so I did this small simple test, multiplying large matrices (A and B, both 2000x2000) several times (10000x):
import numpy as np
import os
import time
def mul_torch(A,B):
# PyTorch matrix multiplication
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import torch
A, B = torch.Tensor(A.copy()), torch.Tensor(B.copy())
A = A.cuda()
B = B.cuda()
start = time.time()
for i in range(10000):
C = torch.matmul(A, B)
torch.cuda.empty_cache()
print('PyTorch:', time.time() - start, 's')
return C
def mul_tf(A,B):
# TensorFlow Matrix Multiplication
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
with tf.device('GPU:0'):
A = tf.constant(A.copy())
B = tf.constant(B.copy())
start = time.time()
for i in range(10000):
C = tf.math.multiply(A, B)
print('TensorFlow:', time.time() - start, 's')
return C
if __name__ == '__main__':
A = np.load('A.npy')
B = np.load('B.npy')
n = 2000
A = np.random.rand(n, n)
B = np.random.rand(n, n)
PT = mul_torch(A, B)
time.sleep(5)
TF = mul_tf(A, B)
As a result:
PyTorch: 19.86856198310852 s
TensorFlow: 2.8338065147399902 s
I was not expecting these results, I thought the results should be similar.
Investigating the GPU performance, I noticed that both are using GPU full capacity, but PyTorch uses a small fraction of the memory Tensorflow uses. It explains the processing time difference, but I cannot explain the difference on memory usage. Is it something intrinsic to the methods, or is it my computer configuration? Regardless the matrix size (at least for matrices larger than 1000x1000), these plateau are the same.
Thanks you for your help.
It is because you are doing matrix multiplication in pytorch but element-wise multiplication in tensorflow. To do matrix multiplication in TF, use tf.matmul or simply:
for i in range(10000):
C = A # B
That does the same for both TF and torch. You also have to call torch.cuda.synchronize() inside the time measurement and move torch.cuda.empty_cache() outside of the measurement for the sake of fairness.
The expected results will be tensorflow's eager execution slower than pytorch.
Regarding the memory usage, TF by default claims all GPU memory and using nvidia-smi in linux or similarly task manager in windows, does not reflect the actual memory usage of the operations.
In PyTorch, I found stack-ing or cat-ing multiple tensors would increase memory usage by the sum of the sizes of all these arrays. An example is as follows:
import torch as tc
import torch.autograd as tag
import sys
import psutil
import os
import resource
def get_ru_maxrss():
""" Return max RSS usage (in kilobytes) """
size = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
if sys.platform == 'darwin':
# on Mac OS X ru_maxrss is in bytes, on Linux it is in KB
size //= 1024
return size / 1024
def cpuStats():
print(sys.version)
print(psutil.cpu_percent())
print(psutil.virtual_memory()) # physical memory usage
pid = os.getpid()
py = psutil.Process(pid)
memoryUse = py.memory_info()[0] / 2. ** 30 # memory use in GB...I think
print('memory GB:', memoryUse)
m0 = get_ru_maxrss()
x1 = tc.ones([8192, 8192], requires_grad=True)
print(x1.dtype)
print(get_ru_maxrss() - m0)
print('=======')
y = x1 * 1.1
print(y.dtype)
print(get_ru_maxrss() - m0)
print('=======')
for i in range(10):
y = tc.cat([y, x1])
print(y.dtype)
print(get_ru_maxrss() - m0)
print('=======')
loss = tc.mean(y)
print(get_ru_maxrss() - m0)
print('=======')
loss.backward()
print(get_ru_maxrss() - m0)
print('=======')
And we can see that each occurence of cat in the for loop increases memory usage by 512 MB, which is the sum of y and x1 (256 MB each). This isn't a major issue for me right now but I'm just curious about it. If I understand it correctly, the vector-Jacobian product of stack or concatenation is just to perform the reversed operation to the gradient vector, splitting it back into multiple arrays whose shapes match the original inputs to the stack or cat function. This process doesn't need the intermediate value of the stacked tensor computed in the forward pass, which is similar to a linear operation, but the latter doesn't incur additional memory usage (e.g., if I change the cat in the for loop to a linear op such as y = y * 1.1, it won't increase memory consumption). So I'm wondering if the increased memory usage is essentially just an empty space allocated to hold the ``splitted'' gradient arrays in the backward pass, which has to be contiguous, so the splitting operation doesn't have to stride in memory? It follows that a linear operation doesn't need this additional memory because the memory space of the input gradient can be directly overwritten without worrying about memory contiguity. Is that right?
I defined a function in Python 3.5 called 'evaluate' and the code is shown below ('REC_Y', 'REC_U', 'REC_V' represent the 3 channels of a YCbCr image respectively):
import numpy as np
def evaluate(REC_Y, REC_U, REC_V):
height = 832
width = 480
bufY = np.reshape(np.asarray(REC_Y), (height, width))
bufU = np.reshape(np.asarray(REC_U), (int(height / 2), int(width / 2)))
bufV = np.reshape(np.asarray(REC_V), (int(height / 2), int(width / 2)))
return (np.stack((bufY, bufU, bufV), axis=2))
In order to release some GPU memory (since I already had a GPU MemoryError), I'd like to remove 'REC_Y','REC_U','REC_V' from memory after the last line of the code (after 'bufV = np.reshape(np.asarray(REC_V), (int(height / 2), int(width / 2)))'). I have tried 'del REC_Y', but it shown 'REC_Y' referenced before assignment. I have tried del global()["REC_Y"] but it shown that "REC_Y" is not defined as a global variable.
Could you please help me with this issue? How to delete 3 parameters of 'evaluate' function to release GPU memory?
Many thanks!
Numpy does not work on GPU.
Only if you had CUPY or CUDA operations could you try to free some memory on the GPU -> numpy works on CPU.
I'm just going through the beginner tutorial on PyTorch and noticed that one of the many different ways to put a tensor (basically the same as a numpy array) on the GPU takes a suspiciously long amount compared to the other methods:
import time
import torch
if torch.cuda.is_available():
print('time =', time.time())
x = torch.randn(4, 4)
device = torch.device("cuda")
print('time =', time.time())
y = torch.ones_like(x, device=device) # directly create a tensor on GPU => 2.5 secs??
print('time =', time.time())
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change dtype together!
a = torch.ones(5)
print(a.cuda())
print('time =', time.time())
else:
print('I recommend you get CUDA to work, my good friend!')
Output (just times):
time = 1551809363.28284
time = 1551809363.282943
time = 1551809365.7204516 # (!)
time = 1551809365.7236063
Version details:
1 CUDA device: GeForce GTX 1050, driver version 415.27
CUDA = 9.0.176
PyTorch = 1.0.0
cuDNN = 7401
Python = 3.5.2
GCC = 5.4.0
OS = Linux Mint 18.3
Linux kernel = 4.15.0-45-generic
As you can see this one operation ("y = ...") takes much longer (2.5 seconds) than the rest combined (.003 seconds). I'm confused about this as I expect all these methods to basically do the same. I've tried making sure the types in this line are 32 bit or have different shapes but that didn't change anything.
When I re-order the commands, whatever command is on top takes 2.5 seconds. So this leads me to believe there is a delayed one-time setup of the device happening here, and future on-GPU allocations will be faster.
In my application, I'm re-using the existing MobileNet trained on ImageNet and re-training the output layers on the flowers dataset with only 5 classes. The re-trained model is saved to disk. Afterwards, the model is loaded and evaluation is executed during several iterations, which eventually leads to memory exhaustion and the whole application crashes. After doing some diagnostics, I realized that the leak is coming from the model.evaluate() keras method. The issue can be reproduced in a standalone sample code:
import os
import resource
import keras
import numpy as np
if __name__ == '__main__':
init_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
for it in range(4):
x_valid = np.random.uniform(0, 1, (64, 224, 224, 3)).astype(np.float32)
y_valid = keras.utils.to_categorical(np.random.uniform(0, 5, (64, )).astype(np.int32), 5)
start_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
model = keras.models.load_model(os.path.abspath(os.path.join('.', 'mobilenet_flowers.h5')),
custom_objects={'relu6': keras.applications.mobilenet.relu6,
'DepthwiseConv2D': keras.applications.mobilenet.DepthwiseConv2D})
loss, _ = model.evaluate(x_valid, y_valid, batch_size=64, verbose=False)
keras.backend.clear_session()
del model
end_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print('Iteration %d:' % it)
print(' Memory alloc before evaluate() is %7d kilobytes' % start_alloc)
print(' Memory alloc after evaluate() is %7d kilobytes' % end_alloc)
print(' Memory alloc loss for evaluate is %7d kilobytes\n' % (end_alloc - start_alloc))
exit_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print('Memory alloc before loop is %7d kilobytes' % init_alloc)
print('Memory alloc after loop is %7d kilobytes' % exit_alloc)
print('Memory alloc difference is %7d kilobytes' % (exit_alloc - init_alloc))
When I execute the script, the following is printed out:
Iteration 0:
Memory alloc before evaluate() is 251864 kilobytes
Memory alloc after evaluate() is 901696 kilobytes
Memory alloc loss for evaluate is 649832 kilobytes
Iteration 1:
Memory alloc before evaluate() is 901696 kilobytes
Memory alloc after evaluate() is 1036780 kilobytes
Memory alloc loss for evaluate is 135084 kilobytes
Iteration 2:
Memory alloc before evaluate() is 1036780 kilobytes
Memory alloc after evaluate() is 1148692 kilobytes
Memory alloc loss for evaluate is 111912 kilobytes
Iteration 3:
Memory alloc before evaluate() is 1148692 kilobytes
Memory alloc after evaluate() is 1190804 kilobytes
Memory alloc loss for evaluate is 42112 kilobytes
Memory alloc before loop is 138792 kilobytes
Memory alloc after loop is 1190804 kilobytes
Memory alloc difference is 1052012 kilobytes
Any suggestions what may be wrong here? After going through the forums, I tried adding K.clear_session(), but, as you can see in the code, that didn't help. The model is stored temporary at https://ufile.io/rgaxs.
Some additional info about my environment:
== cat /etc/issue ===============================================
Linux 4.10.0-38-generic #42~16.04.1-Ubuntu SMP Tue Oct 10 16:32:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial
== are we in docker =============================================
No
== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== check pips ===================================================
numpy (1.12.1)
numpydoc (0.7.0)
protobuf (3.5.0)
tensorflow (1.4.0)
tensorflow-tensorboard (0.4.0rc3)
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.4.0
tf.GIT_VERSION = v1.4.0-rc1-11-g130a514
tf.COMPILER_VERSION = v1.4.0-rc1-11-g130a514
keras.VERSION = 2.0.9