I'm running into a memory leak when performing inference on an mxnet model (i.e. converting an image buffer to tensor and running one forward pass through the model).
A minimal reproducable example is below:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
The result is a linear increase of RSS memory (from 700MB up to 10GB+).
The problem persists with other pretrained models and with a custom model that I am trying to use. And using garbage collectors does not show any increase in objects.
This gist has the full code snippet including an example imgbuf.
Environment info:
python 2.7.15
gcc 4.2.1
mxnet-mkl 1.3.1
gluoncv 0.3.0
MXNet is running a asynchronous engine to maximize parallelism and parallel executions of operators, that means that every call to enqueue operation / copy data returns eagerly and the operation is enqueued on the MXNet backend. Effectively by running the loop as you have written it, you are enqueueing operations faster than you are processing them.
You can add an explicit synchronization point, for example .asnumpy() or .mx.nd.waitall() or .wait_to_read(), that way MXNet will wait for the enqueued operations to be completed before continuing the python execution.
This will solve your issue:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
mx.nd.waitall()
Read more about MXNet asynchronous execution here: http://d2l.ai/chapter_computational-performance/async-computation.html
Related
I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
#jit(nopython=True)
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
##EDIT##
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.
Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
#cuda.jit
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.
First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.
I'm surprised to face an Out-of-Memory error using tf.keras.applications.ResNet50 implementation on an Nvidia RTX2080Ti (with 11Gb of memory !).
Question:
Is there something wrong with the workflow I use?
Notes:
I'm using tensorflow-gpu==2.0.0b1 with CUDA v10.1
I work on a segmentation task, thus the large output_shape
I build the batches myself, thus the use of train_on_batch()
Even when setting memory_growth to True, the memory get filled-up from 700Mb to 10850Mb in a fraction of a second.
Code:
import tensorflow as tf
import tensorflow.keras as ke
import numpy as np
ke.backend.clear_session()
inputs = ke.layers.Input(shape=(512,1024,3), dtype="float32")
outputs = ke.applications.ResNet50(include_top=False, weights="imagenet")(inputs)
outputs = ke.layers.Lambda(lambda x: tf.compat.v1.image.resize_bilinear(x, size=(512,1024)))(outputs)
outputs = ke.layers.Conv2D(2, 1, activation="softmax")(outputs)
model = ke.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=ke.optimizers.RMSprop(lr=0.001), loss=ke.losses.CategoricalCrossentropy())
images = np.zeros((1,512,1024,3), dtype=np.float32)
targets = np.zeros((1,512,1024,2), dtype=np.float32)
model.train_on_batch(images, targets)
Resnet being the complex complex model, the dimensions of the input might be the reason for OOM error. Try reducing the dimensions and the corresponding batch size(as much as the memory can hold) and try.
As mentioned in comments it worked with batch size 1 and with dimensions 700*512.
I've followed Imanol Luengo's answer to build a partial fit and transform for sklearn.decomposition.IncrementalPCA. But for some reason, it looks like (from htop) it uses all CPU cores at maximum. I could find neither n_jobs parameter nor anything related to multiprocessing. My question is: if this is default behavior of these functions how can I set the number of CPU's and where can I find information about it? If not, obviously I am doing something wrong in previous sections of my code.
PS: I need to limit the number of CPU cores because using all cores in a server causing a lot of trouble with other people.
Additional information and debug code:
So, it has been a while and I still couldn't figure out the reason for this behavior or how to limit the number of CPU cores used at a time. I've decided to provide a sample code to test it. Note that, this code snippet is taken from the sklearn's website. The only difference is made to increase the size of the dataset, so one can easily see the behavior.
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA
import numpy as np
X, _ = load_digits(return_X_y=True)
#Copy-paste and increase the size of the dataset to see the behavior at htop.
for _ in range(8):
X = np.vstack((X, X))
print(X.shape)
transformer = IncrementalPCA(n_components=7, batch_size=200)
transformer.partial_fit(X[:100, :])
X_transformed = transformer.fit_transform(X)
print(X_transformed.shape)
And the output is:
(460032, 64)
(460032, 7)
Process finished with exit code 0
And the htop shows:
TL:DR Solved the issue by setting BLAS environmental variables before importing numpy or any library that imports numpy with the code below. Detailed information can be found here.
Long story:
I was looking for a workaround to this problem in another post of mine and I figured out this is not because of scikit-learn implementation fault but rather due to BLAS library (specifically OpenBLAS) used by numpy library, which is used in sklearn's IncrementalPCA function. OpenBLAS is set to use all available threads by default. Detailed information can be found here.
import os
os.environ["OMP_NUM_THREADS"] = 1 # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = 1 # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = 1 # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = 1 # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = 1 # export NUMEXPR_NUM_THREADS=1
I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".
glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()
This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.
For reference, below is the machine configuration and some more information if needed.
CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')
GLM uses multiprocessing only through the linear algbra libraries
The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914
It includes some links to other issues where this shows up.
(quote:)
Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.
However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.
How can we restrict the number of cores used by the linear algebra libraries?
This depends on which linear algebra library is used. see mailing list thread
https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ
openblas: try setting the environment variable OMP_NUM_THREADS=1
Accelerate on OSX, set VECLIB_MAXIMUM_THREADS
mkl in anaconda:
import mkl
mkl.set_num_threads(1)
This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.
Here is the source code from statsmodels:
if method == 'pinv':
pinv_wexog = np.linalg.pinv(self.wexog)
params = pinv_wexog.dot(self.wendog)
elif method == 'qr':
Q, R = np.linalg.qr(self.wexog)
params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
else:
params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,