Memory leak using gridsearchcv - memory-leaks

Problem: My situation appears to be a memory leak when running gridsearchcv. This happens when I run with 1 or 32 concurrent workers (n_jobs=-1). Previously I have run this loads of times with no trouble on ubuntu 16.04, but recently upgraded to 18.04 and did a ram upgrade.
import os
import pickle
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV,StratifiedKFold,train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import make_scorer,log_loss
from horsebet import performance
scorer = make_scorer(log_loss,greater_is_better=True)
kfold = StratifiedKFold(n_splits=3)
# import and split data
input_vectors = pickle.load(open(os.path.join('horsebet','data','x_normalized'),'rb'))
output_vector = pickle.load(open(os.path.join('horsebet','data','y'),'rb')).ravel()
x_train,x_test,y_train,y_test = train_test_split(input_vectors,output_vector,test_size=0.2)
# XGB
model = XGBClassifier()
param = {
'booster':['gbtree'],
'tree_method':['hist'],
'objective':['binary:logistic'],
'n_estimators':[100,500],
'min_child_weight': [.8,1],
'gamma': [1,3],
'subsample': [0.1,.4,1.0],
'colsample_bytree': [1.0],
'max_depth': [10,20],
}
jobs = 8
model = GridSearchCV(model,param_grid=param,cv=kfold,scoring=scorer,pre_dispatch=jobs*2,n_jobs=jobs,verbose=5).fit(x_train,y_train)
Returns:
UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
OR
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

The cause of my issue was that i put n_jobs=-1 in gridsearchcv, when it should be placed in the classifier. This has solved the issue.

model = XGBClassifier(n_jobs)
and you need to remove the n_jobs args in the GridSearchCV

Though, its not entirely same issue, I have run into same error with skopt gp_minimize() method. Even though the documentation says gp_minimize() supports n_jobs, it started failing on my mac. When I moved it n_jobs to the underlying XGBClassifier it worked fine.
This did not work
gp_minimize(_minimize, param_space, n_calls=20, n_random_starts=3, random_state=2405)
This worked
xgb = xgboost.XGBClassifier(
n_estimators=1000, # use large n_estimators deliberately to make use of the early stopping
objective='binary:logistic',
n_jobs=-1
)

Related

Non-deterministic behavior for training a neural network on GPU implemented in PyTorch and with a fixed random seed

I observed a strange behavior of the final Accuracy when I run exactly the same experiment (the same code for training neural net for image classification) with the same random seed on different GPUs (machines). I use only one GPU. Precisely, When I run the experiment on one machine_1 the Accuracy is 86,37. When I run the experiment on machine_2 the Accuracy is 88,0.
There is no variability when I run the experiment multiple times on the same machine. PyTorch and CUDA versions are the same. Could you help me to figure out the reason and fix it?
Machine_1:
NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2
Machine_2:
NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2
To fix random seed I use the following code:
random.seed(args.seed)
os.environ['PYTHONHASHSEED'] = str(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
This is what I use:
import torch
import os
import numpy as np
import random
def set_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
set_seed(13)
Make sure you have a single function that set's the seeds from once. If you are using Jupyter notebooks cell execution timing may cause this. Also the order of functions inside may be important. I never had problems with this code. You may call set_seed() often in code.

Optimizing SVR() parameters using GridSearchCv

I want to tune the parameters of the "SVR()" regression function. It starts processing and doesn't stop, I am unable to figure out the problem. I am predicting a parameter using the SVM regression function SVR(). The results are not good with default values in Python.so I want to try tunning it with "GridSearchCv". The last part "grids.fit(Xtrain,ytrain)" start running without giving any error and doesn't stop.
SVR() tunning using GridSearch
Code:
from sklearn.model_selection import GridSearchCV.
param = {'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),'C' : [1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')},
modelsvr = SVR(),
grids = GridSearchCV(modelsvr,param,cv=5)
grids.fit(Xtrain,ytrain)
It Continues to process without stopping.
Yes, you are right. I have come across the same scenario, when I try to run GridsearchCV for SVR(). The possible reasons are, 1) Your Processor memory(RAM) must be less, 2) The train data sample size is more, equal chance of consuming more time to run Gridsearch since your processor is low memory, so without any error the Job running time will be more.
For your info: I have run Gridsearch with train sample size of 30K using 16GB RAM memory space, it elapsed 210mins to finish the run. So, here patience is must.
Happy Analyzing !!
Maybe you should add two more options to your GridSearch (n_jobs and verbose) :
grid_search = GridSearchCV(estimator = svr_gs, param_grid = param,
cv = 3, n_jobs = -1, verbose = 2)
verbose means that you see some output about the progress of your process.
n_jobs is the numebr of used cores (-1 means all cores/threads you have available)

OOM with a "simple" ResNet50 using Tensorflow2.0 on an Nvidia RTX2080 Ti

I'm surprised to face an Out-of-Memory error using tf.keras.applications.ResNet50 implementation on an Nvidia RTX2080Ti (with 11Gb of memory !).
Question:
Is there something wrong with the workflow I use?
Notes:
I'm using tensorflow-gpu==2.0.0b1 with CUDA v10.1
I work on a segmentation task, thus the large output_shape
I build the batches myself, thus the use of train_on_batch()
Even when setting memory_growth to True, the memory get filled-up from 700Mb to 10850Mb in a fraction of a second.
Code:
import tensorflow as tf
import tensorflow.keras as ke
import numpy as np
ke.backend.clear_session()
inputs = ke.layers.Input(shape=(512,1024,3), dtype="float32")
outputs = ke.applications.ResNet50(include_top=False, weights="imagenet")(inputs)
outputs = ke.layers.Lambda(lambda x: tf.compat.v1.image.resize_bilinear(x, size=(512,1024)))(outputs)
outputs = ke.layers.Conv2D(2, 1, activation="softmax")(outputs)
model = ke.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=ke.optimizers.RMSprop(lr=0.001), loss=ke.losses.CategoricalCrossentropy())
images = np.zeros((1,512,1024,3), dtype=np.float32)
targets = np.zeros((1,512,1024,2), dtype=np.float32)
model.train_on_batch(images, targets)
Resnet being the complex complex model, the dimensions of the input might be the reason for OOM error. Try reducing the dimensions and the corresponding batch size(as much as the memory can hold) and try.
As mentioned in comments it worked with batch size 1 and with dimensions 700*512.

Does partial fit runs in parallel in sklearn.decomposition.IncrementalPCA?

I've followed Imanol Luengo's answer to build a partial fit and transform for sklearn.decomposition.IncrementalPCA. But for some reason, it looks like (from htop) it uses all CPU cores at maximum. I could find neither n_jobs parameter nor anything related to multiprocessing. My question is: if this is default behavior of these functions how can I set the number of CPU's and where can I find information about it? If not, obviously I am doing something wrong in previous sections of my code.
PS: I need to limit the number of CPU cores because using all cores in a server causing a lot of trouble with other people.
Additional information and debug code:
So, it has been a while and I still couldn't figure out the reason for this behavior or how to limit the number of CPU cores used at a time. I've decided to provide a sample code to test it. Note that, this code snippet is taken from the sklearn's website. The only difference is made to increase the size of the dataset, so one can easily see the behavior.
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA
import numpy as np
X, _ = load_digits(return_X_y=True)
#Copy-paste and increase the size of the dataset to see the behavior at htop.
for _ in range(8):
X = np.vstack((X, X))
print(X.shape)
transformer = IncrementalPCA(n_components=7, batch_size=200)
transformer.partial_fit(X[:100, :])
X_transformed = transformer.fit_transform(X)
print(X_transformed.shape)
And the output is:
(460032, 64)
(460032, 7)
Process finished with exit code 0
And the htop shows:
TL:DR Solved the issue by setting BLAS environmental variables before importing numpy or any library that imports numpy with the code below. Detailed information can be found here.
Long story:
I was looking for a workaround to this problem in another post of mine and I figured out this is not because of scikit-learn implementation fault but rather due to BLAS library (specifically OpenBLAS) used by numpy library, which is used in sklearn's IncrementalPCA function. OpenBLAS is set to use all available threads by default. Detailed information can be found here.
import os
os.environ["OMP_NUM_THREADS"] = 1 # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = 1 # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = 1 # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = 1 # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = 1 # export NUMEXPR_NUM_THREADS=1

mxnet cpu memory leak when running inference on model

I'm running into a memory leak when performing inference on an mxnet model (i.e. converting an image buffer to tensor and running one forward pass through the model).
A minimal reproducable example is below:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
The result is a linear increase of RSS memory (from 700MB up to 10GB+).
The problem persists with other pretrained models and with a custom model that I am trying to use. And using garbage collectors does not show any increase in objects.
This gist has the full code snippet including an example imgbuf.
Environment info:
python 2.7.15
gcc 4.2.1
mxnet-mkl 1.3.1
gluoncv 0.3.0
MXNet is running a asynchronous engine to maximize parallelism and parallel executions of operators, that means that every call to enqueue operation / copy data returns eagerly and the operation is enqueued on the MXNet backend. Effectively by running the loop as you have written it, you are enqueueing operations faster than you are processing them.
You can add an explicit synchronization point, for example .asnumpy() or .mx.nd.waitall() or .wait_to_read(), that way MXNet will wait for the enqueued operations to be completed before continuing the python execution.
This will solve your issue:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
mx.nd.waitall()
Read more about MXNet asynchronous execution here: http://d2l.ai/chapter_computational-performance/async-computation.html

Resources