Does partial fit runs in parallel in sklearn.decomposition.IncrementalPCA? - scikit-learn

I've followed Imanol Luengo's answer to build a partial fit and transform for sklearn.decomposition.IncrementalPCA. But for some reason, it looks like (from htop) it uses all CPU cores at maximum. I could find neither n_jobs parameter nor anything related to multiprocessing. My question is: if this is default behavior of these functions how can I set the number of CPU's and where can I find information about it? If not, obviously I am doing something wrong in previous sections of my code.
PS: I need to limit the number of CPU cores because using all cores in a server causing a lot of trouble with other people.
Additional information and debug code:
So, it has been a while and I still couldn't figure out the reason for this behavior or how to limit the number of CPU cores used at a time. I've decided to provide a sample code to test it. Note that, this code snippet is taken from the sklearn's website. The only difference is made to increase the size of the dataset, so one can easily see the behavior.
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA
import numpy as np
X, _ = load_digits(return_X_y=True)
#Copy-paste and increase the size of the dataset to see the behavior at htop.
for _ in range(8):
X = np.vstack((X, X))
print(X.shape)
transformer = IncrementalPCA(n_components=7, batch_size=200)
transformer.partial_fit(X[:100, :])
X_transformed = transformer.fit_transform(X)
print(X_transformed.shape)
And the output is:
(460032, 64)
(460032, 7)
Process finished with exit code 0
And the htop shows:

TL:DR Solved the issue by setting BLAS environmental variables before importing numpy or any library that imports numpy with the code below. Detailed information can be found here.
Long story:
I was looking for a workaround to this problem in another post of mine and I figured out this is not because of scikit-learn implementation fault but rather due to BLAS library (specifically OpenBLAS) used by numpy library, which is used in sklearn's IncrementalPCA function. OpenBLAS is set to use all available threads by default. Detailed information can be found here.
import os
os.environ["OMP_NUM_THREADS"] = 1 # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = 1 # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = 1 # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = 1 # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = 1 # export NUMEXPR_NUM_THREADS=1

Related

Matplotlib savefig() slow ...just the way things are or ideas to speed up?

I've got some code where I'm using MPL (not pyplot) via imshow() to show some arrays and then am using savefig() to save them as PNG files.
The arrays are approx 3,000 x 4,000 in size.
My problem is that saving is taking a long time - on the order of 4 seconds or so per image.
Minor Details
The arrays are floats
I'm using cmap of gray
I'm making sure the figure resolution is the same as the images, and the axes fills the entire figure (so fig size * dpi matches exactly the shape of the arrays)
I'm using imshow() with interpolation of none.
Running on macbook pro - but running on anything else is about the same (assuming SSD)
The slowness seems to be due to CPU bottleneck. Using time wrapped around my code shows real and user time to be about the same, so it doesn't seem to be a IO bottleneck.
However, (very curiously!), if I run the code via Multiprocessing in multiple processes, it doesn't seem to help much with overall real time (even with 4 cores).
Questions
Is saving to PNG taking around 4 seconds 'normal'?
Any tips or ideas on how to speed things up?
Never tried it but I think you could try to run the code via Multiprocessing using the GPU (which may be more suited for the process) if you have Nvidia graphic card.
https://documen.tician.de/pycuda/
Other than that I don't think you can speed up the process more.
From the details of what you are doing it sounds like you just want to save the array as a (false) color image. You are very carefully setting up Matplotlib to do that for you, but we do not have the logic to notice we can take any short-cuts so are still going through all of the resampling logic.
You can generate an equivalent output more simply via:
import matplotlib.colors as mcolors
import matplotlib.cm as mcm
import numpy as np
import PIL
import time
# "data"
my_data = np.random.randn(3000, 4000) * 50
start_time = time.monotonic()
# to scale the data to [0, 1]
my_norm = mcolors.Normalize(-50, 50)
# to map the scaled data to gray scale RGB
my_cmap = mcm.get_cmap('gray')
setup_time = time.monotonic()
# apply the above transforms
color_mapped = my_cmap(my_norm(my_data), bytes=True)
mapping_time = time.monotonic()
# use pillow to save the png
PIL.Image.fromarray(color_mapped).save('/tmp/so.png', compress_level=1)
end_time = time.monotonic()
print(f"saving took {end_time - start_time}")
print(f" setup took {setup_time - start_time}")
print(f" mapping took {mapping_time - setup_time}")
print(f" saving took {end_time - mapping_time}")
but a majority of the time is still spent in .save(...). By playing with compress_level you can make that line more/less expensive which suggests that cost is inside of libpng while compressing the data (see https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#png).

OOM with a "simple" ResNet50 using Tensorflow2.0 on an Nvidia RTX2080 Ti

I'm surprised to face an Out-of-Memory error using tf.keras.applications.ResNet50 implementation on an Nvidia RTX2080Ti (with 11Gb of memory !).
Question:
Is there something wrong with the workflow I use?
Notes:
I'm using tensorflow-gpu==2.0.0b1 with CUDA v10.1
I work on a segmentation task, thus the large output_shape
I build the batches myself, thus the use of train_on_batch()
Even when setting memory_growth to True, the memory get filled-up from 700Mb to 10850Mb in a fraction of a second.
Code:
import tensorflow as tf
import tensorflow.keras as ke
import numpy as np
ke.backend.clear_session()
inputs = ke.layers.Input(shape=(512,1024,3), dtype="float32")
outputs = ke.applications.ResNet50(include_top=False, weights="imagenet")(inputs)
outputs = ke.layers.Lambda(lambda x: tf.compat.v1.image.resize_bilinear(x, size=(512,1024)))(outputs)
outputs = ke.layers.Conv2D(2, 1, activation="softmax")(outputs)
model = ke.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=ke.optimizers.RMSprop(lr=0.001), loss=ke.losses.CategoricalCrossentropy())
images = np.zeros((1,512,1024,3), dtype=np.float32)
targets = np.zeros((1,512,1024,2), dtype=np.float32)
model.train_on_batch(images, targets)
Resnet being the complex complex model, the dimensions of the input might be the reason for OOM error. Try reducing the dimensions and the corresponding batch size(as much as the memory can hold) and try.
As mentioned in comments it worked with batch size 1 and with dimensions 700*512.

mxnet cpu memory leak when running inference on model

I'm running into a memory leak when performing inference on an mxnet model (i.e. converting an image buffer to tensor and running one forward pass through the model).
A minimal reproducable example is below:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
The result is a linear increase of RSS memory (from 700MB up to 10GB+).
The problem persists with other pretrained models and with a custom model that I am trying to use. And using garbage collectors does not show any increase in objects.
This gist has the full code snippet including an example imgbuf.
Environment info:
python 2.7.15
gcc 4.2.1
mxnet-mkl 1.3.1
gluoncv 0.3.0
MXNet is running a asynchronous engine to maximize parallelism and parallel executions of operators, that means that every call to enqueue operation / copy data returns eagerly and the operation is enqueued on the MXNet backend. Effectively by running the loop as you have written it, you are enqueueing operations faster than you are processing them.
You can add an explicit synchronization point, for example .asnumpy() or .mx.nd.waitall() or .wait_to_read(), that way MXNet will wait for the enqueued operations to be completed before continuing the python execution.
This will solve your issue:
import mxnet
from gluoncv import model_zoo
from gluoncv.data.transforms.presets import ssd
model = model_zoo.get_model('ssd_512_resnet50_v1_coco')
model.initialize()
for _ in range(100000):
# note: an example imgbuf string is too long to post
# see gist or use requests etc to obtain
imgbuf =
ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
tensor, orig = ssd.transform_test(ndarray, 512)
labels, confidences, bboxs = model.forward(tensor)
mx.nd.waitall()
Read more about MXNet asynchronous execution here: http://d2l.ai/chapter_computational-performance/async-computation.html

Reduce multiprocessing for statsmodels glm

I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".
glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()
This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.
For reference, below is the machine configuration and some more information if needed.
CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')
GLM uses multiprocessing only through the linear algbra libraries
The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914
It includes some links to other issues where this shows up.
(quote:)
Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.
However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.
How can we restrict the number of cores used by the linear algebra libraries?
This depends on which linear algebra library is used. see mailing list thread
https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ
openblas: try setting the environment variable OMP_NUM_THREADS=1
Accelerate on OSX, set VECLIB_MAXIMUM_THREADS
mkl in anaconda:
import mkl
mkl.set_num_threads(1)
This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.
Here is the source code from statsmodels:
if method == 'pinv':
pinv_wexog = np.linalg.pinv(self.wexog)
params = pinv_wexog.dot(self.wendog)
elif method == 'qr':
Q, R = np.linalg.qr(self.wexog)
params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
else:
params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,

matplotlib.pyplot.hist() hangs if size of bins is too large?

I am plotting histograms and I found this on stack exchange which works great:
histogram for discrete values
Here is the code posted there:
import matplotlib.pyplot as plt
import numpy as np
data = range(11)
data = np.array(data)
d = np.diff(np.unique(data)).min()
left_of_first_bin = data.min() - float(d)/2
right_of_last_bin = data.max() + float(d)/2
plt.hist(data, np.arange(left_of_first_bin, right_of_last_bin + d, d))
plt.show()
I am using it with a case where d = 2.84e-5, the output of np.arrange() above is then 68704 in length. If I run this from python interpreter (python 3.5) on ubuntu 14.04 from an anaconda environment, the system hangs and I can not recover without ctrl-c which kills the interpreter. I am wondering if there is a limit on the size of bins in plt.hist() or if there is something inherently wrong with this approach. If a limitation, I would expect an error rather than a hang. The code works fine if d is not too small. The length of my data might be impacting this as well, it was 22289. I guess it could just be churning and I am not waiting long enough?
I searched for matplotlib.pyplot.hist limitations and other variations and could not find anything. The documentation from what I can tell does not mention a limit. Thank you.
It looks like there is not a real hang. It just takes forever because the data is so large and the bin widths so small. I noted that with d=.001, it took about 30 seconds on my machine to render the plot. Sorry for the trouble, I thought I found a potential bug and as a newbie got excited.

Resources