The meaning of "n_jobs == 1" in GridSearchCV with using multiple GPU - scikit-learn

I have been training NN model by using Keras framework with 4 NVIDIA GPU. (Data Row Count: ~160,000, Column Count: 5). Now I want to optimize its parameter by using GridSearchCV.
However, I encountered several different errors whenever I tried to change n_jobs to other values than one. Error, such as
CUDA OUT OF MEMORY
Can not get device properties error code : 3
Then I read this web page,
"# if you're not using a GPU, you can set n_jobs to something other than 1"
http://queirozf.com/entries/scikit-learn-pipeline-examples
So it is not possible to use multiple GPU with GridSearchCV?
[Environment]
Ubuntu 16.04
Python 3.6.0
Keras / Scikit-Learn
Thanks!

According to the FAQ in scikit learn - GPU is NOT supported. Link
You can use n_jobs to use your CPU cores. If you want to run at maximum speed you might want to use almost all your cores:
import multiprocessing
n_jobs = multiprocessing.cpu_count()-1

Related

Putting Huggingface model on GPU with torch.distributed

I'm using Huggingface and I'm putting my model on GPU using the following code:
from transformers import GPTJForCausalLM
import torch
model = GPTJForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B",
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_cache=False,
gradient_checkpointing=True
)
model.to("cuda")
I would like to train on 2 GPUs using the following command fails:
python -m torch.distributed.launch --nproc_per_node 2 --nnodes=1 train.py
This causes an error because I think it tries to put both instances on the same GPU. When I remove the model.to("cuda") line it works fine (but it is then not running on GPU I guess).
How can I put the model on GPU when using multiple GPUs?

Non-deterministic behavior for training a neural network on GPU implemented in PyTorch and with a fixed random seed

I observed a strange behavior of the final Accuracy when I run exactly the same experiment (the same code for training neural net for image classification) with the same random seed on different GPUs (machines). I use only one GPU. Precisely, When I run the experiment on one machine_1 the Accuracy is 86,37. When I run the experiment on machine_2 the Accuracy is 88,0.
There is no variability when I run the experiment multiple times on the same machine. PyTorch and CUDA versions are the same. Could you help me to figure out the reason and fix it?
Machine_1:
NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2
Machine_2:
NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2
To fix random seed I use the following code:
random.seed(args.seed)
os.environ['PYTHONHASHSEED'] = str(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
This is what I use:
import torch
import os
import numpy as np
import random
def set_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
set_seed(13)
Make sure you have a single function that set's the seeds from once. If you are using Jupyter notebooks cell execution timing may cause this. Also the order of functions inside may be important. I never had problems with this code. You may call set_seed() often in code.

Optimizing SVR() parameters using GridSearchCv

I want to tune the parameters of the "SVR()" regression function. It starts processing and doesn't stop, I am unable to figure out the problem. I am predicting a parameter using the SVM regression function SVR(). The results are not good with default values in Python.so I want to try tunning it with "GridSearchCv". The last part "grids.fit(Xtrain,ytrain)" start running without giving any error and doesn't stop.
SVR() tunning using GridSearch
Code:
from sklearn.model_selection import GridSearchCV.
param = {'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),'C' : [1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')},
modelsvr = SVR(),
grids = GridSearchCV(modelsvr,param,cv=5)
grids.fit(Xtrain,ytrain)
It Continues to process without stopping.
Yes, you are right. I have come across the same scenario, when I try to run GridsearchCV for SVR(). The possible reasons are, 1) Your Processor memory(RAM) must be less, 2) The train data sample size is more, equal chance of consuming more time to run Gridsearch since your processor is low memory, so without any error the Job running time will be more.
For your info: I have run Gridsearch with train sample size of 30K using 16GB RAM memory space, it elapsed 210mins to finish the run. So, here patience is must.
Happy Analyzing !!
Maybe you should add two more options to your GridSearch (n_jobs and verbose) :
grid_search = GridSearchCV(estimator = svr_gs, param_grid = param,
cv = 3, n_jobs = -1, verbose = 2)
verbose means that you see some output about the progress of your process.
n_jobs is the numebr of used cores (-1 means all cores/threads you have available)

Can we run tensorflow lite on linux ? Or it is for android and ios only

Hi is there any possibility to run tensorflow lite on linux platform? If yes, then how we can write code in java/C++/python to load and run models on linux platform? I am familiar with bazel and successfully made Android and ios application using tensorflow lite.
I think the other answers are quite wrong.
Look, I'll tell you my experience... I've been working with Django for many years, and I've been using normal tensorflow, but there was a problem with having 4 or 5 or more models in the same project.
I don't know if you know Gunicorn + Nginx. This generates workers, so if you have 4 machine learning models, for every worker it multiplies, if you have 3 workers you will have 12 models preloaded in RAM. This is not efficient at all, because if the RAM overflows your project will fall or in fact the service responses are slower.
So this is where Tensorflow lite comes in. Switching from a tensorflow model to tensorflow lite improves and makes things much more efficient. Times are reduced absurdly.
Also, Django and Gunicorn can be configured so that the model is pre-loaded and compiled at the same time. So every time the API is used up, it only generates the prediction, which helps you make each API call a fraction of a second long.
Currently I have a project in production with 14 models and 9 workers, you can understand the magnitude of that in terms of RAM.
And besides doing thousands of extra calculations, outside of machine learning, the API call does not take more than 2 seconds.
Now, if I used normal tensorflow, it would take at least 4 or 5 seconds.
In summary, if you can use tensorflow lite, I use it daily in Windows, MacOS, and Linux, it is not necessary to use Docker at all. Just a python file and that's it. If you have any doubt you can ask me without any problem.
Here a example project
Django + Tensorflow Lite
It's possible to run (but it will works slower, than original tf)
Example
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path=graph_file)
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Get quantization info to know input type
quantization = None
using_type = input_details[0]['dtype']
if dtype is np.uint8:
quantization = input_details[0]['quantization']
# Get input shape
input_shape = input_details[0]['shape']
# Input tensor
input_data = np.zeros(dtype=using_type, shape=input_shape)
# Set input tensor, run and get output tensor
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
I agree with Nouvellie. It is possible and worth the time implementing. I developed a model on my Ubuntu 18.04 32 processor server and exported the model to tflite. The model ran in 178 secs on my ubuntu server. On my raspberry pi4 with 4GB memory, the tflite implementation ran in 85 secs, less than half the time of my server. When I installed tflite on my server the run time went down to 22 secs, an 8 fold increase in performance and now almost 4 times faster than the rpi4.
To install for python, I did not have to build the package but was able to use one of the prebuilt interpreters here:
https://www.tensorflow.org/lite/guide/python
I have Ubuntu 18.04 with python 3.7.7. So I ran pip install with the Linux python 3.7 package:
pip3 install
https://dl.google.com/coral/python/tflite_runtime-2.1.0.post1-cp37-cp37m-linux_x86_64.whl
Then import the package with:
from tflite_runtime.interpreter import Interpreter
Previous posts show how to use tflite.
From Tensorflow lite
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices.
Tensorflow lite is a fork of tensorflow for embedded devices. For PC just use the original tensorflow.
From github tensorflow:
TensorFlow is an open source software library
TensorFlow provides stable Python API and C APIs as well as without API backwards compatibility guarantee like C++, Go, Java, JavaScript and Swift.
We support CPU and GPU packages on Linux, Mac, and Windows.
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> tf.add(1, 2)
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
'Hello, TensorFlow!'
Yes, you can compile Tensorflow Lite to run on Linux platforms even with a Docker container. See the demo: https://sconedocs.github.io/tensorflowlite/

Reduce multiprocessing for statsmodels glm

I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".
glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()
This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.
For reference, below is the machine configuration and some more information if needed.
CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')
GLM uses multiprocessing only through the linear algbra libraries
The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914
It includes some links to other issues where this shows up.
(quote:)
Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.
However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.
How can we restrict the number of cores used by the linear algebra libraries?
This depends on which linear algebra library is used. see mailing list thread
https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ
openblas: try setting the environment variable OMP_NUM_THREADS=1
Accelerate on OSX, set VECLIB_MAXIMUM_THREADS
mkl in anaconda:
import mkl
mkl.set_num_threads(1)
This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.
Here is the source code from statsmodels:
if method == 'pinv':
pinv_wexog = np.linalg.pinv(self.wexog)
params = pinv_wexog.dot(self.wendog)
elif method == 'qr':
Q, R = np.linalg.qr(self.wexog)
params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
else:
params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,

Resources