GridSearchCV : n_jobs in parallel (internals)

GridSearchCV : n_jobs in parallel (internals) - scikit-learn

How does GridSearchCV with n_jobs being set to a >1 value actually work. Does it create multiple instances of the classifier for each node(computation node) or does it create 1 single classifier which is shared by all the nodes. The reason I am asking is becuase I am using vowpal_wabbits Python wrapper: https://github.com/josephreisinger/vowpal_porpoise/blob/master/vowpal_porpoise/vw.py and see that it opens a subprocess (with stdin, stdout, stderr etc). However when I use this from GridSearch with n_jobs > 1 , I get a broken pipe error after some time and am trying to understand why?

n_jobs > 1 will make GridSearchCV use Python's multiprocessing module under the hood. That means that the original estimator instance will be copied (pickled) to be send over to the worker Python processes. All scikit-learn models MUST be picklable. If the vowpal_porpoise opens pipes to a vw subprocess in the constructor object, it has to close them and reopen them around the pickling / unpickling steps by defining custom __getstate__ and __setstate__ methods. Have a look at the Python documentation for more details.
The subprocess should probably be close and reopened upon the call to the set_params method to update the parameters of the model with new parameter values.
It would be easier to not open the subprocess in the constructor and just open it on demand in the fit and predict methods and close the subprocess each time.

One of the Questions in the comments was is
Which one is better, to use n_jobs=-1 or n_jobs with a big number like 32 ?!
This depends on your understanding of better. I would say, that this depends on your hardware currently available as well as how much you want to provide of it to the algorithm.
The documentation says that n_jobs=-1 uses all processors (for instance threads). Therefore, if your hardware actually supports 32 Threads, the function GridSearchCV() will use 32 of the processors. And if you decrease the number further (n_jobs=-2, n_jobs=-3 and so forth) you will allocate the number of possible processors minus the number you decreased the parameter. As an example when 8 jobs would be possible, then 7 jobs will be instanciated when n_jobs=-2.
But it is also a little bit more complicated than this: The number of jobs specified with n_jobs in GridSearchCV() does not have to be identical to the actual threads used by Python because there may be other sources that use processors 2.

Related

Using multiprocessing with AllenNLP decoding is sluggish compared to non-multiprocessing case

I'm using the AllenNLP (version 2.6) semantic role labeling model to process a large pile of sentences. My Python version is 3.7.9. I'm on MacOS 11.6.1. My goal is to use multiprocessing.Pool to parallelize the work, but the calls via the pool are taking longer than they do in the parent process, sometimes substantially so.
In the parent process, I have explicitly placed the model in shared memory as follows:
from allennlp.predictors import Predictor
from allennlp.models.archival import load_archive
import allennlp_models.structured_prediction.predictors.srl
PREDICTOR_PATH = "...<srl model path>..."
archive = load_archive(PREDICTOR_PATH)
archive.model.share_memory()
PREDICTOR = Predictor.from_archive(archive)
I know the model is only being loaded once, in the parent process. And I place the model in shared memory whether or not I'm going to make use of the pool. I'm using torch.multiprocessing, as many recommend, and I'm using the spawn start method.
I'm calling the predictor in the pool using Pool.apply_async, and I'm timing the calls within the child processes. I know that the pool is using the available CPUs (I have six cores), and I'm nowhere near running out of physical memory, so there's no reason for the child processes to be swapped to disk.
Here's what happens, for a batch of 395 sentences:
Without multiprocessing: 638 total processing seconds (and elapsed time).
With a 4-process pool: 293 seconds elapsed time, 915 total processing seconds.
With a 12-process pool: 263 seconds elapsed time, 2024 total processing seconds.
The more processes, the worse the total AllenNLP processing time - even though the model is explicitly in shared memory, and the only thing that crosses the process boundary during the invocation is the input text and the output JSON.
I've done some profiling, and the first thing that leaps out at me is that the function torch._C._nn.linear is taking significantly longer in the multiprocessing cases. This function takes two tensors as arguments - but there are no tensors being passed across the process boundary, and I'm decoding, not training, so the model should be entirely read-only. It seems like it has to be a problem with locking or competition for the shared model resource, but I don't understand at all why that would be the case. And I'm not a torch programmer, so my understanding of what's happening is limited.
Any pointers or suggestions would be appreciated.

Turns out that I wasn't comparing exactly the right things. This thread: https://github.com/allenai/allennlp/discussions/5471 goes into all the detail. Briefly, because pytorch can use additional resources under the hood, my baseline test without multiprocessing wasn't taxing my computer enough when running two instances in parallel; I had to run 4 instances to see the penalty, and in that case, the total processing time was essentially the same for 4 parallel nonmultiprocessing invocations, or one multiprocessing case with 4 subprocesses.

Pytorch Lightning duplicates main script in ddp mode

When I launch my main script on the cluster with ddp mode (2 GPU's), Pytorch Lightning duplicates whatever is executed in the main script, e.g. prints or other logic. I need some extended training logic, which I would like to handle myself. E.g. do something (once!) after Trainer.fit(). But with the duplication of the main script, this doesn't work as I intend. I also tried to wrap it in if __name__ == "__main__", but it doesn't change behavior. How could one solve this problem? Or, how can I use some logic around my Trainer object, without the duplicates?

I have since moved on to use the native "ddp" with multiprocessing in PyTorch. As far as I understand, PytorchLightning (PTL) is just running your main script multiple times on multiple GPU's. This is fine if you only want to fit your model in one call of your script. However, a huge drawback in my opinion is the lost flexibility during the training process. The only way of interacting with your experiment is through these (badly documented) callbacks. Honestly, it is much more flexible and convenient to use native multiprocessing in PyTorch. In the end it was so much faster and easier to implement, plus you don't have to search for ages through PTL documentation to achieve simple things.
I think PTL is going in a good direction with removing much of the boiler plate, however, in my opinion, the Trainer concept needs some serious rework. It is too closed in my opinion and violates PTL's own concept of "reorganizing PyTorch code, keep native PyTorch code".
If you want to use PTL for easy multi GPU training, I personally would strongly suggest to refrain from using it, for me it was a waste of time, better learn native PyTorch multiprocessing.

Asked this at the GitHub repo: https://github.com/PyTorchLightning/pytorch-lightning/issues/8563
There are different accelerators for training, and while DDP (DistributedDataParallel) runs the script once per GPU, ddp_spawn and dp doesn't.
However, certain plugins like DeepSpeedPlugin are built on DDP, so changing the accelerator doesn't stop the main script from running multiple times.

You could quit the duplicated sub-processes by putting the following code after Trainer.fit:
import sys
if model.global_rank != 0:
sys.exit(0)
where model is inherited from LightningModule, which has a property global_rank specifying the rank of the machine. We could roughly understand it as the gpu id or the process id. Everything after this code will only be executed in the main process, i.e., process with global_rank = 0.
For more information, please refer the documentation https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#global_rank

Use global variables:
if __name__ == "__main__":
is_primary = os.environ.get(IS_PTL_PRIMARY) is None
os.environ[IS_PTL_PRIMARY] = "yes"
## code to run on each GPU
if is_primary:
## code to run only once

From Pytorch Lightning Official Document on DDP, we know that PL intendedly call the main script multiple times to spin off the child processes that take charge of GPUs:
They used the environment variable "LOCAL_RANK" and "NODE_RANK" to denote GPUs. So we can add conditions to bypass the code blocks that we don't want to get executed repeatedly. For example:
import os
if __name__ == "__main__":
if 'LOCAL_RANK' not in os.environ.keys() and 'NODE_RANK' not in os.environ.keys():
# code you only want to run once

Confusion about multiprocessing and workers in Keras fit_generator() with windows 10 in spyder

In the documentation for fit_generator() (docs: https://keras.io/models/sequential/#fit_generator) it says that the parameter use_multiprocessing accepts a bool that if set to True allows process-based threading.
It also says that the parameter workers is an integer that designates how many process to spin up if using process-based threading. Apparently it defaults to 1 (a single process based thread) and if set to 0 it will execute the generator on the main thread.
What I thought this meant was that if use_multiprocessing=True and workers > 0 (let's use 6 for an example) that it would spin up 6 processes running the generator independently. However, when I test this I think I must be misunderstanding something (see below).
My confusion arises from the fact that if I set use_multiprocessing to False and workers = 1 then in my task manager I can see that all 12 of my virtual cores are being utilized somewhat evenly and I am at about 50% CPU usage while training my model (for reference, I have an i7-8750H CPU with 6 cores that support virtualization and I have virtualization enabled in BIOS). If I increase the number of workers, the CPU usage goes to 100% and training is much faster. If I decrease the number of workers to 0 so that it runs on the main thread, I can see that all of my virtual cores are still being used, but it seems somewhat uneven and CPU usage is at about 36%.
Unfortunately, if I set multiprocessing = True, then I get a brokenpipe error. I have yet to fix this, but I'd like to better understand what I am trying to fix here.
If someone could please explain the difference between training with use_multiprocessing = True and use_multiprocessing = False, as well as when workers are = 0, 1, and >1 I would be very grateful. If it matters, I am using tensorflow (gpu version) as the backend for keras with python 3.6 in Spyder with the IPython Console.
My suspicion is that use_multiprocessing is actually enabling multiprocessing when True whereas workers>1 when use_multiprocessing=False is setting the number of threads, but that's just a guess.

The only thing I know is that when use_multiprocessing=False and workers > 1, there are many parallel data loading threads (I'm not really good with these names, threads, processes, etc.). But there are five parallel fronts loading data to the queue (so, loading data is faster, but it doesn't affect the model's speed - this can be good when data loading takes too long).
Whenever I tried use_multiprocessing=True, everything got frozen.

Multiprocessing backed parallel loops cannot be nested below threads

What is the reason of such issue in joblib?
'Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1'
What should I do to avoid such issue?
Actually I need to implement XMLRPC server which run heavy computation in background thread and report current progress through polling from UI client. It uses scikit-learn which are based on joblib.
P.S.:
I've simply changed name of the thread to "MainThread" to avoid such warning and everything looks working good (run in parallel as expected without issues). What might be a problem in future for such workaround?

I had the same warning while doing predictions with sklearn within a thread, using a model I loaded and which was fitted with n_jobs > 1. It appears when you pickle a model it is saved with its parameters, including n_jobs.
To avoid the warning (and potential serialization cost), set n_jobs to 1 when pickling your models:
clf = joblib.load(model_filename).set_params(n_jobs=1)

This seems to be due to this issue in JobLib library. At the moment of writing this seems to be fixed but not released yet. As written in the question, a dirty fix would to rename the main thread back to MainThread:
threading.current_thread().name = 'MainThread'

in a pickle: how to serialise legacy objects for submission to a Python multiprocessing pool

I have written a nice parallel job processor that accepts jobs (functions, their arguments, timeout information etc.) and submits then to a Python multiprocessing pool. I can provide the full (long) code if requested, but the key step (as I see it) is the asynchronous application to the pool:
job.resultGetter = self.pool.apply_async(
func = job.workFunction,
kwds = job.workFunctionKeywordArguments
)
I am trying to use this parallel job processor with a large body of legacy code and, perhaps naturally, have run into pickling problems:
PicklingError: Can’t pickle <type ’instancemethod’>: attribute lookup builtin .instancemethod failed
This type of problem is observable when I try to submit a problematic object as an argument for a work function. The real problem is that this is legacy code and I am advised that I can make only very minor changes to it. So... is there some clever trick or simple modification I can make somewhere that could allow my parallel job processor code to cope with these traditionally unpicklable objects? I have total control over the parallel job processor code, so I am open to, say, wrapping every submitted function in another function. For the legacy code, I should be able to add the occasional small method to objects, but that's about it. Is there some clever approach to this type of problem?

use dill and pathos.multiprocessing instead of pickle and multiprocessing.
see here:
What can multiprocessing and dill do together?
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/
How to pickle functions/classes defined in __main__ (python)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string