Pytorch Lightning duplicates main script in ddp mode

Pytorch Lightning duplicates main script in ddp mode - pytorch

When I launch my main script on the cluster with ddp mode (2 GPU's), Pytorch Lightning duplicates whatever is executed in the main script, e.g. prints or other logic. I need some extended training logic, which I would like to handle myself. E.g. do something (once!) after Trainer.fit(). But with the duplication of the main script, this doesn't work as I intend. I also tried to wrap it in if __name__ == "__main__", but it doesn't change behavior. How could one solve this problem? Or, how can I use some logic around my Trainer object, without the duplicates?

I have since moved on to use the native "ddp" with multiprocessing in PyTorch. As far as I understand, PytorchLightning (PTL) is just running your main script multiple times on multiple GPU's. This is fine if you only want to fit your model in one call of your script. However, a huge drawback in my opinion is the lost flexibility during the training process. The only way of interacting with your experiment is through these (badly documented) callbacks. Honestly, it is much more flexible and convenient to use native multiprocessing in PyTorch. In the end it was so much faster and easier to implement, plus you don't have to search for ages through PTL documentation to achieve simple things.
I think PTL is going in a good direction with removing much of the boiler plate, however, in my opinion, the Trainer concept needs some serious rework. It is too closed in my opinion and violates PTL's own concept of "reorganizing PyTorch code, keep native PyTorch code".
If you want to use PTL for easy multi GPU training, I personally would strongly suggest to refrain from using it, for me it was a waste of time, better learn native PyTorch multiprocessing.

Asked this at the GitHub repo: https://github.com/PyTorchLightning/pytorch-lightning/issues/8563
There are different accelerators for training, and while DDP (DistributedDataParallel) runs the script once per GPU, ddp_spawn and dp doesn't.
However, certain plugins like DeepSpeedPlugin are built on DDP, so changing the accelerator doesn't stop the main script from running multiple times.

You could quit the duplicated sub-processes by putting the following code after Trainer.fit:
import sys
if model.global_rank != 0:
sys.exit(0)
where model is inherited from LightningModule, which has a property global_rank specifying the rank of the machine. We could roughly understand it as the gpu id or the process id. Everything after this code will only be executed in the main process, i.e., process with global_rank = 0.
For more information, please refer the documentation https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#global_rank

Use global variables:
if __name__ == "__main__":
is_primary = os.environ.get(IS_PTL_PRIMARY) is None
os.environ[IS_PTL_PRIMARY] = "yes"
## code to run on each GPU
if is_primary:
## code to run only once

From Pytorch Lightning Official Document on DDP, we know that PL intendedly call the main script multiple times to spin off the child processes that take charge of GPUs:
They used the environment variable "LOCAL_RANK" and "NODE_RANK" to denote GPUs. So we can add conditions to bypass the code blocks that we don't want to get executed repeatedly. For example:
import os
if __name__ == "__main__":
if 'LOCAL_RANK' not in os.environ.keys() and 'NODE_RANK' not in os.environ.keys():
# code you only want to run once

Related

Coding with Revit API: tips to reduce memory use?

I have a quite 'general' question. I am developing with Revit API (with python), and I am sometimes observing that the Revit session gets slower during my tests and trials (the longer Revit stays open, the more it seems to happen). It's not getting to the point where it would be really problematic, but it made me think about it anyway..
So, since I have no programming background, I am pretty sure that my code is filled with really 'unorthodox' things that could be far better.
Would there be some basic 'tips and tricks' that I could follow (I mean, related to the Revit API) to help the speed of code execution? Or maybe should I say: to help reducing the memory use?
For instance, I've read about the 'Dispose' method available, notably when using Transactions (for instance here: http://thebuildingcoder.typepad.com/blog/2012/09/disposal-of-revit-api-objects.html ), but it's not very clear to me in the end if that's actually very important to do or not (and furthermore, since I'm using Python, and don't know where that puts me in the discussion about using "using" or not)?
Should I just 'Dispose' everything? ;)
Besides the 'Dispose' method, is there something else?
Thanks a lot,
Arnaud.

Basics:
Okay let's talk about a few important points here:
You're running scripts under IronPython which is an implementation of python in C# language
C# Language uses Garbage Collectors to collect unused memory.
Garbage Collector (GC) is a piece of program that is executed at intervals to collect the unused elements. It uses a series of techniques to group and categorize the target memory areas for later collection.
Your main program is halted by the operating system to allow the GC to collect memory. This means that if the GC needs more time to do its job at each interval, your program will get slow and you'll experience some lag.
Issue:
Now to the heart of this issue:
python is an object-oriented programming language at heart and IronPython creates objects (Similar to Elements in Revit in concept) for everything, from your variables to methods of a class to functions and everything else. This means that all these objects need to be collected when they're not used anymore.
When using python as a scripting language for a program, there is generally one single python Engine that executes all user inputs.
However Revit does not have a command prompt and an associated python engine. So every time you run a script in Revit, a new engine is created that executes the program and dies at the end.
This dramatically increases the amount of unused memory for the GC to collect.
Solution:
I'm the creator and maintainer of pyRevit and this issue was resolved in pyRevit v4.2
The solution was to set LightweightScopes = true when creating the IronPython engine and this will force the engine to create smaller objects. This dramatically decreased the memory used by IronPython and increased the amount of time until the user experiences Revit performance degradation.

Sorry i can't comment with a low reputation, i use another way to reduce memory, it's less pretty than the LightweightScopes trick, but it works for one-time cleanup after expensive operations :
import gc
my_object = some_huge_object
# [operation]
del my_object # or my_object = [] does the job for a list or dict
gc.collect()

Timing issues Psychopy PyGaze EyeTribe Eyetracker (Multithreading)

I have an experiment in which I present stimuli using PsychoPy / PyGaze and track eye movements with an EyeTribe eye tracker. In this experiment I update the size of two visual stimuli on each frame (at 60 Hz). I prepare each frame beforehand and afterwards loop through all of the screen objects and present them. Meanwhile, a continuous sound is playing. When I run this experiment in dummy mode (mouse movement is used as a simulation for gaze position), there are no timing issues for the visual presentation. However, when I run the experiment while performing eye tracking, the timing of the visual presentation is no longer accurate (higher variability in duration of frames).
I tried looking into the multi threading more, but in the pytribe script of PyGaze I can't find any evidence that one thread is waiting for an event coming from the eye tracking thread. So, I have no idea how to figure out what is causing the timing issues or how to solve this? (I hope I explained the problem sufficiently specific).

It's worse than just needing a separate thread for eyetrack versus stimulus rendering. What you really need is a separate process that avoids the python Global Interpreter Lock (GIL). The GIL prevents different threads from running on different processors.
For improved temporal precision I would really recommend you switch from pygaze to iohub (which also has support for eyetribe I believe). iohub does run genuinely on a different core of the machine where possible so that your stimuli and eye data can be processed independently in time, and it handles all the sync stuff for you.

Adding to Jon's answer: Hanne also emailed about the problem, and it turns out she was running her experiments from Spyder. When run from the command prompt, there shouldn't be any timing issues. (Obviously, the GIL is still around, but in practice this doesn't seem to affect screen timing.)
To prevent any issues in the future, I've added a class that allows for running the EyeTribe in a parallel Process. See: https://github.com/esdalmaijer/PyTribe/blob/master/pytribe.py#L365
Example use:
if __name__ == "__main__":
from pygaze.display import Display
from pygaze.screen import Screen
from pytribe import ParallelEyeTribe
disp = Display()
scr = Screen()
scr.draw_fixation(fixtype='cross')
tracker = ParallelEyeTribe()
tracker.start_recording()
disp.fill(scr)
disp.show()
tracker.log("Stimulus onset")
time.sleep(10)
disp.show()
tracker.log("Stimulus offset")
tracker.stop_recording()
tracker.close()
disp.close()

Multiprocessing backed parallel loops cannot be nested below threads

What is the reason of such issue in joblib?
'Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1'
What should I do to avoid such issue?
Actually I need to implement XMLRPC server which run heavy computation in background thread and report current progress through polling from UI client. It uses scikit-learn which are based on joblib.
P.S.:
I've simply changed name of the thread to "MainThread" to avoid such warning and everything looks working good (run in parallel as expected without issues). What might be a problem in future for such workaround?

I had the same warning while doing predictions with sklearn within a thread, using a model I loaded and which was fitted with n_jobs > 1. It appears when you pickle a model it is saved with its parameters, including n_jobs.
To avoid the warning (and potential serialization cost), set n_jobs to 1 when pickling your models:
clf = joblib.load(model_filename).set_params(n_jobs=1)

This seems to be due to this issue in JobLib library. At the moment of writing this seems to be fixed but not released yet. As written in the question, a dirty fix would to rename the main thread back to MainThread:
threading.current_thread().name = 'MainThread'

in a pickle: how to serialise legacy objects for submission to a Python multiprocessing pool

I have written a nice parallel job processor that accepts jobs (functions, their arguments, timeout information etc.) and submits then to a Python multiprocessing pool. I can provide the full (long) code if requested, but the key step (as I see it) is the asynchronous application to the pool:
job.resultGetter = self.pool.apply_async(
func = job.workFunction,
kwds = job.workFunctionKeywordArguments
)
I am trying to use this parallel job processor with a large body of legacy code and, perhaps naturally, have run into pickling problems:
PicklingError: Can’t pickle <type ’instancemethod’>: attribute lookup builtin .instancemethod failed
This type of problem is observable when I try to submit a problematic object as an argument for a work function. The real problem is that this is legacy code and I am advised that I can make only very minor changes to it. So... is there some clever trick or simple modification I can make somewhere that could allow my parallel job processor code to cope with these traditionally unpicklable objects? I have total control over the parallel job processor code, so I am open to, say, wrapping every submitted function in another function. For the legacy code, I should be able to add the occasional small method to objects, but that's about it. Is there some clever approach to this type of problem?

use dill and pathos.multiprocessing instead of pickle and multiprocessing.
see here:
What can multiprocessing and dill do together?
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/
How to pickle functions/classes defined in __main__ (python)

GridSearchCV : n_jobs in parallel (internals)

How does GridSearchCV with n_jobs being set to a >1 value actually work. Does it create multiple instances of the classifier for each node(computation node) or does it create 1 single classifier which is shared by all the nodes. The reason I am asking is becuase I am using vowpal_wabbits Python wrapper: https://github.com/josephreisinger/vowpal_porpoise/blob/master/vowpal_porpoise/vw.py and see that it opens a subprocess (with stdin, stdout, stderr etc). However when I use this from GridSearch with n_jobs > 1 , I get a broken pipe error after some time and am trying to understand why?

n_jobs > 1 will make GridSearchCV use Python's multiprocessing module under the hood. That means that the original estimator instance will be copied (pickled) to be send over to the worker Python processes. All scikit-learn models MUST be picklable. If the vowpal_porpoise opens pipes to a vw subprocess in the constructor object, it has to close them and reopen them around the pickling / unpickling steps by defining custom __getstate__ and __setstate__ methods. Have a look at the Python documentation for more details.
The subprocess should probably be close and reopened upon the call to the set_params method to update the parameters of the model with new parameter values.
It would be easier to not open the subprocess in the constructor and just open it on demand in the fit and predict methods and close the subprocess each time.

One of the Questions in the comments was is
Which one is better, to use n_jobs=-1 or n_jobs with a big number like 32 ?!
This depends on your understanding of better. I would say, that this depends on your hardware currently available as well as how much you want to provide of it to the algorithm.
The documentation says that n_jobs=-1 uses all processors (for instance threads). Therefore, if your hardware actually supports 32 Threads, the function GridSearchCV() will use 32 of the processors. And if you decrease the number further (n_jobs=-2, n_jobs=-3 and so forth) you will allocate the number of possible processors minus the number you decreased the parameter. As an example when 8 jobs would be possible, then 7 jobs will be instanciated when n_jobs=-2.
But it is also a little bit more complicated than this: The number of jobs specified with n_jobs in GridSearchCV() does not have to be identical to the actual threads used by Python because there may be other sources that use processors 2.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pytorch Lightning duplicates main script in ddp mode - pytorch

Use global variables: if name == "main": is_primary = os.environ.get(IS_PTL_PRIMARY) is None os.environ[IS_PTL_PRIMARY] = "yes" ## code to run on each GPU if is_primary: ## code to run only once

Related

Coding with Revit API: tips to reduce memory use?

Timing issues Psychopy PyGaze EyeTribe Eyetracker (Multithreading)

Multiprocessing backed parallel loops cannot be nested below threads

in a pickle: how to serialise legacy objects for submission to a Python multiprocessing pool

GridSearchCV : n_jobs in parallel (internals)

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pytorch Lightning duplicates main script in ddp mode - pytorch

Use global variables: if __name__ == "__main__": is_primary = os.environ.get(IS_PTL_PRIMARY) is None os.environ[IS_PTL_PRIMARY] = "yes" ## code to run on each GPU if is_primary: ## code to run only once

Related

Coding with Revit API: tips to reduce memory use?

Timing issues Psychopy PyGaze EyeTribe Eyetracker (Multithreading)

Multiprocessing backed parallel loops cannot be nested below threads

in a pickle: how to serialise legacy objects for submission to a Python multiprocessing pool

GridSearchCV : n_jobs in parallel (internals)

Categories

Resources

Use global variables: if name == "main": is_primary = os.environ.get(IS_PTL_PRIMARY) is None os.environ[IS_PTL_PRIMARY] = "yes" ## code to run on each GPU if is_primary: ## code to run only once