I'd like to get the actual version of a MLFlow model before loading / updating it to the running environment.
The environment access the by such paths: models:/AICam/Production
From time to time I need to reload the model. Unfortunately the models will be released asynchronously and loading take some minutes which interrupts the process.
Has anybody some hint how to do (in python)?
You can pass the following to a variable to check for version. You can then test with if-then or assert statements.
print(mlflow.__version__)
Related
When I launch my main script on the cluster with ddp mode (2 GPU's), Pytorch Lightning duplicates whatever is executed in the main script, e.g. prints or other logic. I need some extended training logic, which I would like to handle myself. E.g. do something (once!) after Trainer.fit(). But with the duplication of the main script, this doesn't work as I intend. I also tried to wrap it in if __name__ == "__main__", but it doesn't change behavior. How could one solve this problem? Or, how can I use some logic around my Trainer object, without the duplicates?
I have since moved on to use the native "ddp" with multiprocessing in PyTorch. As far as I understand, PytorchLightning (PTL) is just running your main script multiple times on multiple GPU's. This is fine if you only want to fit your model in one call of your script. However, a huge drawback in my opinion is the lost flexibility during the training process. The only way of interacting with your experiment is through these (badly documented) callbacks. Honestly, it is much more flexible and convenient to use native multiprocessing in PyTorch. In the end it was so much faster and easier to implement, plus you don't have to search for ages through PTL documentation to achieve simple things.
I think PTL is going in a good direction with removing much of the boiler plate, however, in my opinion, the Trainer concept needs some serious rework. It is too closed in my opinion and violates PTL's own concept of "reorganizing PyTorch code, keep native PyTorch code".
If you want to use PTL for easy multi GPU training, I personally would strongly suggest to refrain from using it, for me it was a waste of time, better learn native PyTorch multiprocessing.
Asked this at the GitHub repo: https://github.com/PyTorchLightning/pytorch-lightning/issues/8563
There are different accelerators for training, and while DDP (DistributedDataParallel) runs the script once per GPU, ddp_spawn and dp doesn't.
However, certain plugins like DeepSpeedPlugin are built on DDP, so changing the accelerator doesn't stop the main script from running multiple times.
You could quit the duplicated sub-processes by putting the following code after Trainer.fit:
import sys
if model.global_rank != 0:
sys.exit(0)
where model is inherited from LightningModule, which has a property global_rank specifying the rank of the machine. We could roughly understand it as the gpu id or the process id. Everything after this code will only be executed in the main process, i.e., process with global_rank = 0.
For more information, please refer the documentation https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#global_rank
Use global variables:
if __name__ == "__main__":
is_primary = os.environ.get(IS_PTL_PRIMARY) is None
os.environ[IS_PTL_PRIMARY] = "yes"
## code to run on each GPU
if is_primary:
## code to run only once
From Pytorch Lightning Official Document on DDP, we know that PL intendedly call the main script multiple times to spin off the child processes that take charge of GPUs:
They used the environment variable "LOCAL_RANK" and "NODE_RANK" to denote GPUs. So we can add conditions to bypass the code blocks that we don't want to get executed repeatedly. For example:
import os
if __name__ == "__main__":
if 'LOCAL_RANK' not in os.environ.keys() and 'NODE_RANK' not in os.environ.keys():
# code you only want to run once
This seems like a trivial question, but in fact it isn't. I'm using puppet 3.7 to deploy and configure artifacts from my project on to a variety of environments. Puppet 5.5 upgrade is on the roadmap, but without an ETA so far.
One of the things I'm trying to automate is the incremental changes to the underlying db. It's not SQL so standard tools are out of question. These changes will come in the form of shell scripts contained in a special module, that will also be deployed as an artifact. For each release we want to have a file, whose content will list the shell scripts to execute in scope of this release. For instance, if in version 1.2 we had implemented JIRA-123, JIRA-124 and JIRA-125, I'd like to execute scripts JIRA-123.sh, JIRA-124.sh and JIRA-125.sh, but no other ones that will still be in that module from previous releases.
So my release "control" file would be called something like jiras.1.2.csv and have one line looking like this:
JIRA-123,JIRA-124,JIRA-125
The task for puppet here seems trivial - read the content of this file, split on "," character, and go on to build exec tasks for each of the jiras. The problem is that the puppet function that should help me do it
file("/somewhere/in/the/filesystem/jiras.1.2.csv")
gets executed at the time of building the puppet catalog, not at the time when the catalog is applied. However, since this file is a part of the payload of the release, it's not there yet. It will be downloaded from nexus in a tar.gz package of the release and extracted later. I have an anchor I can hold on to, which I use to synchronize the exec tasks, but can I attach the reading of the file content to that anchor?
Maybe I'm approaching the problem incorrectly? I was thinking the module with the pre-implementation and post-implementation tasks that constitute the incremental db upgrades could be structured so that for each release there's a subdirectory matching the version name, but then I need to list the contents of that subdirectory to build my exec tasks, and that too - at least to my limited puppet knowledge - can't be deferred until a particular anchor.
--- EDITED after one of the comments ---
The problem is that the upgrade to puppet 5.x is beyond my control - it's another team handling this stuff in a huge organisation, so I have no influence over that and I'm stuck on 3.7 for the foreseeable future.
As for what I'm trying to do - for a bunch of different software packages that we develop and release I want to create three new ones: pre-implementation, post-implementation and checks. The first will hold any tasks that are performed prior to releasing new code in our actual packages. This is typically things like backing up db. Post-implementation will deal with issues that need to be addressed after we've deployed the new code - example operation would be to go and modify older data because for instance we've changed a type of column in a table. Checks are just validations performed to make sure the release is 100% correctly implemented - for instance run a select query and assert on the type of data in the column, whose type we've just changed. Today all of these are daunting manual operations performed by whoever is unlucky to be doing a release. Above all else though, being manual these are by definition error prone.
The approach taken is that for every JIRA ticket being part of the release the responsible developer will have to decide what steps (if any) are needed to release their work, and script that. Puppet is supposed to orchestrate the execution of all of this.
What is the reason of such issue in joblib?
'Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1'
What should I do to avoid such issue?
Actually I need to implement XMLRPC server which run heavy computation in background thread and report current progress through polling from UI client. It uses scikit-learn which are based on joblib.
P.S.:
I've simply changed name of the thread to "MainThread" to avoid such warning and everything looks working good (run in parallel as expected without issues). What might be a problem in future for such workaround?
I had the same warning while doing predictions with sklearn within a thread, using a model I loaded and which was fitted with n_jobs > 1. It appears when you pickle a model it is saved with its parameters, including n_jobs.
To avoid the warning (and potential serialization cost), set n_jobs to 1 when pickling your models:
clf = joblib.load(model_filename).set_params(n_jobs=1)
This seems to be due to this issue in JobLib library. At the moment of writing this seems to be fixed but not released yet. As written in the question, a dirty fix would to rename the main thread back to MainThread:
threading.current_thread().name = 'MainThread'
I am writing a Rails 3.1 app, and I have a set of three cucumber feature files. When run individually, as with:
cucumber features/quota.feature
-- or --
cucumber features/quota.feature:67 # specifying the specific individual test
...each feature file runs fine. However, when all run together, as with:
cucumber
...one of the tests fails. It's odd because only one test fails; all the other tests in the feature pass (and many of them do similar things). It doesn't seem to matter where in the feature file I place this test; it fails if it's the first test or way down there somewhere.
I don't think it can be the test itself, because it passes when run individually or even when the whole feature file is run individually. It seems like it must be some effect related to running the different feature files together. Any ideas what might be going on?
It looks like there is a coupling between your scenarios. Your failing scenario assumes that system is in some state. When scenarios run individually system is in this state and so scenario passes. But when you run all scenarios, scenarios that ran previously change this state and so it fails.
You should solve it by making your scenarios completely independent. Work of any scenario shouldn't influence results of other scenarios. It's highly encouraged in Cucumber Book and Specification by Example.
I had a similar problem and it took me a long time to figure out the root cause.
I was using #selenium tags to test JQuery scripts on a selenium client.
My page had an ajax call that was sending a POST request. I had a bug in the javascript and the post request was failing. (The feature wasn't complete and I hadn't yet written steps to verify the result of the ajax call.)
This error was recorded in Capybara.current_session.server.error.
When the following non-selenium feature was executed a Before hook within Capybara called Capybara.reset_sessions!
This then called
def reset!
driver.reset! if #touched
#touched = false
raise #server.error if #server and #server.error
ensure
#server.reset_error! if #server
end
#server.error was not nil for each scenario in the following feature(s) and Cucumber reported each step as skipped.
The solution in my case was to fix the ajax call.
So Andrey Botalov and Doug Noel were right. I had carry over from an earlier feature.
I had to keep debugging until I found the exception that was being raised and investigate what was generating it.
I hope this helps someone else that didn't realise they had carry over from an earlier feature.
I am creating an OpenGL game on a Windows 7 machine using VS2010. In addition, SDL, QtCore, QtXML and FbxSdk are also used to assist in development. I am experiencing a very peculiar problem with glGenTextures when running outside debug mode. Let me explain.
When I compile and run the application in Debug mode, the models are textured and displayed properly. As soon as I debug the application or compile and run the application using release mode textures are no longer being applied to the models.
I have tracked down the problem to glGenTextures not giving me a valid name. It does not give me any errors either. The way I am loading everything is as follows:
Models are loaded as FBX through FbxSdk, Required textures are loaded as the model is loaded. All models are loaded in another thread, I made sure that no OpenGL functions get called anywhere while this thread is loading models. If I don't load models in another thread everything works. I tried everything I can think of including halting the main thread while the models get loaded to guarantee nothing else if happening while models get loaded. None of it works.
Again this wouldn't be as weird except compiling as debug works. Release and Debugging the application doesn't work. Any thoughts?
I can only guess, but did you maybe not make the GL context current on the model loading thread? Remember that the OpenGL context is thread specific!
On the other hand, it's usually advised against to use the same GL context in different threads. Either use a different context and share resources, or defer all the GL calls to the main GL thread.
The difference when using the debugging mode is that VS will use the debug-heap when you debug, but won't use it when you just run a debug build without actual debugging.
However, if you get the threads thing wrong, all kinds of crazy side-effects can happen which the debug heap might "hide".