MLflow saves models to relative place instead of tracking_uri - mlflow

sorry if my question is too basic, but cannot solve it.
I am experimenting with mlflow currently and facing the following issue:
Even if I have set the tracking_uri, the mlflow artifacts are saved to the ./mlruns/... folder relative to the path from where I run mlfow run path/to/train.py (in command line). The mlflow server searches for the artifacts following the tracking_uri (mlflow server --default-artifact-root here/comes/the/same/tracking_uri).
Through the following example it will be clear what I mean:
I set the following in the training script before the with mlflow.start_run() as run:
mlflow.set_tracking_uri("file:///home/#myUser/#SomeFolders/mlflow_artifact_store/mlruns/")
My expectation would be that mlflow saves all the artifacts to the place I gave in the registry uri. Instead, it saves the artifacts relative to place from where I run mlflow run path/to/train.py, i.e. running the following
/home/#myUser/ mlflow run path/to/train.py
creates the structure:
/home/#myUser/mlruns/#experimentID/#runID/artifacts
/home/#myUser/mlruns/#experimentID/#runID/metrics
/home/#myUser/mlruns/#experimentID/#runID/params
/home/#myUser/mlruns/#experimentID/#runID/tags
and therefore it doesn't find the run artifacts in the tracking_uri, giving the error message:
Traceback (most recent call last):
File "train.py", line 59, in <module>
with mlflow.start_run() as run:
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/fluent.py", line 204, in start_run
active_run_obj = client.get_run(existing_run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/client.py", line 151, in get_run
return self._tracking_client.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/_tracking_service/client.py", line 57, in get_run
return self.store.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 524, in get_run
run_info = self._get_run_info(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 544, in _get_run_info
"Run '%s' not found" % run_uuid, databricks_pb2.RESOURCE_DOES_NOT_EXIST
mlflow.exceptions.MlflowException: Run '788563758ece40f283bfbf8ba80ceca8' not found
2021/07/23 16:54:16 ERROR mlflow.cli: === Run (ID '788563758ece40f283bfbf8ba80ceca8') failed ===
Why is that so? How can I change the place where the artifacts are stored, this directory structure is created? I have tried mlflow run --storage-dir here/comes/the/path, setting the tracking_uri, registry_uri. If I run the /home/path/to/tracking/uri mlflow run path/to/train.py it works, but I need to run the scripts remotely.
My endgoal would be to change the artifact uri to an NFS drive, but even in my local computer I cannot do the trick.
Thanks for reading it, even more thanks if you suggest a solution! :)
Have a great day!

This issue was solved by the following:
I have mixed the tracking_uri with the backend_store_uri.
The tracking_uri is where the MLflow related data (e.g. tags, parameters, metrics, etc.) are saved, which can be a database. On the other hand, the artifact_location is where the artifacts (other, not MLflow related data belonging to the preprocessing/training/evaluation/etc. scripts).
What led me to mistakes is that by running mlflow server from command line one should set up for the --backend-store-uri the tracking_uri (also in the script by setting the mlflow.set_tracking_uri()) and for --default-artifact-location the location of the artifacts. Somehow I didn't get that the tracking_uri = backend_store_uri.

Here's my solution
Launch the server
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD#DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME
Set the the tracking uri to an HTTP URI like
mlflow.set_tracking_uri("http://my-tracking-server:5000/")

Related

Pyspark local env path handling from a cloned pyspark code

I'm new to pyspark and have a query around env configs handling in pyspark.
I would be reading a file in pyspark in hdfs like:
airportDf = spark.read.format('csv')\
.option('sep', ',')\
.option('header', 'false')\
.schema(airportSchema)\
.load(configs['airport_dat'])
here,
configs['airport_dat'] = "/mnt/g/pythonProject/data/airports.DAT"
which is configured in a config.json and would be the HDFS path
However,when i clone this repo in local and want to load file from local windows path, i have to manually edit this file with windows path.
Wanted to know if this is the correct approach or is there any guideline to how to handle such env specific configurations so as to avoid such manual edit in config files when running in local system.
Any link or any article or sample repo guiding the approach will be really helpful.
One more issue I am facing here is , the code works fine in local on IDE , however on the spark-submit on cluster it is not able to find the config file.
My command to read config file is
with open("config/config.json", "r") as config_file:
configs = json.load(config_file)
and my directory structure is
config
-- config.json
main.py
jobs
__init__.py
load_country.py
i have packaged all the files in a packages.zip and passing it as a py-files parameter.
my spark-submit command is
$SPARK_HOME/bin/spark-submit --py-files /mnt/g/pythonProject/sparkproject/packages.zip /mnt/g/pythonProject/sparkproject/main.py
the error i am getting is
Traceback (most recent call last):
File "/mnt/g/pythonProject/sparkproject/main.py", line 20, in <module>
with open("config/config.json", "r") as config_file:
FileNotFoundError: [Errno 2] No such file or directory: 'config/config.json'
I am not getting although file is in zip file still why it is not discoverable to spark-submit.Am i need to add anything else to make the json file discoverable.
P.S. the other functions in job folder are accessible although.

Cannot load tokenizer from local

I just started to use AWS Lambda and Docker so would appreciate any advice.
I am trying to deploy an ML model to AWS Lambda for reference. The image created from Dockerfile successfully load XLNet model from local dir, however, it stucked when doing the same thing for tokenizer
In the pretrained_tokenizer folder, I have 4 files saved from tokenizer.save_pretrained(...) and config.save_pretrained(...)
In Dockerfile, I have tried multiple things, including:
copy the folder COPY app/pretrained_tokenizer/ opt/ml/pretrained_tokenizer/
copy each file from folder with separated COPY command
compress the folder to .tar.gz and use ADD pretrained_tokenizer.tar.gz /opt/ml/ (which is supposed to extract the tar files in the process)
In my python script, I tried to load the tokenizer using tokenizer = XLNetTokenizer.from_pretrained(tokenizer_file, do_lower_case=True), which works on Colab, but not when I try to to do an invocation to the image through sam local invoke -e events/event.json, the error is
[ERROR] OSError: Can't load tokenizer for 'opt/ml/pretrained_tokenizer/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'opt/ml/pretrained_tokenizer/' is the correct path to a directory containing all relev raise EnvironmentError(ers/tokenization_utils_base.py", line 1768, in from_pretrained
END RequestId: bf011045-bed8-41eb-ac21-f98bfcee475a
I have tried to look through past questions but couldn't really fix anything. I will appreciate any help!

Mlflow - empty artifact folder

All,
I started the mlflow server as below. I do see the backend store containing the expected metadata. However, the artifact folder is empty despite many runs.
> mlflow server --backend-store-uri mlflow_db --default-artifact-root
> ./mlflowruns --host 0.0.0.0 --port 5000
The mlflow ui has the below message for the artifacts section:
No Artifacts Recorded
Use the log artifact APIs to store file outputs from MLflow runs.
What am I doing wrong?
Thanks,
grajee
Turns out that
"--backend-store-uri mlflow_db" was pointing to D:\python\Pythonv395\Scripts\mlflow_db
and
"--default-artifact-root ./mlflowruns" was pointing to D:\DataEngineering\MlFlow\Wine Regression\mlflowruns which is the project folder.
I was able to point both the output to one folder with the following syntax
file:/D:/DataEngineering/MlFlow/Wine Regression
In case you want to log artifacts to your server with local file system as object storage, you should specify --serve-artifact --artifact-destination file:/path/to/your/desired/location instead of just a vanilla path.

gcloud error when deploying to google app engine flexible environment

Recently I have needed to add web sockets to my backend application currently hosted on Google App Engine (GAE) standard environment. Because web sockets are a feature only available in GAE's flexible environment, I have been attempting a redeployment but with little success.
To make the change to a flexible environment I have updated the app.yaml file from
runtime: nodejs10
env: standard
to
runtime: nodejs
env: flex
While previously working in the standard environment, now with env: flex when I run the command gcloud app deploy --app-yaml=app-staging.yaml --verbosity=debug I get the following stack trace:
Do you want to continue (Y/n)? Y
DEBUG: No bucket specified, retrieving default bucket.
DEBUG: Using bucket [gs://staging.finnsalud.appspot.com].
DEBUG: Service [appengineflex.googleapis.com] is already enabled for project [finnsalud]
Beginning deployment of service [finnsalud-staging]...
INFO: Using ignore file at [~/checkouts/twilio/backend/.gcloudignore].
DEBUG: not expecting type '<class 'NoneType'>'
Traceback (most recent call last):
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 982, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 809, in Run
resources = command_instance.Run(args)
File "/google-cloud-sdk/lib/surface/app/deploy.py", line 115, in Run
return deploy_util.RunDeploy(
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 669, in RunDeploy
deployer.Deploy(
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 428, in Deploy
source_files = source_files_util.GetSourceFiles(
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/source_files_util.py", line 184, in GetSourceFiles
return list(it)
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/util/gcloudignore.py", line 233, in GetIncludedFiles
six.ensure_str(upload_directory), followlinks=True):
File "//google-cloud-sdk/lib/third_party/six/__init__.py", line 884, in ensure_str
raise TypeError("not expecting type '%s'" % type(s))
TypeError: not expecting type '<class 'NoneType'>'
ERROR: gcloud crashed (TypeError): not expecting type '<class 'NoneType'>'
In this stack trace, it mentions an error in google-cloud-sdk/lib/googlecloudsdk/command_lib/util/gcloudignore.py so I had also reviewed my .gcloudignore file but was unable to find anything out of place:
.gcloudignore
.git
.gitignore
node_modules/
In an attempt to work around this bug I tried removing my .gcloudignore file which resulted in a different error, but still failed nevertheless:
Do you want to continue (Y/n)? Y
DEBUG: No bucket specified, retrieving default bucket.
DEBUG: Using bucket [gs://staging.finnsalud.appspot.com].
DEBUG: Service [appengineflex.googleapis.com] is already enabled for project [finnsalud]
Beginning deployment of service [finnsalud-staging]...
DEBUG: expected str, bytes or os.PathLike object, not NoneType
Traceback (most recent call last):
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 982, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 809, in Run
resources = command_instance.Run(args)
File "/google-cloud-sdk/lib/surface/app/deploy.py", line 115, in Run
return deploy_util.RunDeploy(
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 669, in RunDeploy
deployer.Deploy(
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/deploy_util.py", line 428, in Deploy
source_files = source_files_util.GetSourceFiles(
File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/app/source_files_util.py", line 184, in GetSourceFiles
return list(it)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/app/util.py", line 165, in FileIterator
entries = set(os.listdir(os.path.join(base, current_dir)))
File "/usr/local/Cellar/python#3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
ERROR: gcloud crashed (TypeError): expected str, bytes or os.PathLike object, not NoneType
Thinking maybe this was an error relating to the version of my CLI I have also run the following commands to try and update:
gcloud app update
gcloud components update
Unfortunately, this had no change on the output.
I have noticed that when I run this command with the app.yaml env value set to flexible, there are no updates to the logging section on google cloud and no changes to the files uploaded to the project's storage bucket. To me, this indicates that the crash is occurring in the CLI before any communication to the google cloud services is made. If this is correct, then it seems unlikely that the cause of the error would be related to a bad configuration on google cloud and must be related to something (software or configuration) on my local machine.
I have also tried using the 'Hello World' app.yaml configuration on the flexible environments 'Getting Started' page to rule out a configuration error my own application's app.yaml but this also had no change on the output.
Finally, if at any point I change env: flex back to env: standard then the issue does disappear. Unfortunately, as stated above, this won't work for deploying my web sockets feature.
This has gotten me thinking that possibly the error is due to a bug with the gcloud cli application. However, if this were the case, I would have expected to see many more bug reports for this issue by others whom are also using the GAE's flexible environment.
Regardless, given this stack trace points to code within the gcloud cli, I have opened a bug ticket with google which can be found here: https://issuetracker.google.com/issues/176839574
I have also seen this similar SO post, but it is not the exact error I am experiencing and remains unresolved: gcloud app deploy fails with flexible environment
If anyone has any ideas on other steps to try or methods to overcome this issue, I would be immensely grateful if you drop a note on this post. Thanks!
I deployed a nodejs application using the Quickstart for Node.js in the standard environment
Then I changed the app.yaml file from :
runtime: nodejs10
to
runtime: nodejs
env: flex
Everything worked as expected.
It might be related to your specific use case.
Surprisingly, this issue does seem to be related to a bug in the gcloud cli. However, there does seem to be a workaround.
When a --appyaml flag is specified for a deployment to the flex environment, then the CLI crashes with the messages outlines in my question above. However, if you copy your .yaml file renaming to app.yaml (the default) and delete this --appyaml flag when deploying then the build will proceed without errors.
If you have also experienced this error, please follow the google issue as I am working with the google engineers to be sure they reproduce and eventually fix this bug.
Broken app.yaml
runtime:nodejs14
Fixed app.yaml
runtime: nodejs14
I am dead serious. And :
glcoud info --run-diagnostics
was ZERO HELP.
Once I did this the "ERROR: gcloud crashed (TypeError): expected string or bytes-like object" went away.
I guess "colon + space" is part of the spec:
Why does the YAML spec mandate a space after the colon?

Docker firewall issue with cBioportal

we are sitting behind a firewall and try to run a docker image (cBioportal). The docker itself could be installed with a proxy but now we encounter the following issue:
Starting validation...
INFO: -: Unable to read xml containing cBioPortal version.
DEBUG: -: Requesting cancertypes from portal at 'http://cbioportal-container:8081'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error occurred during validation step:
Traceback (most recent call last):
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4491, in request_from_portal_api
response.raise_for_status()
File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: http://cbioportal-container:8081/api-legacy/cancertypes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/metaImport.py", line 127, in <module>
exitcode = validateData.main_validate(args)
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4969, in main_validate
portal_instance = load_portal_info(server_url, logger)
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4622, in load_portal_info
parsed_json = request_from_portal_api(path, api_name, logger)
File "/cbioportal/core/src/main/scripts/importer/validateData.py", line 4495, in request_from_portal_api
) from e
ConnectionError: Failed to fetch metadata from the portal at [http://cbioportal-container:8081/api-legacy/cancertypes]
Now we know that it is a firewall issue, because it works when we install it outside the firewall. But we do not know how to change the firewall yet. Our idea was to look up the files and lines which throw the errors. But we do not know how to look into the files since they are within the docker.
So we can not just do something like
vim /cbioportal/core/src/main/scripts/importer/validateData.py
...because ... there is nothing. Of course we know this file is within the docker image, but like i said we dont know how to look into it. At the moment we do not know how to solve this riddle - any help appreciated.
maybe you still might need this.
You can access this python file within the container by usingdocker-compose exec cbioportal sh or docker-compose exec cbioportal bash
Then you can us cd, cat, vi, vim or else to access the given path in your post.
I'm not sure which command you're actually running but when I did the import call like
docker-compose run --rm cbioportal metaImport.py -u http://cbioportal:8080 -s study/lgg_ucsf_2014/lgg_ucsf_2014/ -o
I had to replace the http://cbioportal:8080 with the servers ip address.
Also notice that the studies path is one level deeper than in the official documentation.
In cbioportal behind proxy the study import is only available in offline mode via:
First you need to get inside the container
docker exec -it cbioportal-container bash
Then generate portal info folder
cd $PORTAL_HOME/core/src/main/scripts ./dumpPortalInfo.pl $PORTAL_HOME/my_portal_info_folder
Then import the study offline. -o is important to overwrite despite of warnings.
cd $PORTAL_HOME/core/src/main/scripts
./importer/metaImport.py -p $PORTAL_HOME/my_portal_info_folder -s /study/lgg_ucsf_2014 -v -o
Hope this helps.

Resources