MLflow remote execution on databricks from windows creates an invalid dbfs path - databricks

I'm researching the use of MLflow as part of our data science initiatives and I wish to set up a minimum working example of remote execution on databricks from windows.
However, when I perform the remote execution a path is created locally on windows in the MLflow package which is sent to databricks. This path specifies the upload location of the '.tar.gz' file corresponding to the Github repo containing the MLflow Project. In cmd this has a combination of '\' and '/', but on databricks there are no separators at all in this path, which raises the 'rsync: No such file or directory (2)' error.
To be more general, I reproduced the error using an MLflow standard example and following this guide from databricks. The MLflow example is the sklearn_elasticnet_wine, but I had to add a default value to a parameter so I forked it and the MLproject which can be executed remotely can be found at (forked repo).
The Project can be executed remotely by the following command (assuming a databricks instance has been set up)
mlflow run https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine -b databricks -c db-clusterconfig.json --experiment-id <insert-id-here>
where "db-clusterconfig.json" correspond to the cluster to set up in databricks and is in this example set to
{
"autoscale": {
"min_workers": 1,
"max_workers": 2
},
"spark_version": "5.5.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
}
}
When running the project remotely, this is the output in cmd:
2019/10/04 10:09:50 INFO mlflow.projects: === Fetching project from https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine into C:\Users\ARNTS\AppData\Local\Temp\tmp2qzdyq9_ ===
2019/10/04 10:10:04 INFO mlflow.projects.databricks: === Uploading project to DBFS path /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Finished uploading project to /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Running entry point main of project https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
2019/10/04 10:10:06 INFO mlflow.projects.databricks: === Launched MLflow run as Databricks job run with ID 8. Getting run status page URL... ===
2019/10/04 10:10:18 INFO mlflow.projects.databricks: === Check the run's status at https://<region>.azuredatabricks.net/?o=<databricks-id>#job/8/run/1 ===
Where the DBFS path has a leading '/' before the remaining are '\'.
The command spins up a cluster in databricks and is ready to execute the job, but ends up with the following error message on the databricks side:
rsync: link_stat "/dbfsmlflow-experiments3947403843428882projects-codeaa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.1]
Where we can see the same path but without the '\' inserted. I narrowed down the creation of this path to this file in the MLflow Github repo, where the following code creates the path (line 133):
dbfs_path = os.path.join(DBFS_EXPERIMENT_DIR_BASE, str(experiment_id),
"projects-code", "%s.tar.gz" % tarfile_hash)
dbfs_fuse_uri = os.path.join("/dbfs", dbfs_path)
My current hypothesis is that os.path.join() in the first line joins the string together in a "windows fashion" such that they have backslashes. Then the following call to os.path.join() adds a '/'. The databricks file system is then unable to handle this path and something causes the 'tar.gz' file to not be properly uploaded or to be accessed at the wrong path.
It should also be mentioned that the project runs fine locally.
I'm running the following versions:
Windows 10
Python 3.6.8
MLflow 1.3.0 (also replicated the fault with 1.2.0)
Any feedback or suggestions are greatly appreciated!

Thanks for the catch, you're right that using os.path.join when working with DBFS paths is incorrect, resulting in a malformed path that breaks project execution. I've filed to https://github.com/mlflow/mlflow/issues/1926 track this, if you're interested in making a bugfix PR (see the MLflow contributor guide for info on how to do this) to replace os.path.join here with os.posixpath.join I'd be happy to review :)

Thanks for putting this issue.
I also encountered the same at windows 10.
I resolved this issue, with replacing all 'os.path' to 'posixpath' at 'databricks.py' file.
It worked perfectly fine for me.

Related

Azure Pipeline returns "undefined" in filepath

I am running my e2e WebdriverIO on Gitlab Pipeline. Now I am trying to integrate my e2e tests after the deployment on Azure. The tests run after deployment as expected, only that I have a strange error with Azure.
I have a test case to upload a file. Here is how I get the file:
const filePath = process.env.PWD + '/test/resources/files/logo.jpg'
console.log('file = ' + filePath);
When my tests run on Gitlab Pipeline, the file can be located as follow:
file = /builds/hw2yvbjx/0/xxx/xxx/xxx/xxx/e2e/test/resources/files/logo.jpg
But when my tests run on Azure Pipeline, the file is undefined as follow:
file = undefined/test/resources/files/logo.jpg
and the full log is as follow:
Error: ENOENT: no such file or directory, open 'C:\azagent\xxx\xxx\xxx\xxx\_xxx\e2e\undefined\test\resources\files\logo.jpg'
The path is correct, except that extra undefined is added between e2e and test. Does anyone know why this undefined is appended in the path? And how to fix it?
Thanks
process.env.PWD will log as undefined on Windows. The PWD environment variable is a 'Linux thing'.
That's why you're seeing this in your log statement in Azure (running on Windows, deduced based on the path starting with C:\...) but not in your GitLab job, which runs on a Linux host.
To fix it, you can use process.cwd(), which is platform agnostic, instead of process.env.PWD.

MLflow saves models to relative place instead of tracking_uri

sorry if my question is too basic, but cannot solve it.
I am experimenting with mlflow currently and facing the following issue:
Even if I have set the tracking_uri, the mlflow artifacts are saved to the ./mlruns/... folder relative to the path from where I run mlfow run path/to/train.py (in command line). The mlflow server searches for the artifacts following the tracking_uri (mlflow server --default-artifact-root here/comes/the/same/tracking_uri).
Through the following example it will be clear what I mean:
I set the following in the training script before the with mlflow.start_run() as run:
mlflow.set_tracking_uri("file:///home/#myUser/#SomeFolders/mlflow_artifact_store/mlruns/")
My expectation would be that mlflow saves all the artifacts to the place I gave in the registry uri. Instead, it saves the artifacts relative to place from where I run mlflow run path/to/train.py, i.e. running the following
/home/#myUser/ mlflow run path/to/train.py
creates the structure:
/home/#myUser/mlruns/#experimentID/#runID/artifacts
/home/#myUser/mlruns/#experimentID/#runID/metrics
/home/#myUser/mlruns/#experimentID/#runID/params
/home/#myUser/mlruns/#experimentID/#runID/tags
and therefore it doesn't find the run artifacts in the tracking_uri, giving the error message:
Traceback (most recent call last):
File "train.py", line 59, in <module>
with mlflow.start_run() as run:
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/fluent.py", line 204, in start_run
active_run_obj = client.get_run(existing_run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/client.py", line 151, in get_run
return self._tracking_client.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/tracking/_tracking_service/client.py", line 57, in get_run
return self.store.get_run(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 524, in get_run
run_info = self._get_run_info(run_id)
File "/home/#myUser/miniconda3/envs/mlflow-ff56d6062d031d43990effc19450800e72b9830b/lib/python3.6/site-packages/mlflow/store/tracking/file_store.py", line 544, in _get_run_info
"Run '%s' not found" % run_uuid, databricks_pb2.RESOURCE_DOES_NOT_EXIST
mlflow.exceptions.MlflowException: Run '788563758ece40f283bfbf8ba80ceca8' not found
2021/07/23 16:54:16 ERROR mlflow.cli: === Run (ID '788563758ece40f283bfbf8ba80ceca8') failed ===
Why is that so? How can I change the place where the artifacts are stored, this directory structure is created? I have tried mlflow run --storage-dir here/comes/the/path, setting the tracking_uri, registry_uri. If I run the /home/path/to/tracking/uri mlflow run path/to/train.py it works, but I need to run the scripts remotely.
My endgoal would be to change the artifact uri to an NFS drive, but even in my local computer I cannot do the trick.
Thanks for reading it, even more thanks if you suggest a solution! :)
Have a great day!
This issue was solved by the following:
I have mixed the tracking_uri with the backend_store_uri.
The tracking_uri is where the MLflow related data (e.g. tags, parameters, metrics, etc.) are saved, which can be a database. On the other hand, the artifact_location is where the artifacts (other, not MLflow related data belonging to the preprocessing/training/evaluation/etc. scripts).
What led me to mistakes is that by running mlflow server from command line one should set up for the --backend-store-uri the tracking_uri (also in the script by setting the mlflow.set_tracking_uri()) and for --default-artifact-location the location of the artifacts. Somehow I didn't get that the tracking_uri = backend_store_uri.
Here's my solution
Launch the server
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD#DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME
Set the the tracking uri to an HTTP URI like
mlflow.set_tracking_uri("http://my-tracking-server:5000/")

User program failed with ValueError: ZIP does not support timestamps before 1980

Running pipeline failed with the following error.
User program failed with ValueError: ZIP does not support timestamps before 1980
I created Azure ML Pipeline that call several child run. See the attached codes.
# start parent Run
run = Run.get_context()
workspace = run.experiment.workspace
from azureml.core import Workspace, Environment
runconfig = ScriptRunConfig(source_directory=".", script="simple-for-bug-check.py")
runconfig.run_config.target = "cpu-cluster"
# Submit the run
for i in range(10):
print("child run ...")
run.submit_child(runconfig)
It seems timestamp of python script (simple-for-bug-check.py) is invalid.
My Python SDK version is 1.0.83.
Any workaround on this ?
Regards,
Keita
One workaround to the issue is setting the source_directory_data_store to a datastore pointing to a file share. Every workspace comes with a datastore pointing to a file share by default, so you can change the parent run submission code to:
# workspacefilestore is the datastore that is created with every workspace that points to a file share
run_config.source_directory_data_store = 'workspacefilestore'
if you are using RunConfiguration or if you are using an estimator, you can do the following:
datastore = Datastore(workspace, 'workspacefilestore')
est = Estimator(..., source_directory_data_store=datastore, ...)
The cause of the issue is the current working directory in a run is a blobfuse mounted directory, and in the current (1.2.4) as well as prior versions of blobfuse, the last modified date of every directory is set to the Unix epoch (1970/01/01). By changing the source_directory_data_store to a file share, this will change the current working directory to a cifs mounted file share, which will have the correct last modified time for directories and thus will not have this issue.

How to troubleshoot package loading error in spark

I'm using spark in HDInsight with Jupyter notebook. I'm using the %%configure "magic" to import packages. Every time there is a problem with the package, spark crashes with the error:
The code failed because of a fatal error: Status 'shutting_down' not
supported by session..
or
The code failed because of a fatal error: Session 28 unexpectedly
reached final status 'dead'. See logs:
Usually the problem was with me mistyping the name of the package, so after a few attempts I could solve it. Now I'm trying to import spark-streaming-eventhubs_2.11 and I think I got the name right, but I still receive the error. I looked at all kinds of logs but still couldn't find the one which shows any relevant info. Any idea how to troubleshoot similar errors?
%%configure -f
{ "conf": {"spark.jars.packages": "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.5" }}
Additional info: when I run
spark-shell --conf spark.jars.packages=com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.5
The shell starts fine, and downloads the package
I finally was able to find the log files which contain the error. There are two log files which could be interesting
Livy log: livy-livy-server.out
Yarn log
On my HDInsight cluster, I found the livy log by connecting to one of the Head nodes with SSH and downloading a a file at this path (this log didn't contain useful info):
/var/log/livy/livy-livy-server.out
The actual error was in the yarn log file accessible from YarnUI. In HDInsight Azure Portal, go to "Cluster dashboard" -> "Yarn", find your session (KILLED status), click on "Logs" in the table, find "Log Type: stderr", click "click here for full log".
The problem in my case was Scala version incompatibility between one of the dependencies of spark-streaming_2.11 and Livy. This is supposed to be fixed Livy 0.4. More info here

How to use external package in Jupyter of Azure Spark

I am trying to add an external package in Jupyter of Azure Spark.
%%configure -f
{ "packages" : [ "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4" ] }
Its output :
Current session configs: {u'kind': 'spark', u'packages': [u'com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4']}
But when I tried to import:
import org.apache.spark.streaming.eventhubs.EventHubsUtils
I got an error:
The code failed because of a fatal error: Invalid status code '400'
from
http://an0-o365au.zdziktedd3sexguo45qd4z4qhg.xx.internal.cloudapp.net:8998/sessions
with error payload: "Unrecognized field \"packages\" (class
com.cloudera.livy.server.interactive.CreateInteractiveRequest), not
marked as ignorable (15 known properties: \"executorCores\", \"conf\",
\"driverMemory\", \"name\", \"driverCores\", \"pyFiles\",
\"archives\", \"queue\", \"kind\", \"executorMemory\", \"files\",
\"jars\", \"proxyUser\", \"numExecutors\",
\"heartbeatTimeoutInSecond\" [truncated]])\n at [Source:
HttpInputOverHTTP#5bea54d; line: 1, column: 32] (through reference
chain:
com.cloudera.livy.server.interactive.CreateInteractiveRequest[\"packages\"])".
Some things to try: a) Make sure Spark has enough available resources
for Jupyter to create a Spark context. For instructions on how to
assign resources see http://go.microsoft.com/fwlink/?LinkId=717038 b)
Contact your cluster administrator to make sure the Spark magics
library is configured correctly.
I also tried:
%%configure
{ "conf": {"spark.jars.packages": "com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4" }}
Got the same error.
Could someone point me a correct way to use external package in Jupyter of Azure Spark?
If you're using HDInsight 3.6, then use the following. Also, be sure to restart your kernel before executing this:
%%configure -f
{"conf":{"spark.jars.packages":"com.microsoft.azure:spark-streaming-eventhubs_2.11:2.0.4"}}
Also, ensure that your package name, version and scala version are correct. Specifically, the JAR that you're trying to use has changed names since the posting of this question. More information on what it is called now can be found here: https://github.com/Azure/azure-event-hubs-spark.

Resources