User program failed with ValueError: ZIP does not support timestamps before 1980 - azure-machine-learning-service

Running pipeline failed with the following error.
User program failed with ValueError: ZIP does not support timestamps before 1980
I created Azure ML Pipeline that call several child run. See the attached codes.
# start parent Run
run = Run.get_context()
workspace = run.experiment.workspace
from azureml.core import Workspace, Environment
runconfig = ScriptRunConfig(source_directory=".", script="simple-for-bug-check.py")
runconfig.run_config.target = "cpu-cluster"
# Submit the run
for i in range(10):
print("child run ...")
run.submit_child(runconfig)
It seems timestamp of python script (simple-for-bug-check.py) is invalid.
My Python SDK version is 1.0.83.
Any workaround on this ?
Regards,
Keita

One workaround to the issue is setting the source_directory_data_store to a datastore pointing to a file share. Every workspace comes with a datastore pointing to a file share by default, so you can change the parent run submission code to:
# workspacefilestore is the datastore that is created with every workspace that points to a file share
run_config.source_directory_data_store = 'workspacefilestore'
if you are using RunConfiguration or if you are using an estimator, you can do the following:
datastore = Datastore(workspace, 'workspacefilestore')
est = Estimator(..., source_directory_data_store=datastore, ...)
The cause of the issue is the current working directory in a run is a blobfuse mounted directory, and in the current (1.2.4) as well as prior versions of blobfuse, the last modified date of every directory is set to the Unix epoch (1970/01/01). By changing the source_directory_data_store to a file share, this will change the current working directory to a cifs mounted file share, which will have the correct last modified time for directories and thus will not have this issue.

Related

Cannot load tokenizer from local

I just started to use AWS Lambda and Docker so would appreciate any advice.
I am trying to deploy an ML model to AWS Lambda for reference. The image created from Dockerfile successfully load XLNet model from local dir, however, it stucked when doing the same thing for tokenizer
In the pretrained_tokenizer folder, I have 4 files saved from tokenizer.save_pretrained(...) and config.save_pretrained(...)
In Dockerfile, I have tried multiple things, including:
copy the folder COPY app/pretrained_tokenizer/ opt/ml/pretrained_tokenizer/
copy each file from folder with separated COPY command
compress the folder to .tar.gz and use ADD pretrained_tokenizer.tar.gz /opt/ml/ (which is supposed to extract the tar files in the process)
In my python script, I tried to load the tokenizer using tokenizer = XLNetTokenizer.from_pretrained(tokenizer_file, do_lower_case=True), which works on Colab, but not when I try to to do an invocation to the image through sam local invoke -e events/event.json, the error is
[ERROR] OSError: Can't load tokenizer for 'opt/ml/pretrained_tokenizer/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'opt/ml/pretrained_tokenizer/' is the correct path to a directory containing all relev raise EnvironmentError(ers/tokenization_utils_base.py", line 1768, in from_pretrained
END RequestId: bf011045-bed8-41eb-ac21-f98bfcee475a
I have tried to look through past questions but couldn't really fix anything. I will appreciate any help!

Get local workspace in azureml

I am trying to run a machine learning experiment in azureml.
I can't figure out how to get the workspace context from the control script. Examples like this one in the microsoft docs use Workspace.from_config(). When I use this in the control script I get the following error:
"message": "We could not find config.json in: [path] or in its parent directories. Please provide the full path to the config file or ensure that config.json exists in the parent directories."
I've also tried including my subscription id and the resource specs like so:
subscription_id = 'id'
resource_group = 'name'
workspace_name = 'name'
workspace = Workspace(subscription_id, resource_group, workspace_name)
In this case I have to monitor the log and authenticate on each run as I would locally.
How do you get the local workspace from a control script for azureml?
Using Workspace.from_config() method:
The workspace configuration file is a JSON file that tells the SDK how to communicate with your Azure Machine Learning workspace. The file is named config.json, and it has the following format:
{"subscription_id": "<subscription-id>",
"resource_group": "<resource-group>",
"workspace_name": "<workspace-name>"}
IMPORTANT: This JSON file must be in the directory structure that contains your
Python scripts or Jupyter Notebooks. It can be in the same directory,
a subdirectory named .azureml, or in a parent directory.
Alternatively, use the get method to load an existing workspace without using configuration files: (in your case, your code is missing the .get())
ws = Workspace.get(name="myworkspace",subscription_id='<azure-subscription-id>',resource_group='myresourcegroup')
What is the development system that you are using? A DSVM in the AML workspace or your local dev system?
If it is your local then use this to write config file to your project root directory under the path /.azureml/config.json
from azureml.core import Workspace
subscription_id = 'xxxx-xxxx-xxxx-xxxx-xxxx'
resource_group = 'your_resource_group'
workspace_name = 'your_workspace_name'
try:
ws = Workspace(subscription_id = subscription_id, resource_group =
resource_group, workspace_name = workspace_name)
ws.write_config()
print('Library configuration succeeded')
except:
print('Workspace not found')
or else if it DSVM, then you are all set, Workspace.from_config() should work.
Note: You will have to see .config directory under your user name in AML studio.
This had no answers for 10 months, and now they are coming in :). I figuerd this out quite a while ago but haven't gotten around to posting the answer. Here it is.
From the training script, you can get the workspace from the run context as follows:
from azureml.core import Run
Run.get_context()
ws = run.experiment.workspace

MLflow remote execution on databricks from windows creates an invalid dbfs path

I'm researching the use of MLflow as part of our data science initiatives and I wish to set up a minimum working example of remote execution on databricks from windows.
However, when I perform the remote execution a path is created locally on windows in the MLflow package which is sent to databricks. This path specifies the upload location of the '.tar.gz' file corresponding to the Github repo containing the MLflow Project. In cmd this has a combination of '\' and '/', but on databricks there are no separators at all in this path, which raises the 'rsync: No such file or directory (2)' error.
To be more general, I reproduced the error using an MLflow standard example and following this guide from databricks. The MLflow example is the sklearn_elasticnet_wine, but I had to add a default value to a parameter so I forked it and the MLproject which can be executed remotely can be found at (forked repo).
The Project can be executed remotely by the following command (assuming a databricks instance has been set up)
mlflow run https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine -b databricks -c db-clusterconfig.json --experiment-id <insert-id-here>
where "db-clusterconfig.json" correspond to the cluster to set up in databricks and is in this example set to
{
"autoscale": {
"min_workers": 1,
"max_workers": 2
},
"spark_version": "5.5.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
}
}
When running the project remotely, this is the output in cmd:
2019/10/04 10:09:50 INFO mlflow.projects: === Fetching project from https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine into C:\Users\ARNTS\AppData\Local\Temp\tmp2qzdyq9_ ===
2019/10/04 10:10:04 INFO mlflow.projects.databricks: === Uploading project to DBFS path /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Finished uploading project to /dbfs\mlflow-experiments\3947403843428882\projects-code\aa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz ===
2019/10/04 10:10:05 INFO mlflow.projects.databricks: === Running entry point main of project https://github.com/aestene/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
2019/10/04 10:10:06 INFO mlflow.projects.databricks: === Launched MLflow run as Databricks job run with ID 8. Getting run status page URL... ===
2019/10/04 10:10:18 INFO mlflow.projects.databricks: === Check the run's status at https://<region>.azuredatabricks.net/?o=<databricks-id>#job/8/run/1 ===
Where the DBFS path has a leading '/' before the remaining are '\'.
The command spins up a cluster in databricks and is ready to execute the job, but ends up with the following error message on the databricks side:
rsync: link_stat "/dbfsmlflow-experiments3947403843428882projects-codeaa5fbb4769e27e1be5a983751eb1428fe998c3e65d0e66eb9b4c77355076f524.tar.gz" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.1]
Where we can see the same path but without the '\' inserted. I narrowed down the creation of this path to this file in the MLflow Github repo, where the following code creates the path (line 133):
dbfs_path = os.path.join(DBFS_EXPERIMENT_DIR_BASE, str(experiment_id),
"projects-code", "%s.tar.gz" % tarfile_hash)
dbfs_fuse_uri = os.path.join("/dbfs", dbfs_path)
My current hypothesis is that os.path.join() in the first line joins the string together in a "windows fashion" such that they have backslashes. Then the following call to os.path.join() adds a '/'. The databricks file system is then unable to handle this path and something causes the 'tar.gz' file to not be properly uploaded or to be accessed at the wrong path.
It should also be mentioned that the project runs fine locally.
I'm running the following versions:
Windows 10
Python 3.6.8
MLflow 1.3.0 (also replicated the fault with 1.2.0)
Any feedback or suggestions are greatly appreciated!
Thanks for the catch, you're right that using os.path.join when working with DBFS paths is incorrect, resulting in a malformed path that breaks project execution. I've filed to https://github.com/mlflow/mlflow/issues/1926 track this, if you're interested in making a bugfix PR (see the MLflow contributor guide for info on how to do this) to replace os.path.join here with os.posixpath.join I'd be happy to review :)
Thanks for putting this issue.
I also encountered the same at windows 10.
I resolved this issue, with replacing all 'os.path' to 'posixpath' at 'databricks.py' file.
It worked perfectly fine for me.

exec sh from PySpark

I'm trying to run a .sh file loading from a .py file in a PySpark's job but I receive a message always saying that .sh file is not found
This is my code:
test.py:
import os,sys
os.system("sh ./check.sh")
and my gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster mserver file:///home/myuser/test.py
test.py file is loaded well but the system can't find check.sh file
I figure out that is something related with the file's path but not sure
I tried also with os.system("sh home/myuser/check.sh") and same result
I think that this should be easy to do so ... ideas?
The "current working directory" used by Dataproc jobs submitted through the API is a temporary directory with a unique name for each job; if the file wasn't uploaded with the job itself, you'll have to access it using your absolute path.
If you indeed added the check.sh file manually to /home/myuser/check.sh, then you should be able to call it using the fully qualified path, os.system("sh /home/myuser/check.sh"); make sure to start your absolute path with a /.

Runtime.exec() in Hadoop on Azure environment

This question is related to Hadoop on Azure environment.
I am trying to use Runtime.exec() to execute a batch script in the reduce function. I could not get this running in Hadoop on Azure environment while it runs fine in the Hadoop on Linux. I tested the Runtime.exec() code snippet in my desktop (windows 7) environment and it runs fine there. I have made sure that I consume the output and error streams of the sub-process after Runtime.exec().
The batch script contains the below ( a single command):
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\tool.exe
-f c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_000001.out
-i c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\input.txt
I distribute the tool.exe and input.txt files using Distributed cache and it creates a symlink from the working directory. tool.exe and input.txt points to the actual files in the jobcache directory.
2012-07-16 04:31:51,613 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /hdfs/mapred/local/taskTracker/distcache/-978619214658189372_-1497645545_209290723/10.73.50.78tool.exe <- \hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\tool.exe
2012-07-16 04:31:51,644 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /hdfs/mapred/local/taskTracker/distcache/-4944695173898834237_1545037473_2085004342/10.73.50.78input.txt <- \hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\attempt_201207121317_0024_r_000001_0\work\input.txt
The reducer gives the below error when it runs.
Command Execution Error: Cannot run program
"cmd /q /c c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_0000011513543720767963399.bat":
CreateProcess error=2, The system cannot find the file specified
In another case, I tried running the same but without using the absolute paths.. The output stream from the sub-process is shown below:
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0022\attempt_201207121317_0022_r_000000_0\work>tool.exe -f /hdfs/mapred/local/taskTracker/nabeel/jobcache/job_201207121317_0022/work/1_task_201207121317_0022_r_000000.out
-i input.txt
I do not know how the job working directory paths and distributed cache works in Hadoop on Azure environment. Could you please let me know if I am missing something here (or) there is something I need to take care of while using Runtime.exec() in Hadoop on Azure environment.
Thanks,
.,._
Reply to sender | Reply to group | Reply via web post | Start a New Topic
I am not familiar with Hadoop. But the error message seems to be obvious. It would be better if you can check whether the file exists.
c:\hdfs\mapred\local\taskTracker\nabeel\jobcache\job_201207121317_0024\work\11_task_201207121317_0024_r_0000011513543720767963399.bat
Best Regards,
Ming Xu

Resources