I'm running a test sagemaker pytorch training.
It creates the estimator runs the training successfully.
However it dies while runnning the "Uploading generated training model"
The error is "Error for Training job pytorch-training-2022-12-05-19-45-41-370: Failed. Reason: ClientError: Artifact upload failed:Too many files are written"
estimator = PyTorch( # create the estimator
entry_point="CloudSeg.py",
input_mode="FastFile",
TrainingInputMode='FastFile',
role=role,
py_version="py38",
framework_version="1.11.0",
instance_count=1,
instance_type="ml.g4dn.xlarge",
checkpoint_s3_uri=checkpoint_s3_bucket,
checkpoint_local_path=checkpoint_local_path,
use_spot_instances=use_spot_instances,
max_run=max_run,
max_wait=max_wait,
hyperparameters={"epochs": 1, "backend": "nccl"},
)
estimator.fit({"training": "s3://bucket/DATA/"}) # fit with the training data
The result of the fit is:
2022-12-05 19:54:10 Training - Training image download completed. Training in progress.
2022-12-05 19:54:10 Uploading - Uploading generated training model
2022-12-05 19:54:10 Failed - Training job failed
ProfilerReport-1670269542: Stopping
-
UnexpectedStatusException
Traceback (most recent call last)
/tmp/ipykernel_19821/1489485288.py in \<cell line: 1\>()
\----\> 1 estimator.fit({"training": 's3://picard-prov/38-cloud-simple-unet_DATA/'})
...
\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3891
3892 if wait:
\-\> 3893 self.\_check_job_status(job_name, description, "TrainingJobStatus")
3894 if dot:
3895 print()
\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in \_check_job_status(self, job, desc, status_key_name)
3429 actual_status=status,
3430 )
\-\> 3431 raise exceptions.UnexpectedStatusException(
3432 message=message,
3433 allowed_statuses=\["Completed", "Stopped"\],
UnexpectedStatusException: Error for Training job pytorch-training-2022-12-05-19-45-41-370: **Failed. Reason: ClientError: Artifact upload failed:Too many files are written**
Any idea on how to solve this?
Thank you !
I tried getting rid of fastfile mode. It didn't help
When training completes SageMaker will process training outputs, among things, it will upload the files that CloudSeg.py placed in /opt/ml/model. Check out how many files you end up placing in these output folders that SageMaker will upload to S3 in your behalf (According to the the error message it's too many).
/opt/ml/model
/opt/ml/output
You could write code to print out the files stored in there as the last step of your algorithm, or use something SageMaker SSH Helper to interactively inspect what is happening.
I think I fixed the problem. I changed my "checkpoint_s3_bucket" to the session default bucket. Haven't gotten an error since.
bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"
# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)
Related
I've successfully run an ML Pipeline experiment and published the Azure ML Pipeline without issues. When I run the following directly after the successful run and publish (i.e. I'm running all cells using Jupyter), the test fails!
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]
Here is the error in azureml-logs/70_driver_log.txt:
[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
File "driver/amlbi_main.py", line 48, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
Here are the errors in logs/sys/warning.txt:
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry
[...]
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:
with the same URL.
Next...
When I wait a few minutes and rerun the following code/cell.
interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint,
headers=auth_header,
json={"ExperimentName": "***redacted***",
"ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]
It completes successfully!? Huh? (I changed the process count here, but I don't think that makes a difference). Also, there is no user error here in the logs.
Any ideas as to what could be going on here?
Thanks in advance for any insights you might have, and happy coding! :)
========== UPDATE #1: ==========
Running on 1 file with ~300k rows. Sometimes the job works and sometimes it doesn't. We've tried many versions with different config settings, all result in a failure from time to time. Changed sklearn models to use n_jobs=1. We're scoring text data for NLP work.
default_ds = ws.get_default_datastore()
# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')
# location of scoring script
experiment_folder = 'model_pipeline'
rit = 60*60*24
parallel_run_config = ParallelRunConfig(
source_directory=experiment_folder,
entry_script="score.py",
mini_batch_size="5",
error_threshold=10,
output_action="append_row",
environment=batch_env,
compute_target=compute_target,
node_count=5,
run_invocation_timeout=rit,
process_count_per_node=1
)
Our next test was going to be - chuck each row of data into its own file. I tried this with just 30 rows i.e. 30 files each with 1 record for scoring, and still getting the same error. This time I changed the error threshold to 1.
2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1.
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
master.wait()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
file_helper.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
self.analyze_source()
File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
And on the rounds where it does complete, only some of the records are returned. One time the # of records returned I think was 25 or 23, and another time it was 15.
========== UPDATE #2: 12/17/2020 ==========
I removed one of my models (my model is a weight blend of 15 models). I even cleaned up my text fields, removing all tabs, newlines, and commas. Now I'm scoring 30 files, each with 1 record, and the job completes sometimes, but it doesn't return 30 records. Other times it returns an error, and still getting "No temp file found" error.
I think I might have answered my own question. I think the issue was with
OutputFileDatasetConfig
Once I switched back to using
PipelineData
Everything started working again. I guess Azure wasn't kidding when they say that OutputFileDatasetConfig is still experimental.
The thing I still don't understand is how we're supposed to pick up the results of an ML Studio Pipeline from a Data Factory Pipeline without OutputFileDatasetConfig? PipelineData outputs the results in a folder based on the child step run id, so how is Data Factory supposed to know where to get the results? Would love to hear any feedback anyone might have. Thanks :)
== Update ==
For picking up results of an ML Studio Pipeline from a Data Factory Pipeline, check out Pick up Results From ML Studio Pipeline in Data Factory Pipeline
== Update #2 ==
https://github.com/Azure/azure-sdk-for-python/issues/16568#issuecomment-781526789
Hi #yeamusic21 , thank you for your feedback, in current version,
OutputDatasetConfig can't work with ParallelRunStep, we are working on
fixing it.
I have been trying to use this Github (https://github.com/AntixK/PyTorch-VAE) and call the CelebA dataset using the config file listed. Specifically under the vae.yaml I have placed the path of the unzipped file where I have downloaded the celeba dataset (https://www.kaggle.com/jessicali9530/celeba-dataset) on my computer. And every time I run the program, I keep getting these errors:
File "/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py", line 67, in init
' You can use download=True to download it')
RuntimeError: Dataset not found or corrupted. You can use download=True to download it
AttributeError: 'VAEXperiment' object has no attribute '_lazy_train_dataloader'
I have tried to download the dataset, but nothing changes. So I have no idea why the program is not running.
The run.py calls the experiment.py which uses this dataloader to retrieve the information:
def train_dataloader(self):
transform = self.data_transforms()
if self.params['dataset'] == 'celeba':
dataset = CelebA(root = self.params['data_path'],
split = "train",
transform=transform,
download=False)
else:
raise ValueError('Undefined dataset type')
self.num_train_imgs = len(dataset)
return DataLoader(dataset,
batch_size= self.params['batch_size'],
shuffle = True,
drop_last=True)
The config file grabs the information passed on the root. So what I did was upload a few files to google colab (some .jpg files) and when I run the command stated in the GItHub, python run.py -c config/vae.yaml, it states that the dataset is not found or is corrupt. I have tried this on my linux machine and the same error occurs, even when I used the downloaded and unzip link. I have gone further to attempt to change the self.params['data_path'] to the actual path and that still does not work. Any ideas what I can do?
My pytorch version is 1.6.0.
There are two issues which I have faced. The below is my solution. It is not official but it works for me. Hope the next pytorch version will update it.
Issue: Dataset not found or corrupted.'
When I checked file celeba.py in pytorch library. I found this line:
if ext not in [".zip", ".7z"] and not check_integrity(fpath, md5):
return False
This part will make self._check_integrity() return False and the program provides the message error as we got.
Solve: You can ignore this part by add "if False" immediately in front of this line
if False:
if ext not in [".zip", ".7z"] and not check_integrity(fpath, md5):
return False
celeba.py downloads dataset if you choose download=True but these two files are broken "list_landmarks_align_celeba.txt" and "list_attr_celeba.txt"
You need to find somewhere, download and replace them
Hope these solutions will help you !!!!
I am using facenet model.... When i am doing classifier training it shows this message, but image alignment process with this model is going good...
def load_model(model):
# Check if the model is a model directory (containing a metagraph and a checkpoint file)
# or if it is a protobuf file with a frozen graph
model_exp = os.path.expanduser(model)
if (os.path.isfile(model_exp)):
print('Model filename: %s' % model_exp)
with gfile.FastGFile(model_exp,'rb') as f:
graph_def = tf.GraphDef()
print("Graph def value: ",graph_def)
print(type(graph_def))
graph_def.ParseFromString(f.read())
tf.import_graph_def(graph_def, name='')
Can anyone help me to clear this issue?
And also the above code works well in local the issue occurs in heroku server
In the above code the print statement shows an op as,
Graph def value:
<class 'tensorflow.core.framework.graph_pb2.GraphDef'>
Below is a screenshot for an issue:
The error is due to the model serving support is not working on heroku... Better you need to use the paid account on heroku with machinelearning dependencies... Or you can go with some other online deployment which supports tensorflow model serve.
azureml-sdk version: 1.0.85
Calling below (as given in the Dataset UI), I get this
ds_split = Dataset.get_by_name(workspace, name='ret- holdout-split')
ds_split.download(target_path=dir_outputs, overwrite=True)
UnexpectedError:
{'errorCode': 'Microsoft.DataPrep.ErrorCodes.Unknown', 'message':
'The client could not finish the operation within specified timeout.',
'errorData': {}}
The FileDataset 1GB pickled file stored in blob.
Here's a gist with the full traceback
I also faced this same issue(timeout error) while loading sqlpool dataset. After spent some amount of time, I figured out issue in SQL Query and by optimizing the SQL query solved the timeout issue.(It may be useful for someone)
Tried again this AM and it worked. let's file this under "transient error"
I am using a simple (not necessarily efficient) method for Pytorch model saving.
import torch
from google.colab import files
torch.save(model, filename) # save a trained model on the VM
files.download(filename) # download the model to local
best_model = files.upload() # select the model just downloaded
best_model[filename] # access the model
Colab disconnects during execution of the last line, and hitting RECONNECT tab always shows ALLOCATING -> CONNECTING (fails, with "unable to connect to the runtime" message in the left bottom corner) -> RECONNECT. At the same time, executing any one of the cells gives Error message "Failed to execute cell, Could not send execute message to runtime: [object CloseEvent]"
I know it is related to the last line, because I can successfully connect with my other google accounts which doesn't execute that.
Why does it happen? It seems the google accounts which have executed the last line can no longer connect to the runtime.
Edit:
One night later, I can reconnect with the google account after session expiration. I just attempted the approach in the comment, and found that just files.upload() the Pytorch model would lead to the problem. Once the upload completes, Colab disconnects.
Try disabling your ad-blocker. Worked for me
(I wrote this answer before reading your update. Think it may help.)
files.upload() is just for uploading files. We have no reason to expect it to return some pytorch type/model.
When you call a = files.upload(), a is a dictionary of filename - a big bytes array.
{'my_image.png': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....' }
type(a['my_image.png'])
Just like when you do open('my_image', 'b').read()
So, I think the next line best_model[filename] try to print the whole huge bytes array, which bugs the colab.