In the RunDetails Jupyter module, what does the table (see screenshot below) represent?
The RunDetails(run_instance).show() method from the azureml-widgets package shows the progress of your job along with streaming the log files. The widget is asynchronous and provides updates until the training run finishes.
Since the output shown is specific to your Pipeline run, you can troubleshoot it further from the logs from pipeline runs, which can be found in either the Pipelines or Experiments section of the studio.
Related
I have an ADF pipeline which has around 30 activities that call Databricks Notebooks. The activities are arranged sequentially, that is, one gets executed only after the successful completion of the other.
However, at times, even when there is a run time error with a particular notebook, the activity that calls the notebook is not failing, and the next activity is triggered. Ideally, this should not happen.
So, I want to keep an additional check on the link condition between the activities. I plan to put a condition on the status of the commands running in the notebook (imagine a notebook has 10 python commands, I want to capture the status of 10th command).
Is there a way to configure this? Appreciate ideas. Thank you.
I did try at my end - When there was an exception in the code - I did see the output of the error in the activity output. But in my case the activity failed like #Alex mentioned.
In your case you could check the output of the activity and see whether there is any run error. If there is no runError, then proceed with the next activity.
#activity('Notebook2').output.runError
I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs
I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:
ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')
main_ref = DataReference(datastore=ds,
data_reference_name='main_ref'
)
data_ref = DataReference(datastore=ds,
data_reference_name='main_ref',
path_on_datastore='/data'
)
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=['--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
I would like:
my data_prep_step to run,
have it store some data on the path to my data_ref), and
I would then like to access this stored data afterwards outside of the pipeline
But, I can't find a useful function in the documentation. Any guidance would be much appreciated.
two big ideas here -- let's start with the main one.
main ask
With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?
short answer
Consider using OutputFileDatasetConfig (docs example), instead of DataReference.
To your example above, I would just change your last two definitions.
data_ref = OutputFileDatasetConfig(
name='data_ref',
destination=(ds, '/data')
).as_upload()
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=[
'--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
some notes:
be sure to check out how DataPaths work. Can be tricky at first glance.
set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.
more context
PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:
stitch steps together
get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)
The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.
This downside is what motivated OutputFileDatasetConfig
auxilary ask
how might I programmatically test the functionality of my Azure ML pipeline?
there are not enough people talking about data pipeline testing, IMHO.
There are three areas of data pipeline testing:
unit testing (the code in the step works?
integration testing (the code works when submitted to the Azure ML service)
data expectation testing (the data coming out of the meets my expectations)
For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions
For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.
#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:
within the PythonScriptStep itself, i.e.
run whatever code you have
test the outputs with GE before writing them out; or,
for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.
Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).
I have a question for a very specific use case. I'll start by giving a bit of background:
I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create snakemake workflows and execute them on a slurm cluster. Due to limited GPU nodes, I would like to checkpoint my model, stop the job and resubmit once in a while to not block the GPUs. The goal of this would be to train the model iteratively with short running jobs.
Now to my questions:
Is there a way to resubmit a job a certain number of times/until a condition is met?
Is there another clever way to train a model iteratively without having to manually submit the job?
For this, you need to submit job with command
llsubmit job.sh
The shell script or batch job file should be executed as manytimes. Once the job finishes, resources are available. it restarts the same script(you already submitted and waiting in queue) automatically.
Here are a few suggestions:
Just train your network. It's up to the scheduler to try not to block the GPUs and running 10 short jobs vs 1 long job will probably lead to the same priority.
You can specify --restart-times to run a job which has failed multiple times. The trick is that snakemake will also remove outputs from failed jobs. The workaround is to checkpoint your model to a temp file (not in the output directive of the rule) and exit your training with an error to signal to snakemake that it needs to run again. The inelegant part is that you have to set your restart to a large value, or make sure your training code knows that it is running the final attempt and needs to save the actual output. You can acquire the attempt as a resource. I'm not sure the parameter is available in other directives. Also any job that fails will be resubmitted; not a great option for development.
You can make your checkpoint files outputs. This again assumes you want to run a set number of times. Your rule all will look for a file like final.checkpoint, which depends on 10.checkpoint, which depends on 9.checkpoint and so on. With a fancy enough input function this can be implemented in one rule where 1.checkpoint depends on nothing (or your training data perhaps).
An ADF pipeline needs to be executed on a daily basis, lets say at 03:00 h AM.
But prior execution we also need to check if the data sources are available.
Data is provided by an external agent, it periodically loads the corresponding data into each source table and let us know when this process is completed using a flag-table: if data source 1 is ready it set flag to 1.
I don't find a way to implement this logic with ADF.
We would need something that, for instance, at 03.00 h would trigger an 'element' that checks the flags, if the flags are not up don't launch the pipeline. Past, lets say, 10 minutes, check again the flags, and be like this for at most X times OR until the flags are up.
If the flags are up, launch the pipeline execution and stop trying to launch the pipeline any further.
How would you do it?
The logic per se is not complicated in any way, but I wouldn't know where to implement it. Should I develop an Azure Funtions that launches the Pipeline or is there a way to achieve it with an out-of-the-box AZDF activity?
There is a UNTIL iteration activity where you can check if your clause.
Example:
Your azure function (AF) checking the flag and returns 0 or 1.
Build ADF pipeline with UNTIL activity where you check the output of AF (if its 1 do something). In UNTIL activity you can have your process step. For example, you have a variable flag that will before until activity is 0. In your until you check if it's 1. if it is do your processing step, if its not, put WAIT activity on 10 min or so.
So you have the ability in ADF to iterate until something it's not satisfied.
Hope that this will help you :)