Submitting multiple runs to the same node on AzureML - azure

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?

Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Related

How to acces output folder from a PythonScriptStep?

I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:
ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')
main_ref = DataReference(datastore=ds,
data_reference_name='main_ref'
)
data_ref = DataReference(datastore=ds,
data_reference_name='main_ref',
path_on_datastore='/data'
)
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=['--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
I would like:
my data_prep_step to run,
have it store some data on the path to my data_ref), and
I would then like to access this stored data afterwards outside of the pipeline
But, I can't find a useful function in the documentation. Any guidance would be much appreciated.
two big ideas here -- let's start with the main one.
main ask
With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?
short answer
Consider using OutputFileDatasetConfig (docs example), instead of DataReference.
To your example above, I would just change your last two definitions.
data_ref = OutputFileDatasetConfig(
name='data_ref',
destination=(ds, '/data')
).as_upload()
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=[
'--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
some notes:
be sure to check out how DataPaths work. Can be tricky at first glance.
set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.
more context
PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:
stitch steps together
get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)
The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.
This downside is what motivated OutputFileDatasetConfig
auxilary ask
how might I programmatically test the functionality of my Azure ML pipeline?
there are not enough people talking about data pipeline testing, IMHO.
There are three areas of data pipeline testing:
unit testing (the code in the step works?
integration testing (the code works when submitted to the Azure ML service)
data expectation testing (the data coming out of the meets my expectations)
For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions
For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.
#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:
within the PythonScriptStep itself, i.e.
run whatever code you have
test the outputs with GE before writing them out; or,
for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.
Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).

Snakemake trigger automatic job re-submission on slurm cluster

I have a question for a very specific use case. I'll start by giving a bit of background:
I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create snakemake workflows and execute them on a slurm cluster. Due to limited GPU nodes, I would like to checkpoint my model, stop the job and resubmit once in a while to not block the GPUs. The goal of this would be to train the model iteratively with short running jobs.
Now to my questions:
Is there a way to resubmit a job a certain number of times/until a condition is met?
Is there another clever way to train a model iteratively without having to manually submit the job?
For this, you need to submit job with command
llsubmit job.sh
The shell script or batch job file should be executed as manytimes. Once the job finishes, resources are available. it restarts the same script(you already submitted and waiting in queue) automatically.
Here are a few suggestions:
Just train your network. It's up to the scheduler to try not to block the GPUs and running 10 short jobs vs 1 long job will probably lead to the same priority.
You can specify --restart-times to run a job which has failed multiple times. The trick is that snakemake will also remove outputs from failed jobs. The workaround is to checkpoint your model to a temp file (not in the output directive of the rule) and exit your training with an error to signal to snakemake that it needs to run again. The inelegant part is that you have to set your restart to a large value, or make sure your training code knows that it is running the final attempt and needs to save the actual output. You can acquire the attempt as a resource. I'm not sure the parameter is available in other directives. Also any job that fails will be resubmitted; not a great option for development.
You can make your checkpoint files outputs. This again assumes you want to run a set number of times. Your rule all will look for a file like final.checkpoint, which depends on 10.checkpoint, which depends on 9.checkpoint and so on. With a fancy enough input function this can be implemented in one rule where 1.checkpoint depends on nothing (or your training data perhaps).

Azure ML: How to train a model on multiple instances

I have a AML compute cluster with the min & max nodes set to 2. When I execute a pipeline, I expect the cluster to run the training on both instances in parallel. But the cluster status reports that only one node is busy and the other is idle.
Here's my code to submit the pipeline, as you can see, I'm resolving the cluster name and passing that to my Step1, thats training a model on Keras.
aml_compute = AmlCompute(ws, "cluster-name")
step1 = PythonScriptStep(name="train_step",
script_name="Train.py",
arguments=["--sourceDir", os.path.realpath(source_directory) ],
compute_target=aml_compute,
source_directory=source_directory,
runconfig=run_config,
allow_reuse=False)
pipeline_run = Experiment(ws, 'MyExperiment').submit(pipeline1, regenerate_outputs=False)
Each python script step runs on a single node even if you allocate multiple nodes in your cluster. I'm not sure whether training on different instances is possible off-the-shelf in AML, but there's definitely the possibility to use that single node more effectively (looking into using all your cores, etc.)
Really great question. The TL;DR is that there isn't an easy way to do that right now. IMHO there's a few questions within your questions -- here's a stab at all of them.
Distributed Training with keras
I'm no keras expert, but from their distributed training guide, I'm interested to know about the parallelism that you are after? model parallelism or data parallelism?
For data parallelism, it looks like the tf.distribute API is the way to go. I would strongly recommend getting that working on a single, multi-GPU machine (local or Azure VM) without Azure ML before starting to use Pipelines.
Distributed training with Azure ML
This Azure ML notebook shows how to use PyTorch with Horovod on AzureML. Seems not too tricky to change this to work with keras.
As for how to get distributed training to work inside of an Azure ML Pipeline, one spitball workaround would be to have the PythonScriptStep be a controller that would create a new compute cluster and submit the training script to it. I'm not too confident but I'll do some digging.
Multi-Node PythonScripSteps
This is possible (at least w/ pyspark). Below is a PythonScriptStep a production pipeline of ours that can run on more than one node. It uses a Docker image with Spark pre-installed, and a pyspark RunConfiguration. In the screenshots below you can see one of the nodes is the primary orchestrator, and the other is a secondary worker.
from azureml.core import Environment, RunConfiguration
env = Environment.from_pip_requirements(
'spark_env',
os.path.join(os.getcwd(), 'compute', 'spark-requirements.txt'))
env.docker.enabled = True
env.docker.base_image = 'microsoft/mmlspark:0.16'
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.environment = spark_env
spark_run_config.node_count = 2
roll_step = PythonScriptStep(
name='roll.py',
script_name='roll.py',
arguments=['--input_dir', joined_data,
'--output_dir', rolled_data],
compute_target=compute_target_spark,
inputs=[joined_data],
outputs=[rolled_data],
runconfig=spark_run_config,
source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
allow_reuse=pipeline_reuse
)

Performance analysis of U-SQL script

When I run a U SQL script from portal/visual studio it follows stages like preparing,queued,running,finalizing. What exactly happens behind the scenes in all these stages?Will there be any execution time difference when the job is run from visual studio/portal in dev and production environment? We need to clock the speeds and record the time the script would take in production.Ultimately, the goal is to run these scripts as Data Factory activities in production.
I assume that there would be differences since I assume your dev environment would probably run at lower resource usage (lower degree of parallelism both between jobs and inside a job) than your production environment. Otherwise there should be no difference.
Note that we are still working on performance so if you are running into particular issues, please let us know.
The phases roughly do the following (I am probably missing some parts):
preparing: includes compilation, optimization, Codegen, preparing the execution graph and required resources and putting the job into the queue.
queueing: The job sits in the queue to get executed once the job is at the top of the queue and resources are available to start the job. This can be impacted by setting the maximal number of jobs that can run in parallel (a setting you can set by "calling" support/us).
running: Actual job execution. This will be affected by resources: Maximal number of parallelism that is specified on the job, network bandwidth, store access (throttling, bandwidth).
finalizing: Cleanup and stitching results into files, "sealing" table files. This can be more expensive depending on where you write the data (ADL is faster than WASB for example).

Running multiple Apache Nutch fetch map tasks on a Hadoop Cluster

I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
I am using the bin/crawl script and did the following tweaks to trigger a fetch with multiple map tasks , however I am unable to do so.
Added maxNumSegments and numFetchers parameters to the generate phase.
$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
Removed the topN paramter and removed the noParsing parameter because I want the parsing to happen at the time of fetch.
$bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
The generate phase is not generating more than one segment.
And as a result the fetch phase is not creating multiple map tasks, also I belive the script is written it does not allow the fecth to fecth multiple segemnts even if the generate were to generate multiple segments.
Can someone please let me know , how they go the script to run in a distributed Hadoop cluster ? Or if there is a different version of script that should be used?
Thanks.
Are you using Nutch 1.xx for this? In this case, the Generator class looks for a flag called "mapred.job.tracker" and tries to see if it is local. This property has been deprecated in Hadoop2 and the default value is set to local. You will have to overwrite the value of the property to something other than local and the Generator will generate multiple partitions for the segments.
I've recently faced this problem and thought it'd be a good idea to build upon Keith's answer to provide a more thorough explanation about how to solve this issue.
I've tested this with Nutch 1.10 and Hadoop 2.4.0.
As Keith said the if block on line 542 in Generator.java reads the mapred.job.tracker property and sets the value of variable numLists to 1 if the property is local. This variable seems to control the number of reduce tasks and has influence in the number of map tasks.
Overwriting the value of said property in mapred-site.xml fixes this:
<property>
    <name>mapred.job.tracker</name>
    <value>distributed</value>
</property>
(Or any other value you like except local).
The problem is this wasn't enough in my case to generate more than one fetch map task. I also had to update the value of the numSlaves parameter in the runtime/deploy/bin/crawl script. I didn't find any mentions of this parameter in the Nutch 1.x docs so I stumbled upon it after a bit of trial and error.
#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################
# set the number of slaves nodes
numSlaves=3
# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`
...

Resources