Azure ML: How to train a model on multiple instances - azure-machine-learning-service

I have a AML compute cluster with the min & max nodes set to 2. When I execute a pipeline, I expect the cluster to run the training on both instances in parallel. But the cluster status reports that only one node is busy and the other is idle.
Here's my code to submit the pipeline, as you can see, I'm resolving the cluster name and passing that to my Step1, thats training a model on Keras.
aml_compute = AmlCompute(ws, "cluster-name")
step1 = PythonScriptStep(name="train_step",
script_name="Train.py",
arguments=["--sourceDir", os.path.realpath(source_directory) ],
compute_target=aml_compute,
source_directory=source_directory,
runconfig=run_config,
allow_reuse=False)
pipeline_run = Experiment(ws, 'MyExperiment').submit(pipeline1, regenerate_outputs=False)

Each python script step runs on a single node even if you allocate multiple nodes in your cluster. I'm not sure whether training on different instances is possible off-the-shelf in AML, but there's definitely the possibility to use that single node more effectively (looking into using all your cores, etc.)

Really great question. The TL;DR is that there isn't an easy way to do that right now. IMHO there's a few questions within your questions -- here's a stab at all of them.
Distributed Training with keras
I'm no keras expert, but from their distributed training guide, I'm interested to know about the parallelism that you are after? model parallelism or data parallelism?
For data parallelism, it looks like the tf.distribute API is the way to go. I would strongly recommend getting that working on a single, multi-GPU machine (local or Azure VM) without Azure ML before starting to use Pipelines.
Distributed training with Azure ML
This Azure ML notebook shows how to use PyTorch with Horovod on AzureML. Seems not too tricky to change this to work with keras.
As for how to get distributed training to work inside of an Azure ML Pipeline, one spitball workaround would be to have the PythonScriptStep be a controller that would create a new compute cluster and submit the training script to it. I'm not too confident but I'll do some digging.
Multi-Node PythonScripSteps
This is possible (at least w/ pyspark). Below is a PythonScriptStep a production pipeline of ours that can run on more than one node. It uses a Docker image with Spark pre-installed, and a pyspark RunConfiguration. In the screenshots below you can see one of the nodes is the primary orchestrator, and the other is a secondary worker.
from azureml.core import Environment, RunConfiguration
env = Environment.from_pip_requirements(
'spark_env',
os.path.join(os.getcwd(), 'compute', 'spark-requirements.txt'))
env.docker.enabled = True
env.docker.base_image = 'microsoft/mmlspark:0.16'
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.environment = spark_env
spark_run_config.node_count = 2
roll_step = PythonScriptStep(
name='roll.py',
script_name='roll.py',
arguments=['--input_dir', joined_data,
'--output_dir', rolled_data],
compute_target=compute_target_spark,
inputs=[joined_data],
outputs=[rolled_data],
runconfig=spark_run_config,
source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
allow_reuse=pipeline_reuse
)

Related

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Parallel Python with joblibspark: how to evenly distribute jobs?

I have a project in which joblib works well on one computer, it sends function to different cores effectively.
Now I have assignment to do same thing on a Databricks cluster. I've tried this many ways today, but the problem in the result is that the jobs do not spread out one-per-compute node. I've got 4 executors, I set n_jobs=6, but when I send 4 jobs through, some of them pile up on same node, leaving nodes unused. Here's a picture of Databricks Spark UI:
. Sometimes when I try this, I get 1 job running on a node by itself and all of the rest are piled up on one node.
In the joblib and joblibspark docs, I see the parameter batch_size which specifies how many tasks are sent to a given node. Even when I set that to 1, I get this same problem, nodes unused.
from joblib import Parallel, delayed
from joblibspark import register_spark
register_spark()
output = Parallel(backend="spark", n_jobs=6,
verbose=config.JOBLIB_VERBOSE, batch_size=1)(
delayed(fit_one)
(x, model_data=model_data, dlmodel=dlmodel,
outdir=outdir, frac=sample_p,
score_type=score_type,
save=save,
verbose=verbose) for x in ZZ)
I've hacked at this all day, trying various backends and combinations of settings. What am I missing?

Snakemake trigger automatic job re-submission on slurm cluster

I have a question for a very specific use case. I'll start by giving a bit of background:
I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create snakemake workflows and execute them on a slurm cluster. Due to limited GPU nodes, I would like to checkpoint my model, stop the job and resubmit once in a while to not block the GPUs. The goal of this would be to train the model iteratively with short running jobs.
Now to my questions:
Is there a way to resubmit a job a certain number of times/until a condition is met?
Is there another clever way to train a model iteratively without having to manually submit the job?
For this, you need to submit job with command
llsubmit job.sh
The shell script or batch job file should be executed as manytimes. Once the job finishes, resources are available. it restarts the same script(you already submitted and waiting in queue) automatically.
Here are a few suggestions:
Just train your network. It's up to the scheduler to try not to block the GPUs and running 10 short jobs vs 1 long job will probably lead to the same priority.
You can specify --restart-times to run a job which has failed multiple times. The trick is that snakemake will also remove outputs from failed jobs. The workaround is to checkpoint your model to a temp file (not in the output directive of the rule) and exit your training with an error to signal to snakemake that it needs to run again. The inelegant part is that you have to set your restart to a large value, or make sure your training code knows that it is running the final attempt and needs to save the actual output. You can acquire the attempt as a resource. I'm not sure the parameter is available in other directives. Also any job that fails will be resubmitted; not a great option for development.
You can make your checkpoint files outputs. This again assumes you want to run a set number of times. Your rule all will look for a file like final.checkpoint, which depends on 10.checkpoint, which depends on 9.checkpoint and so on. With a fancy enough input function this can be implemented in one rule where 1.checkpoint depends on nothing (or your training data perhaps).

Understanding Dask's Task Stream

I'm running dask locally using the distributed scheduler on my machine with 8 cores. On initialization I see:
Which looks correct, but I'm confused by the task stream in the diagnostics (shown below):
I was expecting 8 rows corresponding to the 8 workers/cores, is that incorrect?
Thanks
AJ
I've added the code I'm running:
import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client()
progress(client)
# load datasets
trd = (dd.read_csv('trade_201811*.csv', compression='gzip',
blocksize=None, dtype={'Notional': 'float64'})
.assign(timestamp=lambda x: dd.to_datetime(x.timestamp.str.replace('D', 'T')))
.set_index('timestamp', sorted=True))
Each line corresponds to a single thread. Some more sophisticated Dask operations will start up additional threads, this happens particularly when tasks launch other tasks, which is common especially in machine learning workloads.
My guess is that you're using one of the following approaches:
dask.distributed.get_client or dask.distributed.worker_client
Scikit-Learn's Joblib
Dask-ML
If so, the behavior that you're seeing is normal. The task stream plot will look a little odd, yes, but hopefully it is still interpretable.

How do I get predict in SparkALS Implicit to use multiple cores?

I am using the Spark ALS Implicit recommender code. Training the model works fine using multiple cores/parallelism:
model = ALS.trainImplicit(ratings, rank, numIterations, lambda_, blocks=-1, alpha=alpha)
I have also configured the Spark context parameters to use multiple cores and I can see that the model training part is using multiple cores.
But when I use predict functions such as:
Item = model.recommendUsers(item_id, 1000000)
this is quite slow (takes ~70secs for 1 million users)
I can't find any documentation on how to make the predict and recommend functions use multiple cores, or how to multi-thread this. Is this possible?

Resources