So, I have a dbt project, which contains a number of models in the models folder, divided into mart, intermediate, and staging subfolders. I use GitLab CI/CD to deploy these models to Snowflake.
Now, I'm currently testing these models, and would like to deploy them one by one.
Is it possible to somehow set up GitLab CI/CD (or maybe the dbt project) to just run/deploy one model at a time?
The dbt-run command could be supplemented with --select argument.
Examples
By default, dbt run will execute all of the models in the dependency graph. During development (and deployment), it is useful to specify only a subset of models to run. Use the --select flag with dbt run to select a subset of models to run. Note that the following arguments (--select, --exclude, and --selector) also apply to other dbt tasks, such as test and build.
Examples:
$ dbt run --select my_dbt_project_name # runs all models in your project
$ dbt run --select my_dbt_model # runs a specific model
$ dbt run --select path.to.my.models # runs all models in a specific directory
$ dbt run --select my_package.some_model # run a specific model in a specific package
$ dbt run --select tag:nightly # run models with the "nightly" tag
$ dbt run --select path/to/models # run models contained in path/to/models
$ dbt run --select path/to/my_model.sql # run a specific model by its path
dbt supports a shorthand language for defining subsets of nodes. This language uses the characters +, #, *, and ,.
Related
We split our regression tests into several parallel jobs that each produce a json artifact providing some data about the test results. I would like to create a new job that produces a single artifact with all regression results concatenate. The number of parallel jobs is not necessarily static. The most similar post I've seen is this, Get artifacts from previous GIT jobs, but my jobs are parallel jobs that produce the same file name (although I suppose output names can include node number for simplicity).
I've read that a job will automatically download artifacts from previous jobs, but how can I actually load/use this data in a script?
I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs
I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:
ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')
main_ref = DataReference(datastore=ds,
data_reference_name='main_ref'
)
data_ref = DataReference(datastore=ds,
data_reference_name='main_ref',
path_on_datastore='/data'
)
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=['--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
I would like:
my data_prep_step to run,
have it store some data on the path to my data_ref), and
I would then like to access this stored data afterwards outside of the pipeline
But, I can't find a useful function in the documentation. Any guidance would be much appreciated.
two big ideas here -- let's start with the main one.
main ask
With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?
short answer
Consider using OutputFileDatasetConfig (docs example), instead of DataReference.
To your example above, I would just change your last two definitions.
data_ref = OutputFileDatasetConfig(
name='data_ref',
destination=(ds, '/data')
).as_upload()
data_prep_step = PythonScriptStep(
name='data_prep',
script_name='pipeline_steps/data_prep.py',
source_directory='/.',
arguments=[
'--main_path', main_ref,
'--data_ref_folder', data_ref
],
inputs=[main_ref, data_ref],
outputs=[data_ref],
runconfig=arbitrary_run_config,
allow_reuse=False
)
some notes:
be sure to check out how DataPaths work. Can be tricky at first glance.
set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.
more context
PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:
stitch steps together
get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)
The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.
This downside is what motivated OutputFileDatasetConfig
auxilary ask
how might I programmatically test the functionality of my Azure ML pipeline?
there are not enough people talking about data pipeline testing, IMHO.
There are three areas of data pipeline testing:
unit testing (the code in the step works?
integration testing (the code works when submitted to the Azure ML service)
data expectation testing (the data coming out of the meets my expectations)
For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions
For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.
#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:
within the PythonScriptStep itself, i.e.
run whatever code you have
test the outputs with GE before writing them out; or,
for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.
Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).
I have a AML compute cluster with the min & max nodes set to 2. When I execute a pipeline, I expect the cluster to run the training on both instances in parallel. But the cluster status reports that only one node is busy and the other is idle.
Here's my code to submit the pipeline, as you can see, I'm resolving the cluster name and passing that to my Step1, thats training a model on Keras.
aml_compute = AmlCompute(ws, "cluster-name")
step1 = PythonScriptStep(name="train_step",
script_name="Train.py",
arguments=["--sourceDir", os.path.realpath(source_directory) ],
compute_target=aml_compute,
source_directory=source_directory,
runconfig=run_config,
allow_reuse=False)
pipeline_run = Experiment(ws, 'MyExperiment').submit(pipeline1, regenerate_outputs=False)
Each python script step runs on a single node even if you allocate multiple nodes in your cluster. I'm not sure whether training on different instances is possible off-the-shelf in AML, but there's definitely the possibility to use that single node more effectively (looking into using all your cores, etc.)
Really great question. The TL;DR is that there isn't an easy way to do that right now. IMHO there's a few questions within your questions -- here's a stab at all of them.
Distributed Training with keras
I'm no keras expert, but from their distributed training guide, I'm interested to know about the parallelism that you are after? model parallelism or data parallelism?
For data parallelism, it looks like the tf.distribute API is the way to go. I would strongly recommend getting that working on a single, multi-GPU machine (local or Azure VM) without Azure ML before starting to use Pipelines.
Distributed training with Azure ML
This Azure ML notebook shows how to use PyTorch with Horovod on AzureML. Seems not too tricky to change this to work with keras.
As for how to get distributed training to work inside of an Azure ML Pipeline, one spitball workaround would be to have the PythonScriptStep be a controller that would create a new compute cluster and submit the training script to it. I'm not too confident but I'll do some digging.
Multi-Node PythonScripSteps
This is possible (at least w/ pyspark). Below is a PythonScriptStep a production pipeline of ours that can run on more than one node. It uses a Docker image with Spark pre-installed, and a pyspark RunConfiguration. In the screenshots below you can see one of the nodes is the primary orchestrator, and the other is a secondary worker.
from azureml.core import Environment, RunConfiguration
env = Environment.from_pip_requirements(
'spark_env',
os.path.join(os.getcwd(), 'compute', 'spark-requirements.txt'))
env.docker.enabled = True
env.docker.base_image = 'microsoft/mmlspark:0.16'
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.environment = spark_env
spark_run_config.node_count = 2
roll_step = PythonScriptStep(
name='roll.py',
script_name='roll.py',
arguments=['--input_dir', joined_data,
'--output_dir', rolled_data],
compute_target=compute_target_spark,
inputs=[joined_data],
outputs=[rolled_data],
runconfig=spark_run_config,
source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
allow_reuse=pipeline_reuse
)
If I go to Test results screen after run of my pipeline, it is showing each test case from Java/Maven/TestNG automated test project duplicated. One instance of each test case shows blank for machine name and the duplicate of that shows a machine name.
Run 1000122 - JUnit_TestResults_3662
There are several possibilities. First, if you added multiple configurations to a test plan, if so, the tests cases will be repeated in the plan with the each of the configurations you have assigned.
Another possibility is that when you passed parameters to the test method, did you use multiple parameters, so the test method was executed two times.
The information you provided is not sufficient. Can you share the code or screenshots of your Test Samples?