AzureMl pipeline: How to access data of step1 into step2 - azure-machine-learning-service

I am following this article from microsoft to create azure ml pipeline with two steps and want to use data written by step1 into step2. According to the article below code should provide path of data written by step1 into script used for step2 as an argument
datastore = workspace.datastores['my_adlsgen2']
step1_output_data = OutputFileDatasetConfig(name="processed_data", destination=(datastore, "mypath/{run-id}/{output-name}")).as_upload()
step1 = PythonScriptStep(
name="generate_data",
script_name="step1.py",
runconfig = aml_run_config,
arguments = ["--output_path", step1_output_data]
)
step2 = PythonScriptStep(
name="read_pipeline_data",
script_name="step2.py",
compute_target=compute,
runconfig = aml_run_config,
arguments = ["--pd", step1_output_data.as_input]
)
pipeline = Pipeline(workspace=ws, steps=[step1, step2])
But when I acccess the pd argument in step2.py it provides the
"<bound method OutputFileDatasetConfig.as_mount of
<azureml.data.output_dataset_config.OutputFileDatasetConfig object at
0x7f8ae7f478d0>>"
Any idea how to pass blob storage location used by step1 to write data in step2?

You will probably find what you need here: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines. Particularly, note the section Read OutputFileDatasetConfig as inputs to non-initial steps:
# get adls gen 2 datastore already registered with the workspace
datastore = workspace.datastores['my_adlsgen2']
step1_output_data = OutputFileDatasetConfig(name="processed_data",
destination=(datastore, "mypath/{run-id}/{output-name}")).as_upload()
step1 = PythonScriptStep(
name="generate_data",
script_name="step1.py",
runconfig = aml_run_config,
arguments = ["--output_path", step1_output_data]
)
step2 = PythonScriptStep(
name="read_pipeline_data",
script_name="step2.py",
compute_target=compute,
runconfig = aml_run_config,
arguments = ["--pd", step1_output_data.as_input()]
)
pipeline = Pipeline(workspace=ws, steps=[step1, step2])
Your mistake is probably that OutputFileDatasetConfig has a method as_input() but not a property.

Related

can an azure .py (run from pipeline) create a file in the same notebook portal directory? i.e. /mnt/batch/tasks/shared/LS_root/mounts/clusters/USERxxx

please below , do you know how i can have this .py in a notebook in AML portal (which is then run from a pipeline) to create a file in the same notebook portal directory? i.e. /mnt/batch/tasks/shared/LS_root/mounts/clusters/USERxxx
The pipeline seems to create the file in its own temp memory directory:
ws = Workspace.from_config()
source_directory='/mnt/batch/tasks/shared/LS_root/mounts/clusters/USERxxxx.'
print('Source directory for the step is {}.'.format(os.path.realpath(source_directory)))
aml_run_config = RunConfiguration()
aml_run_config.target = "XXXXX"
aml_run_config.environment.python.user_managed_dependencies = False
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
conda_packages=['pandas','openpyxl','pyodbc','sqlalchemy'],
pip_packages=['pandas==1.4.4','openpyxl','pyodbc','sqlalchemy','azureml-sdk', 'azureml-dataprep[fuse,pandas]'],
pin_sdk_version=False)
aml_run_config = RunConfiguration(framework = "python", conda_dependencies =
aml_run_config.environment.python.conda_dependencies )
step1 = PythonScriptStep(name="hello-step1",
script_name="hello_world.py",
compute_target="XXXXX",
runconfig=aml_run_config,
source_directory=source_directory,
allow_reuse=True)
pipeline1 = Pipeline(workspace=ws, steps=step1)
pipeline1.validate()
pipeline_run1 = Experiment(ws, 'trigger-hello-experiment').submit(pipeline1, regenerate_outputs=False)
print("Pipeline is submitted for execution")
many thanks

Get auto-generated OutputFileDatasetConfig destination

From the OutputFileDatasetConfig documentation for the destination class member,
If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}
Given I have the handle to such OutputFileDatasetConfig with destination set to None, how can I get the generated destination without recomputing the default myself as this can be subject to change.
If you do not want to pass a name and path, then in that scenario the run details should provide the run id and the path can be created using the same. In an ideal scenario you would like to pass these details, if they are not passed the recommended approach is to use them in a intermediate step so the SDK can handle this for you, as seen in this PythonScriptStep()
from azureml.data import OutputFileDatasetConfig
dataprep_output = OutputFileDatasetConfig()
input_dataset = Dataset.get_by_name(workspace, 'raw_data')
dataprep_step = PythonScriptStep(
name="prep_data",
script_name="dataprep.py",
compute_target=cluster,
arguments=[input_dataset.as_named_input('raw_data').as_mount(), dataprep_output]
)
output = OutputFileDatasetConfig()
src = ScriptRunConfig(source_directory=path,
script='script.py',
compute_target=ct,
environment=env,
arguments = ["--output_path", output])
run = exp.submit(src, tags=tags)
###############INSIDE script.py
mount_point = os.path.dirname(args.output_path)
os.makedirs(mount_point, exist_ok=True)
print("mount_point : " + mount_point)

How do we do Batch Inferencing on Azure ML Service with Parameterized Dataset/DataPath input?

The ParallelRunStep Documentation suggests the following:
A named input Dataset (DatasetConsumptionConfig class)
path_on_datastore = iris_data.path('iris/')
input_iris_ds = Dataset.Tabular.from_delimited_files(path=path_on_datastore, validate=False)
named_iris_ds = input_iris_ds.as_named_input(iris_ds_name)
Which is just passed as an Input:
distributed_csv_iris_step = ParallelRunStep(
name='example-iris',
inputs=[named_iris_ds],
output=output_folder,
parallel_run_config=parallel_run_config,
arguments=['--model_name', 'iris-prs'],
allow_reuse=False
)
The Documentation to submit Dataset Inputs as Parameters suggests the following:
The Input is a DatasetConsumptionConfig class element
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
Which is passed in arguments as well in inputs
train_step = PythonScriptStep(
name="train_step",
script_name="train_with_dataset.py",
arguments=["--param2", tabular_ds_consumption],
inputs=[tabular_ds_consumption],
compute_target=compute_target,
source_directory=source_directory)
While submitting with new parameter we create a new Dataset class:
iris_tabular_ds = Dataset.Tabular.from_delimited_files('some_link')
And submit it like this:
pipeline_run_with_params = experiment.submit(pipeline, pipeline_parameters={'tabular_ds_param': iris_tabular_ds})
However, how do we combine this: How do we pass a Dataset Input as a Parameter to the ParallelRunStep?
If we create a DatasetConsumptionConfig class element like so:
tabular_dataset = Dataset.Tabular.from_delimited_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_dataset)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
And pass it as an argument in the ParallelRunStep, it will throw an error.
References:
Notebook with Dataset Input Parameter
ParallelRunStep Notebook
AML ParallelRunStep GA is a managed solution to scale up and out large ML workload, including batch inference, training and large data processing. Please check out below documents for the details.
• Overview doc: run batch inference using ParallelRunStep
• Sample notebooks
• AI Show: How to do Batch Inference using AML ParallelRunStep
• Blog: Batch Inference in Azure Machine Learning
For the inputs we create Dataset class instances:
tabular_ds1 = Dataset.Tabular.from_delimited_files('some_link')
tabular_ds2 = Dataset.Tabular.from_delimited_files('some_link')
ParallelRunStep produces an output file, we use the PipelineData class to create a folder which will store this output:
from azureml.pipeline.core import Pipeline, PipelineData
output_dir = PipelineData(name="inferences", datastore=def_data_store)
The ParallelRunStep depends on ParallelRunConfig Class to include details about the environment, entry script, output file name and other necessary definitions:
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
parallel_run_config = ParallelRunConfig(
source_directory=scripts_folder,
entry_script=script_file,
mini_batch_size=PipelineParameter(name="batch_size_param", default_value="5"),
error_threshold=10,
output_action="append_row",
append_row_file_name="mnist_outputs.txt",
environment=batch_env,
compute_target=compute_target,
process_count_per_node=PipelineParameter(name="process_count_param", default_value=2),
node_count=2
)
The input to ParallelRunStep is created using the following code
tabular_pipeline_param = PipelineParameter(name="tabular_ds_param", default_value=tabular_ds1)
tabular_ds_consumption = DatasetConsumptionConfig("tabular_dataset", tabular_pipeline_param)
The PipelineParameter helps us run the pipeline for different datasets.
ParallelRunStep consumes this as an input:
parallelrun_step = ParallelRunStep(
name="some-name",
parallel_run_config=parallel_run_config,
inputs=[ tabular_ds_consumption ],
output=output_dir,
allow_reuse=False
)
To consume with another dataset:
pipeline_run_2 = experiment.submit(pipeline,
pipeline_parameters={"tabular_ds_param": tabular_ds2}
)
There is an error currently: DatasetConsumptionConfig and PipelineParameter cannot be reused

i want to list all Resource in organisations using python function in google cloud

I went through the official docs of google cloud but I don't have an idea how to use these to list resources of specific organization by providing the organization id
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
You can use Cloud Asset Inventory for this. I wrote this code for performing a sink in BigQuery.
import os
from google.cloud import asset_v1
from google.cloud.asset_v1.proto import asset_service_pb2
def asset_to_bq(request):
client = asset_v1.AssetServiceClient()
parent = 'organizations/{}'.format(os.getEnv('ORGANIZATION_ID'))
output_config = asset_service_pb2.OutputConfig()
output_config.bigquery_destination.dataset = 'projects/{}}/datasets/{}'.format(os.getEnv('PROJECT_ID'),
os.getEnv('DATASET'))
output_config.bigquery_destination.table = 'asset_export'
output_config.bigquery_destination.force = True
response = client.export_assets(parent, output_config)
# For waiting the finish
# response.result()
# Do stuff after export
return "done", 200
if __name__ == "__main__":
asset_to_bq('')
Be careful is you use it, the sink must be done in an empty/not existing table or set the force to true.
In my case, some minutes after the Cloud Scheduler that trigger my function and extract the data to BigQuery, I have a Scheduled Query into BigQuery that copy the data to another table, for keeping the history.
Note: It's also possible to configure an extract in Cloud Storage if you prefer.
I hope that is a starting point for you and for achieving what do you want to do.
I am able to list the project but I also want to list the folder and resources under folder and folder.name and tags and i also want to specify the organization id to resources information from a specific organization
import os
from google.cloud import resource_manager
def export_resource (organizations):
client = resource_manager.Client()
for project in client.list_projects():
print("%s, %s" % (project.project_id, project.status))

How to avoid header while exporting BigQuery table in to Google Storage

I have developed below code which is helping to export BigQuery table in to Google storage bucket. I want to merge files into single file with out header, so that next processes will use file with out any issue.
def export_bq_table_to_gcs(self, table_name):
client = bigquery.Client(project=project_name)
print("Exporting table {}".format(table_name))
dataset_ref = client.dataset(dataset_name,
project=project_name)
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table(table_name)
size_bytes = client.get_table(table_ref).num_bytes
# For tables bigger than 1GB uses Google auto split, otherwise export is forced in a single file.
if size_bytes > 10 ** 9:
destination_uris = [
'gs://{}/{}{}*.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
else:
destination_uris = [
'gs://{}/{}{}.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
extract_job = client.extract_table(table_ref, destination_uris) # API request
result = extract_job.result() # Waits for job to complete.
if result.state != 'DONE' or result.errors:
raise Exception('Failed extract job {} for table {}'.format(result.job_id, table_name))
else:
print('BQ table(s) export completed successfully')
storage_client = storage.Client(project=gs_project_name)
bucket = storage_client.get_bucket(gs_bucket_name)
blob_list = bucket.list_blobs(prefix=f'{table_name}_temp')
print('Merging shard files into single file')
bucket.blob(f'{table_name}.csv').compose(blob_list)
Can you please help me to find a way to skip header.
Thanks,
Raghunath.
We can avoid header by using jobConfig to set the print_header parameter to False. Sample code
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(table_ref, destination_uris,
job_config=job_config)
Thanks
You can use skipLeadingRows (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.googleSheetsOptions.skipLeadingRows)

Resources