Size of the input / output parameters in the pipeline step in Azure - azure-machine-learning-service

while running pipeline creation python script facing the following error.
"AzureMLCompute job failed. JobConfigurationMaxSizeExceeded: The specified job configuration exceeds the max allowed size of 32768 characters. Please reduce the size of the job's command line arguments and environment settings"

I haven't seen that error before! My guess is that you're passing data as a string argument to a downstream pipeline step when you should be using PipelineData or OutputFileDatasetConfig.
I strongly suggest you read more about moving data between steps of an AML pipeline

When we tried to pass a quite lengthy content as argument value to a Pipeline. You can try to upload file to blob, optionally create a dataset, then pass on dataset name or file path to AML pipeline as parameter. The pipeline step will read content of the file from the blob.

Related

Azure Data Factory, how to pass parameters from trigger/pipeline in to data source

I need help. I've create a pipeline for data processing, which is importing csv and copy data to DB. I've also configure a Blob storage trigger, which is triggering pipeline with dataflow, when speciffic file will be uploaded in to container. For the moment, this trigger is set to monitor one container, however I would like to set it to be more universal. To monitor all containers in desired Storage Account and if someone will send some files, pipeline will be triggered. But for that I need to pass container name to the pipeline to be used in datasource file path. for now I've create something like that:
in the pipeline, I've add this parameter #pipeline().parameters.sourceFolder:
Next in Trigger, I've set this:
Now what should I set here, to pass this folder path?
You need to use dataset parameters for this.
Like folderpath parameter in pipeline create another pipeline parameter for the file name also and give #triggerBody().folderPath and #triggerBody().fileName to those when creating trigger.
Pipeline parameters:
Make sure you give all containers in storage event trigger while creating trigger.
Assiging trigger parameters to pipeline parameters:
Now, create two dataset parameters for the folder and file name like below.
Source dataset parameters:
Use these in the file path of the dataset dynamic content.
If you use copy activity for this dataset, then assign the pipeline parameters values(which we can get from trigger parameters) to dataset parameters like below.
If you use dataflows for the dataset, you can assign these in the dataflow activity itself like below after giving dataset as source in the dataflow.
Thank you Rakesh
I need to process few speciffic files from package that will be send to container. Each time user/application will send same set of files so in trigger I'm checking does new drive.xml file was send to any container. This file defines type of the data that was send, so if it comes, I know that new datafiles has been send as well and they will be present in lover folder.
F.eg. drive.xml was found in /container/data/somefolder/2022-01-22/drive.xml and then I know that in /container/data/somefolder/2022-01-22/datafiles/, are located 3 files that I need to process.
Therefor in parameters, I need to pass only file path, file names will be always the same.
The Dataset configuration looks like that:
and the event trigger like that:

How do I pull the last modified file with data flow in azure data factory?

I have files that are uploaded into an onprem folder daily, from there I have a pipeline pulling it to a blob storage container (input), from there I have another pipeline from blob (input) to blob (output), here is were the dataflow is, between those two blobs. Finally, I have output linked to sql. However, I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow. The way I have it setup, every time the pipeline runs, it doubles my files. I've attached images below
[![Blob to Blob Pipeline][1]][1]
Please let me know if there is anything else that would make this more clear
[1]: https://i.stack.imgur.com/24Uky.png
I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow.
To achieve above scenario, you can use Filter by last Modified date by passing the dynamic content as below:
#startOfDay(utcnow()) : It will take start of the day for the current timestamp.
#utcnow() : It will take current timestamp.
Input and Output of Get metadata activity: (Its filtering file for that day only)
If the files are multiple for particular day, then you have to use for each activity and pass the output of Get metadata activity to foreach activity as
#activity('Get Metadata1').output.childItems
Then add Dataflow activity in Foreach and create source dataset with filename parameter
Give filename parameter which is created as dynamic value in filename
And then pass source parameter filename as #item().name
It will run dataflow for each file get metadata is returning.
I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate

How to log errors in dataflow adf of parallel sources

I have to do some data engineering by reading manifest.cdm.json files from datalake.
add pipeline run id column and push to sql database.
I have one json list file which have required parameter to read CDM json file in source of dataflow.
Previous Approach: I used Foreach and passed parameter to dataflow with single activity then error capturing. But use of Dataflow with for each costs too much..
Current Approch: I mannually created Dataflow with all cdm files. But here I'm not able to capture error. If any source got error all dataflow activity fails. and If I select skip error in dataflow activity I'm not getting any error.
So what should be the approch to get errors from current approch.
You can capture the error using set variable activity in Azure Data Factory.
Use below expression to capture the error message using Set Variable activity:
#activity('Data Flow1').Error.message
Later you can store the error message in blob storage for future reference using copy activity. In below example we are saving error message in .csv file using DelimitedText dataset.

Pass value from operator to dag

I have a BashOperator in my dag which runs a python script. At the end of its run, the python script creates a json as a report. In case of failure, I want to report this json to Slack. I have an on_failure_callback function which can push a message to Slack. However, I have no elegant way to pass the json value to the dag. Currently, I am saving the json to a file and then reading it from the file in the dag and reporting it. I also tried storing it in an environment variable and getting it in the dag. But is there a more direct way to pass this value to the dag? Preferably, without saving it to a file or an environment variable.
Rather than using on_failure_callback you should use another task to send your slack message.
And for that you should simply use XComs to push/pull the file. https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#
Then you can pull the xcom in the task that sends it further to slack.

Azure ML SDK DataReference - File Pattern - MANY files

I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.

Resources