Azure Data Factory, how to pass parameters from trigger/pipeline in to data source - azure

I need help. I've create a pipeline for data processing, which is importing csv and copy data to DB. I've also configure a Blob storage trigger, which is triggering pipeline with dataflow, when speciffic file will be uploaded in to container. For the moment, this trigger is set to monitor one container, however I would like to set it to be more universal. To monitor all containers in desired Storage Account and if someone will send some files, pipeline will be triggered. But for that I need to pass container name to the pipeline to be used in datasource file path. for now I've create something like that:
in the pipeline, I've add this parameter #pipeline().parameters.sourceFolder:
Next in Trigger, I've set this:
Now what should I set here, to pass this folder path?

You need to use dataset parameters for this.
Like folderpath parameter in pipeline create another pipeline parameter for the file name also and give #triggerBody().folderPath and #triggerBody().fileName to those when creating trigger.
Pipeline parameters:
Make sure you give all containers in storage event trigger while creating trigger.
Assiging trigger parameters to pipeline parameters:
Now, create two dataset parameters for the folder and file name like below.
Source dataset parameters:
Use these in the file path of the dataset dynamic content.
If you use copy activity for this dataset, then assign the pipeline parameters values(which we can get from trigger parameters) to dataset parameters like below.
If you use dataflows for the dataset, you can assign these in the dataflow activity itself like below after giving dataset as source in the dataflow.

Thank you Rakesh
I need to process few speciffic files from package that will be send to container. Each time user/application will send same set of files so in trigger I'm checking does new drive.xml file was send to any container. This file defines type of the data that was send, so if it comes, I know that new datafiles has been send as well and they will be present in lover folder.
F.eg. drive.xml was found in /container/data/somefolder/2022-01-22/drive.xml and then I know that in /container/data/somefolder/2022-01-22/datafiles/, are located 3 files that I need to process.
Therefor in parameters, I need to pass only file path, file names will be always the same.
The Dataset configuration looks like that:
and the event trigger like that:

Related

How do I pull the last modified file with data flow in azure data factory?

I have files that are uploaded into an onprem folder daily, from there I have a pipeline pulling it to a blob storage container (input), from there I have another pipeline from blob (input) to blob (output), here is were the dataflow is, between those two blobs. Finally, I have output linked to sql. However, I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow. The way I have it setup, every time the pipeline runs, it doubles my files. I've attached images below
[![Blob to Blob Pipeline][1]][1]
Please let me know if there is anything else that would make this more clear
[1]: https://i.stack.imgur.com/24Uky.png
I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow.
To achieve above scenario, you can use Filter by last Modified date by passing the dynamic content as below:
#startOfDay(utcnow()) : It will take start of the day for the current timestamp.
#utcnow() : It will take current timestamp.
Input and Output of Get metadata activity: (Its filtering file for that day only)
If the files are multiple for particular day, then you have to use for each activity and pass the output of Get metadata activity to foreach activity as
#activity('Get Metadata1').output.childItems
Then add Dataflow activity in Foreach and create source dataset with filename parameter
Give filename parameter which is created as dynamic value in filename
And then pass source parameter filename as #item().name
It will run dataflow for each file get metadata is returning.
I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate

I would like to add the date and time string in addition to the source file name.(Azure Synapse Analytics)

When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.
In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31
And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files
Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.
I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.
Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?
Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.
Any help woud be appreciated.
Thanks.
Using the Get Metadata activity, ForEach activity, and Execute pipeline activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.
Create a source dataset with the dataset parameter for the directory.
Pipeline1:
Using the Get Metadata activity, get the child items under the container (data/).
Pass the child items to the ForEach activity to loop each folder.
#activity('get sub folder list_yyyy').output.childItems
Inside ForEach activity, add the execute pipeline activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.
Subolder1: #item().name
Sink_dir1: #concat('year=',item().name)
Pipeline2:
In pipeline2, repeat the same processes as pipeline1. Using Get Metadata activity get the child items under the folder (yyyy folder) and pass the child items to ForEach activity.
Pipeline2 parameters:
Get Metadata:
Dataset property - dir: #pipeline().parameters.SubFolder1
Inside ForEach activity, add execute pipeline to pass the current item to nested pipeline (pipeline3). Create 2 pipeline parameters inside pipeline3 to hold source and sink structures.
SubFolder2: #concat(pipeline().parameters.SubFolder1,'/',item().name)
sink_dir2: #concat(pipeline().parameters.sink_dir1,'/month=',item().name)
Pipeline3:
Using the Get Metadata activity get the child items under the source structure.
Dataset property – dir: #pipeline().parameters.SubFolder2
Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.
Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.
dir: #concat(pipeline().parameters.SubFolder2,'/',item().name,'/')
Create a sink dataset with dataset parameters to pass the directory path dynamically.
In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.
Sink_dir: #concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')
Output structure: It creates the folder structure automatically if not available in the sink.
You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable 'NewFileName' and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.
Hope this helps.

how to segregate files in a blob storage using ADF copy activity

I have a copy data activity in ADF and I want to segregate the files into a different container based on the file type.
ex.
Container A - .jpeg, .png
Container B - .csv, .xml and .doc
My initial idea was to use 'if condition' and 'or' statement but looks like my approach won't work.
I'd appreciate it if you could give some inputs.
First, get the list files from the source container and loop each file in the Foreach activity to check the extension using If condition and copy files based on the condition to their respective containers.
I have the below files in my source container.
In ADF:
Using the Get metadata activity, get the list of all files from the source container.
Output of Get Metadata activity:
Pass the output list to Foreach activity.
#activity('Get Metadata1').output.childitems
Inside the Foreach activity, add If Condition activity to separate the files based on extension.
#or(contains(item().name, '.xml'),contains(item().name, '.csv'))
If the condition is true, copy the current file to container1.
If the condition returns false, copy the current file to container2 in False activity.
Files in the container after running the pipeline.
Container1:
Container2:
You should use the GetMetadata activity to first get all the file types that needs to be first passed to a CopyData activity which copies to Container A, and then add next Getmetadata activity to get file types for next copydata activity that copies to container B.
so, your ADF pipeline may be like GetMetaData1 - > Copydata1 -> GetMetaData2 -> Copydata2. Refer how to use GetMetaData activity in this article, and documentation
Copy activity itself allows more than one wildcard path, so you could use that in the Data source, see this.

How to set and get variable value in Azure Synapse or Data Factory pipeline

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

Azure ML SDK DataReference - File Pattern - MANY files

I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline
Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).
Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.
How can I use data references for this instead?
What does data references do for me that mounting time stamped data does not?
a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")
You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.

Resources