How to set and get variable value in Azure Synapse or Data Factory pipeline - azure

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.

As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:

There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

Related

How do I pull the last modified file with data flow in azure data factory?

I have files that are uploaded into an onprem folder daily, from there I have a pipeline pulling it to a blob storage container (input), from there I have another pipeline from blob (input) to blob (output), here is were the dataflow is, between those two blobs. Finally, I have output linked to sql. However, I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow. The way I have it setup, every time the pipeline runs, it doubles my files. I've attached images below
[![Blob to Blob Pipeline][1]][1]
Please let me know if there is anything else that would make this more clear
[1]: https://i.stack.imgur.com/24Uky.png
I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow.
To achieve above scenario, you can use Filter by last Modified date by passing the dynamic content as below:
#startOfDay(utcnow()) : It will take start of the day for the current timestamp.
#utcnow() : It will take current timestamp.
Input and Output of Get metadata activity: (Its filtering file for that day only)
If the files are multiple for particular day, then you have to use for each activity and pass the output of Get metadata activity to foreach activity as
#activity('Get Metadata1').output.childItems
Then add Dataflow activity in Foreach and create source dataset with filename parameter
Give filename parameter which is created as dynamic value in filename
And then pass source parameter filename as #item().name
It will run dataflow for each file get metadata is returning.
I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate

Azure Data Factory: Cannot save the output of Set Variable into file/Database

I'm trying to store a list of file names within an Azure Blob container into a SQL db. The pipeline runs successfully, but after running the pipeline, it cannot output the values (file names) into the sink database, and the sink table doesn't get updated even after the pipeline completed. Followings are the steps I went through to implement the pipeline. I wonder which steps I made mistake.
I have followed the solutions given in the following links as well:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
Transfer the output of 'Set Variable' activity into a json file [Azure Data Factory]
Steps:
1- Validating File Exists, Get Files metadata and child items, Iterate the files through a foreach.
2- Variable defined at the pipeline level to hold the filenames
Variable Name: Files, Type: string
3- parameter defined to dynamically specify the dataset directory name. Parameter name: dimName, parameter type: string
4- Get Metadata configurations
5- Foreach settings
#activity('MetaGetFileNames').output.childItems
6 - Foreach Activity overview. A set Variable to set the each filename into the defined variable 'files'. Copy Activity to store the set value into db.
7- set variable configuration
8- Copy Activity source configuration. Excel Dataset refers to an empty excel file in azure blob container.
9- Copy Activity sink configuration
10-Copy Activity: mapping configuration
Instead of selecting an empty excel file, refer to a dummy excel file with dummy data.
Source: dummy excel file
You can skip using Set variable activity as you can use the Foreach current item directly in the Additional column dynamic expression.
Add additional columns in the Mapping.
Sink results in SQL database.

Azure Data Factory v2 - How to get Copy Data tool output

I'm trying to create an archiving pipeline which essentially does the following:
Call stored procedure (GET) from a SQL Azure DB that returns a resultset
Archive the result from #1 onto storage account (e.g. json files)
Extract ID column from #1 into an array of int
Use result from #4 as a parameter to call a stored procedure (DELETE) in the same SQL Azure DB
So far, I've tried the Copy Data activity/tool which satisfies steps 1 & 2. However, I'm not sure how to get the outputs from that step and can't find any documentation at Microsoft.
Is it not the correct usage? Do I have to manually do it instead?
Also, I'd like to do some validation in between steps (i.e. no result? don't proceed).
I've managed to try the bare/general stored procedure activity but also can't find where to retrieve its output for use in the next step. I'm pretty new to Data Factory and don't really work with data engineering/piplines so please bear with me.
Using Copy data activity, you can copy stored procedure data to storage. Connect the source to SQL database and use stored procedure as query option, connect the sink to sink folder in a storage account.
Once the data is copied to a storage account, use lookup activity to read the data from the file which is generated from #1.
Extract the output of lookup activity into an array variable using set variable activity.
You can use If condition activity validates the result and run the other activities if it’s true.

how to download blobs using azure data factory

I'm using data factory to create my pipeline, and I'm facing some challenges.
the pipeline consists of a lookup which has a json array and a foreach to loop this json array and finally a set variable inside the foreach loop:
pipeline :
lookup :
variable :
now what I'm looking for is to pass the result of the set variable value(which is a like to an image) to a copy activity or something like that in order to doawnload the image in our datalake container.
and the name of doownloaded image should be like this :
id +'_'+guid()+'.png'
thanks for your help
Using copy data activity and connecting source dataset to HTTP connector to execute the URL and sink to data lake storage, you can copy the images to data lake as shown below.
Using the lookup activity, get the array value from the JSON file.
Output of lookup:
Passing the output of the lookup activity to Foreach activity.
#activity('Lookup1').output.value
Inside Foreach activity, use copy data activity to copy the image from URL to data lake.
Using a set variable is optional to store the URL from lookup output as you can directly use the current item in copy activity.
Source dataset:
Create HttpServer linked service by parameterizing the URL and with the binary dataset.
In the source dataset, pass the dataset parameter to the linked service parameter value.
In source settings, pass the current item URLs column to the dataset property item().urls.
Connect the sink to data lake storage with binary dataset and create a dataset parameter to provide the name of the file in the sink dataset.
In sink settings, provide the dataset file name as required.
#concat(item().id,'_',guid(),'.png')
Sink file in data lake:
Image1:
Image2:

How to include blob metadata in copy data mapping

I'm working on a ADF v2 pipeline, which copies data from csv blob to Azure SQL database table. For each load I would like to collect source metadata, like source blob name, and save it to a target table as a part of data lineage framework.
My blob source run the following schema:
StoreName,
StoreLocation,
StoreTaxId.
My destination table run the following schema:
StoreName,
StoreLocation,
DwhProcessDate,
DwhSourceName.
I do not know, how to properly include name of the source in the mapping section of Copy Data activity.
For the moment I have:
defined a [Get Metadata1] activity to get references to all blobs that are available from Azure Blob Storage
defined a [ForEach1] activity, iterating through the output of an expression #activity('Get Metadata1').output.childitems
inside the [ForEach1] activity, I have placed [Copy Data1] activity, where I have source and sink sections defined.
What I'm looking for is a way to add extra line to the mapping section, which will samehow bind #item().name to destination column [DwhSourceName]
Thanks for all suggestion on how to achieve this.
Actually,based on my test,you can specify the dymatic content of column key,but you can't set blob metadata as value of columns in copy data mapping at the pipeline run time. Please see the rules mentioned in this document.
You still need to add the FileName column in your source data before the copy activity.Maybe you could use Azure Blob Trigger Function to get the blob file name so hat you could add the FileName column when any data stream into the blob.(Please refer to this case:How Do I get the Name of The inputBlob That Triggered My Azure Function With Python)

Resources