Azure Data Factory: Cannot save the output of Set Variable into file/Database - azure

I'm trying to store a list of file names within an Azure Blob container into a SQL db. The pipeline runs successfully, but after running the pipeline, it cannot output the values (file names) into the sink database, and the sink table doesn't get updated even after the pipeline completed. Followings are the steps I went through to implement the pipeline. I wonder which steps I made mistake.
I have followed the solutions given in the following links as well:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
Transfer the output of 'Set Variable' activity into a json file [Azure Data Factory]
Steps:
1- Validating File Exists, Get Files metadata and child items, Iterate the files through a foreach.
2- Variable defined at the pipeline level to hold the filenames
Variable Name: Files, Type: string
3- parameter defined to dynamically specify the dataset directory name. Parameter name: dimName, parameter type: string
4- Get Metadata configurations
5- Foreach settings
#activity('MetaGetFileNames').output.childItems
6 - Foreach Activity overview. A set Variable to set the each filename into the defined variable 'files'. Copy Activity to store the set value into db.
7- set variable configuration
8- Copy Activity source configuration. Excel Dataset refers to an empty excel file in azure blob container.
9- Copy Activity sink configuration
10-Copy Activity: mapping configuration

Instead of selecting an empty excel file, refer to a dummy excel file with dummy data.
Source: dummy excel file
You can skip using Set variable activity as you can use the Foreach current item directly in the Additional column dynamic expression.
Add additional columns in the Mapping.
Sink results in SQL database.

Related

I would like to add the date and time string in addition to the source file name.(Azure Synapse Analytics)

When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.
In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31
And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files
Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.
I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.
Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?
Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.
Any help woud be appreciated.
Thanks.
Using the Get Metadata activity, ForEach activity, and Execute pipeline activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.
Create a source dataset with the dataset parameter for the directory.
Pipeline1:
Using the Get Metadata activity, get the child items under the container (data/).
Pass the child items to the ForEach activity to loop each folder.
#activity('get sub folder list_yyyy').output.childItems
Inside ForEach activity, add the execute pipeline activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.
Subolder1: #item().name
Sink_dir1: #concat('year=',item().name)
Pipeline2:
In pipeline2, repeat the same processes as pipeline1. Using Get Metadata activity get the child items under the folder (yyyy folder) and pass the child items to ForEach activity.
Pipeline2 parameters:
Get Metadata:
Dataset property - dir: #pipeline().parameters.SubFolder1
Inside ForEach activity, add execute pipeline to pass the current item to nested pipeline (pipeline3). Create 2 pipeline parameters inside pipeline3 to hold source and sink structures.
SubFolder2: #concat(pipeline().parameters.SubFolder1,'/',item().name)
sink_dir2: #concat(pipeline().parameters.sink_dir1,'/month=',item().name)
Pipeline3:
Using the Get Metadata activity get the child items under the source structure.
Dataset property – dir: #pipeline().parameters.SubFolder2
Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.
Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.
dir: #concat(pipeline().parameters.SubFolder2,'/',item().name,'/')
Create a sink dataset with dataset parameters to pass the directory path dynamically.
In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.
Sink_dir: #concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')
Output structure: It creates the folder structure automatically if not available in the sink.
You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable 'NewFileName' and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.
Hope this helps.

how to segregate files in a blob storage using ADF copy activity

I have a copy data activity in ADF and I want to segregate the files into a different container based on the file type.
ex.
Container A - .jpeg, .png
Container B - .csv, .xml and .doc
My initial idea was to use 'if condition' and 'or' statement but looks like my approach won't work.
I'd appreciate it if you could give some inputs.
First, get the list files from the source container and loop each file in the Foreach activity to check the extension using If condition and copy files based on the condition to their respective containers.
I have the below files in my source container.
In ADF:
Using the Get metadata activity, get the list of all files from the source container.
Output of Get Metadata activity:
Pass the output list to Foreach activity.
#activity('Get Metadata1').output.childitems
Inside the Foreach activity, add If Condition activity to separate the files based on extension.
#or(contains(item().name, '.xml'),contains(item().name, '.csv'))
If the condition is true, copy the current file to container1.
If the condition returns false, copy the current file to container2 in False activity.
Files in the container after running the pipeline.
Container1:
Container2:
You should use the GetMetadata activity to first get all the file types that needs to be first passed to a CopyData activity which copies to Container A, and then add next Getmetadata activity to get file types for next copydata activity that copies to container B.
so, your ADF pipeline may be like GetMetaData1 - > Copydata1 -> GetMetaData2 -> Copydata2. Refer how to use GetMetaData activity in this article, and documentation
Copy activity itself allows more than one wildcard path, so you could use that in the Data source, see this.

How to set and get variable value in Azure Synapse or Data Factory pipeline

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.

How to include blob metadata in copy data mapping

I'm working on a ADF v2 pipeline, which copies data from csv blob to Azure SQL database table. For each load I would like to collect source metadata, like source blob name, and save it to a target table as a part of data lineage framework.
My blob source run the following schema:
StoreName,
StoreLocation,
StoreTaxId.
My destination table run the following schema:
StoreName,
StoreLocation,
DwhProcessDate,
DwhSourceName.
I do not know, how to properly include name of the source in the mapping section of Copy Data activity.
For the moment I have:
defined a [Get Metadata1] activity to get references to all blobs that are available from Azure Blob Storage
defined a [ForEach1] activity, iterating through the output of an expression #activity('Get Metadata1').output.childitems
inside the [ForEach1] activity, I have placed [Copy Data1] activity, where I have source and sink sections defined.
What I'm looking for is a way to add extra line to the mapping section, which will samehow bind #item().name to destination column [DwhSourceName]
Thanks for all suggestion on how to achieve this.
Actually,based on my test,you can specify the dymatic content of column key,but you can't set blob metadata as value of columns in copy data mapping at the pipeline run time. Please see the rules mentioned in this document.
You still need to add the FileName column in your source data before the copy activity.Maybe you could use Azure Blob Trigger Function to get the blob file name so hat you could add the FileName column when any data stream into the blob.(Please refer to this case:How Do I get the Name of The inputBlob That Triggered My Azure Function With Python)

Resources