How to do pattern match part of file names using ADF - azure

I have like 10 files in blob where I need to pattern match part of the string name of the file, if matching then variable should be set to true. I will be getting the child names and file name from "Get metadata stage".
How to achieve this using Azure data factory?
Is it possible to match the pattern using Databricks Notebook by getting metadata using "Get metadata stage"?

You can do it using a ForEach activity after Get Metadata activity in ADF.
Please follow the demonstration below:
My files in blob having pattern word as pattern.
Use Get Meta data activity to pass this files list to ForEach. Create an array variable in the pipeline.
Give the ForEach Items dynamic content as below.
#activity('Get Metadata1').output.childItems
Inside ForEach use an append variable to append our file name and the True or false based on the Pattern to the array we created earlier(newfiles in mycase).
ChildItems gives the filenames and type of files, so take only the filenames from every item in ForEach and check with pattern.
#concat(item().name,'-',if(contains(string(item().name),'pattern'),'true','false'))
Last set variable for result(optional and only for output show).
Output:
Is it possible to match the pattern using Databricks Notebook by getting metadata using "Get metadata stage"?
Yes, it is possible. If you want to avoid the type of files you can use an append variable inside ForEach to just pass the filename. If you want the type of files you can pass the childItems from Get Metadata directly to Notebook.
To just pass File name
pass this newfiles variable to Databricks notebook and use pattern matching condition in notebook.

Related

ADF - Pipeline expression builder to constract a folderpath

I'm using ADF to copy files from several folders in a container on a storage account.
My container name is cont01 and the folder structure is as follow :
cont01:
--projA
--Sub01
--Sub02
--2022-10-01
-file01_A.gz
-file02_A.gz
-file03_A.gz
-file04_A.gz
--2022-10-02
-file01_B.gz
-file02_B.gz
-file03_B.gz
-file04_B.gz
The aim is copying all the files starting with file01 into a destination container.
To do so, I create a pipeline with GetMetadata activity and filter on Folders and then I want to use ForEach to iterate throuth the folders. To get the list of files inside each folder I need to use another GetMetadata activity inside the ForEach which then the dataset needs a File Path which has to be a dynamic path ! something like : proj01/Sub01/Sub02/ + the outcome of ForEach like item().name
How can I dynamically point to my ForEach outcomes ?
I reproduced the above and got the below result.
As you said the levels of all files are same, you can copy the files that starts with file01 with below approach.
These are my sample files in source container. Here for sample, I have used csv files.
First use Get Meta data activity to get all files list. Use a dataset parameter as wild card placeholder.
This will give you all files list inside the source container.
Then Use filter activity to filter the files starts with file01.
Items - #activity('Get Metadata1').output.childItems
Condition - #startswith(item().name,'file01')
You will get the required files list.
Give this Values array to Foreach activity as #activity('Filter1').output.Value.
Inside Foreach use copy activity and give the #item().name in the wild card path of source as follows.
In sink Dataset, give the same #item().name by using a dataset parameter.
Execute this pipeline and you will get the files in the target container.

Azure Data Factory: Cannot save the output of Set Variable into file/Database

I'm trying to store a list of file names within an Azure Blob container into a SQL db. The pipeline runs successfully, but after running the pipeline, it cannot output the values (file names) into the sink database, and the sink table doesn't get updated even after the pipeline completed. Followings are the steps I went through to implement the pipeline. I wonder which steps I made mistake.
I have followed the solutions given in the following links as well:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
Transfer the output of 'Set Variable' activity into a json file [Azure Data Factory]
Steps:
1- Validating File Exists, Get Files metadata and child items, Iterate the files through a foreach.
2- Variable defined at the pipeline level to hold the filenames
Variable Name: Files, Type: string
3- parameter defined to dynamically specify the dataset directory name. Parameter name: dimName, parameter type: string
4- Get Metadata configurations
5- Foreach settings
#activity('MetaGetFileNames').output.childItems
6 - Foreach Activity overview. A set Variable to set the each filename into the defined variable 'files'. Copy Activity to store the set value into db.
7- set variable configuration
8- Copy Activity source configuration. Excel Dataset refers to an empty excel file in azure blob container.
9- Copy Activity sink configuration
10-Copy Activity: mapping configuration
Instead of selecting an empty excel file, refer to a dummy excel file with dummy data.
Source: dummy excel file
You can skip using Set variable activity as you can use the Foreach current item directly in the Additional column dynamic expression.
Add additional columns in the Mapping.
Sink results in SQL database.

How to set and get variable value in Azure Synapse or Data Factory pipeline

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.

Get all files names in subfolders Azure Data factory

I have a below Folder Structure in Data lake, I want to get all .csv file names from all subfolders of my ParentFolder directory. All my files are .csv files is there a simple approach to do using Metadata activity.
ParentFolder > Year=2020Folder
2020-10-20Folder > 2020-10-20.csv
2020-10-21Folder > 2020-10-21.csv
2020-10-22Folder > 2020-10-22.csv
I've made a test to get the FileNames successfully.
I created the same file structure as yours.
In ADF, we can define an Array type variable to store the file names later.
It's the summary of the pipeline.
At the GetMetaData1 activity, let's define a DataSet of the root folder 2020Folder. Then we use Child Items to get all the subfolders.
At the ForEach1 activity, we can use the expression #activity('Get Metadata1').output.childItems to foreach the Folder list.
In the ForEach1 activity, We can add dynamic content #item().name to pass the subfolder name to the GetMetadata2 activity. Then we can use the GetMetadata2 activity to get the Child Items from the subfolder.
At the Append variable activity, we can use the array variable FileNames we defined previously to store all the filenames. Here we use expression #activity('Get Metadata2').output.childItems[0] to get the filename.
In the end. We can define another Array type variable to store and review the result.
The output we can see the array.

Resources