change itemname: "*" to parent folder of the file [Azure Data Factory] - azure

How can I get the the folder name of the file in the Get MetaData function.
As itemname I get this as result "*", since I have a wildcard in the connection settings of the source dataset. The folder structure is like this: project/location/file and would like to have location passed aswell.
Is it possible to pass like '/location/file.tdms' or pass the location in the next step which is an iteration (ForEach)?

You can provide the project folder name under the container as shown below and use the Get Metadata activity to get the list of folders under the Project folder.
Select Child items under Field list in the Get Metadata activity to get the folders under the project.
The output of Get Metadata:
Connect the Get Metadata activity output to the ForEach activity to loop the current item and get the files under the location folder.
Add an activity in ForEach to process the files of the current ForEach item. Here as an example, I am using another Get Metadata activity to show the list of files under location (output folder of Get Metadata1).
Create another source dataset locating to same container/project, and parameterize the directory & filename.
In Get Metadata2, pass the current item name (output folder name of Get Metadata1) in directory parameter and specify (*) in file_name parameter to get the files list with the Filed list as child items.
The output of Get Metadata2:

Related

ADF - Pipeline expression builder to constract a folderpath

I'm using ADF to copy files from several folders in a container on a storage account.
My container name is cont01 and the folder structure is as follow :
cont01:
--projA
--Sub01
--Sub02
--2022-10-01
-file01_A.gz
-file02_A.gz
-file03_A.gz
-file04_A.gz
--2022-10-02
-file01_B.gz
-file02_B.gz
-file03_B.gz
-file04_B.gz
The aim is copying all the files starting with file01 into a destination container.
To do so, I create a pipeline with GetMetadata activity and filter on Folders and then I want to use ForEach to iterate throuth the folders. To get the list of files inside each folder I need to use another GetMetadata activity inside the ForEach which then the dataset needs a File Path which has to be a dynamic path ! something like : proj01/Sub01/Sub02/ + the outcome of ForEach like item().name
How can I dynamically point to my ForEach outcomes ?
I reproduced the above and got the below result.
As you said the levels of all files are same, you can copy the files that starts with file01 with below approach.
These are my sample files in source container. Here for sample, I have used csv files.
First use Get Meta data activity to get all files list. Use a dataset parameter as wild card placeholder.
This will give you all files list inside the source container.
Then Use filter activity to filter the files starts with file01.
Items - #activity('Get Metadata1').output.childItems
Condition - #startswith(item().name,'file01')
You will get the required files list.
Give this Values array to Foreach activity as #activity('Filter1').output.Value.
Inside Foreach use copy activity and give the #item().name in the wild card path of source as follows.
In sink Dataset, give the same #item().name by using a dataset parameter.
Execute this pipeline and you will get the files in the target container.

I would like to add the date and time string in addition to the source file name.(Azure Synapse Analytics)

When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.
In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31
And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files
Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.
I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.
Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?
Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.
Any help woud be appreciated.
Thanks.
Using the Get Metadata activity, ForEach activity, and Execute pipeline activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.
Create a source dataset with the dataset parameter for the directory.
Pipeline1:
Using the Get Metadata activity, get the child items under the container (data/).
Pass the child items to the ForEach activity to loop each folder.
#activity('get sub folder list_yyyy').output.childItems
Inside ForEach activity, add the execute pipeline activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.
Subolder1: #item().name
Sink_dir1: #concat('year=',item().name)
Pipeline2:
In pipeline2, repeat the same processes as pipeline1. Using Get Metadata activity get the child items under the folder (yyyy folder) and pass the child items to ForEach activity.
Pipeline2 parameters:
Get Metadata:
Dataset property - dir: #pipeline().parameters.SubFolder1
Inside ForEach activity, add execute pipeline to pass the current item to nested pipeline (pipeline3). Create 2 pipeline parameters inside pipeline3 to hold source and sink structures.
SubFolder2: #concat(pipeline().parameters.SubFolder1,'/',item().name)
sink_dir2: #concat(pipeline().parameters.sink_dir1,'/month=',item().name)
Pipeline3:
Using the Get Metadata activity get the child items under the source structure.
Dataset property – dir: #pipeline().parameters.SubFolder2
Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.
Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.
dir: #concat(pipeline().parameters.SubFolder2,'/',item().name,'/')
Create a sink dataset with dataset parameters to pass the directory path dynamically.
In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.
Sink_dir: #concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')
Output structure: It creates the folder structure automatically if not available in the sink.
You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable 'NewFileName' and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.
Hope this helps.

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.

Get all files names in subfolders Azure Data factory

I have a below Folder Structure in Data lake, I want to get all .csv file names from all subfolders of my ParentFolder directory. All my files are .csv files is there a simple approach to do using Metadata activity.
ParentFolder > Year=2020Folder
2020-10-20Folder > 2020-10-20.csv
2020-10-21Folder > 2020-10-21.csv
2020-10-22Folder > 2020-10-22.csv
I've made a test to get the FileNames successfully.
I created the same file structure as yours.
In ADF, we can define an Array type variable to store the file names later.
It's the summary of the pipeline.
At the GetMetaData1 activity, let's define a DataSet of the root folder 2020Folder. Then we use Child Items to get all the subfolders.
At the ForEach1 activity, we can use the expression #activity('Get Metadata1').output.childItems to foreach the Folder list.
In the ForEach1 activity, We can add dynamic content #item().name to pass the subfolder name to the GetMetadata2 activity. Then we can use the GetMetadata2 activity to get the Child Items from the subfolder.
At the Append variable activity, we can use the array variable FileNames we defined previously to store all the filenames. Here we use expression #activity('Get Metadata2').output.childItems[0] to get the filename.
In the end. We can define another Array type variable to store and review the result.
The output we can see the array.

Create Folder Based on File Name in Azure Data Factory

I have a requirement to copy few files from an ADLS Gen1 location to another ADLS Gen1 location, but have to create folder based on file name.
I am having few files as below in the source ADLS:
ABCD_20200914_AB01_Part01.csv.gz
ABCD_20200914_AB02_Part01.csv.gz
ABCD_20200914_AB03_Part01.csv.gz
ABCD_20200914_AB03_Part01.json.gz
ABCD_20200914_AB04_Part01.json.gz
ABCD_20200914_AB04_Part01.csv.gz
Scenario-1
I have to copy these files into destination ADLS as below with only csv file and create folder from file name (If folder exists, copy to that folder) :
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
Scenario-2
I have to copy these files into destination ADLS as below with only csv and json files and create folder from file name (If folder exists, copy to that folder):
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
|-ABCD_20200914_AB03_Part01.json.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
|-ABCD_20200914_AB04_Part01.json.gz
Is there any way to achieve this in Data Factory?
Appreciate any leads!
So I am not sure if this will entirely help, but I had a similar situation where we have 1 zip file and I had to copy those files out into their own folders.
So what you can do is use parameters in the datasink that you would be using, plus a variable activity where you would do a substring.
The job below is more for the delta job, but I think has enough stuff in it to hopefully help. My job can be divided into 3 sections.
The first Orange section gets the latest file name date from ADLS gen 1 folder that you want to copy.
It is then moved to the orange block. On the bottom I get the latest file name based on the ADLS gen 1 date and then I do a sub-string where I take out the date portion of the file. In your case you might be able to do an array and capture all of the folder names that you need.
Getting file name
Getting Substring
On the top section I get first extract and unzip that file into a test landing zone.
Source
Sink
I then get the names of all the files that were in that zip file to them be used in the ForEach Activity. These file names will then become folders for the copy activity.
Get File names from initial landing zone:
I then pass on those childitems from "Get list of staged files" into ForEach:
In that ForEach activity I have one copy activity. For that I made to datasets. One to grab the files from the initial landing zone that we have created. For this example lets call it Staging (forgive the ms paint drawing):
The purpose of this is to go to that dummy folder and grab each file that was just copied into there. From that 1 zip file we expect 5 files.
In the Sink section what I did is create a new dataset with a parameter for folder and file name. In that dataset I have am putting that data into same container, but created a new folder called "Stage" and concatenated it with the item name. I also added a "replace" command to remove the ".txt" from the file name.
What this will do then is what ever the file name that is coming from that dummy staging it will then have a folder name specifically for each file. Based on your requirements I am not sure if that is what you want to do, but you can always rework that to be more specific.
For Item name I basically get the same file name, then replace the ".txt", concat the name of the date value, and only after that add the ".txt" extension. Otherwise I would have had to ".txt" in the file name.
In the end I have created a delete activity that will then be used to delete all the files (I am not sure if have set that up properly so feel free to adjust obviously).
Hopefully the description above gave you an idea on how to use parameters for your files. Let me know if this helps you in your situation.

Resources