Create Folder Based on File Name in Azure Data Factory - azure

I have a requirement to copy few files from an ADLS Gen1 location to another ADLS Gen1 location, but have to create folder based on file name.
I am having few files as below in the source ADLS:
ABCD_20200914_AB01_Part01.csv.gz
ABCD_20200914_AB02_Part01.csv.gz
ABCD_20200914_AB03_Part01.csv.gz
ABCD_20200914_AB03_Part01.json.gz
ABCD_20200914_AB04_Part01.json.gz
ABCD_20200914_AB04_Part01.csv.gz
Scenario-1
I have to copy these files into destination ADLS as below with only csv file and create folder from file name (If folder exists, copy to that folder) :
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
Scenario-2
I have to copy these files into destination ADLS as below with only csv and json files and create folder from file name (If folder exists, copy to that folder):
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
|-ABCD_20200914_AB03_Part01.json.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
|-ABCD_20200914_AB04_Part01.json.gz
Is there any way to achieve this in Data Factory?
Appreciate any leads!

So I am not sure if this will entirely help, but I had a similar situation where we have 1 zip file and I had to copy those files out into their own folders.
So what you can do is use parameters in the datasink that you would be using, plus a variable activity where you would do a substring.
The job below is more for the delta job, but I think has enough stuff in it to hopefully help. My job can be divided into 3 sections.
The first Orange section gets the latest file name date from ADLS gen 1 folder that you want to copy.
It is then moved to the orange block. On the bottom I get the latest file name based on the ADLS gen 1 date and then I do a sub-string where I take out the date portion of the file. In your case you might be able to do an array and capture all of the folder names that you need.
Getting file name
Getting Substring
On the top section I get first extract and unzip that file into a test landing zone.
Source
Sink
I then get the names of all the files that were in that zip file to them be used in the ForEach Activity. These file names will then become folders for the copy activity.
Get File names from initial landing zone:
I then pass on those childitems from "Get list of staged files" into ForEach:
In that ForEach activity I have one copy activity. For that I made to datasets. One to grab the files from the initial landing zone that we have created. For this example lets call it Staging (forgive the ms paint drawing):
The purpose of this is to go to that dummy folder and grab each file that was just copied into there. From that 1 zip file we expect 5 files.
In the Sink section what I did is create a new dataset with a parameter for folder and file name. In that dataset I have am putting that data into same container, but created a new folder called "Stage" and concatenated it with the item name. I also added a "replace" command to remove the ".txt" from the file name.
What this will do then is what ever the file name that is coming from that dummy staging it will then have a folder name specifically for each file. Based on your requirements I am not sure if that is what you want to do, but you can always rework that to be more specific.
For Item name I basically get the same file name, then replace the ".txt", concat the name of the date value, and only after that add the ".txt" extension. Otherwise I would have had to ".txt" in the file name.
In the end I have created a delete activity that will then be used to delete all the files (I am not sure if have set that up properly so feel free to adjust obviously).
Hopefully the description above gave you an idea on how to use parameters for your files. Let me know if this helps you in your situation.

Related

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

In my Azure data factory I need to copy data from an SFTP source that has structured the data into date based directories with the following hierarchy
year -> month -> date -> file
I have created a linked service and a binary dataset where the dataset "filesystem" points to the host and "Directory" points to the folder that contains the year directories. Ex: host/exampledir/yeardir/
with yeardir containing the year directories.
When I manually write into the dataset that I want the folder "2015" it will copy the entirety of the 2015 folder, however if I put a parameter for the directory and then input the same folder path from a copy activity it creates a file called "2015" inside of my blob storage that contains no data.
My current workaround is to make a nested sequence of get metadata for loops that drill into each folder and subfolder and copy the individual file ends. However the desired result is to instead have the single binary dataset copy each folder without the need for get metadata.
Is this possible within the scope of the data factory?
edit:
manual filepath that works
parameterized filepath
properties used in copy activity
To add further context I have tried manually writing the filepath into the copy activity as shown in the photo, I have also attempted to use variables, dynamic content for the parameter (using base filepath and concat) and also putting the base filepath into the dataset alongside #dataset().filePath. None of these solutions have worked for me so far and either copy nothing or create the empty file I mentioned earlier.
The sink is a binary dataset linked to Azure Data Lake Storage Gen2.
sink filepath
Update:
The accepted answer is the solution. My problem was that the source dataset when retrieved would have a newline at the end when passed as a parameter. I used concat to clean this up and this has worked since then.
Since giving exampledir/yeardir/2015 worked perfectly for you and you want to copy all the folders present in exampledir/yeardir, you can follow the below procedure:
I have taken a get metadata activity to get the child items of the folder exampledir/yeardir/ (In my demonstration, I have taken path as 'maindir/yeardir'.).
This will give you all the year folders present. I have taken only 2020 and 2021 as an example.
Now, with only one for each activity with items value as the child items output of get metadata activity, I have directly used copy activity.
#activity('Get Metadata1').output.childItems
Now, inside for each I have my copy data activity. For both source and sink, I have created a dataset parameter for paths. I have given the following dynamic content for source path.
maindir/yeardir/#{item().name}
For sink, I have given the output directory as follows:
outputDir/#{item().name}
Since giving path manually as exampledir/yeardir/2015 worked, we have got the list of year folders using get metadata activity. We looped through each of this and copy each folder with source path as exampledir/yeardir/<current_iteration_year_folder>.
Based on how I have given my sink path, the data will be copied with contents. The following is a reference image.

Unzip a file contains multiple text files using copy activity in azure data factory

I have an issue while unzipping a file that contains multiple text files. I have used copy activity to unzip the file but its creating folder with name as zip file (folder named as source zip file) and can see my text files inside that. My requirement is text files should be placed in the folder I wanted.
I tried below copy sink properties but nothing working:
flatten hierarchy+ #{item().name}
none+ #{item(),name}
preserver hierarchy+ #{item().name}
Please unselect Preserve zip file name as folder at the source tab. ADF will not create the xxx.zip folder.
At source side dataset, we can select ZipDeflate as Compression type.
At sink side dataset, select none as Compression type.

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.

Unzip files from a ZIP file and use them as source in an ADF Copy activity

I see there is a way to deflate a ZIP file but when there are multiple .csv files within a ZIP, how do I specify which to use as my source for copy activity? It is now parsing both csv files and giving as a single file and I'm not able to select the file I want as source
According to my test, we can't unzip .zip file in the ADF to get the file name lists in the ADF dataset. So, i provide below workaround for your reference.
Firstly, you could use Azure Function Activity to trigger a function which is for the decompression of your zip file.You only need to get the file name list then return it as an array.
Secondly, use ForEach Activity to loop the result, to get your desired file name.
Finally, inside ForEach Activity, please use #item() in the Dataset to configure the specific file path so that you could you could refer it in the copy activity.

How to Copy Files with Databricks dbutilis in particular order

A member from this group assisted me in copying files to a follow based on date
copy based on date
I would like to tweak the code to copy file based on certain characters in a filename – in the example that follows the characters are 1111, 1112, 1113 and 1114
So, if we have four files as follows:
File_Account_1111_exam1.csv
File_Account_1112_testxx.csv
File_Account_1113_pringle.csv
File_Account_1114_sam34.csv
I would like File_Account_1114_sam34.csv copied to the folder only if File_Account_1113_pringle.csv has already been copied to the folder.
Likewise I would only want File_Account_1113_pringle.csv copied if File_Account_1112_testxx.csv has been already been copied to the folder and so on.
Therefore, if all files have been copied to a folder it would look something like the following:
dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1111_exam1.csv", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1112_testxx.csv", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1113_pringle.csv", data, True)
dbutils.fs.put("/mnt/adls2/demo/files/file_Account_1114_sam34.csv", data, True)
Instead of applying any business logic when uploading files to DBFS I would recommend uploading all available files, then read them using test = sc.wholeTextFiles("pathtofile") which will return the key/value RDD of the file name and the file content, here is a corresponding thread. Once it is done any sort or filtering business logic based on file name may be implemented and tested in Spark job.
I hope it is helpful.

Resources