Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized

Azure data factory with a copy activity using a binary dataset fails to copy folder contents if parameterized - azure

In my Azure data factory I need to copy data from an SFTP source that has structured the data into date based directories with the following hierarchy
year -> month -> date -> file
I have created a linked service and a binary dataset where the dataset "filesystem" points to the host and "Directory" points to the folder that contains the year directories. Ex: host/exampledir/yeardir/
with yeardir containing the year directories.
When I manually write into the dataset that I want the folder "2015" it will copy the entirety of the 2015 folder, however if I put a parameter for the directory and then input the same folder path from a copy activity it creates a file called "2015" inside of my blob storage that contains no data.
My current workaround is to make a nested sequence of get metadata for loops that drill into each folder and subfolder and copy the individual file ends. However the desired result is to instead have the single binary dataset copy each folder without the need for get metadata.
Is this possible within the scope of the data factory?
edit:
manual filepath that works
parameterized filepath
properties used in copy activity
To add further context I have tried manually writing the filepath into the copy activity as shown in the photo, I have also attempted to use variables, dynamic content for the parameter (using base filepath and concat) and also putting the base filepath into the dataset alongside #dataset().filePath. None of these solutions have worked for me so far and either copy nothing or create the empty file I mentioned earlier.
The sink is a binary dataset linked to Azure Data Lake Storage Gen2.
sink filepath
Update:
The accepted answer is the solution. My problem was that the source dataset when retrieved would have a newline at the end when passed as a parameter. I used concat to clean this up and this has worked since then.

Since giving exampledir/yeardir/2015 worked perfectly for you and you want to copy all the folders present in exampledir/yeardir, you can follow the below procedure:
I have taken a get metadata activity to get the child items of the folder exampledir/yeardir/ (In my demonstration, I have taken path as 'maindir/yeardir'.).
This will give you all the year folders present. I have taken only 2020 and 2021 as an example.
Now, with only one for each activity with items value as the child items output of get metadata activity, I have directly used copy activity.
#activity('Get Metadata1').output.childItems
Now, inside for each I have my copy data activity. For both source and sink, I have created a dataset parameter for paths. I have given the following dynamic content for source path.
maindir/yeardir/#{item().name}
For sink, I have given the output directory as follows:
outputDir/#{item().name}
Since giving path manually as exampledir/yeardir/2015 worked, we have got the list of year folders using get metadata activity. We looped through each of this and copy each folder with source path as exampledir/yeardir/<current_iteration_year_folder>.
Based on how I have given my sink path, the data will be copied with contents. The following is a reference image.

Related

ADF - Pipeline expression builder to constract a folderpath

I'm using ADF to copy files from several folders in a container on a storage account.
My container name is cont01 and the folder structure is as follow :
cont01:
--projA
--Sub01
--Sub02
--2022-10-01
-file01_A.gz
-file02_A.gz
-file03_A.gz
-file04_A.gz
--2022-10-02
-file01_B.gz
-file02_B.gz
-file03_B.gz
-file04_B.gz
The aim is copying all the files starting with file01 into a destination container.
To do so, I create a pipeline with GetMetadata activity and filter on Folders and then I want to use ForEach to iterate throuth the folders. To get the list of files inside each folder I need to use another GetMetadata activity inside the ForEach which then the dataset needs a File Path which has to be a dynamic path ! something like : proj01/Sub01/Sub02/ + the outcome of ForEach like item().name
How can I dynamically point to my ForEach outcomes ?

I reproduced the above and got the below result.
As you said the levels of all files are same, you can copy the files that starts with file01 with below approach.
These are my sample files in source container. Here for sample, I have used csv files.
First use Get Meta data activity to get all files list. Use a dataset parameter as wild card placeholder.
This will give you all files list inside the source container.
Then Use filter activity to filter the files starts with file01.
Items - #activity('Get Metadata1').output.childItems
Condition - #startswith(item().name,'file01')
You will get the required files list.
Give this Values array to Foreach activity as #activity('Filter1').output.Value.
Inside Foreach use copy activity and give the #item().name in the wild card path of source as follows.
In sink Dataset, give the same #item().name by using a dataset parameter.
Execute this pipeline and you will get the files in the target container.

I want to use part of the source file name as the file name in the sink folder. "Azure Synapse Analytics"

I am trying to unzip a Zip folder with 21 files in the Synapse pipeline and create a folder with each file name and put each file in there.
Example
sorce: zipfolder/zipfiles(SRI W 01 maker メーカー.zip and SRI W 02 product_name 製品名.zip)/files(SRI W 01 maker メーカー.csv and SRI W 02 product_name 製品名.csv)
I tried the below answer but the error occurred.
I attached the screenshot of error messeage.
enter image description here
*"メーカー","製品名" are Japanese.
sink: SRI W 01 maker(folder)/SRI W 01 maker.csv
SRI W 02 product_name(folder)/SRI W 02 product_name.csv
*I want to remove the Japanese part.
This is an example of file structure, and the actual file names contain Japanese characters.
The actual file name is
"SRI W 01 maker maker", which has a space and Japanese characters at the end, resulting in garbled characters.
I would like to remove this Japanese part and create the folder and file names dynamically.
Sorry for the poor explanation.
If necessary, I can give you screenshots and other information.
Source dataset is here:
1.Zipped folder.(It has 21 zip files.)
enter image description here
2.21 zip files under the Zip folder.
(Each of them has one file containing Japanese.)
enter image description here
3.Example of a file containing Japanese.
enter image description here

I have broken the requirements into 2 parts. The reason is because my local machine does not allow me to zip a file consisting non-English characters (for reproducing the requirement).
1. Creating a pipeline to unzip the main zip file.
Upload your zip file to storage account and use the following procedure. In this, I have considered the same storage account as source and sink as well.
Using copy data activity, create a binary dataset for source. Set compression type as ZipDeflate for this dataset. In the source tab, check the recursively checkbox.
UPDATE:
Please make sure that the copy data activity settings are identical to the one given in the following image.
Source settings:
Source dataset:
Sink settings:
Sink dataset:
For destination, create a binary dataset as well, and specify a name in folder field. It creates a folder, where your sub zip folders will be decompressed (I have given its value as repro1407). Run the pipeline and the following is the output it creates.
The following folders inside repro1407, each with a file containing Japanese characters (I had to manually upload a file containing Japanese characters because of the above specified reason)
2. Creating a pipeline to remove the Japanese characters.
Now this part will help you remove unwanted characters from the file names inside the folders (in above image).
After the above process the folders are structured as:
`repro1407 -> {country_data.zip -> {SRI W 01 maker メーカー.csv}, person.zip -> {SRI W 02 product_name 製品名.csv}}`
Use a Get Metadata activity to get the names of folders in repro1407 by creating a dataset pointing to it and selecting Field List as child items. Give the output of Get Metadata to ForEach activity and give items value as #activity('Get Metadata1').output.childItems
Inside ForEach, use set variable to create a variable folder_name (String type) to set folder name by giving its value as #item().name. Connect it to another Get Metadata activity where create a dataset using dynamic content for folder as #concat('repro1407','/',item().name). (Field list is child items). Create execute pipeline activity as shown below:
In below image of execute pipeline activity which invokes for pipeline, the highlighted parameters are parameters inside for pipeline and we are passing values from this pipeline (Had to use execute pipeline because we need to execute nested foreach because of nested folders, and nested for are not supported, have to create execute pipeline):
Inside for pipeline, there is a direct ForEach activity whose items value is #pipeline().parameters.file_names (list of file names inside folder_name folder) which we passed from previous pipeline. Inside foreach, there are 3 activities.
The first set variable split_file (Array) is to split the currect file name (there is only one). Since there are spaces between file name so, I used split using ' '. The value to be set is #split(item().name,' ')
The second set variable japanese is to get the the japanese character. Since it is occuring at the end of the file, use #last(variables('split_file')) value to get it using last() on the split_file array.
The last is a copy data activity. I am saving the renamed file in the same folder it is taken from. Create source and sink datasets (open dataset -> Parameters and create). Create 2 parameters each for these datasets. folder_structure and ip_name for source dataset and folder_struct and output_name for sink dataset (To use them as path).
Values to be passed for folder_structure is #concat('repro1407','/',pipeline().parameters.folder_name) and for ip_name it is #item().name
Similarly values of sink dataset for output_name is #concat(replace(item().name,variables('japanese'),''),'.csv') and for folder_struct is #concat('repro1407','/',pipeline().parameters.folder_name)
The following is how you use these values in datasets:
Source dataset:
Sink dataset:
Publish and run the pipeline. It runs successfully and generates the the required output. The following is an image inside one folder (Every folder would reflect in the same way.)
NOTE:
If you don't need the original file (with Japanese characters), you can simply add delete activity and use dynamic content to delete the original file.
If the Japanese characters are not always at the end of the file name, then you have to find a pattern that can be applied to all the 21 sub-folders. This logic needs to be appiled on the set variable activities inside ForEach activity of for pipeline.

I would like to add the date and time string in addition to the source file name.(Azure Synapse Analytics)

When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.
In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31
And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files
Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.
I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.
Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?
Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.
Any help woud be appreciated.
Thanks.

Using the Get Metadata activity, ForEach activity, and Execute pipeline activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.
Create a source dataset with the dataset parameter for the directory.
Pipeline1:
Using the Get Metadata activity, get the child items under the container (data/).
Pass the child items to the ForEach activity to loop each folder.
#activity('get sub folder list_yyyy').output.childItems
Inside ForEach activity, add the execute pipeline activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.
Subolder1: #item().name
Sink_dir1: #concat('year=',item().name)
Pipeline2:
In pipeline2, repeat the same processes as pipeline1. Using Get Metadata activity get the child items under the folder (yyyy folder) and pass the child items to ForEach activity.
Pipeline2 parameters:
Get Metadata:
Dataset property - dir: #pipeline().parameters.SubFolder1
Inside ForEach activity, add execute pipeline to pass the current item to nested pipeline (pipeline3). Create 2 pipeline parameters inside pipeline3 to hold source and sink structures.
SubFolder2: #concat(pipeline().parameters.SubFolder1,'/',item().name)
sink_dir2: #concat(pipeline().parameters.sink_dir1,'/month=',item().name)
Pipeline3:
Using the Get Metadata activity get the child items under the source structure.
Dataset property – dir: #pipeline().parameters.SubFolder2
Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.
Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.
dir: #concat(pipeline().parameters.SubFolder2,'/',item().name,'/')
Create a sink dataset with dataset parameters to pass the directory path dynamically.
In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.
Sink_dir: #concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')
Output structure: It creates the folder structure automatically if not available in the sink.

You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable 'NewFileName' and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.
Hope this helps.

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.

Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.

Create Folder Based on File Name in Azure Data Factory

I have a requirement to copy few files from an ADLS Gen1 location to another ADLS Gen1 location, but have to create folder based on file name.
I am having few files as below in the source ADLS:
ABCD_20200914_AB01_Part01.csv.gz
ABCD_20200914_AB02_Part01.csv.gz
ABCD_20200914_AB03_Part01.csv.gz
ABCD_20200914_AB03_Part01.json.gz
ABCD_20200914_AB04_Part01.json.gz
ABCD_20200914_AB04_Part01.csv.gz
Scenario-1
I have to copy these files into destination ADLS as below with only csv file and create folder from file name (If folder exists, copy to that folder) :
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
Scenario-2
I have to copy these files into destination ADLS as below with only csv and json files and create folder from file name (If folder exists, copy to that folder):
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
|-ABCD_20200914_AB03_Part01.json.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
|-ABCD_20200914_AB04_Part01.json.gz
Is there any way to achieve this in Data Factory?
Appreciate any leads!

So I am not sure if this will entirely help, but I had a similar situation where we have 1 zip file and I had to copy those files out into their own folders.
So what you can do is use parameters in the datasink that you would be using, plus a variable activity where you would do a substring.
The job below is more for the delta job, but I think has enough stuff in it to hopefully help. My job can be divided into 3 sections.
The first Orange section gets the latest file name date from ADLS gen 1 folder that you want to copy.
It is then moved to the orange block. On the bottom I get the latest file name based on the ADLS gen 1 date and then I do a sub-string where I take out the date portion of the file. In your case you might be able to do an array and capture all of the folder names that you need.
Getting file name
Getting Substring
On the top section I get first extract and unzip that file into a test landing zone.
Source
Sink
I then get the names of all the files that were in that zip file to them be used in the ForEach Activity. These file names will then become folders for the copy activity.
Get File names from initial landing zone:
I then pass on those childitems from "Get list of staged files" into ForEach:
In that ForEach activity I have one copy activity. For that I made to datasets. One to grab the files from the initial landing zone that we have created. For this example lets call it Staging (forgive the ms paint drawing):
The purpose of this is to go to that dummy folder and grab each file that was just copied into there. From that 1 zip file we expect 5 files.
In the Sink section what I did is create a new dataset with a parameter for folder and file name. In that dataset I have am putting that data into same container, but created a new folder called "Stage" and concatenated it with the item name. I also added a "replace" command to remove the ".txt" from the file name.
What this will do then is what ever the file name that is coming from that dummy staging it will then have a folder name specifically for each file. Based on your requirements I am not sure if that is what you want to do, but you can always rework that to be more specific.
For Item name I basically get the same file name, then replace the ".txt", concat the name of the date value, and only after that add the ".txt" extension. Otherwise I would have had to ".txt" in the file name.
In the end I have created a delete activity that will then be used to delete all the files (I am not sure if have set that up properly so feel free to adjust obviously).
Hopefully the description above gave you an idea on how to use parameters for your files. Let me know if this helps you in your situation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string