Split a file using Azure Data Factory based on delimiter - azure

I'm very new to ADF, and I'm kinda stuck with the following use case:
I want to split a big file into multiple smaller files based on a delimiter. There will be a delimiter after some rows. For example, following is the input file content:
row1content
row2content
row3content
-----
row4content
row5content
-----
row6content
row7content
row8content
row9content
-----
row10content
row11content
row12content
Here ----- is the delimiter on which I want to split a file into multiple smaller files as output, and name them like MyFile1, MyFile2, MyFile3, MyFile4 and so on, such that their content will be as follows(split based on the delimiter):
MyFile1:
row1content
row2content
row3content
MyFile2
row4content
row5content
MyFile3:
row6content
row7content
row8content
row9content
MyFile4:
row10content
row11content
row12content
I'm trying to achieve it using Data flows in ADF.
The source and destination of input/output files will be azure blob storage.
It'll be really helpful if someone can point me to a direction or a source from where I can proceed further.

Related

I want to use part of the source file name as the file name in the sink folder. "Azure Synapse Analytics"

I am trying to unzip a Zip folder with 21 files in the Synapse pipeline and create a folder with each file name and put each file in there.
Example
sorce: zipfolder/zipfiles(SRI W 01 maker メーカー.zip and SRI W 02 product_name 製品名.zip)/files(SRI W 01 maker メーカー.csv and SRI W 02 product_name 製品名.csv)
I tried the below answer but the error occurred.
I attached the screenshot of error messeage.
enter image description here
*"メーカー","製品名" are Japanese.
sink: SRI W 01 maker(folder)/SRI W 01 maker.csv
SRI W 02 product_name(folder)/SRI W 02 product_name.csv
*I want to remove the Japanese part.
This is an example of file structure, and the actual file names contain Japanese characters.
The actual file name is
"SRI W 01 maker maker", which has a space and Japanese characters at the end, resulting in garbled characters.
I would like to remove this Japanese part and create the folder and file names dynamically.
Sorry for the poor explanation.
If necessary, I can give you screenshots and other information.
Source dataset is here:
1.Zipped folder.(It has 21 zip files.)
enter image description here
2.21 zip files under the Zip folder.
(Each of them has one file containing Japanese.)
enter image description here
3.Example of a file containing Japanese.
enter image description here
I have broken the requirements into 2 parts. The reason is because my local machine does not allow me to zip a file consisting non-English characters (for reproducing the requirement).
1. Creating a pipeline to unzip the main zip file.
Upload your zip file to storage account and use the following procedure. In this, I have considered the same storage account as source and sink as well.
Using copy data activity, create a binary dataset for source. Set compression type as ZipDeflate for this dataset. In the source tab, check the recursively checkbox.
UPDATE:
Please make sure that the copy data activity settings are identical to the one given in the following image.
Source settings:
Source dataset:
Sink settings:
Sink dataset:
For destination, create a binary dataset as well, and specify a name in folder field. It creates a folder, where your sub zip folders will be decompressed (I have given its value as repro1407). Run the pipeline and the following is the output it creates.
The following folders inside repro1407, each with a file containing Japanese characters (I had to manually upload a file containing Japanese characters because of the above specified reason)
2. Creating a pipeline to remove the Japanese characters.
Now this part will help you remove unwanted characters from the file names inside the folders (in above image).
After the above process the folders are structured as:
`repro1407 -> {country_data.zip -> {SRI W 01 maker メーカー.csv}, person.zip -> {SRI W 02 product_name 製品名.csv}}`
Use a Get Metadata activity to get the names of folders in repro1407 by creating a dataset pointing to it and selecting Field List as child items. Give the output of Get Metadata to ForEach activity and give items value as #activity('Get Metadata1').output.childItems
Inside ForEach, use set variable to create a variable folder_name (String type) to set folder name by giving its value as #item().name. Connect it to another Get Metadata activity where create a dataset using dynamic content for folder as #concat('repro1407','/',item().name). (Field list is child items). Create execute pipeline activity as shown below:
In below image of execute pipeline activity which invokes for pipeline, the highlighted parameters are parameters inside for pipeline and we are passing values from this pipeline (Had to use execute pipeline because we need to execute nested foreach because of nested folders, and nested for are not supported, have to create execute pipeline):
Inside for pipeline, there is a direct ForEach activity whose items value is #pipeline().parameters.file_names (list of file names inside folder_name folder) which we passed from previous pipeline. Inside foreach, there are 3 activities.
The first set variable split_file (Array) is to split the currect file name (there is only one). Since there are spaces between file name so, I used split using ' '. The value to be set is #split(item().name,' ')
The second set variable japanese is to get the the japanese character. Since it is occuring at the end of the file, use #last(variables('split_file')) value to get it using last() on the split_file array.
The last is a copy data activity. I am saving the renamed file in the same folder it is taken from. Create source and sink datasets (open dataset -> Parameters and create). Create 2 parameters each for these datasets. folder_structure and ip_name for source dataset and folder_struct and output_name for sink dataset (To use them as path).
Values to be passed for folder_structure is #concat('repro1407','/',pipeline().parameters.folder_name) and for ip_name it is #item().name
Similarly values of sink dataset for output_name is #concat(replace(item().name,variables('japanese'),''),'.csv') and for folder_struct is #concat('repro1407','/',pipeline().parameters.folder_name)
The following is how you use these values in datasets:
Source dataset:
Sink dataset:
Publish and run the pipeline. It runs successfully and generates the the required output. The following is an image inside one folder (Every folder would reflect in the same way.)
NOTE:
If you don't need the original file (with Japanese characters), you can simply add delete activity and use dynamic content to delete the original file.
If the Japanese characters are not always at the end of the file name, then you have to find a pattern that can be applied to all the 21 sub-folders. This logic needs to be appiled on the set variable activities inside ForEach activity of for pipeline.

Read only specific csv files in azure dataflow source

I have a data flow source, a delimited text dataset that points to a folder containing many csv files.
So the source reads all the csv files inside the folder2. The files inside folder2 are
abc.csv
someFile.csv
otherFile_2021.csv
predicted_file_1.csv
predicted_file_2.csv
predicted_file_99.csv
The aim is to read data from only the files like predicted_file_*.csv i.e to only read the last three files. Is it possible to add dynamic content in dataset so that it reads specific pattern files?
In source transformation, under source options, you can provide the wildcard path with filename prefix to read the required files.
Example:
(For debug purpose, I have added column to store the filename to verify the files)
Source:
Source preview:
Refer this document for more information.

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.

Azure Data Factory- Data Flow - After completion - move

I am using ADF v2 DataFlow ativity to load data from a csv file in a Blob Storage into a table in Azure SQL database. In the Dataflow (Source - Blob storage), in Source options, there is an option 'After Completion(No Action/Delete Source file/ Move)'. I am looking to utilize the move option to save those csv files in a container renaming those files in concatenation with with today's date. How do I frame the logic for this? Can someone please help?
You can define the file name explicitly in both From and To-fields. This is not so well (if at all) documented, and I found it just trying different approaches.
You can also add dynamic content such as timestamps. Here's an example:
concat('incoming/archive/', toString(currentUTC(), 'yyyy-MM-dd_HH.mm.ss_'), 'target_file.csv')
You could parameter the source file to achieve that. Please ref my example.
Data Flow parameter settings:
Set the source file and move expression in Source Options:
Expressions to rename the source with "name + current date":
concat(substring($filename, 1, length($filename)-4),toString(currentUTC(),'yyyy-MM-dd') )
My full file name is "word.csv", the output file name is "word2020-01-26",
HTH.

write data to text file in azure data factory version 2

It's seem ADF v2 does not support writing data to TEXT file (.TXT).
After select File System
But don't see TextFormat at the next screen
So do we any method to write data to TEXT file ?
Thanks,
Thai
Data Factory only support these 6 file formats:
Please see: Supported file formats and compression codecs in Azure Data Factory.
If we want to write data to a txt file, the only format we can using is Delimited text, when the pipeline finished, you will get a txt file.
Reference: Delimited text: Follow this article when you want to parse the delimited text files or write the data into delimited text format.
For example, I create a pipeline to copy data from Azure SQL to Blob, choose DelimitedText format as Sink dataset:
The txt file I get in Blob Storeage:
Hope this helps
I think what you are looking for is DelimitedText dataset. You can specify extension as part of the file name

Resources