Unzip files from a ZIP file and use them as source in an ADF Copy activity - azure

I see there is a way to deflate a ZIP file but when there are multiple .csv files within a ZIP, how do I specify which to use as my source for copy activity? It is now parsing both csv files and giving as a single file and I'm not able to select the file I want as source

According to my test, we can't unzip .zip file in the ADF to get the file name lists in the ADF dataset. So, i provide below workaround for your reference.
Firstly, you could use Azure Function Activity to trigger a function which is for the decompression of your zip file.You only need to get the file name list then return it as an array.
Secondly, use ForEach Activity to loop the result, to get your desired file name.
Finally, inside ForEach Activity, please use #item() in the Dataset to configure the specific file path so that you could you could refer it in the copy activity.

Related

how to segregate files in a blob storage using ADF copy activity

I have a copy data activity in ADF and I want to segregate the files into a different container based on the file type.
ex.
Container A - .jpeg, .png
Container B - .csv, .xml and .doc
My initial idea was to use 'if condition' and 'or' statement but looks like my approach won't work.
I'd appreciate it if you could give some inputs.
First, get the list files from the source container and loop each file in the Foreach activity to check the extension using If condition and copy files based on the condition to their respective containers.
I have the below files in my source container.
In ADF:
Using the Get metadata activity, get the list of all files from the source container.
Output of Get Metadata activity:
Pass the output list to Foreach activity.
#activity('Get Metadata1').output.childitems
Inside the Foreach activity, add If Condition activity to separate the files based on extension.
#or(contains(item().name, '.xml'),contains(item().name, '.csv'))
If the condition is true, copy the current file to container1.
If the condition returns false, copy the current file to container2 in False activity.
Files in the container after running the pipeline.
Container1:
Container2:
You should use the GetMetadata activity to first get all the file types that needs to be first passed to a CopyData activity which copies to Container A, and then add next Getmetadata activity to get file types for next copydata activity that copies to container B.
so, your ADF pipeline may be like GetMetaData1 - > Copydata1 -> GetMetaData2 -> Copydata2. Refer how to use GetMetaData activity in this article, and documentation
Copy activity itself allows more than one wildcard path, so you could use that in the Data source, see this.

Can you use ADF to copy a file to an existing Zip folder in a way that it will add that file to the other files within the zip folder?

I could not find a way to do this, I continue to get error that a directory already exists, but I am using zipDeflate on the sink side, and no compression on source side of copy activity. My goal is to add this file to the zipped folder here:
and here is the contents of the zipped folder:
My expectations would be that I could use ADF to just add the file in the first screenshot to this zipped folder, but I have found no way to do this.
I have used ADF to uncompress zipped folder and files, and the opposite compressing a single file or folder, but never been able to add a file to a pre-existing zip folder.
I think we need to use Azure function to do this. Create a PowerShell function in Azure.
Powershell code: adding a file to the existing zip-file.
#Add file to existing zip-file
Function AddtoExistingZip ($ZIPFileName,$NewFileToAdd)
{
[Reflection.Assembly]::LoadWithPartialName('System.IO.Compression.FileSystem') | Out-Null
$zip = [System.IO.Compression.ZipFile]::Open($ZIPFileName,"Update")
$FileName = [System.IO.Path]::GetFileName($NewFileToAdd)
[System.IO.Compression.ZipFileExtensions]::CreateEntryFromFile($zip,$NewFileToAdd,$FileName,"Optimal") | Out-Null
$Zip.Dispose()
}​
Use azure function activity in ADF to trigger the powershell cmd.

Unzip a file contains multiple text files using copy activity in azure data factory

I have an issue while unzipping a file that contains multiple text files. I have used copy activity to unzip the file but its creating folder with name as zip file (folder named as source zip file) and can see my text files inside that. My requirement is text files should be placed in the folder I wanted.
I tried below copy sink properties but nothing working:
flatten hierarchy+ #{item().name}
none+ #{item(),name}
preserver hierarchy+ #{item().name}
Please unselect Preserve zip file name as folder at the source tab. ADF will not create the xxx.zip folder.
At source side dataset, we can select ZipDeflate as Compression type.
At sink side dataset, select none as Compression type.

Data Factory Data Flow sink file name

I have a data flow that merges multiple pipe delimited files into one file and stores it in Azure Blob Container. I'm using a file pattern for the output file name concat('myFile' + toString(currentDate('PST')), '.txt').
How can I grab the file name that's generated after the dataflow is completed? I have other activities to log the file name into a database, but not able to figure out how to get the file name.
I tried #{activity('Data flow1').output.filePattern} but it didn't help.
Thank you
You can use GetMeta data activity to get the file name that is generated after the data flow.

Create Folder Based on File Name in Azure Data Factory

I have a requirement to copy few files from an ADLS Gen1 location to another ADLS Gen1 location, but have to create folder based on file name.
I am having few files as below in the source ADLS:
ABCD_20200914_AB01_Part01.csv.gz
ABCD_20200914_AB02_Part01.csv.gz
ABCD_20200914_AB03_Part01.csv.gz
ABCD_20200914_AB03_Part01.json.gz
ABCD_20200914_AB04_Part01.json.gz
ABCD_20200914_AB04_Part01.csv.gz
Scenario-1
I have to copy these files into destination ADLS as below with only csv file and create folder from file name (If folder exists, copy to that folder) :
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
Scenario-2
I have to copy these files into destination ADLS as below with only csv and json files and create folder from file name (If folder exists, copy to that folder):
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
|-ABCD_20200914_AB03_Part01.json.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
|-ABCD_20200914_AB04_Part01.json.gz
Is there any way to achieve this in Data Factory?
Appreciate any leads!
So I am not sure if this will entirely help, but I had a similar situation where we have 1 zip file and I had to copy those files out into their own folders.
So what you can do is use parameters in the datasink that you would be using, plus a variable activity where you would do a substring.
The job below is more for the delta job, but I think has enough stuff in it to hopefully help. My job can be divided into 3 sections.
The first Orange section gets the latest file name date from ADLS gen 1 folder that you want to copy.
It is then moved to the orange block. On the bottom I get the latest file name based on the ADLS gen 1 date and then I do a sub-string where I take out the date portion of the file. In your case you might be able to do an array and capture all of the folder names that you need.
Getting file name
Getting Substring
On the top section I get first extract and unzip that file into a test landing zone.
Source
Sink
I then get the names of all the files that were in that zip file to them be used in the ForEach Activity. These file names will then become folders for the copy activity.
Get File names from initial landing zone:
I then pass on those childitems from "Get list of staged files" into ForEach:
In that ForEach activity I have one copy activity. For that I made to datasets. One to grab the files from the initial landing zone that we have created. For this example lets call it Staging (forgive the ms paint drawing):
The purpose of this is to go to that dummy folder and grab each file that was just copied into there. From that 1 zip file we expect 5 files.
In the Sink section what I did is create a new dataset with a parameter for folder and file name. In that dataset I have am putting that data into same container, but created a new folder called "Stage" and concatenated it with the item name. I also added a "replace" command to remove the ".txt" from the file name.
What this will do then is what ever the file name that is coming from that dummy staging it will then have a folder name specifically for each file. Based on your requirements I am not sure if that is what you want to do, but you can always rework that to be more specific.
For Item name I basically get the same file name, then replace the ".txt", concat the name of the date value, and only after that add the ".txt" extension. Otherwise I would have had to ".txt" in the file name.
In the end I have created a delete activity that will then be used to delete all the files (I am not sure if have set that up properly so feel free to adjust obviously).
Hopefully the description above gave you an idea on how to use parameters for your files. Let me know if this helps you in your situation.

Resources