Azure Data Factory unzip and move files - zip

I know this has been asked (in other question as well as) here, which are exactly my cases. I have downloaded (via ADF) a zip file to Azure Blob and I am trying to decompress it and move the files to another location within the Azure Blob container.
However having tried both of those approaches I only end up with a zipped file moved to another location without it being unzipped.

Trying to understand your question - Was your outcome a zip file or the folder name has .zip on it? It sounds crazy, let me explain in detail. In ADF decompressing the zip file using copy activity creates a folder(which has .zip on its name) which has actual file in it.
Example: Let's say you have sample.txt inside abc.zip
Blob sourcepath: container1/abc.zip [Here abc.zip is a compressed file]
Output path will be: container2/abc.zip/sample.txt [Here, abc.zip is the decompressed folder name]
This is achieved when the copy behaviour of sink is "none". Hope it helps :)

Related

Can you use ADF to copy a file to an existing Zip folder in a way that it will add that file to the other files within the zip folder?

I could not find a way to do this, I continue to get error that a directory already exists, but I am using zipDeflate on the sink side, and no compression on source side of copy activity. My goal is to add this file to the zipped folder here:
and here is the contents of the zipped folder:
My expectations would be that I could use ADF to just add the file in the first screenshot to this zipped folder, but I have found no way to do this.
I have used ADF to uncompress zipped folder and files, and the opposite compressing a single file or folder, but never been able to add a file to a pre-existing zip folder.
I think we need to use Azure function to do this. Create a PowerShell function in Azure.
Powershell code: adding a file to the existing zip-file.
#Add file to existing zip-file
Function AddtoExistingZip ($ZIPFileName,$NewFileToAdd)
{
[Reflection.Assembly]::LoadWithPartialName('System.IO.Compression.FileSystem') | Out-Null
$zip = [System.IO.Compression.ZipFile]::Open($ZIPFileName,"Update")
$FileName = [System.IO.Path]::GetFileName($NewFileToAdd)
[System.IO.Compression.ZipFileExtensions]::CreateEntryFromFile($zip,$NewFileToAdd,$FileName,"Optimal") | Out-Null
$Zip.Dispose()
}​
Use azure function activity in ADF to trigger the powershell cmd.

Unzip a file contains multiple text files using copy activity in azure data factory

I have an issue while unzipping a file that contains multiple text files. I have used copy activity to unzip the file but its creating folder with name as zip file (folder named as source zip file) and can see my text files inside that. My requirement is text files should be placed in the folder I wanted.
I tried below copy sink properties but nothing working:
flatten hierarchy+ #{item().name}
none+ #{item(),name}
preserver hierarchy+ #{item().name}
Please unselect Preserve zip file name as folder at the source tab. ADF will not create the xxx.zip folder.
At source side dataset, we can select ZipDeflate as Compression type.
At sink side dataset, select none as Compression type.

Is there a tool to extract a file from a ZIP archive when that file is not present in central directory but has its own LFH?

I'm looking for a tool that can extract files by searching aggressively through a ZIP archive. The compressed files are preceded with LFHs but no CDHs are present. Unzip outputs an empty folder.
I found one called 'binwalk' but even though it finds the hidden files inside ZIP archives it seems not to know how to extract them.
Thank You in advance.
You can try sunzip. It reads the zip file as a stream, and will extract files as it encounters the local headers and compressed data.
Use the -r option to retain the files decompressed in the event of an error. You will be left with a temporary directory starting with _z containing the extracted files, but with temporary, random names.

Create Folder Based on File Name in Azure Data Factory

I have a requirement to copy few files from an ADLS Gen1 location to another ADLS Gen1 location, but have to create folder based on file name.
I am having few files as below in the source ADLS:
ABCD_20200914_AB01_Part01.csv.gz
ABCD_20200914_AB02_Part01.csv.gz
ABCD_20200914_AB03_Part01.csv.gz
ABCD_20200914_AB03_Part01.json.gz
ABCD_20200914_AB04_Part01.json.gz
ABCD_20200914_AB04_Part01.csv.gz
Scenario-1
I have to copy these files into destination ADLS as below with only csv file and create folder from file name (If folder exists, copy to that folder) :
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
Scenario-2
I have to copy these files into destination ADLS as below with only csv and json files and create folder from file name (If folder exists, copy to that folder):
AB01-
|-ABCD_20200914_AB01_Part01.csv.gz
AB02-
|-ABCD_20200914_AB02_Part01.csv.gz
AB03-
|-ABCD_20200914_AB03_Part01.csv.gz
|-ABCD_20200914_AB03_Part01.json.gz
AB04-
|-ABCD_20200914_AB04_Part01.csv.gz
|-ABCD_20200914_AB04_Part01.json.gz
Is there any way to achieve this in Data Factory?
Appreciate any leads!
So I am not sure if this will entirely help, but I had a similar situation where we have 1 zip file and I had to copy those files out into their own folders.
So what you can do is use parameters in the datasink that you would be using, plus a variable activity where you would do a substring.
The job below is more for the delta job, but I think has enough stuff in it to hopefully help. My job can be divided into 3 sections.
The first Orange section gets the latest file name date from ADLS gen 1 folder that you want to copy.
It is then moved to the orange block. On the bottom I get the latest file name based on the ADLS gen 1 date and then I do a sub-string where I take out the date portion of the file. In your case you might be able to do an array and capture all of the folder names that you need.
Getting file name
Getting Substring
On the top section I get first extract and unzip that file into a test landing zone.
Source
Sink
I then get the names of all the files that were in that zip file to them be used in the ForEach Activity. These file names will then become folders for the copy activity.
Get File names from initial landing zone:
I then pass on those childitems from "Get list of staged files" into ForEach:
In that ForEach activity I have one copy activity. For that I made to datasets. One to grab the files from the initial landing zone that we have created. For this example lets call it Staging (forgive the ms paint drawing):
The purpose of this is to go to that dummy folder and grab each file that was just copied into there. From that 1 zip file we expect 5 files.
In the Sink section what I did is create a new dataset with a parameter for folder and file name. In that dataset I have am putting that data into same container, but created a new folder called "Stage" and concatenated it with the item name. I also added a "replace" command to remove the ".txt" from the file name.
What this will do then is what ever the file name that is coming from that dummy staging it will then have a folder name specifically for each file. Based on your requirements I am not sure if that is what you want to do, but you can always rework that to be more specific.
For Item name I basically get the same file name, then replace the ".txt", concat the name of the date value, and only after that add the ".txt" extension. Otherwise I would have had to ".txt" in the file name.
In the end I have created a delete activity that will then be used to delete all the files (I am not sure if have set that up properly so feel free to adjust obviously).
Hopefully the description above gave you an idea on how to use parameters for your files. Let me know if this helps you in your situation.

Deleting Zip Entry from Zip archive using Minizip API

I am using Minizip API to zip and unzip file to and from my archive. I have a requirement to delete the zip entry from the zip as soon as i extract it.
if the zip archive has multiple zip entries , i am able to delete a particular zip entry soon as i extract it and then able to zip archive with the remaining zip entries. i am able to achieve this using a temp zip .
But when i have a single file inside the zip archive, i am only able to delete the zip after complete extraction....Can there be a optimize way for this situation where i can extract and delete the zip entry in chunks. there is no direct API's in minizip to delete, i am using raw write and read.
Thanks in advance,
JP
No, there is no way to delete part of a file in a ZIP archive, short of extracting the whole file and archiving the part you don't want. (Which doesn't make sense here, since you're already trying to extract the file!)

Resources