copy and decompress .tar file with Azure Data Factory - azure

I m trying to copy and decompress .tar file from FTP to Azure Data Lake Store.
.tar file contains HTML files. In the copy activity, on a dataset, i select Compression type GZipDeflate, but I wonder what file format do I need to use? Is it supported to do such I thing without custom activity?

Unfortunately, Data factory doesn't support decompression of .tar files. The supported types for ftp are GZip, Deflate, BZip2, and ZipDeflate. (as seen here: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support).
A solution may be to save the files in one of the supported formats, or try a custom activity as was explained here, although I'm not sure if it was for data factory v1 or v2: Import .tar file using Azure Data Factory
Hope this helped!

So its true that there is no way just to decompress .tar files with ADF or ADL Analytics, but there is an option to take a content from every file in .tar file and save as an output in U-SQL.
I have a scenario that I need to take content from html files inside the .tar file, so i just created html extractor that will take stream content of each html file in .tar file and save in a U-SQL output variable.
Maybe this can help someone who has a similar use case.
I used SharpCompress.dll for extracting and looping over .tar files in c#.

Related

Azure Data Factory unzip and move files

I know this has been asked (in other question as well as) here, which are exactly my cases. I have downloaded (via ADF) a zip file to Azure Blob and I am trying to decompress it and move the files to another location within the Azure Blob container.
However having tried both of those approaches I only end up with a zipped file moved to another location without it being unzipped.
Trying to understand your question - Was your outcome a zip file or the folder name has .zip on it? It sounds crazy, let me explain in detail. In ADF decompressing the zip file using copy activity creates a folder(which has .zip on its name) which has actual file in it.
Example: Let's say you have sample.txt inside abc.zip
Blob sourcepath: container1/abc.zip [Here abc.zip is a compressed file]
Output path will be: container2/abc.zip/sample.txt [Here, abc.zip is the decompressed folder name]
This is achieved when the copy behaviour of sink is "none". Hope it helps :)

Can we extract a zip file in copy activity - Azure Datafactory

i have zip file i would like to uncproesss the file and get the csv file and push it to the blob.i can achive in.gz but .zip file we are not able to.
could you please assit here.
Thanks
Richard
You could set binary format as source and sink dataset in ADF copy activity.Select Compression type as ZipDefalte following this link: https://social.msdn.microsoft.com/Forums/en-US/a46a62f2-e211-4a5f-bf96-0c0705925bcf/working-with-zip-files-in-azure-data-factory
Source:
Sink:
Test result in sink path:

Is there any way to convert the encoding of json files in Azure Blob Storage?

I have copied the files from remote server to Azure Blob Storage using Azure Data Factory Copy Activity (Binary file copy). Those files are json files and txt files. I would like to change the encoding of the files to UTF-16.
I know its possible to change the encoding while copying the text files from remote server by just mentioning the encoding as UTF-16 in sink side in Copy Activity.I have implemented the copy activity which takes every files as txt file and it was working file. Sometimes, i get some error related to row delimiter and i changed the implementation to binary copy.Now, i would like to change the encoding of those files from UTF-8 to UTF-16. I couldn't find any way to do it.
Any help/suggestions would be appreciated.
If a file is stored in blob storage, you cannot directly change it's content-encoding, even if you set the blob property of content-encoding.
The way(via code or manually) to do this is that you should download it -> encode it with UTF-16 -> then upload it again.

Error using data factory for copyactivity from blob storage as source

Why do I keep getting this error while using a folder from a blob container as source (which contains only one GZ compressed file) in copy activity in data factory v2 and as sink another blob storage (but I want the file decompressed)?
"message":"ErrorCode=UserErrorFormatIsRequired,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Format setting is required for file based store(s) in this scenario.,Source=Microsoft.DataTransfer.ClientLibrary,'",
I know it means I need to specify explicitly the format for my sink dataset, but I am not sure how to do that.
I suggest using the copy data tool.
step 1
step 2
According you comment, I tried a lot times, unless you choose the compressed file as source dataset and import the schemas, Azure Data factory copy actives will not help you decompress the file.
If the files in the the compressed file don't have the same schema, the copy active also could be failed.
Hope this helps.
The easiest way to do this: go to the dataset, and click on the tab Schema, then Import Schema.
Hope this helped!!

Azure Data Factory Compression Type

I am working on a project trying to zip the files in Azure Blob Storage. I know the Azure Data Factory support compression type option, but I cannot find any reference to how this compression process behaves.
If I want to generate a *.zip file:
Origin Files:
ParentFolder
Image1.jpeg
Txt1.txt
ChildFolder
Image2.jpeg
Txt.txt
Is it going to zip only the ParentFolder? Or it is going to zip every single file recursively?
Compression type seems does not support .zip, it just supports GZIP, Deflate, BZIP2, ZipDeflate, see this link.
I test to copy the files like the sample you mentioned from one storage account to another one, use GZIP.
The files after being copied, it will be like as below(choose the Copy file recursively option).
ParentFolder
Image1.jpeg.gz
Txt1.txt.gz
ChildFolder
Image2.jpeg.gz
Txt.txt.gz
If not choose the Copy file recursively option, it will be like as below.
ParentFolder
Image1.jpeg.gz
Txt1.txt.gz

Resources