How to uncompress rar files using Azure DataFactory

How to uncompress rar files using Azure DataFactory - azure

We have a new client, while landing the project we gave them a blob storage for them to leave files so we could later automate and process the information.
The idea is to use Azure Datafactory but we find no way of dealing with .rar files, and even .zip, being it files from windows, are giving us trouble. And since it is the clien giving the .rar format, we wanted to make absolutely sure there is no way to process before asking them to change it, or deploying a databricks or similar service just for the purpose of transforming the file.
Is there any way to get a .rar file from a blob storage, uncompress it, then process it?
I have been looking in posts like this and related official documentation and closest we have come is using ZipDeflate, but it does not seem to fill our requirement.
Thanks in advance!

Data factory compression only supported types are GZip, Deflate, BZip2, and ZipDeflate.
For the Unsupported file types and compression formats, Data Factory provides some workarounds for us:
You can use the extensibility features of Azure Data Factory to transform files that aren't supported. Two options include Azure Functions and custom tasks by using Azure Batch.
You can see a sample that uses an Azure function to extract the contents of a tar file. For more information, see Azure Functions activity.
You can also build this functionality using a custom dotnet activity. Further information is available here.
Next way, you may need to figure out how to using Azure function to extract the contents of a rar file.

you can use logic apps
you can use webhook activity calling a runbook
both are easiee than using a custom activity

Related

Uploading large .bak file to Azure Blob through Powershell

So I am trying to create a powershell script which will upload a large (> 4GB) .Bak file to Azure Blob Storage but currently it is getting hung. This script works with small files which I have been using to test.
Originally the issue I was having was the requirement to have a Content-Length specified (I imagine due its size) so I now calculate the file size of the .bak file (as it varies slightly each week) and pass this through as a request header
I am a total powershell newbie, as well as being very new to Azure blob. (NOTE: I am trying to do this purely in powershell, without relying on other tools such as AzCopy)
Below is my script
Powershell Script
Any help would be greatly appreciated..

There are a few things to check. Since file is big, are you sure it isn't uploading? Have you checked network activity in performance tab of task explorer? AzCopy seems like a good option too that you can use from within Powershell, but if it's not an option in your case, then why not to use native AZ module for Powershell?
I suggest you using Set-AzStorageBlobContent cmdlet to see if it helps. You can find examples at Microsoft docs

Azure blob soft delete and versioning- how to restore files easily?

I am trying to understand how soft delete and versioning work within azure blog storage.
It seems that if you have both soft delete and versioning turned on... you can’t just ‘undelete ’ deleted files, as versioning actually saves a new version as a deleted file.
So instead you have to promote the last version of each deleted file.
But what if you have a structure of nested folders and thousands of blobs... you can’t just promote the top version of the top level folder... you need to use Powershell to list files with no current version, and promote them? How would you do this?
This seems awfully complicated, when without versioning - a simple ‘undelete’ command is available from the GUI.
Am I missing something? What is the easiest way to ‘undelete’ a nested folder structure of thousand of blobs in folders, when versioning is turned on?
Thanks

As Rob Minson pointed out, the approach involves copying a blob version to the same container. For PowerShell, use the Copy-AzStorageBlob cmdlet; for Azure CLI, use the az storage blob copy start command. You can pass an account key or SAS token, or use Azure AD.
We've updated the documentation to shed some light on an approach to restoring blobs when soft-delete and/or versioning is enabled. Code samples are available for both PowerShell and Azure CLI.

Simply put, no.
The first point that needs to be emphasized is that blobs in blob storage are not nested as you might think. It seems that blob storage is the same as the local file system: some nested folders, and many files inside. But in fact these are fake, the storage structure of blob storage is flat. Blob storage is not about putting a small box in a box and then putting items in the small box. In fact, all blobs are items of blob storage, and there is no such thing as a "small box".
Then, the second point, for blob storage, the soft-delete operation only supports two objects, one is a blob and the other is a container.
Check out this document:
https://learn.microsoft.com/en-us/azure/storage/blobs/soft-delete-container-overview?tabs=azure-cli#how-container-soft-delete-works
However, you can only use container soft delete to restore blobs if
the container itself was deleted. To a restore a deleted blob when its
parent container has not been deleted, you must use blob soft delete
or blob versioning.
So unfortunately, there is no so-called easy way. You need to operate on each blob, the nested structure does not actually exist.
If you are interested, you can read this blog:
https://medium.com/#loopjockey/structuring-azure-blobs-for-functions-8305ba427356

I completely agree that this seems really un-documented at the moment. I've raised a github issue against this docs page to see if they can get the situation improved.
The best path through that I've found is something like the following:
Using Azure Storage Explorer, open up the container with the soft-deleted, versioned blobs, then change the drop down to "All blobs and blobs without current version". Now you can select a blob and hit 'Promote Version'. The deleted blob will be restored and in the Activities pane you can expand the operation and hit 'Copy AzCopy Command to Clipboard'.
The result will show you something like the following:
./azcopy.exe copy
"https://accountname.blob.core.windows.net/containername/blobname?<sastoken>&versionid=2021-04-22T11%3A35%3A36.9385599Z"
"https://accountname.blob.core.windows.net/containername/blobname?<sastoken>"
--overwrite=true
--recursive
--trusted-microsoft-suffixes=;
Now, based on this you can see you have a building block for automating the process you're talking about. Your problem at this point is finding this thing:
versionid=2021-04-22T11%3A35%3A36.9385599Z
Unfortunately that's a timestamp to nanosecond precision which you're not going to be able to infer. There's no functionality I can find in powershell, in the REST APIs or in AzCopy to get this data, the only way I have found is this sample for the .Net SDK.
All this probably means you can either:
Implement your own C# console app using the Azure.Storage.Blobs library to list the versions for each blob, then perform the relevant copy command now you know the magic version string
Wait for the REST API or Powershell library to get the ability to list versions

Use Azure Data Factory to copy files and place a csv of files copied

I am trying to implement the following flow in an Azure Data Factory pipeline:
Copy files from an SFTP to a local folder.
Create a comma separated file in the local folder with the list of files and their
sizes.
The first step was easy enough, using a 'Copy Data' step with 'SFTP' as source and 'File System' as sink.
The files are being copied, but in the output of this step, I don't see any file information.
I also don't see an option to create a file using data from a previous step.
Maybe I'm using the wrong technology?
One of the reasons I'm using Azure Data Factory, is because of the integration runtime, which allows us to have a single fixed IP to connect to the external SFTP. (easier firewall configuration)
Is there a way to implement step 2?
Thanks for any insight!

There is no built-in feature to achieve this.
You need to use ADF with other service, I suppose you to first use azure function to check the files and then do copy.
The structure should be like this:
You can get the size of the files and save them to the csv file:
Get size of files(python):
How to fetch sizes of all SFTP files in a directory through Paramiko
And use pandas to save the messages as csv(python):
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
Writing a pandas DataFrame to CSV file
Simple http trigger of azure function(python):
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-http-webhook-trigger?tabs=python
(Put the processing logic in the body of the azure function. Basically, you can do anything you want in the body of the azure function except for the graphical interface and some unsupported things. You can choose the language you are familiar with, but in short, there is not a feature in ADF that satisfies your idea.)

Build a pipeline in azure data factory to load Excel files, format content, transform in csv and send to azure sql DB

I'm approaching to Azure environment and watching tutorials/reading documents, but I'm trying to figure out how to setup a flow that enables the process that I will describe hereunder. The starting point are reports in .xlsx format produced monthly by Mktg Dept: the requirements are to bring them in Azure SQL DB so that data can be stored and analysed. Sofar I managed to put those files (previously manually converted in .csv format) in a BLOB storage and build an ADF pipeline that copy each file in a table on the SQL DB.
The problem is that as far as I understood with ADF it's not possible to directly manage xlsx files, and I'm wondering how to set up an automated procedure that enables the conversion from .xlsx to .csv and save them on BLOB storage. I was thinking about adding to the pipeline a python script/Databricks notebook to convert format, but I'm not sure this could be the best solution. Any hint/reference to existing tutorial or resources would be very appreciated

I found a tutorial which uses Logic Apps to do the conversion.
Datanovice indirectly suggested using a Custom activity to run either a C# or Python application to do the conversion for you.
The least expensive solution would be to do the conversion before uploading to blob, like Datanovice said.

Output file in Azure-automation script

I'm adapting a powershell script I have at work for use in Azure-automation, which outputs 3 different CSV files. I'm trying to avoid having to create a DB and send the information there since it would require a changing the script too much, and its quite complex.
Does anyone know if there's a way to just send the 3 files to some kind of folder in Azure? Or maybe another solution that wouldn't require messing too much with the script?
Sorry if it is a dumb question, I'm not very familiar with Azure yet.

Probably the easiest option is to continue writing the file as you are now, then after the file is written have your Powershell code upload it to Blob storage using Set-AzureStorageBlobContent. See https://savilltech.com/2018/03/25/writing-to-files-with-azure-automation/ for an example.
You can read more about using Powershell to upload to blob storage, including all the steps you need to create the storage account and container, at https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-powershell.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to uncompress rar files using Azure DataFactory - azure

you can use logic apps you can use webhook activity calling a runbook both are easiee than using a custom activity

Related

Uploading large .bak file to Azure Blob through Powershell

Azure blob soft delete and versioning- how to restore files easily?

Use Azure Data Factory to copy files and place a csv of files copied

Build a pipeline in azure data factory to load Excel files, format content, transform in csv and send to azure sql DB

Output file in Azure-automation script

Categories

Resources