Moving files among azure blob without downloading - azure

Currently, I have a blob container with about 5TB archive files. I need to move some of those files to another container. Is that a way to avoid download and upload files related? I do not need to access the data of those file. I do not want to get any bill about reading archive files either.
Thanks.

I suggest that you can use Data Factory. It usually used to transfer big data.
Copy performance and scalability achievable using ADF:
You can learn from below tutorial:
Copy and transform data in Azure Blob storage by using Azure Data Factory
Hope this helps.

You can use azcopy for that. It is a command line util that you can use to initiate server to server transfers:
AzCopy uses server-to-server APIs, so data is copied directly between storage servers. These copy operations don't use the network bandwidth of your computer.

Related

azureml register datastore file share or blob storage

I have a folder called data with a bunch of csvs (about 80), file sizes are fairly small. This data is clean and has already been preprocessed. I want to upload this data folder and register as a datastore in azureml. Which would be best for this scenario data store created with file share or a data store created with blob storage?
AFAIK, based on your scenario you can make use of Azure File Share to create data store.
Please note that, Azure Blob storage is suitable for uploading large amount of unstructured data whereas Azure File Share is suitable for uploading and processing the structured files in chunks (more interaction with app to share files).
I have a folder called data with a bunch of csvs (about 80), file sizes are fairly small. This data is clean and has already been preprocessed.
As you mentioned CSV data is clean and preprocessed it comes under structured data. So, you can make you of Azure File Share to create data store.
To register a data store with Azure File Share you can make use of this MsDoc
To know more about Azure File Share and Azure Blob storage, please find below links:
Azure Blob Storage or Azure File Storage by Mike
azureml.data.azure_storage_datastore.AzureFileDatastore class - Azure Machine Learning Python | Microsoft Docs

Azure WebApp storing Files

I am updating a system that had all of it's files stored inside of sql server.
It's going from an on prem server to a Azure webapp.
My questions are:
I think I should be using a storage blob for these files. Is that correct or is there a better option inside of Azure that I should be using?
Is there a quick way to migrate files from sql to that blob?
For storage purposes, do I write the file to the blob and then store the hyperlink to that file?
The staging environment gets updated with the latest data from production when they do a release, is there a way to migrate storage blob to a different resource group for when they do this?
Yes, I would use blob.
Quickest way would be a quick powershell or cli script or console app to pull the files from the database and upload them to blob.
I don't store the entire hyperlink to the file in the database, just the path. That way the storage account and container can be environment configurations.
I would recommend against doing this... we've found since we started doing automated continuous deployment, we haven't had a reason to move backwards, which has eliminated a lot of effort. That being said, AzCopy is a utility that allows you to do server-side copy of blobs between storage accounts (along with many other types of source and destination if needed). That should do what you need.
To answer your questions:
I think I should be using a storage blob for these files. Is that
correct or is there a better option inside of Azure that I should be
using?
That's correct. Blob storage is meant for this purpose only.
Is there a quick way to migrate files from sql to that blob?
I'm not aware of any automated way to do that. What you would need to do is read the binary data from SQL Database and then create a stream out of it and upload that stream. You can use Azure Storage SDK for uploading purpose.
For storage purposes, do I write the file to the blob and then store
the hyperlink to that file?
Under normal circumstances, it is recommended approach however considering you have a need to create a staging environment that will be a copy of production environment (including database I am assuming), I would recommend you store 2 things in your database: blob container name and blob name (or you could store relative URL e.g. <container-name>/<blob-name>). Assuming you keep storage account name somewhere in the configuration file, you can create the URL dynamically using https://<account-name>.blob.core.windows.net/<container-name>/<blob-name> pattern.
The staging environment gets updated with the latest data from
production when they do a release, is there a way to migrate storage
blob to a different resource group for when they do this?
Azure Storage provides Copy Blobs functionality using which you can copy blobs from one blob container to another in same or a different storage account. You can use that to copy data from production environment to staging environment.

Which Azure Storage method is best for a temporary file transfer?

I want to automate the transfer of files from a website not hosted in Azure to my client’s premises.
I am considering having an API on the website send the files to Azure Blob Storage , and then having another API running at the client site, download them.
Both would make use of the Azure storage API, which I like because it is easy to implement.
The files do not need to stay in Azure and can be deleted from storage once they are downloaded.
However I am wondering if there is a faster way.
Should I be using Hot Blob Storage or File Storage perhaps?
I looked at https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers but am still unclear as to the fastest method for my use case.
I suggest you can use File share, which can be mapped to local as a mapped drive and can be easily and faster operation like read / delete.
If you choose code only, from the comparison of blob and file, they can be up to Up to 60 MiB/s, I cannot see which is faster. There is a Azure Storage Data Movement Library , which is designed for high-performance uploading, downloading and copying Azure Storage Blob and File, you can use it for your purpose.
I would recommend blob storage for this application. Logic apps can also be used to automate this pipeline based on timer triggers or some other trigger.

Can Azure Data Factory write to FTP

I want to write the output of pipeline to an FTP folder. ADF seems to support on-premises file but not FTP folder.
How can I write the output in text format to an FTP folder?
Unfortunately FTP Servers are not a supported data store for ADF as of right now. Therefore there is no OOTB way to interact with an FTP Server for either reading or writing.
However, you can use a custom activity to make it possible, but it will require some custom development to make this happen. A fellow Cloud Solution Architect within MS put together a blog post that talks about how he did it for one of his customers. Please take a look at the following:
https://blogs.msdn.microsoft.com/cloud_solution_architect/2016/07/02/creating-ftp-data-movement-activity-for-azure-data-factory-pipeline/
I hope that this helps.
Upon thinking about it you might be able to achieve what you want in a mildly convoluted way by writing the output to a Azure Blob storage account and then either
1) manually: downloading and pushing the file to the "FTP" site from the Blob storage account or
2) automatically: using Azure CLI to pull the file locally and then push it to the "FTP" site with a batch or shell script as appropriate
As a lighter weight approach to custom activities (certainly the better option for heavy work).
You may wish to consider using azure functions to write to ftp (note there is a time out when using a consumption plan - not in other plans, so it will depend on how big the files are).
https://learn.microsoft.com/en-us/azure/azure-functions/functions-create-storage-blob-triggered-function
You could instruct data factory to write to a intermediary blob storage.
And use blob storage triggers in azure functions to upload them as soon as they appear in blob storage.
Or alternatively, write to blob storage. And then use a timer in logic apps to upload from blob storage to ftp. Logic Apps hide a tremendous amount of power behind there friendly exterior.
You can write a Logic app that will pick your file up from Azure storage and send it to an FTP site. Then call the Logic App using a Data Factory Web Activity.
Make sure you do some error handling in your Logic app to return 400 if the ftp fails.

Uploading 10,000,000 files to Azure blob storage from Linux

I have some experience with S3, and in the past have used s3-parallel-put to put many (millions) small files there. Compared to Azure, S3 has an expensive PUT price so I'm thinking to switch to Azure.
I don't however seem to be able to figure out how to sync a local directory to a remote container using azure cli. In particular, I have the following questions:
1- aws client provides a sync option. Is there such an option for azure?
2- Can I concurrently upload multiple files to Azure storage using cli? I noticed that there is a -concurrenttaskcount flag for azure storage blob upload, so I assume it must be possible in principle.
If you prefer the commandline and have a recent Python interpreter, the Azure Batch and HPC team has released a code sample with some AzCopy-like functionality on Python called blobxfer. This allows full recursive directory ingress into Azure Storage as well as full container copy back out to local storage. [full disclosure: I'm a contributor for this code]
To answer your questions:
blobxfer supports rsync-like operations using MD5 checksum comparisons for both ingress and egress
blobxfer performs concurrent operations, both within a single file and across multiple files. However, you may want to split up your input across multiple directories and containers which will not only help reduce memory usage in the script but also will partition your load better
https://github.com/Azure/azure-sdk-tools-xplat is the source of azure-cli, so more details can be found there. You may want to open issues to the repo :)
azure-cli doesn't support "sync" so far.
-concurrenttaskcount is to support parallel upload within a single file, which will increase the upload speed a lot, but it doesn't support multiple files yet.
In order to upload bulk files into the blob storage there is a tool provided by Microsoft Checkout
the storage explorer allows you to do the required task..

Resources