Uploading 10,000,000 files to Azure blob storage from Linux - azure

I have some experience with S3, and in the past have used s3-parallel-put to put many (millions) small files there. Compared to Azure, S3 has an expensive PUT price so I'm thinking to switch to Azure.
I don't however seem to be able to figure out how to sync a local directory to a remote container using azure cli. In particular, I have the following questions:
1- aws client provides a sync option. Is there such an option for azure?
2- Can I concurrently upload multiple files to Azure storage using cli? I noticed that there is a -concurrenttaskcount flag for azure storage blob upload, so I assume it must be possible in principle.

If you prefer the commandline and have a recent Python interpreter, the Azure Batch and HPC team has released a code sample with some AzCopy-like functionality on Python called blobxfer. This allows full recursive directory ingress into Azure Storage as well as full container copy back out to local storage. [full disclosure: I'm a contributor for this code]
To answer your questions:
blobxfer supports rsync-like operations using MD5 checksum comparisons for both ingress and egress
blobxfer performs concurrent operations, both within a single file and across multiple files. However, you may want to split up your input across multiple directories and containers which will not only help reduce memory usage in the script but also will partition your load better

https://github.com/Azure/azure-sdk-tools-xplat is the source of azure-cli, so more details can be found there. You may want to open issues to the repo :)
azure-cli doesn't support "sync" so far.
-concurrenttaskcount is to support parallel upload within a single file, which will increase the upload speed a lot, but it doesn't support multiple files yet.

In order to upload bulk files into the blob storage there is a tool provided by Microsoft Checkout
the storage explorer allows you to do the required task..

Related

Azure File-Sharing Solutions Compatible With Databricks

I am in search of a file-sharing solution within the Azure ecosystem of tools/services.
The current need is to be able to write thousands of files (3-4 thousand per week) from a script that runs in Databricks, to a storage solution that allows for access from a few other non-technical users. The script that generates the reports is a Python script, not PySpark, although it does run in databricks (a number of PySpark jobs precede it). The storage solution must allow for:
1) writing/saving excel and html files from Python
2) users to view and download multiple files at a time (I believe this knocks out Blob storage?)
Thanks!
Thank you for sharing your question. If
Azure does offer a data-share service you can use. Azure Data Share can let you separate the store your Python script writes to, from the store your non-technical users read from.
For point number 1, I do not see any issues. The storage solutions on Azure are mostly file-type agnostic. It is technically possible to write to any of the storage solutions, the main difference is how easy or long the process is to do so.
In point number 2, I think what you are hinting at, is the ease with which your non-technical people can access the storage. It is possible to download multiple files at a time from Blob storage, but the the Portal may not be the most user-friendly way to do this. I recommend you look into Azure Storage Explorer. Azure Storage Explorer provides one client application with which your users can manage or download the files from all the Azure Storage solutions.
Given how you specified html files, and viewing multiple files at a time, I suspect you want to render the files like a browser. Many resources have a URI. If a self-contained html file is made publicly accessible in Blob storage or ADLS gen2, and you navigate to it in a browser, the html page will render.

moving locally stored documented documents to azure

I want to spike whether azure and the cloud is a good fit for us.
We have a website where users upload documents to our currently hosted website.
Every document has an equivalent record in a database.
I am using terraform to create the azure infrastructure.
What is my best way of migrating the documents from the local file path on the server to azure?
Should I be using file storage or blob storage. I am confused about the difference.
Is there anything in terraform that can help with this?
Based on your comments, I would recommend storing them in Blob Storage. This service is suited for storing and serving unstructured data like files and images. There are many other features like redundancy, archiving etc. that you may find useful in your scenario.
File Storage is more suitable in Lift-and-Shift kind of scenarios where you're moving an on-prem application to the cloud and the application writes data to either local or network attached disk.
You may also find this article useful: https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
UPDATE
Regarding uploading files from local computer to Azure Storage, there are actually many options available:
Use a Storage Explorer like Microsoft's Storage Explorer.
Use AzCopy command-line tool.
Use Azure PowerShell Cmdlets.
Use Azure CLI.
Write your own code using any available Storage Client libraries or directly consuming REST API.

Moving files among azure blob without downloading

Currently, I have a blob container with about 5TB archive files. I need to move some of those files to another container. Is that a way to avoid download and upload files related? I do not need to access the data of those file. I do not want to get any bill about reading archive files either.
Thanks.
I suggest that you can use Data Factory. It usually used to transfer big data.
Copy performance and scalability achievable using ADF:
You can learn from below tutorial:
Copy and transform data in Azure Blob storage by using Azure Data Factory
Hope this helps.
You can use azcopy for that. It is a command line util that you can use to initiate server to server transfers:
AzCopy uses server-to-server APIs, so data is copied directly between storage servers. These copy operations don't use the network bandwidth of your computer.

How to transfer dependancy jars / files to azure storage from linux?

I am trying to use azure spark. To run my job, I need to copy my dependancy jar and files to storage. i have created a storage and container. Could you please guide me how to access Azure storage from my linux machine so as to copy date from/to it.
Since you didn't state your restrictions (e.g., command line, programmatically, gui), here are a few options:
If you have access to a recent Python interpreter on your Linux machine, you can use blobxfer (https://pypi.python.org/pypi/blobxfer), which can transfer entire directories of files into and out of Azure blob storage.
Use Azure cross-platform cli (https://azure.microsoft.com/en-us/documentation/articles/xplat-cli/) which has functionality to transfer files one at a time into or out of Azure storage.
Directly invoke Azure storage calls via Azure storage SDKs programmatically. There are SDKs available in a variety of languages along with REST.

Is it possible to mount blob storage to my local machine for deployment?

I have a build script that it would be very useful to configure to dump some files into Azure blob storage so they can be picked up by my Azure web role.
My preferred plan was to find some way of mounting the blob storage on my build server as a mapped drive and simply using Robocopy copy to copy the files over. This will involve the least ammount of friction as I already am deploying some files like this to other web servers using WebDrive.
I found a piece of software that will allow me to do that: http://www.gladinet.com/
However on further investigation I found that it needs port 80 to run without some hairy looking hacking about on the server.
So is there another piece of software I could use or perhaps another way I haven't considered, such as deploying the files to a local folder that is automagically synced with blob storage?
Update in response to #David Makogon
I am using http://waacceleratorumbraco.codeplex.com/ this performs 2 way synchronisation between the blob storage and the web roles. I have tested this with http://cloudberrylab.com/ and I can deploy files manually to the blob and they are deployed correctly to the web roles. Also I have done the reverse and updated files in the web roles which have then been synced back to the blob and I have subsequently edited/downloaded them from blob storage.
What I'm really looking for is a way to automate the cloudberry side of things. So I don't have a manual step to copy a few files over. I will investigate the Powershell solutions in the meantime.
I know this is an old post - but in case someone else comes here... the answer is now "yes". I've been working on a CodePlex project to do exactly that. (All source code is available).
http://azuredrive.codeplex.com/
If you're comfortable using powershell in your build process then you could use the Cerebrata Cmdlets to upload the files. If that doesn't work for you, you could write a custom activity (but this sounds quite a bit more involved).
Mounting a cloud drive from a non-Windows Azure compute instance (e.g. your local build machine) is not supported.
Having said that: Even if you could mount a Cloud Drive from your build machine, your compute instances would need access to it too, and there can only be one writer. If your compute instances only needed read-only access, they'd need to create a snapshot after you upload new files.
This really doesn't sound like a good idea though. As knightpfhor suggested, the Cerebrata cmdlets provide this capability (look at Import-File). This allows you to push individual files into their own blobs. You can optimize further by pushing a single ZIP file into a blob. You can then use a technique similar to the one described by Nate Totten in his multi-tenant web role sample, to detect new zip files and expand them to your local storage. Nate's blog post is here.
Oh, and if you don't want to use the Cerebrata cmdlets, you can upload blobs directly with the Windows Azure Storage REST API (though the cmdlets are very simple to use and integrate seamlessly with PowerShell).

Resources