Azure File-Sharing Solutions Compatible With Databricks - azure

I am in search of a file-sharing solution within the Azure ecosystem of tools/services.
The current need is to be able to write thousands of files (3-4 thousand per week) from a script that runs in Databricks, to a storage solution that allows for access from a few other non-technical users. The script that generates the reports is a Python script, not PySpark, although it does run in databricks (a number of PySpark jobs precede it). The storage solution must allow for:
1) writing/saving excel and html files from Python
2) users to view and download multiple files at a time (I believe this knocks out Blob storage?)
Thanks!

Thank you for sharing your question. If
Azure does offer a data-share service you can use. Azure Data Share can let you separate the store your Python script writes to, from the store your non-technical users read from.
For point number 1, I do not see any issues. The storage solutions on Azure are mostly file-type agnostic. It is technically possible to write to any of the storage solutions, the main difference is how easy or long the process is to do so.
In point number 2, I think what you are hinting at, is the ease with which your non-technical people can access the storage. It is possible to download multiple files at a time from Blob storage, but the the Portal may not be the most user-friendly way to do this. I recommend you look into Azure Storage Explorer. Azure Storage Explorer provides one client application with which your users can manage or download the files from all the Azure Storage solutions.
Given how you specified html files, and viewing multiple files at a time, I suspect you want to render the files like a browser. Many resources have a URI. If a self-contained html file is made publicly accessible in Blob storage or ADLS gen2, and you navigate to it in a browser, the html page will render.

Related

moving locally stored documented documents to azure

I want to spike whether azure and the cloud is a good fit for us.
We have a website where users upload documents to our currently hosted website.
Every document has an equivalent record in a database.
I am using terraform to create the azure infrastructure.
What is my best way of migrating the documents from the local file path on the server to azure?
Should I be using file storage or blob storage. I am confused about the difference.
Is there anything in terraform that can help with this?
Based on your comments, I would recommend storing them in Blob Storage. This service is suited for storing and serving unstructured data like files and images. There are many other features like redundancy, archiving etc. that you may find useful in your scenario.
File Storage is more suitable in Lift-and-Shift kind of scenarios where you're moving an on-prem application to the cloud and the application writes data to either local or network attached disk.
You may also find this article useful: https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
UPDATE
Regarding uploading files from local computer to Azure Storage, there are actually many options available:
Use a Storage Explorer like Microsoft's Storage Explorer.
Use AzCopy command-line tool.
Use Azure PowerShell Cmdlets.
Use Azure CLI.
Write your own code using any available Storage Client libraries or directly consuming REST API.

Can Azure Data Factory write to FTP

I want to write the output of pipeline to an FTP folder. ADF seems to support on-premises file but not FTP folder.
How can I write the output in text format to an FTP folder?
Unfortunately FTP Servers are not a supported data store for ADF as of right now. Therefore there is no OOTB way to interact with an FTP Server for either reading or writing.
However, you can use a custom activity to make it possible, but it will require some custom development to make this happen. A fellow Cloud Solution Architect within MS put together a blog post that talks about how he did it for one of his customers. Please take a look at the following:
https://blogs.msdn.microsoft.com/cloud_solution_architect/2016/07/02/creating-ftp-data-movement-activity-for-azure-data-factory-pipeline/
I hope that this helps.
Upon thinking about it you might be able to achieve what you want in a mildly convoluted way by writing the output to a Azure Blob storage account and then either
1) manually: downloading and pushing the file to the "FTP" site from the Blob storage account or
2) automatically: using Azure CLI to pull the file locally and then push it to the "FTP" site with a batch or shell script as appropriate
As a lighter weight approach to custom activities (certainly the better option for heavy work).
You may wish to consider using azure functions to write to ftp (note there is a time out when using a consumption plan - not in other plans, so it will depend on how big the files are).
https://learn.microsoft.com/en-us/azure/azure-functions/functions-create-storage-blob-triggered-function
You could instruct data factory to write to a intermediary blob storage.
And use blob storage triggers in azure functions to upload them as soon as they appear in blob storage.
Or alternatively, write to blob storage. And then use a timer in logic apps to upload from blob storage to ftp. Logic Apps hide a tremendous amount of power behind there friendly exterior.
You can write a Logic app that will pick your file up from Azure storage and send it to an FTP site. Then call the Logic App using a Data Factory Web Activity.
Make sure you do some error handling in your Logic app to return 400 if the ftp fails.

Uploading 10,000,000 files to Azure blob storage from Linux

I have some experience with S3, and in the past have used s3-parallel-put to put many (millions) small files there. Compared to Azure, S3 has an expensive PUT price so I'm thinking to switch to Azure.
I don't however seem to be able to figure out how to sync a local directory to a remote container using azure cli. In particular, I have the following questions:
1- aws client provides a sync option. Is there such an option for azure?
2- Can I concurrently upload multiple files to Azure storage using cli? I noticed that there is a -concurrenttaskcount flag for azure storage blob upload, so I assume it must be possible in principle.
If you prefer the commandline and have a recent Python interpreter, the Azure Batch and HPC team has released a code sample with some AzCopy-like functionality on Python called blobxfer. This allows full recursive directory ingress into Azure Storage as well as full container copy back out to local storage. [full disclosure: I'm a contributor for this code]
To answer your questions:
blobxfer supports rsync-like operations using MD5 checksum comparisons for both ingress and egress
blobxfer performs concurrent operations, both within a single file and across multiple files. However, you may want to split up your input across multiple directories and containers which will not only help reduce memory usage in the script but also will partition your load better
https://github.com/Azure/azure-sdk-tools-xplat is the source of azure-cli, so more details can be found there. You may want to open issues to the repo :)
azure-cli doesn't support "sync" so far.
-concurrenttaskcount is to support parallel upload within a single file, which will increase the upload speed a lot, but it doesn't support multiple files yet.
In order to upload bulk files into the blob storage there is a tool provided by Microsoft Checkout
the storage explorer allows you to do the required task..

What is the best strategy for using Windows Azure as a file storage system - with http download capabilities

I need to store multiple files that users upload, and then provide these users with the capability of accessing their files via http. There are two key considerations:
- Storage (which is my primary concern here)
- Security (which let's leave aside for now)
The question is:
What is the most cost efficient and performant way of storing all these files and giving access to them later? I believe the answer is:
- Store files within Azure Storage Account, and have a key that references them in an SQL Azure database.
I am correct on this?
Is a blob storage flat? Or can I create something like folders inside it to better organize my files?
The idea of using SQL Azure to store metadata for your blobs is a pretty common scenario, which allows you to take advantage of SQL for searching, and blobs for storage.
Blobs are organized by container. So you'd have something like:
http://mystorage.blob.core.windows.net/mycontainer/myfile.doc
You can also simulate a hierarchy using a delimiter, but in reality there's just container plus blob.
If you keep the container or blob private, the user would either have to go through your web front end (or web service), or you'd have to provide them with a special URL with a Shared Access Signature appended, which is a time-limited URL.
I would recommend you to take a look at BlobShare Sample which is a simple file sharing application that demonstrates the storage services of the Windows Azure Platform, together with the authentication and authorization capabilities of Access Control Service (ACS). The full sample code is located at following link:
http://blobshare.codeplex.com/
You can use this sample code immediately, just by adding proper reference to your Windows Azure Account credentials. The best thing with this sample is that you can provide blob access directly through Access Control Services. You can also modify the code to add SAS support as well as blob download from public containers. Once you have it working and understood the concept you can tweak to make it the way you would want.

Is it possible to mount blob storage to my local machine for deployment?

I have a build script that it would be very useful to configure to dump some files into Azure blob storage so they can be picked up by my Azure web role.
My preferred plan was to find some way of mounting the blob storage on my build server as a mapped drive and simply using Robocopy copy to copy the files over. This will involve the least ammount of friction as I already am deploying some files like this to other web servers using WebDrive.
I found a piece of software that will allow me to do that: http://www.gladinet.com/
However on further investigation I found that it needs port 80 to run without some hairy looking hacking about on the server.
So is there another piece of software I could use or perhaps another way I haven't considered, such as deploying the files to a local folder that is automagically synced with blob storage?
Update in response to #David Makogon
I am using http://waacceleratorumbraco.codeplex.com/ this performs 2 way synchronisation between the blob storage and the web roles. I have tested this with http://cloudberrylab.com/ and I can deploy files manually to the blob and they are deployed correctly to the web roles. Also I have done the reverse and updated files in the web roles which have then been synced back to the blob and I have subsequently edited/downloaded them from blob storage.
What I'm really looking for is a way to automate the cloudberry side of things. So I don't have a manual step to copy a few files over. I will investigate the Powershell solutions in the meantime.
I know this is an old post - but in case someone else comes here... the answer is now "yes". I've been working on a CodePlex project to do exactly that. (All source code is available).
http://azuredrive.codeplex.com/
If you're comfortable using powershell in your build process then you could use the Cerebrata Cmdlets to upload the files. If that doesn't work for you, you could write a custom activity (but this sounds quite a bit more involved).
Mounting a cloud drive from a non-Windows Azure compute instance (e.g. your local build machine) is not supported.
Having said that: Even if you could mount a Cloud Drive from your build machine, your compute instances would need access to it too, and there can only be one writer. If your compute instances only needed read-only access, they'd need to create a snapshot after you upload new files.
This really doesn't sound like a good idea though. As knightpfhor suggested, the Cerebrata cmdlets provide this capability (look at Import-File). This allows you to push individual files into their own blobs. You can optimize further by pushing a single ZIP file into a blob. You can then use a technique similar to the one described by Nate Totten in his multi-tenant web role sample, to detect new zip files and expand them to your local storage. Nate's blog post is here.
Oh, and if you don't want to use the Cerebrata cmdlets, you can upload blobs directly with the Windows Azure Storage REST API (though the cmdlets are very simple to use and integrate seamlessly with PowerShell).

Resources