How to transfer dependancy jars / files to azure storage from linux? - azure

I am trying to use azure spark. To run my job, I need to copy my dependancy jar and files to storage. i have created a storage and container. Could you please guide me how to access Azure storage from my linux machine so as to copy date from/to it.

Since you didn't state your restrictions (e.g., command line, programmatically, gui), here are a few options:
If you have access to a recent Python interpreter on your Linux machine, you can use blobxfer (https://pypi.python.org/pypi/blobxfer), which can transfer entire directories of files into and out of Azure blob storage.
Use Azure cross-platform cli (https://azure.microsoft.com/en-us/documentation/articles/xplat-cli/) which has functionality to transfer files one at a time into or out of Azure storage.
Directly invoke Azure storage calls via Azure storage SDKs programmatically. There are SDKs available in a variety of languages along with REST.

Related

Azure Databricks integration with Unix File systems

I am looking for help to understand the integration of Unix file system with Azure DataBricks. I would like to connect to on-Prem Unix file systems and access relevant files and process through DataBricks and load into ADLS Gen2.
I understand that if the files are available in DBFS, we should be able to process. But my requirement is specific to process files available on on-prem Unix file system using Azure Technologies such as Azure DataBricks or Azure DataFactory.
Any suggestion/help in this regard will be very helpful.
Unfortunately, it is not possible to directly connect to on-Prem Unix file systems.
However you can try below workarounds:
You can upload files onto DBFS, and then access them. Browse DBFS using the UI
To copy large files use AzCopy. AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account.

moving locally stored documented documents to azure

I want to spike whether azure and the cloud is a good fit for us.
We have a website where users upload documents to our currently hosted website.
Every document has an equivalent record in a database.
I am using terraform to create the azure infrastructure.
What is my best way of migrating the documents from the local file path on the server to azure?
Should I be using file storage or blob storage. I am confused about the difference.
Is there anything in terraform that can help with this?
Based on your comments, I would recommend storing them in Blob Storage. This service is suited for storing and serving unstructured data like files and images. There are many other features like redundancy, archiving etc. that you may find useful in your scenario.
File Storage is more suitable in Lift-and-Shift kind of scenarios where you're moving an on-prem application to the cloud and the application writes data to either local or network attached disk.
You may also find this article useful: https://learn.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks
UPDATE
Regarding uploading files from local computer to Azure Storage, there are actually many options available:
Use a Storage Explorer like Microsoft's Storage Explorer.
Use AzCopy command-line tool.
Use Azure PowerShell Cmdlets.
Use Azure CLI.
Write your own code using any available Storage Client libraries or directly consuming REST API.

How to I manage Azure "files" from GUI and command line?

I'm confused about the difference between "files" and other objects in Azure storage. I understand how to upload a file to a share using the Azure web console and command line, but in the Azure Storage Explorer I don't see either of these, but only see "blobs" and though I can upload "files" there using the explorer, I can't upload to or see any of my "file" "shares".
Is there a way to browse and manage "files" and "shares" using Azure Storage Explorer, or some other client or CLI tool (on OS X)?
It is the different services. Azure Storage is... the "umbrella" service that consists of some services - Queues (obvious :)), Tables (kind of a noSQL table storage), Blobs (binary large objects, from text files to multimedia) and Files (the service that implements the file shares that may be connected to the Virtual Machine, for example, as a file share).
They are different services that may be used from the Azure Storage Explorer, but it depends on what you want to use and/or implement. If you need to put just files, you may use blobs. If you need to attach the storage as a file share to the VM, then the Files service is what you need. Good comparison.
I am not sure if you can manage Files with the Azure Storage Explorer (UPD: checked - do not), but something like CloudXplorer is able to do that.
You can browse and add/edit/delete files in Azure File Shares similar to how you would any other file share after mounting. You can refer to these two articles on how to do so:
Mount Azure File Share in Windows
Mount Azure File Share in Linux
Alternatively, you can use CLI or PowerShell, see examples below:
PowerShell example
CLI example

Using AzCopy in azure virtual machine

I have an azure virtual machine which has some application specific CSV files(retrieved via ftp from on-premise) that needs to be stored into a blob (and eventually will be read and pushed into a Azure SQL DB via a worker role). Question is around pushing the files from VM to blob. Is it possible to get AzCopy without installing the SDK to have the files copied to the blob? Is there a better solution than this? Please read the points below for further information
Points to note:
1) Though files could be directly uploaded to a blob rather than getting them into the VM first and copying from there, for security reasons the files will have to be pulled into the VM and this cannot be changed.
2) I also thought about a worker role talking to a VM folder share (via common virtual network) to pull the files and upload to the blob, but this does not appear to be a right solution after reading some blogs - as it requires changes to both VMs (worker role VM and the Iaas VM).
3) Azure File Service is still in preview (?) and hence cannot be used.
Is it possible to get AzCopy without installing the SDK to have the
files copied to the blob?
Absolutely yes. You can directly download AzCopy binaries without installing SDK using the following links:
Version 3.1.0: http://aka.ms/downloadazcopy
Version 4.1.0: http://aka.ms/downloadazcopypr
Source: http://blogs.msdn.com/b/windowsazurestorage/archive/2015/01/13/azcopy-introducing-synchronous-copy-and-customized-content-type.aspx

Uploading 10,000,000 files to Azure blob storage from Linux

I have some experience with S3, and in the past have used s3-parallel-put to put many (millions) small files there. Compared to Azure, S3 has an expensive PUT price so I'm thinking to switch to Azure.
I don't however seem to be able to figure out how to sync a local directory to a remote container using azure cli. In particular, I have the following questions:
1- aws client provides a sync option. Is there such an option for azure?
2- Can I concurrently upload multiple files to Azure storage using cli? I noticed that there is a -concurrenttaskcount flag for azure storage blob upload, so I assume it must be possible in principle.
If you prefer the commandline and have a recent Python interpreter, the Azure Batch and HPC team has released a code sample with some AzCopy-like functionality on Python called blobxfer. This allows full recursive directory ingress into Azure Storage as well as full container copy back out to local storage. [full disclosure: I'm a contributor for this code]
To answer your questions:
blobxfer supports rsync-like operations using MD5 checksum comparisons for both ingress and egress
blobxfer performs concurrent operations, both within a single file and across multiple files. However, you may want to split up your input across multiple directories and containers which will not only help reduce memory usage in the script but also will partition your load better
https://github.com/Azure/azure-sdk-tools-xplat is the source of azure-cli, so more details can be found there. You may want to open issues to the repo :)
azure-cli doesn't support "sync" so far.
-concurrenttaskcount is to support parallel upload within a single file, which will increase the upload speed a lot, but it doesn't support multiple files yet.
In order to upload bulk files into the blob storage there is a tool provided by Microsoft Checkout
the storage explorer allows you to do the required task..

Resources