Download and unzip dataset directly in Azure

Download and unzip dataset directly in Azure - azure

I need to load and unzip a 27 giga dataset directly in my azure account to work on it with a spark instance with the textFile function to do some machine learning. How can I do it?
I would like to write more, but I have spent so many hours surfing on the internet and still I am not able to achieve anything useful.
This is the dataset:
https://academicgraphwe.blob.core.windows.net/graph-2016-02-05/index.html

If directly means from that location to your VM, then the most simple way, in my opinion, is to use AzCopy.
For example, in your case it can be like that:
AzCopy /Source:https://academicgraphwe.blob.core.windows.net/graph-2016-02-05/ /Dest:C:\myfolder /SourceKey:key /Pattern:"abc.txt"
Install AzCopy on your VM and run the command. You need no SourceKey here as it looks like your dataset is in publicly available blob. But change your link to the needed location (because it is going to some kind of list of links).

Related

What is an efficient way to copy a subset of files from one container to another?

I have millions of files in one container and I need to copy ~100k to another container in the same storage account. What is the most efficient way to do this?
I have tried:
Python API -- Using BlobServiceClient and related classes, I make a BlobClient for the source and destination and start a copy with new_blob.start_copy_from_url(source_blob.url). This runs at roughly 7 files per second.
azcopy (one file per line) -- Basically a batch script with a line like azcopy copy <source w/ SAS> <destination w/ SAS> for every file. This runs at roughly 0.5 files per second due to azcopy's overhead.
azcopy (1000 files per line) -- Another batch script like the above, except I use the --include-path argument to specify a bunch of semicolon-separated files at once. (The number is arbitrary but I chose 1000 because I was concerned about overloading the command. Even 1000 files makes a command with 84k characters.) Extra caveat here: I cannot rename the files with this method, which is required for about 25% due to character constraints on the system that will download from the destination container. This runs at roughly 3.5 files per second.
Surely there must be a better way to do this, probably with another Azure tool that I haven't tried. Or maybe by tagging the files I want to copy then copying the files with that tag, but I couldn't find the arguments to do that.

Please check with below references:
1. AZCOPY would be best for best performance for copying blobs within
same storage or other storage accounts .we can force a synchronous
copy by specifying "/SyncCopy" parameter for AZCopy to ensures that
the copy operation will get consistent speed. azcopy sync |
Microsoft Docs .
But note that AzCopy performs the synchronous copy by
downloading the blobs to local memory and then uploads to the Blob
storage destination. So performance will also depend on network
conditions between the location where AZCopy is being run and Azure
DC location. Also note that /SyncCopy might generate additional
egress cost comparing to asynchronous copy, the recommended approach
is to use this sync option with azcopy in the Azure VM which is in the same region as
your source storage account to avoid egress cost.
Choose a tool and strategy to copy blobs - Learn | Microsoft Docs
2. StartCopyAsync is one of the ways you can try for copy within a
storage account .
References:
1. .net - Copying file across Azure container without using azcopy - Stack Overflow
2. Copying Azure Blobs Between Containers the Quick Way (markheath.net)
3. You may consider Azure data factory in case of millions of files
but also note that it may be expensive and little timeouts may occur
but it may be worth for repeated kind of work.
References:
1. Copy millions of files (andrewconnell.com) , GitHub(microsoft docs)
2. File Transfer between container to another container - Microsoft Q&A
4. Also check out and try the Azure storage explorer copy blob container to
another

Azure ML studio really slow

I had been using Azure ML studio for a while now and it was really fast but now when I try to unzip folders containing images around 3000 images using
!unzip "file.zip" -d "to unzip directory"
it took more than 30 minutes and other activities(longer concatenation methods) also seem to take a long time even using numpy arrays. Wondering if it is something with configuration or other problems. I have tried switching locations, creating new resource groups, workspaces, changing computes(Both CPU and GPU).
Compute and other set of current configurations can be seen on the image

When you are using a notebook, your local directory is persisted on a (remote) Blob Store. Consequently, you are limited by network delays and more significantly the IOPS your compute agent has.
What has worked for me is to use the local disk mounted on the compute agent. NOTE: This is not persisted and all the stuff on this will disappear when the compute agent is stopped.
After doing all your work, you can move the data to your persistent storage (which should be in your list of mounts). This might still be slow but you don't have to wait for it to complete.

Quick way of uploading many images (3000+) to a google cloud server

I am working on object detection for a school project. To train my CNN model I am using a google cloud server because I do not own a strong enough GPU to train it locally.
The training data consists of images (.jpg files) and annotations (.txt files) and is spread over around 20 folders due to the fact that they come from different sources and I do not want to mix pictures from different sources so I want to keep this directory structure.
My current issue is that I could not find a fast way of uploading them to my google cloud server.
My workaround was to upload those image folders as a .zip file on google drive and download them on the cloud and unzip them there. This process needs way too much time because I have to upload many folders and google drive does not have a good API to download folders to Linux.
On my local computer, I am using Windows 10 and my cloud server runs Debian.
Therefore, I'd be really grateful if you know a fast and easy way to either upload my images directly to the server or at least to upload my zipped folders.

Couldn't you just create an infinite loop to look for jpg files and scp/sftp the jpg directly to your server once the file is there? On windows, you can achieve this using WSL.
(sorry this may not be your final answer, but i don't have the reputation to ask you this question)

I would upload them to a Google Cloud Storage bucket using gsutil with multithreading. This means that multiple files are copied at once, so the only limitation here is your internet speed. Gsutil installers for Windows and Linux are found here. Example command:
gsutil -m cp -r dir gs://my-bucket
Then on the VM you do exactly the opposite:
gsutil -m cp -r gs://my-bucket dir
This is super fast, and you only pay a small amount for the storage, which is super cheap and/or falls within the GCP free tier.
Note: make sure you have write permissions on the storage bucket and the default compute service account (i.e. the VM service account) has read permissions on the storage bucket.

The best stack for the use case will be gsutil + storage bucket
Copy the zip files to cloud storage bucket and put a sync cron to get the files on the VM.
Make use of gsutil
https://cloud.google.com/storage/docs/gsutil

PDI slow loading into azure databases

I have an Azure VM with Pentaho Data Integration installed, i'm trying to build some ETL which loads a dimensional model from the staging area, but when i start a transformation, the load speed of PDI into any azure database is painfully slow.
It is possible to have PDI working on cloud with Azure Databases? There is some configuration step needed to achieve a reasonable loading speed?
PS:
VM and databases are in the same region
There is a firewall rule to allow port acess
Reading speed is working just fine
PDI 8.1, using table output step

I've been experiencing same speed problem but I will tell you my workarounds with this.
First of all: Download and install latest jdbc driver that lets you gain connection with azure sql database, in documentation the link is here but the way I do is keep it synced from here in GitHub any of this will let you use the latest driver in PDI.
Second workaround: for large files what I've found most powerful is using BCP Utility integrated with PowerShell or Linux Batch. Doesn't care if it files are local or in azure blob storage but you might need credentials for this.
Last but not least: Use Azure Data Factory V2 to move and load files (if you are like me I try to keep it in PDI until I have to load it, the http get step will let you trigger ADF pipeline).
Good luck and let me know if you get it.

Use Microsoft Azure as a computing cluster

My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?

I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.

Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string