Slow data transfer from Azure Blob Storage to compute target - azure

It's taking 1 hour to download a 48gb dataset with 90000 files.
I am training an image segmentation model on Azure ML pipeline using compute target p100-nc6s-v2.
In my script I'm accessing Azure Blob Storage using DataReference's as_download() functionality. The blob storage is in the same location as workspace (using get_default_datastore).
Note: I'm able to download complete dataset to local workstation within a few minutes using az copy.
When I tried to use as_mount() the first epoch was extremely slow (4700 seconds vs 772 seconds for subsequent epochs).
Is this expected behavior? If not, what can be done to improve dataset loading speed?

Working folder of the run is mounted cloud storage, which could be defaulting to file storage in your workspace.
Can you try setting blob datastore instead, and see if perf improves?
run_config.source_directory_data_store = 'workspaceblobstore'

as_download() downloads the data to the current working directory, which is a mouted file-share (or blob if you do what #reznikov suggested).
Unfortunately, for small files, neither blob nor file-share are very performant (although blob is much better) -- see this reply for some measurements: Disk I/O extremely slow on P100-NC6s-V2
When you are mounting, the reason that the first epoch is so slow, lies in the fact that blob fuse (which is used for mounting blobs) caches to the local SSD, so for after the first epoch, everything is on your SSD and you get the full performance.
As for why the first epoch takes much longer than the az copy, I would suspect that the data reader of the framework you are using does not pipeline the reads. What are you using?
You could try one of 2 things:
Mount, but at the beginning of the job, copy the data from the mount path to /tmp and consume it from there.
If #1 is significantly slower than az copy, then, don't mount. Instead, at the beginning of the job, copy the data to /tmp using az copy.

Related

What is an efficient way to copy a subset of files from one container to another?

I have millions of files in one container and I need to copy ~100k to another container in the same storage account. What is the most efficient way to do this?
I have tried:
Python API -- Using BlobServiceClient and related classes, I make a BlobClient for the source and destination and start a copy with new_blob.start_copy_from_url(source_blob.url). This runs at roughly 7 files per second.
azcopy (one file per line) -- Basically a batch script with a line like azcopy copy <source w/ SAS> <destination w/ SAS> for every file. This runs at roughly 0.5 files per second due to azcopy's overhead.
azcopy (1000 files per line) -- Another batch script like the above, except I use the --include-path argument to specify a bunch of semicolon-separated files at once. (The number is arbitrary but I chose 1000 because I was concerned about overloading the command. Even 1000 files makes a command with 84k characters.) Extra caveat here: I cannot rename the files with this method, which is required for about 25% due to character constraints on the system that will download from the destination container. This runs at roughly 3.5 files per second.
Surely there must be a better way to do this, probably with another Azure tool that I haven't tried. Or maybe by tagging the files I want to copy then copying the files with that tag, but I couldn't find the arguments to do that.
Please check with below references:
1. AZCOPY would be best for best performance for copying blobs within
same storage or other storage accounts .we can force a synchronous
copy by specifying "/SyncCopy" parameter for AZCopy to ensures that
the copy operation will get consistent speed. azcopy sync |
Microsoft Docs .
But note that AzCopy performs the synchronous copy by
downloading the blobs to local memory and then uploads to the Blob
storage destination. So performance will also depend on network
conditions between the location where AZCopy is being run and Azure
DC location. Also note that /SyncCopy might generate additional
egress cost comparing to asynchronous copy, the recommended approach
is to use this sync option with azcopy in the Azure VM which is in the same region as
your source storage account to avoid egress cost.
Choose a tool and strategy to copy blobs - Learn | Microsoft Docs
2. StartCopyAsync is one of the ways you can try for copy within a
storage account .
References:
1. .net - Copying file across Azure container without using azcopy - Stack Overflow
2. Copying Azure Blobs Between Containers the Quick Way (markheath.net)
3. You may consider Azure data factory in case of millions of files
but also note that it may be expensive and little timeouts may occur
but it may be worth for repeated kind of work.
References:
1. Copy millions of files (andrewconnell.com) , GitHub(microsoft docs)
2. File Transfer between container to another container - Microsoft Q&A
4. Also check out and try the Azure storage explorer copy blob container to
another

Azure ML studio really slow

I had been using Azure ML studio for a while now and it was really fast but now when I try to unzip folders containing images around 3000 images using
!unzip "file.zip" -d "to unzip directory"
it took more than 30 minutes and other activities(longer concatenation methods) also seem to take a long time even using numpy arrays. Wondering if it is something with configuration or other problems. I have tried switching locations, creating new resource groups, workspaces, changing computes(Both CPU and GPU).
Compute and other set of current configurations can be seen on the image
When you are using a notebook, your local directory is persisted on a (remote) Blob Store. Consequently, you are limited by network delays and more significantly the IOPS your compute agent has.
What has worked for me is to use the local disk mounted on the compute agent. NOTE: This is not persisted and all the stuff on this will disappear when the compute agent is stopped.
After doing all your work, you can move the data to your persistent storage (which should be in your list of mounts). This might still be slow but you don't have to wait for it to complete.

data lake file to blob poor performance

I'm using azcopy to upload local files to a blob storage.
I'm using command:
azcopy copy "localpath" "destinationpath(with SAS)" --include="*.csv" --recursive=true
I also tried
azcopy sync "localpath" "destinationpath(with SAS)" --include="*.csv"
The files I'm trying to upload are each 1GB+.
When I manually upload a file to the data lake it takes 40min+ for 1 file. If I do it with azcopy it takes 30min+ per file and often fails.
Is it normal that it takes this long? Am I doing something wrong or is there a faster way of doing this?
As you may know, the azcopy is optimized for better performance. I see your code, nothing is missing. If that's the case, we can do nothing(or maybe you can check if it's network issue?).
You can take a try with Azure Data Factory, it provides very high performance which can be up to 1-GB/s data loading speed into Data Lake Storage Gen1.

Slow speed and high latency when downloading multiple small files from azure storage container

I'm trying to download data of azure blob storage container to my machine. It contains of multiple small files, 12-60 KB each. When I use Microsoft Azure Storage Explorer app, it downloads no more than few hundreds of items at once and then halts for tens of minutes before trying to download a next batch.
This makes speed of download roughly less than 3 KB/s, which is quite horrible.
I've also tried using open source npm package to download container files, with similar results.
Is there a way to decrease latency/increase speeds? Or is there a better way do download all container data?
Actually it depend on your (local) machine's network speed. So you can try create a small instance in same region, and download to that instance. It will much faster than your local machine . Then try to archive total files and ftp back to your local machine.

Does Azcopy support moving and removing files

My question is: Does AZCOPY support moving and removing files from disk after copy to Azure storage?
From the local disk, no, it only copies to the cloud and does not take responsibility for any deletion or move. Best practice is to wrap your AzCopy in a script that takes care of the local file handling based on a successful response from AzCopy.
You could roll your own version using the Storage Data Movement Library.

Resources