Copy large number of small blobs with AzCopy

Copy large number of small blobs with AzCopy - azure

I am trying to do an incremental copy of ca. 500.000 blobs from one storage account to another.
However, it seems that if I do not specify a /Pattern: parameter, AzCopy just hangs forever, never finishes.. (I actually stopped the process after about 15 min).
Is half a million (potentially up to 5 million) blobs too much for AzCopy to handle, or am I missing something here?
The command I'm using looks like this:
AzCopy /Source:<src>/documents /SourceKey:<srcKey> /Dest:<dest>/documents /DestKey:<deskKey> /S /XO /Y
Adding the /pattern parameter solves it, but I'd like a complete copy of all blobs in the container.
I have to add, it managed to copy all the blobs already, it is the subsequent runs that fail, when it has to "figure out" which blobs have been added since the last full backup..

Which version of AzCopy are you using? I guess this issue has been fixed for many releases... Several versions ago, AzCopy needs to list all the blobs to be downloaded before starting transfer; currently AzCopy is able to do listing and transfer simultaneously.
For download latest version of AzCopy and find more information, please refer to http://aka.ms/azcopy .

Related

What is an efficient way to copy a subset of files from one container to another?

I have millions of files in one container and I need to copy ~100k to another container in the same storage account. What is the most efficient way to do this?
I have tried:
Python API -- Using BlobServiceClient and related classes, I make a BlobClient for the source and destination and start a copy with new_blob.start_copy_from_url(source_blob.url). This runs at roughly 7 files per second.
azcopy (one file per line) -- Basically a batch script with a line like azcopy copy <source w/ SAS> <destination w/ SAS> for every file. This runs at roughly 0.5 files per second due to azcopy's overhead.
azcopy (1000 files per line) -- Another batch script like the above, except I use the --include-path argument to specify a bunch of semicolon-separated files at once. (The number is arbitrary but I chose 1000 because I was concerned about overloading the command. Even 1000 files makes a command with 84k characters.) Extra caveat here: I cannot rename the files with this method, which is required for about 25% due to character constraints on the system that will download from the destination container. This runs at roughly 3.5 files per second.
Surely there must be a better way to do this, probably with another Azure tool that I haven't tried. Or maybe by tagging the files I want to copy then copying the files with that tag, but I couldn't find the arguments to do that.

Please check with below references:
1. AZCOPY would be best for best performance for copying blobs within
same storage or other storage accounts .we can force a synchronous
copy by specifying "/SyncCopy" parameter for AZCopy to ensures that
the copy operation will get consistent speed. azcopy sync |
Microsoft Docs .
But note that AzCopy performs the synchronous copy by
downloading the blobs to local memory and then uploads to the Blob
storage destination. So performance will also depend on network
conditions between the location where AZCopy is being run and Azure
DC location. Also note that /SyncCopy might generate additional
egress cost comparing to asynchronous copy, the recommended approach
is to use this sync option with azcopy in the Azure VM which is in the same region as
your source storage account to avoid egress cost.
Choose a tool and strategy to copy blobs - Learn | Microsoft Docs
2. StartCopyAsync is one of the ways you can try for copy within a
storage account .
References:
1. .net - Copying file across Azure container without using azcopy - Stack Overflow
2. Copying Azure Blobs Between Containers the Quick Way (markheath.net)
3. You may consider Azure data factory in case of millions of files
but also note that it may be expensive and little timeouts may occur
but it may be worth for repeated kind of work.
References:
1. Copy millions of files (andrewconnell.com) , GitHub(microsoft docs)
2. File Transfer between container to another container - Microsoft Q&A
4. Also check out and try the Azure storage explorer copy blob container to
another

Slow data transfer from Azure Blob Storage to compute target

It's taking 1 hour to download a 48gb dataset with 90000 files.
I am training an image segmentation model on Azure ML pipeline using compute target p100-nc6s-v2.
In my script I'm accessing Azure Blob Storage using DataReference's as_download() functionality. The blob storage is in the same location as workspace (using get_default_datastore).
Note: I'm able to download complete dataset to local workstation within a few minutes using az copy.
When I tried to use as_mount() the first epoch was extremely slow (4700 seconds vs 772 seconds for subsequent epochs).
Is this expected behavior? If not, what can be done to improve dataset loading speed?

Working folder of the run is mounted cloud storage, which could be defaulting to file storage in your workspace.
Can you try setting blob datastore instead, and see if perf improves?
run_config.source_directory_data_store = 'workspaceblobstore'

as_download() downloads the data to the current working directory, which is a mouted file-share (or blob if you do what #reznikov suggested).
Unfortunately, for small files, neither blob nor file-share are very performant (although blob is much better) -- see this reply for some measurements: Disk I/O extremely slow on P100-NC6s-V2
When you are mounting, the reason that the first epoch is so slow, lies in the fact that blob fuse (which is used for mounting blobs) caches to the local SSD, so for after the first epoch, everything is on your SSD and you get the full performance.
As for why the first epoch takes much longer than the az copy, I would suspect that the data reader of the framework you are using does not pipeline the reads. What are you using?
You could try one of 2 things:
Mount, but at the beginning of the job, copy the data from the mount path to /tmp and consume it from there.
If #1 is significantly slower than az copy, then, don't mount. Instead, at the beginning of the job, copy the data to /tmp using az copy.

Does Azcopy support moving and removing files

My question is: Does AZCOPY support moving and removing files from disk after copy to Azure storage?

From the local disk, no, it only copies to the cloud and does not take responsibility for any deletion or move. Best practice is to wrap your AzCopy in a script that takes care of the local file handling based on a successful response from AzCopy.
You could roll your own version using the Storage Data Movement Library.

Download and unzip dataset directly in Azure

I need to load and unzip a 27 giga dataset directly in my azure account to work on it with a spark instance with the textFile function to do some machine learning. How can I do it?
I would like to write more, but I have spent so many hours surfing on the internet and still I am not able to achieve anything useful.
This is the dataset:
https://academicgraphwe.blob.core.windows.net/graph-2016-02-05/index.html

If directly means from that location to your VM, then the most simple way, in my opinion, is to use AzCopy.
For example, in your case it can be like that:
AzCopy /Source:https://academicgraphwe.blob.core.windows.net/graph-2016-02-05/ /Dest:C:\myfolder /SourceKey:key /Pattern:"abc.txt"
Install AzCopy on your VM and run the command. You need no SourceKey here as it looks like your dataset is in publicly available blob. But change your link to the needed location (because it is going to some kind of list of links).

Azure China Storage - AZCopy upload fail

I am trying to upload a 2.6 GB iso to Azure China Storage using AZCopy from my machine here in the USA. I shared the file with a colleague in China and they didn't have a problem. Here is the command which appears to work for about 30 minutes and then fails. I know there is a "Great Firewall of China" but I'm not sure how to get around the problem.
C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy> .\AzCopy.exe
/Source:C:\DevTrees\MyProject\Layout-Copy\Binaries\Iso\Full
/Dest:https://xdiso.blob.core.chinacloudapi.cn/iso
/DestKey:<my-key-here>

The network between the azure server and your local machine should be very slow, and AzCopy use default 8*core threads to do data transfer which might be too aggressive for the slow network.
I would suggest you reduce the thread number by set parameter "/NC:", you can set it as a smaller number as "/NC:2" or "/NC:5", and see if the transfer will be more stable.
BTW, when the timeout issue repro again, please resume with same AzCopy command line, then you can always make progress with resume, instead of start from beginning.

Since you're experiencing a timeout, you could try AZCopy with in re-startable mode like this:
C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy> .\AzCopy.exe
/Source:<path-to-my-source-data>
/Dest:<path-to-my-storage>
/DestKey:<my-key-here>
/Z:<path-to-my-journal-file>
The path to your journal file is arbitrary. For instance, you could site it to C:\temp\azcopy.log if you'd like.
Assume an interrupt occurs while copying your file, and 90% of the file has been transferred to Azure already. Then upon restarting, we will only transfer the remaining 10% of the file.
For more information, type .\AzCopy.exe /?:Z to find the following info:
Specifies a journal file folder for resuming an operation. AzCopy
always supports resuming if an operation has been interrupted.
If this option is not specified, or it is specified without a folder path,
then AzCopy will create the journal file in the default location,
which is %LocalAppData%\Microsoft\Azure\AzCopy.
Each time you issue a command to AzCopy, it checks whether a journal
file exists in the default folder, or whether it exists in a folder
that you specified via this option. If the journal file does not exist
in either place, AzCopy treats the operation as new and generates a
new journal file.
If the journal file does exist, AzCopy will check whether the command
line that you input matches the command line in the journal file.
If the two command lines match, AzCopy resumes the incomplete
operation. If they do not match, you will be prompted to either
overwrite the journal file to start a new operation, or to cancel the
current operation.
The journal file is deleted upon successful completion of the
operation.
Note that resuming an operation from a journal file created by a
previous version of AzCopy is not supported.
You can also find out more here: http://blogs.msdn.com/b/windowsazurestorage/archive/2013/09/07/azcopy-transfer-data-with-re-startable-mode-and-sas-token.aspx

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string