My question is: Does AZCOPY support moving and removing files from disk after copy to Azure storage?
From the local disk, no, it only copies to the cloud and does not take responsibility for any deletion or move. Best practice is to wrap your AzCopy in a script that takes care of the local file handling based on a successful response from AzCopy.
You could roll your own version using the Storage Data Movement Library.
Related
I have millions of files in one container and I need to copy ~100k to another container in the same storage account. What is the most efficient way to do this?
I have tried:
Python API -- Using BlobServiceClient and related classes, I make a BlobClient for the source and destination and start a copy with new_blob.start_copy_from_url(source_blob.url). This runs at roughly 7 files per second.
azcopy (one file per line) -- Basically a batch script with a line like azcopy copy <source w/ SAS> <destination w/ SAS> for every file. This runs at roughly 0.5 files per second due to azcopy's overhead.
azcopy (1000 files per line) -- Another batch script like the above, except I use the --include-path argument to specify a bunch of semicolon-separated files at once. (The number is arbitrary but I chose 1000 because I was concerned about overloading the command. Even 1000 files makes a command with 84k characters.) Extra caveat here: I cannot rename the files with this method, which is required for about 25% due to character constraints on the system that will download from the destination container. This runs at roughly 3.5 files per second.
Surely there must be a better way to do this, probably with another Azure tool that I haven't tried. Or maybe by tagging the files I want to copy then copying the files with that tag, but I couldn't find the arguments to do that.
Please check with below references:
1. AZCOPY would be best for best performance for copying blobs within
same storage or other storage accounts .we can force a synchronous
copy by specifying "/SyncCopy" parameter for AZCopy to ensures that
the copy operation will get consistent speed. azcopy sync |
Microsoft Docs .
But note that AzCopy performs the synchronous copy by
downloading the blobs to local memory and then uploads to the Blob
storage destination. So performance will also depend on network
conditions between the location where AZCopy is being run and Azure
DC location. Also note that /SyncCopy might generate additional
egress cost comparing to asynchronous copy, the recommended approach
is to use this sync option with azcopy in the Azure VM which is in the same region as
your source storage account to avoid egress cost.
Choose a tool and strategy to copy blobs - Learn | Microsoft Docs
2. StartCopyAsync is one of the ways you can try for copy within a
storage account .
References:
1. .net - Copying file across Azure container without using azcopy - Stack Overflow
2. Copying Azure Blobs Between Containers the Quick Way (markheath.net)
3. You may consider Azure data factory in case of millions of files
but also note that it may be expensive and little timeouts may occur
but it may be worth for repeated kind of work.
References:
1. Copy millions of files (andrewconnell.com) , GitHub(microsoft docs)
2. File Transfer between container to another container - Microsoft Q&A
4. Also check out and try the Azure storage explorer copy blob container to
another
I'm using azcopy to upload local files to a blob storage.
I'm using command:
azcopy copy "localpath" "destinationpath(with SAS)" --include="*.csv" --recursive=true
I also tried
azcopy sync "localpath" "destinationpath(with SAS)" --include="*.csv"
The files I'm trying to upload are each 1GB+.
When I manually upload a file to the data lake it takes 40min+ for 1 file. If I do it with azcopy it takes 30min+ per file and often fails.
Is it normal that it takes this long? Am I doing something wrong or is there a faster way of doing this?
As you may know, the azcopy is optimized for better performance. I see your code, nothing is missing. If that's the case, we can do nothing(or maybe you can check if it's network issue?).
You can take a try with Azure Data Factory, it provides very high performance which can be up to 1-GB/s data loading speed into Data Lake Storage Gen1.
It's taking 1 hour to download a 48gb dataset with 90000 files.
I am training an image segmentation model on Azure ML pipeline using compute target p100-nc6s-v2.
In my script I'm accessing Azure Blob Storage using DataReference's as_download() functionality. The blob storage is in the same location as workspace (using get_default_datastore).
Note: I'm able to download complete dataset to local workstation within a few minutes using az copy.
When I tried to use as_mount() the first epoch was extremely slow (4700 seconds vs 772 seconds for subsequent epochs).
Is this expected behavior? If not, what can be done to improve dataset loading speed?
Working folder of the run is mounted cloud storage, which could be defaulting to file storage in your workspace.
Can you try setting blob datastore instead, and see if perf improves?
run_config.source_directory_data_store = 'workspaceblobstore'
as_download() downloads the data to the current working directory, which is a mouted file-share (or blob if you do what #reznikov suggested).
Unfortunately, for small files, neither blob nor file-share are very performant (although blob is much better) -- see this reply for some measurements: Disk I/O extremely slow on P100-NC6s-V2
When you are mounting, the reason that the first epoch is so slow, lies in the fact that blob fuse (which is used for mounting blobs) caches to the local SSD, so for after the first epoch, everything is on your SSD and you get the full performance.
As for why the first epoch takes much longer than the az copy, I would suspect that the data reader of the framework you are using does not pipeline the reads. What are you using?
You could try one of 2 things:
Mount, but at the beginning of the job, copy the data from the mount path to /tmp and consume it from there.
If #1 is significantly slower than az copy, then, don't mount. Instead, at the beginning of the job, copy the data to /tmp using az copy.
I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.
Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.
Any insight appreciated - a Google turned up little.
So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.
For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.
Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.
I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.
One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!
Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.
Hope this helps.
Is a azure blob available for download whilst it is being overwritten with a new version?
From my tests using Cloud Storage Studio the download is blocked until the overwrite is completed, however my tests are from the same machine so I can't be sure this is correct.
If it isn't available during an overwrite, then I presume the solution (to maintain availability) would be to upload using a different blob name and then rename once complete. Does anyone have any better solution than this?
The blob is available during overwrite. What you see will depend on whether you are using a block blob or page blob however. For block blobs, you will download the older version until the final block commit. That final PutBlockList operation will atomically update the blob to the new version. I am not actually sure however for very large blobs that you are in the middle of downloading what happens when a PutBlockList atomically updates the blob. Choices are: a.) request continues with older blob, b.) the connection is broken, or c:) you start downloading bytes of new blob. What a fun thing to test!
If you are using page blobs (without a lease), you will read inconsistent data as the page ranges are updated underneath you. Each page range update is atomic, but it will look weird unless you lease the blob and keep other readers out (readers can snapshot a leased blob and read the state).
I might try to test the block blob update in middle of read scenario to see what happens. However, your core question should be answered: the blob is available.