I have a use case which often requires to copy a blob (file) from one Azure region to another. The file size spans from 25 to 45GB. Needless to say, this sometimes goes very slowly, with inconsistent performance. This might take up to two hours, sometimes more. Distance plays a role, but it differs. Even within the same region copying is slower then I would expect. I've been trying:
The Python SDK, and its copy blob method from the blob service.
The rest API copy blob
az copy from the CLI.
Although I didn't really expect different results, since all of them use the same backend methods.
Is there any approach I am missing? Is there any way to speed up this process, or any kind of blob sharing integrated in Azure? VHD/disk sharing could also do.
You may want to try /SyncCopy option in AzCopy:
Synchronously copy blobs from one storage account to another
AzCopy by default copies data between two storage endpoints asynchronously. Therefore, the copy operation runs in the background using spare bandwidth capacity that has no SLA in terms of how fast a blob is copied, and AzCopy periodically checks the copy status until the copying is completed or failed.
The /SyncCopy option ensures that the copy operation gets consistent speed. AzCopy performs the synchronous copy by downloading the blobs to copy from the specified source to local memory, and then uploading them to the Blob storage destination.
AzCopy /Source:https://myaccount1.blob.core.windows.net/myContainer/ /Dest:https://myaccount2.blob.core.windows.net/myContainer/ /SourceKey:key1 /DestKey:key2 /Pattern:ab /SyncCopy
/SyncCopy might generate additional egress cost compared to asynchronous copy, the recommended approach is to use this option in an Azure VM that is in the same region as your source storage account to avoid egress cost.
In linux you can try using --parallel-level option. Try looking it up using azcopy --help. Also, the max op limit is 512 officially. Go bonkers!
Related
Would like to use Azure Function to copy lots of files from blob container A to another blob container B. However, faced missing copying files due to Function timeout. Is there any method to resume it smartly? Is there any indication on source blob storage which identified copied/handled before so that next Function can skip copying that?
Would like to use Azure Function to copy lots of files from blob
container A to another blob container B. However, faced missing
copying files due to Function timeout.
You can avoid this timeout problem by changing the plan level. For example, if you use the app service plan and turn on always on, there will be no more timeout restrictions. But to be honest, if you have a lot of files and it takes a long time, then azure function is not a recommended method (the task performed by the function should be lightweight).
Is there any indication on source blob storage which identified
copied/handled before so that next Function can skip copying that?
Yes, of course you can. Just add the custom metadata of the blob after it was copied. When you copy files next time, you can first check the custom metadata.
It's a problem of plenty. You can:
copy from comannd line or code. AZ CLI or azcopy or .NET SDK (which can be extended to other language SDKs).
use Storage explorer.
use Azure Data Factory as Bowman suggested.
use SSIS.
[mis]use Databricks, especially if you are dealing with massive amount of data and need scalability.
Write some code and use new "Put XXX from URL" APIs. E.g. "Put Blob from URL" will create a new blob. Put Block from URL will create a block in a block blob.
#1 and 2 would use your local machine's internet bandwidth (download to local and then upload) whereas 3, 4, 5 would be totally in cloud. So in case your source and destination are in same region, for 1 & 2 you'll end up paying egress charges, where as 3, 4 and 5 you won't.
Azure Functions to copy files is probably the worst thing you can do. Azure Functions cost is proportional to execution time (and memory usage). In this case (as it's taking more than 10 minutes) I assume you're moving large amount of data, so you'll be paying for your Azure Function because it's just waiting on I/O to complete a file transfer.
I have a storage account with over 600k blobs.
I want to move them to another storage account, in a different region.
After googling I found someone recommending "Azure storage explorer". When I tried it, it was extremely slow. It looked like it was going to take about a week to transfer them all, but then after 24 hours, it has cancelled the copy and I can see no option to restart it.
Is there a fast and convenient way of moving a large number of blobs from one storage account, in one region, to another storage account, in another region?
You are looking for the command,
azcopy
Example,
azcopy cp "https://<source-storage-account-name>.blob.core.windows.net/<container-name>/<blob-path>?<SAS-token>" "https://<destination-storage-account-name>.blob.core.windows.net/<container-name>/<blob-path>"
Read more here
Azure Blob Storage does not expose any kind of "blob rename" operation - which sounds preposterous because the idea of renaming an entity is a fundamental operation in almost any storage system - and Azure's documentation makes no reference to how a blob's name is used internally (e.g. as DHT key), but as we can specify our own names it's clear that Azure isn't using a content-addressable storage model (so renaming should be possible, once the Azure Storage team decides to allow it).
Microsoft advocates instead that to "rename" a blob, you simply copy it, then delete the original - which seems incredibly inefficient - for example, if you have a 200GB video file blob with a typo in the blob name - unless internally Azure has some kind of dedupe system - in which case it makes perfect sense to eliminate the special-case of "blob renaming" because internally it really would be a "name copy" operation.
Unfortunately the current documentation for blob copy ( https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/copy-blob ) does not describe any internal processes, and in-fact, suggests that the blob copy might be a very long operation:
State of the copy operation, with these values:
success: the copy completed successfully.
pending: the copy is in progress.
If it was using a dedupe system internally then all blob copy operations would be instantaneous so there would be no need for an "in progress" status; also confusingly it uses "pending" to refer to "in progress" - when normally "pending" means "enqueued, not starting yet".
Alarmingly, the documentation also states this:
A copy attempt that has not completed after 2 weeks times out and leaves an empty blob
...which can be taken to read that there are zero guarantees about the time it takes to copy a blob. There is nothing in the page to suggest smaller blobs are copied quicker compared to bigger blobs - so for some reason (such as a long queue, unfortunate outages, and so on) it could take 2 weeks to correct my hypothetical typo in my hypothetical 200GB video file - and don't forget that I cannot delete my original misnamed blob until the copy operation is completed - which means needing to design my client software to constantly check and eventually issue the delete operation (and to ensure my software runs continuously for up to 2 weeks...).
Is there any authoritative information regarding the runtime characteristics and nature of Azure Blob copy operations?
As you may already know that Copy Blob operation is an asynchronous operation and all the things you mentioned above are true with one caveat. The copy operation is synchronous when it comes to copying within same storage account. Even though you get the same state whether you're copying blobs across storage accounts or within a storage account but when this operation is performed in the same storage account, it happens almost instantaneously.
So when you rename a blob, you're creating a copy of the blob in the same storage account (even same container) which is instantaneous. I am not 100% sure about the internal implementation but if I am not mistaken when you copy a blob in the same storage account, it doesn't copy the bytes in some separate place. It just create 2 pointers (new blob and the old blob) pointing to the same storage data. Once you start making changes to the blobs I think at that time it goes and changes those bytes.
For internal understanding of Azure Storage, I would highly recommend that you read the paper published by the team a few years ago. Please look at my answer here which has links to this paper: Azure storage underlying technology.
We use some
block blobs to store some durable resources and then
page blobs to store event data
We need to backup the blobs, so I tried to use AzCopy. It works ok on my dev machine, but it fails on other slow machine with error "The remote server returned an error: (412) The condition specified using HTTP conditional header(s) is not met.." almost every time.
We write to page blobs quite often (might be up to several times in a second, but this is not so common case), so this might be the reason.
Is there any better strategy how to backup the changing blobs? Or is there any way how to bypass the problem with ETag used by AzCopy?
A changed ETag will always halt a copy, since a changing ETag signifies that the source has changed.
The general approach to blob backup is subjective, but objectively:
blob copies within Azure itself, in the same region, from Storage account to Storage account, are going to be significantly faster than trying to copy a blob to an on-premises location (due to general Internet latency) or even copying from storage account to local disk on a VM.
Blobs support snapshots (which take place relatively instantly). If you create a snapshot, the snapshot remains unchanged, allowing you to then execute a copy operation against the snapshot instead of the actual blob itself (using AzCopy in your case) without fear of the source data changing during the copy. Note: You can create as many snapshots as you want; just be careful, since storage size grows as the underlying original blob changes.
Summary
I have picked up support for a fairly old website which stores a bunch of blobs in Azure. What I would like to do is duplicate all of my blobs from live to the test environment so I can use them without affecting users.
Architecture
The website is a mix of VB webforms and MVC, communicating with an Azure blob service (e.g. https://x.blob.core.windows.net/LiveBlobs).
The test site mirrors the live setup, except it points to a different blob container in the same storage account (e.g. https://x.blob.core.windows.net/TestBlobs)
Questions
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
How do I work out what it will cost to do this? The live blob
storage is roughly 130GB, but it should just be copying the data within the same data centre right?
Things I've investigated
I've spent quite some time searching for an answer, but what I've found deals with copying between storage accounts or copying single blobs.
I've also found AzCopy which looks promising but it looks like it would copy the files one by one so I'm worried it would end up taking a long time and costing a lot.
I am fairly new to Azure so please forgive me if this is a silly question or I've missed out some important details. I'm more than happy to add any extra information should you need it.
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
Yes, you can. Copying blob is an asynchronous server-side operation. You simply tell the blob service the blobs to copy & destination details and it will do the job for you. No need to download first and upload them to destination.
How do I work out what it will cost to do this? The live blob storage
is roughly 130GB, but it should just be copying the data within the
same data centre right?
So there are 3 things you need to consider when it comes to costing: 1) Storage costs, 2) transaction costs and 3) data egress costs.
Since the copied blobs will be stored somewhere, they will be consuming storage and you will incur storage costs.
Copy operation will perform some read operations on source blobs and then write operation on destination blobs (to create them), so you will have to incur transaction costs. At very minimum for each blob copy, you can expect 2 transactions - read on source and write on destination (though there can be more transactions).
You incur data egress costs if the destination storage account is not in the same region as your source storage account. As long as both storage accounts are in the same region, you would not incur this.
You can use Azure Storage Pricing Calculator to get an idea about how much it is going to cost you.
I've also found AzCopy which looks promising but it looks like it
would copy the files one by one so I'm worried it would end up taking
a long time and costing a lot.
Blobs are always copied one-by-one. Copying across storage accounts is always async server side operation so you can't really predict how much time it would take for the copy operation to complete but in my experience it is quite fast. If you want to control when the blobs are copied, you would need to download them first and upload them. AzCopy supports this mode as well.
As far as costs are concerned, I think it is a relative term when you say it is going to cost a lot. But in general Azure Storage is very cheap and 130 GB is not a whole lot of data.