Are Azure Blob copy operations cheap? - azure

Azure Blob Storage does not expose any kind of "blob rename" operation - which sounds preposterous because the idea of renaming an entity is a fundamental operation in almost any storage system - and Azure's documentation makes no reference to how a blob's name is used internally (e.g. as DHT key), but as we can specify our own names it's clear that Azure isn't using a content-addressable storage model (so renaming should be possible, once the Azure Storage team decides to allow it).
Microsoft advocates instead that to "rename" a blob, you simply copy it, then delete the original - which seems incredibly inefficient - for example, if you have a 200GB video file blob with a typo in the blob name - unless internally Azure has some kind of dedupe system - in which case it makes perfect sense to eliminate the special-case of "blob renaming" because internally it really would be a "name copy" operation.
Unfortunately the current documentation for blob copy ( https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/copy-blob ) does not describe any internal processes, and in-fact, suggests that the blob copy might be a very long operation:
State of the copy operation, with these values:
success: the copy completed successfully.
pending: the copy is in progress.
If it was using a dedupe system internally then all blob copy operations would be instantaneous so there would be no need for an "in progress" status; also confusingly it uses "pending" to refer to "in progress" - when normally "pending" means "enqueued, not starting yet".
Alarmingly, the documentation also states this:
A copy attempt that has not completed after 2 weeks times out and leaves an empty blob
...which can be taken to read that there are zero guarantees about the time it takes to copy a blob. There is nothing in the page to suggest smaller blobs are copied quicker compared to bigger blobs - so for some reason (such as a long queue, unfortunate outages, and so on) it could take 2 weeks to correct my hypothetical typo in my hypothetical 200GB video file - and don't forget that I cannot delete my original misnamed blob until the copy operation is completed - which means needing to design my client software to constantly check and eventually issue the delete operation (and to ensure my software runs continuously for up to 2 weeks...).
Is there any authoritative information regarding the runtime characteristics and nature of Azure Blob copy operations?

As you may already know that Copy Blob operation is an asynchronous operation and all the things you mentioned above are true with one caveat. The copy operation is synchronous when it comes to copying within same storage account. Even though you get the same state whether you're copying blobs across storage accounts or within a storage account but when this operation is performed in the same storage account, it happens almost instantaneously.
So when you rename a blob, you're creating a copy of the blob in the same storage account (even same container) which is instantaneous. I am not 100% sure about the internal implementation but if I am not mistaken when you copy a blob in the same storage account, it doesn't copy the bytes in some separate place. It just create 2 pointers (new blob and the old blob) pointing to the same storage data. Once you start making changes to the blobs I think at that time it goes and changes those bytes.
For internal understanding of Azure Storage, I would highly recommend that you read the paper published by the team a few years ago. Please look at my answer here which has links to this paper: Azure storage underlying technology.

Related

Azure Blob Storage : Virtual Folder structure vs Blob Index Tags

I am trying to figure out what is the benefit of Index Tags vs Creating a full Virtual Folder tree structure in azure blob storage, when i have full programatic control over the creation of the blobs.
Virtual Folder structure vs Blob Index Tags
You're asking us to compare just two separate features of Azure Blob Storage as though they were mutually exclusive, when in-fact they can be used together, and there are more options for organizing blobs than just those 2 options:
TL;DR:
Azure Blob Index Tags - arbitrary mutable tags on your blobs.
Virtual folder structure - this is just a naming convention where your blobs are named with slash-separated "directory" names.
NFS 3.0 Blob Storage and Data Lake Storage Gen2 - this is a major new version (or revision) of Azure Blob Storage that makes it behave almost exactly like a traditional disk file-system (hence the NFS 3.0-compliance) however it (currently) comes with major shortcomings.
In detail:
Azure Blob Index Tags is a recently introduced new feature to Azure Blob Storage: it entered preview in May 2020 and left the preview-stage in June 2021 (2 months ago at the time of writing).
Your storage account needs to be "General Purpose v2" - so if you have a an older-style storage account you'll need to update it.
Advantages:
It's built-in to Azure Blob Storage, so you don't need to maintain your own indexing infrastructure (which is what we used to have to do: I stored my own blob index in a table in Azure Table Storage in the same storage account, and had a process that ran on a disposable Azure VM nightly to index new blobs).
As it's a tagging system it means you can have your own taxonomy and don't have to force your nomenclature into a single hierarchy as with virtual folders.
Tags are mutable: you can add/remove/edit them as you like.
Disadvantages:
As with maintaining your own blob index the index updates are not instantaneous (unlike compared to an RDBMS where indexes are always up-to-date). The blog article linked handwaves this away by saying:
and the account indexing engine exposes the new blob index shortly after."
...note that they don't define what "shortly" means.
As of August 2021, Azure charges $0.03 per 10,000 tags (regardless of the storage-tier in use). So if you have 1,000,000 blobs and 3 tags per blob, then that's $9/mo.
This isn't a significant cost by any means, but the cost-per-information-theoretic-unit is kinda-high, which is disappointing.
"Virtual Folder tree structure" - By this I assume you mean giving your blob's hierarchical naming system and using Azure Blob Storage's blob-name-prefix search filter.
Advantages:
Tried-and-tested. Simple.
Doesn't cost you anything.
No indexing delay.
Disadvantages:
It's still as slow as enumerating blobs lexicographically.
You cannot conceptually move or rename blobs.
(You can, technically, provided source and destination are in the same container by doing a copy+delete, and the copy operation should be instantaneous as I understand that Blob Storage uses COW for same-container copies, but it's still imperfect: the client API still exposes it as an asynchronous operation with an unbounded time-to-copy rather than giving hard guarantees)
The fact this has been a limitation of Azure Blob Storage for a decade now utterly confounds me.
NFS 3.0 Blob Storage - Also new in 2020/2021 with Blob Index Tags is NFS 3.0 Blob Storage, which gives you a full "real" hierarchical filesystem for your blobs.
The Hierarchical Namespace feature is powered by Azure Data Lake Storage Gen 2. I don't know any technical details of this so I can't say anything.
Advantages:
NFS 3.0-compliant (that's huge!) so Linux clients can even mount it directly.
It's cheaper than normal blob storage (whaaaaat?!):
In West US 2, NFS+LRS+Hot is $0.018/GB while the old-school flat namespace with LRS+Hot is $0.0184/GB.
In other Azure locations and with other redundancy options then NFS can be slightly more expensive, but otherwise they're generally within $0.01 of each other.
Disadvantages:
Apparently you're limited to only block-blobs: not page-blobs or append-blobs.
Notes from the Known Issues page:
NFS can only be used with new accounts: you cannot update an existing account. You also cannot disable it once you enable it.
You cannot (currently) lock blobs/files - though this looks to come in a future version.
You cannot use both Blob Index Tags and NFS in the same storage account - or in fact most features of Blob Storage (ooo-er!).
The documentation for operations exclusively to Hierarchical namespace blobs only lists Set Blob Expiry - there (still) doesn't seem to be a synchronous/atomic "move blob" or "rename blob" operation, instead the Protocol Support page implies that an operation to rename an NFS file will be translated into raw blob storage operations behind-the-scenes... so I'm curious how they do that atomically.
When your application makes a request by using the NFS 3.0 protocol, that request is translated into combination of block blob operations. For example, NFS 3.0 read Remote Procedure Call (RPC) requests are translated into Get Blob operation. NFS 3.0 write RPC requests are translated into a combination of Get Block List, Put Block, and Put Block List.
Alternative concept: Content-addressable-storage
Because blobs cannot be atomically/synchronously renamed so a few years ago I simply gave up trying to come up with a perfect blob nomenclature that would stand the test of time because business requirements always change.
Instead, I noticed that my blobs were invariably immutable: once they've been uploaded to storage they're never updated, or when they are updated they're saved to new, separate blobs - which means that a content-addressable naming strategy suited my projects perfectly.
In short: give your immutable blobs a name which is a string-representation of a hash of their content, and store their hashes in a traditional RDBMS where you have much greater flexibility (and ideally: performance) with how they're indexed and referenced by the rest of your system.
In my case, I set my blobs' names to the Base-16 representation of their SHA-256 hash.
Advantages:
You get de-duping for free: blobs with identical content will have identical hashes, so you can avoid uploading/downloading the same huge blob twice.
You get integrity checks for free: if you download a blob and its hash doesn't match its blob-name then your storage account likely got hacked)
Disadvantages:
You still need to maintain your own index in your RDBMS (if applicable) - but you can still use Blob Index Tags with content-addressable storage if you like.

Azure Function(C#): How to copy lots of files from blob container A to another blob container B? (Function has timeout in 10 mins)

Would like to use Azure Function to copy lots of files from blob container A to another blob container B. However, faced missing copying files due to Function timeout. Is there any method to resume it smartly? Is there any indication on source blob storage which identified copied/handled before so that next Function can skip copying that?
Would like to use Azure Function to copy lots of files from blob
container A to another blob container B. However, faced missing
copying files due to Function timeout.
You can avoid this timeout problem by changing the plan level. For example, if you use the app service plan and turn on always on, there will be no more timeout restrictions. But to be honest, if you have a lot of files and it takes a long time, then azure function is not a recommended method (the task performed by the function should be lightweight).
Is there any indication on source blob storage which identified
copied/handled before so that next Function can skip copying that?
Yes, of course you can. Just add the custom metadata of the blob after it was copied. When you copy files next time, you can first check the custom metadata.
It's a problem of plenty. You can:
copy from comannd line or code. AZ CLI or azcopy or .NET SDK (which can be extended to other language SDKs).
use Storage explorer.
use Azure Data Factory as Bowman suggested.
use SSIS.
[mis]use Databricks, especially if you are dealing with massive amount of data and need scalability.
Write some code and use new "Put XXX from URL" APIs. E.g. "Put Blob from URL" will create a new blob. Put Block from URL will create a block in a block blob.
#1 and 2 would use your local machine's internet bandwidth (download to local and then upload) whereas 3, 4, 5 would be totally in cloud. So in case your source and destination are in same region, for 1 & 2 you'll end up paying egress charges, where as 3, 4 and 5 you won't.
Azure Functions to copy files is probably the worst thing you can do. Azure Functions cost is proportional to execution time (and memory usage). In this case (as it's taking more than 10 minutes) I assume you're moving large amount of data, so you'll be paying for your Azure Function because it's just waiting on I/O to complete a file transfer.

Azure blobs backup

We use some
block blobs to store some durable resources and then
page blobs to store event data
We need to backup the blobs, so I tried to use AzCopy. It works ok on my dev machine, but it fails on other slow machine with error "The remote server returned an error: (412) The condition specified using HTTP conditional header(s) is not met.." almost every time.
We write to page blobs quite often (might be up to several times in a second, but this is not so common case), so this might be the reason.
Is there any better strategy how to backup the changing blobs? Or is there any way how to bypass the problem with ETag used by AzCopy?
A changed ETag will always halt a copy, since a changing ETag signifies that the source has changed.
The general approach to blob backup is subjective, but objectively:
blob copies within Azure itself, in the same region, from Storage account to Storage account, are going to be significantly faster than trying to copy a blob to an on-premises location (due to general Internet latency) or even copying from storage account to local disk on a VM.
Blobs support snapshots (which take place relatively instantly). If you create a snapshot, the snapshot remains unchanged, allowing you to then execute a copy operation against the snapshot instead of the actual blob itself (using AzCopy in your case) without fear of the source data changing during the copy. Note: You can create as many snapshots as you want; just be careful, since storage size grows as the underlying original blob changes.

How to clone blob container and contents

Summary
I have picked up support for a fairly old website which stores a bunch of blobs in Azure. What I would like to do is duplicate all of my blobs from live to the test environment so I can use them without affecting users.
Architecture
The website is a mix of VB webforms and MVC, communicating with an Azure blob service (e.g. https://x.blob.core.windows.net/LiveBlobs).
The test site mirrors the live setup, except it points to a different blob container in the same storage account (e.g. https://x.blob.core.windows.net/TestBlobs)
Questions
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
How do I work out what it will cost to do this? The live blob
storage is roughly 130GB, but it should just be copying the data within the same data centre right?
Things I've investigated
I've spent quite some time searching for an answer, but what I've found deals with copying between storage accounts or copying single blobs.
I've also found AzCopy which looks promising but it looks like it would copy the files one by one so I'm worried it would end up taking a long time and costing a lot.
I am fairly new to Azure so please forgive me if this is a silly question or I've missed out some important details. I'm more than happy to add any extra information should you need it.
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
Yes, you can. Copying blob is an asynchronous server-side operation. You simply tell the blob service the blobs to copy & destination details and it will do the job for you. No need to download first and upload them to destination.
How do I work out what it will cost to do this? The live blob storage
is roughly 130GB, but it should just be copying the data within the
same data centre right?
So there are 3 things you need to consider when it comes to costing: 1) Storage costs, 2) transaction costs and 3) data egress costs.
Since the copied blobs will be stored somewhere, they will be consuming storage and you will incur storage costs.
Copy operation will perform some read operations on source blobs and then write operation on destination blobs (to create them), so you will have to incur transaction costs. At very minimum for each blob copy, you can expect 2 transactions - read on source and write on destination (though there can be more transactions).
You incur data egress costs if the destination storage account is not in the same region as your source storage account. As long as both storage accounts are in the same region, you would not incur this.
You can use Azure Storage Pricing Calculator to get an idea about how much it is going to cost you.
I've also found AzCopy which looks promising but it looks like it
would copy the files one by one so I'm worried it would end up taking
a long time and costing a lot.
Blobs are always copied one-by-one. Copying across storage accounts is always async server side operation so you can't really predict how much time it would take for the copy operation to complete but in my experience it is quite fast. If you want to control when the blobs are copied, you would need to download them first and upload them. AzCopy supports this mode as well.
As far as costs are concerned, I think it is a relative term when you say it is going to cost a lot. But in general Azure Storage is very cheap and 130 GB is not a whole lot of data.

automatic backups of Azure blob storage?

I need to do an automatic periodic backup of an Azure blob storage to another Azure blob storage.
This is in order to guard against any kind of malfunction in the software.
Are there any services which do that? Azure doesn't seem to have this
As #Brent mentioned in the comments to Roberto's answer, the replicas are for HA; if you deleted a blob, that delete is replicated instantly.
For blobs, you can very easily create asynchronous copies to a separate blob (even in a separate storage account). You can also make snapshots which capture a blob at a current moment in time. At first, snapshots don't cost anything, but if you start modifying the blocks/pages referred to by the snapshot, then new blocks/pages are allocated. Over time, you'll want to start purging your snapshots. This is a great way to keep data "as-is" over time and revert back to a snapshot if there's a malfunction in your software.
With queues, the malfunction story isn't quite the same, as typically you'd only have a small number of queue items present (at least that's the hope; if you have thousands of queue messages, this is typically a sign that your software is falling behind). In any event: You could, when writing queue messages, write your queue messages to blob storage, for archive purposes, in case there's a malfunction. I wouldn't recommend using blob- based messaging for scaling/parallel processing, since they don't have the mechanisms in place that queues do, but you could use them manually in case of malfunction.
There's no copy function for tables. You'd need to write to two tables during your write.
Azure keeps 3 redundant copies of your data in different locations in the same data centre where your data is hosted (to guard against hardware failure).
This applies to blob, table and queue storage.
Additionally, You can enable geo-replication on all of your storage. Azure will automatically keep redundant copies of your data in separate data centres. This guards against anything happening to the data centre itself.
See Here

Resources