We use some
block blobs to store some durable resources and then
page blobs to store event data
We need to backup the blobs, so I tried to use AzCopy. It works ok on my dev machine, but it fails on other slow machine with error "The remote server returned an error: (412) The condition specified using HTTP conditional header(s) is not met.." almost every time.
We write to page blobs quite often (might be up to several times in a second, but this is not so common case), so this might be the reason.
Is there any better strategy how to backup the changing blobs? Or is there any way how to bypass the problem with ETag used by AzCopy?
A changed ETag will always halt a copy, since a changing ETag signifies that the source has changed.
The general approach to blob backup is subjective, but objectively:
blob copies within Azure itself, in the same region, from Storage account to Storage account, are going to be significantly faster than trying to copy a blob to an on-premises location (due to general Internet latency) or even copying from storage account to local disk on a VM.
Blobs support snapshots (which take place relatively instantly). If you create a snapshot, the snapshot remains unchanged, allowing you to then execute a copy operation against the snapshot instead of the actual blob itself (using AzCopy in your case) without fear of the source data changing during the copy. Note: You can create as many snapshots as you want; just be careful, since storage size grows as the underlying original blob changes.
Related
I am trying to figure out what is the benefit of Index Tags vs Creating a full Virtual Folder tree structure in azure blob storage, when i have full programatic control over the creation of the blobs.
Virtual Folder structure vs Blob Index Tags
You're asking us to compare just two separate features of Azure Blob Storage as though they were mutually exclusive, when in-fact they can be used together, and there are more options for organizing blobs than just those 2 options:
TL;DR:
Azure Blob Index Tags - arbitrary mutable tags on your blobs.
Virtual folder structure - this is just a naming convention where your blobs are named with slash-separated "directory" names.
NFS 3.0 Blob Storage and Data Lake Storage Gen2 - this is a major new version (or revision) of Azure Blob Storage that makes it behave almost exactly like a traditional disk file-system (hence the NFS 3.0-compliance) however it (currently) comes with major shortcomings.
In detail:
Azure Blob Index Tags is a recently introduced new feature to Azure Blob Storage: it entered preview in May 2020 and left the preview-stage in June 2021 (2 months ago at the time of writing).
Your storage account needs to be "General Purpose v2" - so if you have a an older-style storage account you'll need to update it.
Advantages:
It's built-in to Azure Blob Storage, so you don't need to maintain your own indexing infrastructure (which is what we used to have to do: I stored my own blob index in a table in Azure Table Storage in the same storage account, and had a process that ran on a disposable Azure VM nightly to index new blobs).
As it's a tagging system it means you can have your own taxonomy and don't have to force your nomenclature into a single hierarchy as with virtual folders.
Tags are mutable: you can add/remove/edit them as you like.
Disadvantages:
As with maintaining your own blob index the index updates are not instantaneous (unlike compared to an RDBMS where indexes are always up-to-date). The blog article linked handwaves this away by saying:
and the account indexing engine exposes the new blob index shortly after."
...note that they don't define what "shortly" means.
As of August 2021, Azure charges $0.03 per 10,000 tags (regardless of the storage-tier in use). So if you have 1,000,000 blobs and 3 tags per blob, then that's $9/mo.
This isn't a significant cost by any means, but the cost-per-information-theoretic-unit is kinda-high, which is disappointing.
"Virtual Folder tree structure" - By this I assume you mean giving your blob's hierarchical naming system and using Azure Blob Storage's blob-name-prefix search filter.
Advantages:
Tried-and-tested. Simple.
Doesn't cost you anything.
No indexing delay.
Disadvantages:
It's still as slow as enumerating blobs lexicographically.
You cannot conceptually move or rename blobs.
(You can, technically, provided source and destination are in the same container by doing a copy+delete, and the copy operation should be instantaneous as I understand that Blob Storage uses COW for same-container copies, but it's still imperfect: the client API still exposes it as an asynchronous operation with an unbounded time-to-copy rather than giving hard guarantees)
The fact this has been a limitation of Azure Blob Storage for a decade now utterly confounds me.
NFS 3.0 Blob Storage - Also new in 2020/2021 with Blob Index Tags is NFS 3.0 Blob Storage, which gives you a full "real" hierarchical filesystem for your blobs.
The Hierarchical Namespace feature is powered by Azure Data Lake Storage Gen 2. I don't know any technical details of this so I can't say anything.
Advantages:
NFS 3.0-compliant (that's huge!) so Linux clients can even mount it directly.
It's cheaper than normal blob storage (whaaaaat?!):
In West US 2, NFS+LRS+Hot is $0.018/GB while the old-school flat namespace with LRS+Hot is $0.0184/GB.
In other Azure locations and with other redundancy options then NFS can be slightly more expensive, but otherwise they're generally within $0.01 of each other.
Disadvantages:
Apparently you're limited to only block-blobs: not page-blobs or append-blobs.
Notes from the Known Issues page:
NFS can only be used with new accounts: you cannot update an existing account. You also cannot disable it once you enable it.
You cannot (currently) lock blobs/files - though this looks to come in a future version.
You cannot use both Blob Index Tags and NFS in the same storage account - or in fact most features of Blob Storage (ooo-er!).
The documentation for operations exclusively to Hierarchical namespace blobs only lists Set Blob Expiry - there (still) doesn't seem to be a synchronous/atomic "move blob" or "rename blob" operation, instead the Protocol Support page implies that an operation to rename an NFS file will be translated into raw blob storage operations behind-the-scenes... so I'm curious how they do that atomically.
When your application makes a request by using the NFS 3.0 protocol, that request is translated into combination of block blob operations. For example, NFS 3.0 read Remote Procedure Call (RPC) requests are translated into Get Blob operation. NFS 3.0 write RPC requests are translated into a combination of Get Block List, Put Block, and Put Block List.
Alternative concept: Content-addressable-storage
Because blobs cannot be atomically/synchronously renamed so a few years ago I simply gave up trying to come up with a perfect blob nomenclature that would stand the test of time because business requirements always change.
Instead, I noticed that my blobs were invariably immutable: once they've been uploaded to storage they're never updated, or when they are updated they're saved to new, separate blobs - which means that a content-addressable naming strategy suited my projects perfectly.
In short: give your immutable blobs a name which is a string-representation of a hash of their content, and store their hashes in a traditional RDBMS where you have much greater flexibility (and ideally: performance) with how they're indexed and referenced by the rest of your system.
In my case, I set my blobs' names to the Base-16 representation of their SHA-256 hash.
Advantages:
You get de-duping for free: blobs with identical content will have identical hashes, so you can avoid uploading/downloading the same huge blob twice.
You get integrity checks for free: if you download a blob and its hash doesn't match its blob-name then your storage account likely got hacked)
Disadvantages:
You still need to maintain your own index in your RDBMS (if applicable) - but you can still use Blob Index Tags with content-addressable storage if you like.
Would like to use Azure Function to copy lots of files from blob container A to another blob container B. However, faced missing copying files due to Function timeout. Is there any method to resume it smartly? Is there any indication on source blob storage which identified copied/handled before so that next Function can skip copying that?
Would like to use Azure Function to copy lots of files from blob
container A to another blob container B. However, faced missing
copying files due to Function timeout.
You can avoid this timeout problem by changing the plan level. For example, if you use the app service plan and turn on always on, there will be no more timeout restrictions. But to be honest, if you have a lot of files and it takes a long time, then azure function is not a recommended method (the task performed by the function should be lightweight).
Is there any indication on source blob storage which identified
copied/handled before so that next Function can skip copying that?
Yes, of course you can. Just add the custom metadata of the blob after it was copied. When you copy files next time, you can first check the custom metadata.
It's a problem of plenty. You can:
copy from comannd line or code. AZ CLI or azcopy or .NET SDK (which can be extended to other language SDKs).
use Storage explorer.
use Azure Data Factory as Bowman suggested.
use SSIS.
[mis]use Databricks, especially if you are dealing with massive amount of data and need scalability.
Write some code and use new "Put XXX from URL" APIs. E.g. "Put Blob from URL" will create a new blob. Put Block from URL will create a block in a block blob.
#1 and 2 would use your local machine's internet bandwidth (download to local and then upload) whereas 3, 4, 5 would be totally in cloud. So in case your source and destination are in same region, for 1 & 2 you'll end up paying egress charges, where as 3, 4 and 5 you won't.
Azure Functions to copy files is probably the worst thing you can do. Azure Functions cost is proportional to execution time (and memory usage). In this case (as it's taking more than 10 minutes) I assume you're moving large amount of data, so you'll be paying for your Azure Function because it's just waiting on I/O to complete a file transfer.
Why would I need to create a blob snapshot and incur additional cost if Azure already provides GRS(Geo redundant storage) or ZRS (Zone redundant storage)?
Redundancy (ZRS/GRS/RAGRS) provides means to achieve high availability of your resources (blobs in your scenario). By enabling redundancy you are ensuring that a copy of your blob is available in another region/zone in case primary region/zone is not available. It also ensures against data corruption of the primary blob.
When you take a snapshot of your blob, a readonly copy of that blob in its current state is created and stored. If needed, you can restore a blob from a snapshot. This scenario is well suited if you want to store different versions of the same blob.
However, please keep in mind that neither redundancy nor snapshot is backup because if you delete base blob, all the snapshots associated with that blob are deleted and all the copies of that blob available in other zones/regions are deleted as well.
I guess you need to understand the difference between Backup and Redundancy.
Backups make sure if something is lost, corrupted or stolen, that a copy of the data is available at your disposal.
Redundancy makes sure that if something fails—your computer fails, a drive gets fried, or a server freezes and you are able to work regardless of the problem. Redundancy means that all your changes are replicated to another location. In case of a failover, your slave can theoretically function as a master and serve the (hopefully) latest state of your file system.
You could also turn soft delete on. That would keep a copy of every blob for every change made to it, even if someone deletes it. Then you set the retention period for those blobs so they would be automatically removed after some period of time.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-soft-delete
I have a use case which often requires to copy a blob (file) from one Azure region to another. The file size spans from 25 to 45GB. Needless to say, this sometimes goes very slowly, with inconsistent performance. This might take up to two hours, sometimes more. Distance plays a role, but it differs. Even within the same region copying is slower then I would expect. I've been trying:
The Python SDK, and its copy blob method from the blob service.
The rest API copy blob
az copy from the CLI.
Although I didn't really expect different results, since all of them use the same backend methods.
Is there any approach I am missing? Is there any way to speed up this process, or any kind of blob sharing integrated in Azure? VHD/disk sharing could also do.
You may want to try /SyncCopy option in AzCopy:
Synchronously copy blobs from one storage account to another
AzCopy by default copies data between two storage endpoints asynchronously. Therefore, the copy operation runs in the background using spare bandwidth capacity that has no SLA in terms of how fast a blob is copied, and AzCopy periodically checks the copy status until the copying is completed or failed.
The /SyncCopy option ensures that the copy operation gets consistent speed. AzCopy performs the synchronous copy by downloading the blobs to copy from the specified source to local memory, and then uploading them to the Blob storage destination.
AzCopy /Source:https://myaccount1.blob.core.windows.net/myContainer/ /Dest:https://myaccount2.blob.core.windows.net/myContainer/ /SourceKey:key1 /DestKey:key2 /Pattern:ab /SyncCopy
/SyncCopy might generate additional egress cost compared to asynchronous copy, the recommended approach is to use this option in an Azure VM that is in the same region as your source storage account to avoid egress cost.
In linux you can try using --parallel-level option. Try looking it up using azcopy --help. Also, the max op limit is 512 officially. Go bonkers!
Azure Blob Storage does not expose any kind of "blob rename" operation - which sounds preposterous because the idea of renaming an entity is a fundamental operation in almost any storage system - and Azure's documentation makes no reference to how a blob's name is used internally (e.g. as DHT key), but as we can specify our own names it's clear that Azure isn't using a content-addressable storage model (so renaming should be possible, once the Azure Storage team decides to allow it).
Microsoft advocates instead that to "rename" a blob, you simply copy it, then delete the original - which seems incredibly inefficient - for example, if you have a 200GB video file blob with a typo in the blob name - unless internally Azure has some kind of dedupe system - in which case it makes perfect sense to eliminate the special-case of "blob renaming" because internally it really would be a "name copy" operation.
Unfortunately the current documentation for blob copy ( https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/copy-blob ) does not describe any internal processes, and in-fact, suggests that the blob copy might be a very long operation:
State of the copy operation, with these values:
success: the copy completed successfully.
pending: the copy is in progress.
If it was using a dedupe system internally then all blob copy operations would be instantaneous so there would be no need for an "in progress" status; also confusingly it uses "pending" to refer to "in progress" - when normally "pending" means "enqueued, not starting yet".
Alarmingly, the documentation also states this:
A copy attempt that has not completed after 2 weeks times out and leaves an empty blob
...which can be taken to read that there are zero guarantees about the time it takes to copy a blob. There is nothing in the page to suggest smaller blobs are copied quicker compared to bigger blobs - so for some reason (such as a long queue, unfortunate outages, and so on) it could take 2 weeks to correct my hypothetical typo in my hypothetical 200GB video file - and don't forget that I cannot delete my original misnamed blob until the copy operation is completed - which means needing to design my client software to constantly check and eventually issue the delete operation (and to ensure my software runs continuously for up to 2 weeks...).
Is there any authoritative information regarding the runtime characteristics and nature of Azure Blob copy operations?
As you may already know that Copy Blob operation is an asynchronous operation and all the things you mentioned above are true with one caveat. The copy operation is synchronous when it comes to copying within same storage account. Even though you get the same state whether you're copying blobs across storage accounts or within a storage account but when this operation is performed in the same storage account, it happens almost instantaneously.
So when you rename a blob, you're creating a copy of the blob in the same storage account (even same container) which is instantaneous. I am not 100% sure about the internal implementation but if I am not mistaken when you copy a blob in the same storage account, it doesn't copy the bytes in some separate place. It just create 2 pointers (new blob and the old blob) pointing to the same storage data. Once you start making changes to the blobs I think at that time it goes and changes those bytes.
For internal understanding of Azure Storage, I would highly recommend that you read the paper published by the team a few years ago. Please look at my answer here which has links to this paper: Azure storage underlying technology.