How to clone blob container and contents - azure

Summary
I have picked up support for a fairly old website which stores a bunch of blobs in Azure. What I would like to do is duplicate all of my blobs from live to the test environment so I can use them without affecting users.
Architecture
The website is a mix of VB webforms and MVC, communicating with an Azure blob service (e.g. https://x.blob.core.windows.net/LiveBlobs).
The test site mirrors the live setup, except it points to a different blob container in the same storage account (e.g. https://x.blob.core.windows.net/TestBlobs)
Questions
Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
How do I work out what it will cost to do this? The live blob
storage is roughly 130GB, but it should just be copying the data within the same data centre right?
Things I've investigated
I've spent quite some time searching for an answer, but what I've found deals with copying between storage accounts or copying single blobs.
I've also found AzCopy which looks promising but it looks like it would copy the files one by one so I'm worried it would end up taking a long time and costing a lot.
I am fairly new to Azure so please forgive me if this is a silly question or I've missed out some important details. I'm more than happy to add any extra information should you need it.

Can I copy all of the blobs from live to test without downloading
them? They would need to maintain the same names.
Yes, you can. Copying blob is an asynchronous server-side operation. You simply tell the blob service the blobs to copy & destination details and it will do the job for you. No need to download first and upload them to destination.
How do I work out what it will cost to do this? The live blob storage
is roughly 130GB, but it should just be copying the data within the
same data centre right?
So there are 3 things you need to consider when it comes to costing: 1) Storage costs, 2) transaction costs and 3) data egress costs.
Since the copied blobs will be stored somewhere, they will be consuming storage and you will incur storage costs.
Copy operation will perform some read operations on source blobs and then write operation on destination blobs (to create them), so you will have to incur transaction costs. At very minimum for each blob copy, you can expect 2 transactions - read on source and write on destination (though there can be more transactions).
You incur data egress costs if the destination storage account is not in the same region as your source storage account. As long as both storage accounts are in the same region, you would not incur this.
You can use Azure Storage Pricing Calculator to get an idea about how much it is going to cost you.
I've also found AzCopy which looks promising but it looks like it
would copy the files one by one so I'm worried it would end up taking
a long time and costing a lot.
Blobs are always copied one-by-one. Copying across storage accounts is always async server side operation so you can't really predict how much time it would take for the copy operation to complete but in my experience it is quite fast. If you want to control when the blobs are copied, you would need to download them first and upload them. AzCopy supports this mode as well.
As far as costs are concerned, I think it is a relative term when you say it is going to cost a lot. But in general Azure Storage is very cheap and 130 GB is not a whole lot of data.

Related

Azure Function(C#): How to copy lots of files from blob container A to another blob container B? (Function has timeout in 10 mins)

Would like to use Azure Function to copy lots of files from blob container A to another blob container B. However, faced missing copying files due to Function timeout. Is there any method to resume it smartly? Is there any indication on source blob storage which identified copied/handled before so that next Function can skip copying that?
Would like to use Azure Function to copy lots of files from blob
container A to another blob container B. However, faced missing
copying files due to Function timeout.
You can avoid this timeout problem by changing the plan level. For example, if you use the app service plan and turn on always on, there will be no more timeout restrictions. But to be honest, if you have a lot of files and it takes a long time, then azure function is not a recommended method (the task performed by the function should be lightweight).
Is there any indication on source blob storage which identified
copied/handled before so that next Function can skip copying that?
Yes, of course you can. Just add the custom metadata of the blob after it was copied. When you copy files next time, you can first check the custom metadata.
It's a problem of plenty. You can:
copy from comannd line or code. AZ CLI or azcopy or .NET SDK (which can be extended to other language SDKs).
use Storage explorer.
use Azure Data Factory as Bowman suggested.
use SSIS.
[mis]use Databricks, especially if you are dealing with massive amount of data and need scalability.
Write some code and use new "Put XXX from URL" APIs. E.g. "Put Blob from URL" will create a new blob. Put Block from URL will create a block in a block blob.
#1 and 2 would use your local machine's internet bandwidth (download to local and then upload) whereas 3, 4, 5 would be totally in cloud. So in case your source and destination are in same region, for 1 & 2 you'll end up paying egress charges, where as 3, 4 and 5 you won't.
Azure Functions to copy files is probably the worst thing you can do. Azure Functions cost is proportional to execution time (and memory usage). In this case (as it's taking more than 10 minutes) I assume you're moving large amount of data, so you'll be paying for your Azure Function because it's just waiting on I/O to complete a file transfer.

Is rehydration of the (Azure Blob Storage) archive tier always needed?

I have studied the following link to understand the Hot, Cool and Archive tiers of Azure Storage V2.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
In the Blob rehydration section it says:
To read data in archive storage, you must first change the tier of the blob to hot or cool. This process is known as rehydration and can take up to 15 hours to complete.
My questions are:
Can I get just list of all blobs without rehydration? Is it going to cost me?
Do I have to perform rehydration before reading/deleting a single file?
Do I have to perform rehydration to delete a file before 180 days?
All answers are taken from the article you linked to:
1) Yes, you can get a list and it will not cost you extra
2) Yes, you have to rehydrate to read file contents, but you can delete without rehydrating
While a blob is in archive storage, the blob data is offline and cannot be read, copied, overwritten, or modified. You can't take snapshots of a blob in archive storage. However, the blob metadata remains online and available, allowing you to list the blob and its properties. For blobs in archive, the only valid operations are GetBlobProperties, GetBlobMetadata, ListBlobs, SetBlobTier, and DeleteBlob.
As an addition to the answer to the reading part of question 2):
Blob-level tiering allows you to change the tier of your data at the object level using a single operation called Set Blob Tier. You can easily change the access tier of a blob among the hot, cool, or archive tiers as usage patterns change, without having to move data between accounts. All tier changes happen immediately. However, rehydrating a blob from archive can take several hours.
3) The 180 days are the minimum amount of time a blob needs to be in archive storage. Changes before that period incur an early deletion charge. This does not change the way you delete blobs, so you can still call DeleteBlob (and be charged the early deletion charge).
Any blob that is deleted or moved out of the cool (GPv2 accounts only) or archive tier before 30 days and 180 days respectively will incur a prorated early deletion charge.

Speeding up blob copying on Azure

I have a use case which often requires to copy a blob (file) from one Azure region to another. The file size spans from 25 to 45GB. Needless to say, this sometimes goes very slowly, with inconsistent performance. This might take up to two hours, sometimes more. Distance plays a role, but it differs. Even within the same region copying is slower then I would expect. I've been trying:
The Python SDK, and its copy blob method from the blob service.
The rest API copy blob
az copy from the CLI.
Although I didn't really expect different results, since all of them use the same backend methods.
Is there any approach I am missing? Is there any way to speed up this process, or any kind of blob sharing integrated in Azure? VHD/disk sharing could also do.
You may want to try /SyncCopy option in AzCopy:
Synchronously copy blobs from one storage account to another
AzCopy by default copies data between two storage endpoints asynchronously. Therefore, the copy operation runs in the background using spare bandwidth capacity that has no SLA in terms of how fast a blob is copied, and AzCopy periodically checks the copy status until the copying is completed or failed.
The /SyncCopy option ensures that the copy operation gets consistent speed. AzCopy performs the synchronous copy by downloading the blobs to copy from the specified source to local memory, and then uploading them to the Blob storage destination.
AzCopy /Source:https://myaccount1.blob.core.windows.net/myContainer/ /Dest:https://myaccount2.blob.core.windows.net/myContainer/ /SourceKey:key1 /DestKey:key2 /Pattern:ab /SyncCopy
/SyncCopy might generate additional egress cost compared to asynchronous copy, the recommended approach is to use this option in an Azure VM that is in the same region as your source storage account to avoid egress cost.
In linux you can try using --parallel-level option. Try looking it up using azcopy --help. Also, the max op limit is 512 officially. Go bonkers!

Are Azure Blob copy operations cheap?

Azure Blob Storage does not expose any kind of "blob rename" operation - which sounds preposterous because the idea of renaming an entity is a fundamental operation in almost any storage system - and Azure's documentation makes no reference to how a blob's name is used internally (e.g. as DHT key), but as we can specify our own names it's clear that Azure isn't using a content-addressable storage model (so renaming should be possible, once the Azure Storage team decides to allow it).
Microsoft advocates instead that to "rename" a blob, you simply copy it, then delete the original - which seems incredibly inefficient - for example, if you have a 200GB video file blob with a typo in the blob name - unless internally Azure has some kind of dedupe system - in which case it makes perfect sense to eliminate the special-case of "blob renaming" because internally it really would be a "name copy" operation.
Unfortunately the current documentation for blob copy ( https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/copy-blob ) does not describe any internal processes, and in-fact, suggests that the blob copy might be a very long operation:
State of the copy operation, with these values:
success: the copy completed successfully.
pending: the copy is in progress.
If it was using a dedupe system internally then all blob copy operations would be instantaneous so there would be no need for an "in progress" status; also confusingly it uses "pending" to refer to "in progress" - when normally "pending" means "enqueued, not starting yet".
Alarmingly, the documentation also states this:
A copy attempt that has not completed after 2 weeks times out and leaves an empty blob
...which can be taken to read that there are zero guarantees about the time it takes to copy a blob. There is nothing in the page to suggest smaller blobs are copied quicker compared to bigger blobs - so for some reason (such as a long queue, unfortunate outages, and so on) it could take 2 weeks to correct my hypothetical typo in my hypothetical 200GB video file - and don't forget that I cannot delete my original misnamed blob until the copy operation is completed - which means needing to design my client software to constantly check and eventually issue the delete operation (and to ensure my software runs continuously for up to 2 weeks...).
Is there any authoritative information regarding the runtime characteristics and nature of Azure Blob copy operations?
As you may already know that Copy Blob operation is an asynchronous operation and all the things you mentioned above are true with one caveat. The copy operation is synchronous when it comes to copying within same storage account. Even though you get the same state whether you're copying blobs across storage accounts or within a storage account but when this operation is performed in the same storage account, it happens almost instantaneously.
So when you rename a blob, you're creating a copy of the blob in the same storage account (even same container) which is instantaneous. I am not 100% sure about the internal implementation but if I am not mistaken when you copy a blob in the same storage account, it doesn't copy the bytes in some separate place. It just create 2 pointers (new blob and the old blob) pointing to the same storage data. Once you start making changes to the blobs I think at that time it goes and changes those bytes.
For internal understanding of Azure Storage, I would highly recommend that you read the paper published by the team a few years ago. Please look at my answer here which has links to this paper: Azure storage underlying technology.

Azure blobs, what are they for?

I'm reading about Azure blobs and storage, and there are things I don't understand.
First, you can hire Azure for just hosting, but when you create a web role ... do you need storage for the .dll's and other files (.js and .css) ?? Or there are a small storage quota in a worker role you can use? how long is it? I cannot understand getting charged every time a browser download a CSS file, so I guess I can store those things in another kind of storage.
Second, you get charged for transaction and bandwidth, so it's not a good idea to provide direct links to the blobs in your websites, then... what do you do? Download it from your web site code and write to the client output stream on the fly from ASP.NET? I think I've read that internal trafic/transactions are for free, so it looks like a "too-good-for-be-truth" solution :D
Is the trafic between hosting and storage also free?
Thanks in advance.
First, to answer your main question: blobs are best used for dynamic data files. If you run a YouTube sorta site, you would use blobs to store videos in every compressed state and thumbnails to images generated from those videos. Tables within table storage are best for dynamic data that does not require files. For example comments on YouTube videos would likely be best stored by tables in ATS.
You generally want a storage account for at least: publishing your deployments into Azure and to have your compute nodes transfer their diagnostic data to, for when you're deployed and need to monitor your compute nodes
Even though you publish your deployments THROUGH a storage account, the deployment code lives on your compute nodes. .CSS/.HTML files served by your app are served through your node's storage space which you get plenty of (it is NOT a good place for your dynamic data however)
You pay for traffic/data that crosses the Azure data center boundary, irregardless where it came from. Furthermore, transactions (reads or writes) between your azure table storage and anywhere else are not free. You also pay for storing the data in the storage account (storing data on compute nodes themselves is not metered). Data that does not leave their data center is not subject to transfer fees. Now in reality, the costs are so low, that you have to be pushing gigabytes per day to start noticing
Don't store any dynamic data only on compute instances. That data will get purged whenever you redeploy your app or whenever they decide to move your app onto a different node.
Hope this helps

Resources